Feature Selection for SVMs J. Westony, S. Mukherjeeyy , O. Chapelle, M. Pontilyy Tomaso Poggioyy , Vladimir Vapnik;yyy
Barnhill BioInformatics.com, Savannah, Georgia, USA. yy CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research Laboratories, Red Bank, USA. yyy Royal Holloway, University of London, Egham, Surrey, UK. fchapelle,vlad,weston,
[email protected] y
Abstract We introduce a method of feature selection for Support Vector Machines. The method is based upon nding those features which minimize bounds on the leave-one-out error, which can be eciently computed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA microarray data.
1 Introduction In many supervised learning problems1 feature selection is important for a variety of reasons: generalization performance, running time requirements, and constraints and interpretational issues imposed by the problem itself. In classi cation problems we are given ` data points xi 2 Rn labeled y 2 1 drawn i.i.d from a probability distribution P (x; y). We would like to select a subset of features while preserving or improving the discriminative ability of a classi er. A brute force search of all possible subsets is known to be NP-complete [5], so the computational expense of an algorithm needs to be taken into account. Support vector machines (SVMs) [13] have been extensively used as a classi cation tool with a great deal of success in a variety of areas from object recognition [6, 11] to classi cation of cancer morphologies [10]. In this paper we introduce feature selection algorithms for SVMs which minimize generalization bounds via gradient descent, and are thus feasible to compute. Previous work on feature selection for SVMs has been limited to linear kernels [3] or linear probabilistic models [8]. Our approach can be applied to any kernel. 1 In this article we restrict ourselves to the case of Pattern Recognition. However, the reasoning also applies to other domains.
The article is organized as follows. In section (2) we describe the feature selection problem, in section (3) we review SVMs and some of their generalization bounds and in section (4) we introduce the the new SVM feature selection method. Section (5) then describes results on toy and real life data indicating the usefulness of our approach.
2 The Feature Selection problem The feature selection problem can be addressed in the following two ways: (1) given a xed m n, nd the m features with the smallest expected generalization error; or (2) given xed nd the smallest number of features m that give an expected generalization error not more than a factor of worse than using all n features. In both of these problems the expected generalization error is of course unknown, and thus must be estimated. In this article we will consider problem (1), formulations for problem (2) are analogous. Problem (1) is formulated as follows. Given a xed set of functions y = f (x; ) we wish to nd a preprocessing of the data x 7! ( x), 2 f0; 1gn, and the parameters of the function that give the minimum value of
Z
(; ) = V (y; f ((x ); ))dF (x; y)
(1)
subject to jjjj0 = m, where F (x; y) is unknown, V (; ) is a loss functional and jjjj0 is the L0 norm. In the literature one distinguishes between two types of method to solve this problem: the so-called lter and wrapper methods [2]. Filter methods are de ned as a preprocessing step to induction that can remove irrelevant attributes before induction occurs, and thus wish to be valid for any set of functions f (x; ). For example one popular lter method is to use Pearson correlation coecients. The wrapper method, on the other hand, is de ned as a search through the space of feature subsets using the estimated accuracy from an induction algorithm as a measure of goodness of a particular feature subset. Thus, one approximates (; ) by minimizing wrap (; ) = min alg () (2) subject to 2 f0; 1gn where alg is a learning algorithm trained on data preprocessed with xed . Wrapper methods can provide more accurate solutions than lter methods [9], but in general are more computationally expensive since the induction algorithm alg must be evaluated over each feature set (vector ) considered, typically using performance on a hold out set as a measure of goodness of t. In this article we introduce a feature selection algorithm for SVMs that takes advantage of the performance increase of wrapper methods whilst avoiding their computational complexity. In order to describe this method, we rst review the SVM method and some of its properties.
3 Support Vector Learning Support Vector Machines [13] realize the following idea: map x 2 Rn into a high (possibly in nite) dimensional space and construct an optimal hyperplane in this space. Dierent mappings x 7! (x) 2 H construct dierent SVMs. The mapping () is performed by a kernel function K (; ) which de nes an inner product in H. The decision function given by an SVM is thus: X (3) f (x) = w (x) + b = 0i yi K (xi ; x) + b: i
The optimal hyperplane is the one with the maximal distance (in H space) to the closest image (xi ) from the training data. For non-separable training data a generalization of this concept is used. This reduces to maximizing the following optimization problem:
W 2 () =
P
X` ? 1 X` y y K (x ; x ) i=1
i
2 i;j=1
i j i j
i
j
(4)
under constraints `i=1 i yi = 0 and C i 0; i = 1; :::; `. Suppose that the maximal distance is equal to m and that the images (x1 ); :::; (x` ) of the training vectors x1 ; :::; x` are within a sphere of radius r. Then the following theorem holds true [13]. Theorem 1 If images of training data of size ` belonging to a sphere of size r are separable with the corresponding margin m, then the expectation of the error probability has the bound 2 R = 1 E R2 W 2 (0 ) ; 1 (5) EPerr ` E M 2 ` where expectation is taken over sets of training data of size `. This theorem justi es the idea that the performance depends on the ratio E fR2 =M 2g and not simply on the large margin M , where R is controlled by the mapping function (). Other bounds also exist, in particular Vapnik and Chapelle [4] derived an estimate using the concept of the span of support vectors.
Theorem 2 Under the assumption that the set of support vectors does not change when removing the example p `?1 1 E EPerr `
` X p=1
0p
?1 )pp ? 1 (KSV
(6)
where is the step function, KSV is the matrix of dot products between support vectors, ?1 is the probability of test error p`err for the machine trained on a sample of size ` ? 1 and
the expectations are taken over the random choice of the sample.
4 Feature Selection for SVMs In the problem of feature selection we wish to minimize equation (1) over and . The support vector method approximately minimizes the true risk over the set of
functions f (x; ) = w (x) + b. We rst enlarge the set of functions considered by the algorithm to f (x; ) = w (x ) + b. Note that the mapping (x) = (x ) can be represented by choosing the kernel function K in equations (3) and (4):
K (x; y) = K ((x ); (y )) = ( (x) (y))
(7)
for any K . Thus for these kernels the bounds in Theorems (1) and (2) still hold. Hence, to minimize (; ) over and we minimize the wrapper functional wrap in equation (2) where alg is given by the equations (5) or (6) choosing a xed value of implemented by the kernel (7). Using equation (5) one minimizes over :
R2 W 2 () = R2 ()W 2 (0 ; )
(8)
where the radius r for kernel K can be computed as the following optimization problem [13]:
P
R2 () = max
X K (x ; x ) ? X K (x ; x ) i
i
i
i
i;j
i j
i
j
(9)
subject to i i = 1; i 0; i = 1; : : : ; ` , and W 2 (0 ; ) is de ned by the maximum of functional (4) using kernel (7). Minimizing via the span bound is de ned similarly. Finding the minimum of R2 W 2 over requires searching over all possible subsets of n features which is a combinatorial problem. This is avoided by using search methods such as greedily adding or removing features (forward or backward selection), hill climbing. All of these methods are expensive to compute if n is large. As an alternative to these approaches we suggest the following method: approximate the binary valued vector 2 f0; 1gn; with a real valued vector 2 Rn . Find the optimum value of by minimizing R2 W 2 , or some other criterion, by gradient descent. (Denote the k-th element of the n dimensional vector as k ). As explained in [4] the derivative of our criterion is:
@R2W 2 (; 0 ) = R2 () @W 2 (0 ; ) + W 2 (0 ; ) @R2 () : (10) @k @k @k We estimate the minimum of (; ) by minimizing equation (8) in the space 2 Rn
using the gradients (10) with the following extra constraint which approximates integer programming: X (11) R2 W 2 () + (i )p
P subject to i i = m;
i
i 0; i = 1; : : : ; `. For large enough as p ! 0 only m elements of will be nonzero, approximating optimization problem (; ). One can further simplify computations by considering a step-wise procedure to nd m features. To do this one can minimize R2 W 2 () with unconstrained. One then constrains the q smallest values of to zero, and repeats the minimization until only m nonzero elements of remain.
5 Experiments 5.1 Toy data We compared standard SVMs, our feature selection algorithm, and three classical lter methods to select features followed by SVM training. The three lter methods used were choosing the m largest features according to, Pearson correlation coecients, the Fischer criterion score2, and the Kolmogorov-Smirnov test (as used in [1]3 ). The Pearson coecients and Fischer criterion cannot model nonlinear dependencies. In the two following arti cial datasets our objective was to assess the ability of the algorithm to select a small number of target features in the presence of irrelevant and redundant features.
Linear problem Six dimensions of 202 were relevant. The probability of y = 1 or ?1 was equal. The rst three features fx1 ; x2 ; x3 g were drawn as xi = yN (i; 1) with probability of 0:7 and xi = N (0; 1) with probability 0:3. The remanining three features fx4 ; x5 ; x6 g were drawn as xi = yN (i ? 3; 1) with probability of 0:7 and xi = N (0; 1) with probability 0:3. The remaining features are noise xi = N (0; 20); i = 7; : : : ; 202. Nonlinear problem There were an equal number points with y = 1 and y = ?1. Two dimensions of 52 were relevant. They are drawn from the following: if y = ?1 then fx1 ; x2 g are drawn from N (1 ; ) or N (2 ; ) with equal probability, 1 = f? 43 ; ?3g and 2 = f 43 ; 3g and = I , if y = 1 then fx1 ; x2 g are drawn again from two normal distributions with equal probability, with 1 = f3; ?3g and 2 = f?3; 3g and the same as before. The rest of the features are noise
xi = N (0; 20); i = 3; : : : ; 52.
In the linear problem the rst six features have redundancy and the rest of the features are irrelevant. In the nonlinear problem all but the rst two features are irrelevant. We used a linear SVM for the linear problem and a second order polynomial kernel for the nonlinear problem. For the lter methods and the SVM with feature selection we selected the 2 best features. The results are shown in Figure (??) for various training set sizes, taking the average test error on 500 samples over 30 runs of each training set size. The Fischer score (not shown in graphs due to space constraints) performed almost identically to correlation coecients. Standard SVMs perform poorly for both problems. Our SVM feature selection methods outperformed the lter methods, with forward selection being marginally better than gradient descent. Among the lter methods only KStst improved performance in the nonlinear problem. 2 F (r) = +r2 ??r 2 , where r is the mean value for the r-th feature in the positive + ? r +r 2 and negative classes andr is the standard deviation 3 KStst (r) = p` sup P^ fX fr g ? P^ fX fr ; yr = 1g where fr denotes the r-th
feature from each training example, and P^ is the corresponding empirical distribution.
Span−Bound & Forward Selection RW−Bound & Gradient Standard SVMs Correlation Coefficients Kolmogorov−Smirnov Test
0.7 0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
20
40
60
80
100
Span−Bound & Forward Selection RW−Bound & Gradient Standard SVMs Correlation Coefficients Kolmogorov−Smirnov Test
0.7
0
20
40
(a)
60
80
100
(b)
Figure 1: A comparison of feature selection methods on (a) a linear problem and (b) a nonlinear problem both with many irrelevant features. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points.
5.2 Real-life Data For the following problems we compared three selection algorithms: (1) minimizing R2 W 2 , (2) rank ordering features according to how much they change the decision boundary and removing those that cause the least change4 as used in [6], and (3) the Fischer criterion score. In these experiments Pearson's correlation was not used because of its poor performance in the pedestrian and face detection problems resulted in exceedingly slow performance.
Face detection The face detection experiments described in this section are for
the system introduced in [12, 6]. The training set consisted of 2; 429 positive images of frontal faces of size 19x19 and 13; 229 negative images not containing faces. The test set consisted of 105 positive images and 2 million negative images. A wavelet representation of these images [6] was used, which resulted in 1; 740 coecients for each image. Performance of the system using all coecients, 725 coecients, and 120 coecients is shown in the ROC curve in gure (2a). The best results were achieved using all features. Method (1) performed marginally better than method (2) and method (3) performed signi cantly worse than either method (1) or (2).
Pedestrian detection The pedestrian detection experiments described in this section are for the system introduced in [11]. The training set consisted of 924 positive images of people of size 128x64 and 10; 044 negative images not containing faces. The test set consisted of 124 positive images and 8 hunderd thousand negative images. A wavelet representation of these images [6, 11] was used, which resulted in 1; 326 coecients for each image. 4
I (r) =
y K r (x ; x ) ; j j j i i=1 j =1
sv N sv N X X
where K r (xj ; xi ) = K (xj ; xi )xri and xri is the r-th element of xi
(12)
1
1
0.9
0.95
0.8
0.9
0.7
0.85
0.6
−6
10
−5
10 False Positive Rate
−4
10
Detection Rate
1
Detection Rate
Detection Rate
Performance of the system using all coecients and 120 coecients is shown in the ROC curve in gure (2b). The best results were achieved using all features. Method (1) performed marginally better than method (2) and method (3) performed signi cantly worse than either method (1) or (2).
0.8
0.75
0.7
0.8
0.65 0.6
0.6 0.4
0.55 0.2
−6
10
−5
10 False Positive Rate
(a)
−4
10
0.5
−4
−3
10
10 False Positive Rate
(b)
Figure 2: The solid line is using all features, the solid line with a circle is using method (1), the dotted and dashed line is using method (2), and the dotted line is using method (3) (a)The top ROC curves are for 725 features and the bottom one for 120 features for face detection. (b) ROC curves using all features and 120 features for pedestrian detection.
Cancer morphology classi cation For DNA microarray data analysis one
needs to determine the relevant genes in discrimination as well as discriminate accurately. We look at two leukemia discrimination problems [7, 10]. The rst problem was classifying myeloid and lymphoblastic leukemias based on the expression of 7129 genes. The training set consists of 38 examples and the test set 34 examples. Using all genes a linear SVM makes 1 error on the test set. Using 20 genes 0 errors are made for method (1), 5 errors are made for method (2), and 3 errors are made for method (3). Using 5 genes 1 error is made for method (1), 7 errors are made for method (2), and 5 errors are made for method (3). The second problem was discriminating B versus T cells for lymphoblastic cells [7]. Standard linear SVMs makes 1 error for this problem. For method (1) 0 errors are made using 5 genes. For method (2) 2 errors are made using 5 genes and for method (3) 3 errors are made using 5 genes.
6 Conclusion In this article we have introduced a (wrapper) method to perform feature selection for SVMs. This method is computationally feasible for high dimensional datasets compared to existing wrapper methods, and experiments on a variety of toy and real datasets show superior performance to the lter methods tried.
References [1] S. Bengio and Y. Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In IEEE Transaction on Neural Networks special issue on data mining and knowledge discovery, 2000.
[2] A. Blum and P. Langley. Selection of relevant features and examples in machine learning, 1997. [3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. San Francisco. [4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing kernel parameters for support vector machines. Machine Learning, 2000. Submitted. [5] T. Evgeniou and T. Poggio. Sparse representation of multiple signals. AI Memo 1619, Massachusetts Institute of Technology, 1997. [6] T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio. Image representations for object detection using kernel classi ers. In Asian Conference on Computer Vision, 2000. [7] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. D. Bloom eld, and E. S. Lander. Molecular classi cation of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531{537, 1999. [8] T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy discrimination. In Uncertainity In Arti cial Intellegence, 2000. [9] J. Kohavi. Wrappers for feature subset selection, 1995. [10] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. Mesirov, and T. Poggio. Support vector machine classi cation of microarray data. AI Memo 1677, Massachusetts Institute of Technology, 1999. [11] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates. In Proc. Computer Vision and Pattern Recognition, pages 193{199, Puerto Rico, June 16{20 1997. [12] C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In International Conference on Computer Vision, Bombay, India, January 1998. [13] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.