1342
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
Comments
IðX1 ; !Þ þ
m X
IðXi ; !Þ
Abstract—We derive the feature selection criterion presented in [1] and [2] from the multidimensional mutual information between features and the class. Our derivation: 1) specifies and validates the lower-order dependency assumptions of the criterion and 2) mathematically justifies the utility of the criterion by relating it to Bayes classification error. Index Terms—Feature selection, entropy, mutual information, Bayes classification error, entropy estimation.
Ç INTRODUCTION
GIVEN an initial set of n features, the goal of mutual information based feature selection is to select a subset of m features that maximizes the multidimensional (joint) mutual information [3] between features and the class, given as IðX; !Þ ¼ IðX1 ; ; Xm ; !Þ X X P ðX1 ; ; Xm ; !Þ ; ¼ P ðX1 ; ; Xm ; !Þ log P ðX1 ; ; Xm ÞP ð!Þ ! X ;;X m
ð1Þ where X is a feature vector, Xi is a feature, and ! ¼ f!1 ; ; !k g is the class variable. Hellman and Raviv’s equivocation bound [4] shows that the Bayes classification error is upper bounded by 1 2 Hð!jXÞ, where Hð!jXÞ is the class-conditional entropy. Because Hð!jXÞ ¼ Hð!Þ IðX; !Þ [5], maximizing (1) minimizes Helman and Raviv’s bound on Bayes classification error, thus justifying its application as a feature selection criterion. Histograms and continuous kernels are two popular nonparametric “plug-in” estimators [6] of mutual information. However, when the dimensionality is high, estimating (1) with histograms becomes impractical because of its complexity, which grows exponentially with the number of features. On the other hand, estimating (1) with a high-dimensional kernel (see [7]) often demands large training sample sizes, which may be unrealistic for the problem at hand. Considering these issues, Battiti [1] and Peng et al. [2] proposed a low-dimensional approximation to (1). The approximation selects features that maximize class-separability and simultaneously minimize dependencies between feature pairs. The approximation is given as
. The authors are with the Center for Secure Cyberspace, Computer Science, Louisiana Tech University, Nethken Hall, 600 W. Arizona Ave., Ruston, LA 71272. E-mail: {ksb011, phoha}@latech.edu. Manuscript received 21 May 2009; revised 5 Oct. 2009; accepted 11 Oct. 2009; published online 11 Mar. 2010. Recommended for acceptance by M. Figueiredo. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TPAMI-2009-05-0322. Digital Object Identifier no. 10.1109/TPAMI.2010.62. Published by the IEEE Computer Society
! IðXi ; Xj Þ ;
ð2Þ
where IðX1 ; !Þ represents the selection of the first feature that maximizes the class-separability and ! m X X IðXi ; !Þ IðXi ; Xj Þ i¼2
0162-8828/10/$26.00 ß 2010 IEEE
X j