2010 International Conference on Pattern Recognition
Performance Evaluation of Automatic Feature Discovery Focused within Error Clusters Sui-Yu Wang and Henry S. Baird Computer Science and Engineering Dept., Lehigh University 19 Memorial Dr West, Bethlehem, PA 18017, USA E-mail:
[email protected],
[email protected]
Abstract
PCA [6, 5] to reduce the dimensionality to a manageable size before applying respective algorithms. PCA assumes that dimensions with greater variance are more important and so discard dimensions with smaller variance. However, larger variance does not imply more useful information for discriminating data in different classes. Thus using PCA to reduce dimensionality before applying feature selection algorithms may risk throwing away useful information. We propose a different approach: we will attempt to automatically generate a sequence of features which are guaranteed to yield a sequence of classifiers with improving accuracy, without throwing information away beforehand. Our algorithm was first detailed in [7], and was proven superior to PCA and competitive to timeconsuming manual search [8]. We briefly review it here. We assume a data-rich problem domain: our algorithm benefits from a large amount of labeled data and have it help us find features that focused on discriminating samples in certain regions of the feature space. We assume that we are working on a two-class problem and that we are given an initial set of features on which a Nearest Neighbor (NN) classifier has been trained. Our algorithm finds tight clusters of errors of both types and uses them to guide the search for new features. In order to compare our algorithm with current stateof-the-art algorithms, we conducted an experiment on a public dataset that was used in the NIPS 2003 Feature Selection Challenge [1], the Gisette dataset [2]. We started the algorithm with a manual feature that assigned equal weights to all original features. The first discovered feature dropped the error rate to 9.6%. A set of 28 discovered features with the manual feature achieved an error rate of 1.8% on the validation set, putting us number 129 among all 411 current entries, as of December 4, 2009. We also give a proof that shows if the underlying distributions of both classes are Gaussian, the error rate is guaranteed to improve by adding features discovered
We report performance evaluation of our automatic feature discovery method on the publicly available Gisette dataset: a set of 29 features discovered by our method ranks 129 among all 411 current entries on the validation set. Our approach is a greedy forward selection algorithm guided by error clusters. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. This method assumes a “data-rich” problem domain and works well when large amount of labeled data is available. The result on the Gisette dataset shows that our method is competitive to many of the current feature selection algorithms. We also provide analytical results showing that our method is guaranteed to lower the error rate on Gaussian distributions and that our approach may outperform the standard Linear Discriminant Analysis (LDA) method in some cases.
1. Introduction Feature selection is a field in pattern recognition that investigates how to choose a well-performing subset when given a set of features. Cover and Van Campenhout showed that for any feature selection algorithm to search for the optimal k-element subset, it has to search exhaustively through all k-element subsets for some distributions [3]. Van Horn and Martinez showed that the problem of finding the minimum subset of features without misclassifying any training sample is NP complete [4]. In a lot of real-world problems, the dimensionality of the original sample space is too large for most feature selection algorithms to work on. Most of them apply 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.181
722 718
the cluster / pairwise distances). 1 d 1 ,h (x)=ω2 should Check that the ratio of ω(x)∈ω ω(x)∈ω2 ,hd (x)=ω1 be in [1/10, 10], else skip it. Project samples in the selected cluster back to the null space. Find a separating hyperplane in the null space. Construct a new feature and examine its performance. Add the best performing feature to the feature set, and set d = d + 1 Until the error rate is satisfactory to the user.
by our algorithm. The proof follows the same argument that proves LDA’s effectiveness to select features. We also provide examples showing that our algorithm may outperform LDA in some cases.
2. Formal Definitions We work on a 2-class (ω1 , ω2 ) problem. We assume that there is a source of labeled samples X = {x1 , x2 , ...} and a ground-truthing function ω : X 7→ [ω1 , ω2 ]. Each sample x is described by a large number D of real-valued features: i.e., x ∈ RD . We assume D is too large for use as a classifier feature space. We also have a much smaller set of d real-valued features f d = {f1 , f2 , ..., fd }, where d P (ω1 )P (ω2 )e−µ (1/2) = P d+1 (error)
4. Sufficient Condition for Algorithm
5. Approximate NN Error Estimator We will prove a sufficient condition for our algorithm to work: if the data is Gaussian distributed and that both classes have the same covariance matrices
We propose an approach that computes an approximation of the error of Nearest Neighbor classifier given 720 724
Baysian Network, our approach uses a simple model that requires less engining tuning. We also provide proof of a sufficient condition: if the data is Gaussian distributed with equal covariance matrices for both classes, the error rate is guaranteed to improve by adding features discovered by our algorithm. We can also guarantee that the error rate is guaranteed to drop for any feature set with mutually independent features. This proof is excluded from this paper because if follows the usual proof for such problems. Furthermore, we propose an approximate NN error estimator that can find good hyperplanes when LDA fails. We also conjecture that this approach is able to deal with multi-class problems. However instead of finding a simple separating hyperplane between two classes, the hyperplane should maximize the mutual distances between various classes. Because of the linear algebra and clustering-based property of our algorithm, we had identified the type of problems that are likely to benefit from our algorithm. If a problem has the following three properties: 1) the original dimension of the problem D is so high that it is infeasible to apply exhaustive search; 2) linear combinations of original features is likely yield effective features; and 3) a large set of labeled data is available — then our method may be a good choice. Furthermore, the algorithm performs best when data has the following properties: 1) the data tends to cluster in certain regions in the feature space, and 2) the data distributions tend to peak in these regions and decay rapidly outside these regions. For example, the algorithm can be a good fit for data that is Gaussian or Binomial distributed.
Figure 4. The classification error by using different hyperplanes as new features may differ over 200% in an example given in Section 5.
a hyperplane. For a given sample x, we want the samples with feature value within (xwi ) ± δ to be mostly of the same class as x. We can approximate this process by first discretizing the space of xwi into δ1 , δ2 , ..., δm followed by calculating the quantity X qj = min( I(x ∈ ω1 , xwi ∈ δj ), X
I(x ∈ ω2 ), xwi ∈ δj ),
where I is the indicator function. A hyperplane w∗ that minimizes X J(w∗ x) = qj j=1,...,∞
References
has the minimum empirical error among all onedimensional subspace of the null space. An example where D=2 is shown in Fig. 4. Assume we have two Gaussian distributions with means (0, 0), and (0, 0.1), and with diagonal covariances matrix entries being (1, 1) and (5, 1), respectively. In this extreme case, LDA would find the direction with little discriminability.because the displacement of the means isn’t the direction where the variances are different. We analyze all potential 1D hyperplanes in this example and show the error rate differs up to almost a factor of three.
[1] http://clopinet.com/isabelle/Projects/NIPS2003/challenge. [2] http://archive.ics.uci.edu/ml/datasets/Gisette. [3] T. M. Cover and J. M. V. Campenhout. On the possible orderings in the measurement selection problem. IEEE Transactions on Systems, Man and Cybernetics, SMC7(9). [4] K. V. Horn and T. Martinez. The minimum feature set problem, 1994. [5] I. Jolliffe. Principal Component Analysis. Springer Series in Statistics. Springer, 2002. [6] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559–572, 1901. [7] S.-Y. Wang and H. S. Baird. Feature selection focused within error clusters. In ICPR, pages 1–4, Tampa, FL, Dec 8-11 2008. [8] S.-Y. Wang, H. S. Baird, and C. An. Document content extraction using automatically discovered features. In ICDAR, 2009.
6. Conclusion We provide performance evaluation of our automatic feature discovery method on the Gisette dataset: a set of 29 features ranks 129 among all 411 current entries. Compare to other more complex models such as the 721 725