The Fisher-Markov Selector: Fast Selecting Maximally ... - IEEE Xplore

14 downloads 0 Views 2MB Size Report
Oct 20, 2010 - Index Terms—Classification, feature subset selection, Fisher's linear discriminant ... Fisher-Markov selector can be regarded as a filter method,.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 33, NO. 6,

JUNE 2011

1217

The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multiclass Classification with Applications to High-Dimensional Data Qiang Cheng, Member, IEEE, Hongbo Zhou, and Jie Cheng Abstract—Selecting features for multiclass classification is a critically important task for pattern recognition and machine learning applications. Especially challenging is selecting an optimal subset of features from high-dimensional data, which typically have many more variables than observations and contain significant noise, missing components, or outliers. Existing methods either cannot handle high-dimensional data efficiently or scalably, or can only obtain local optimum instead of global optimum. Toward the selection of the globally optimal subset of features efficiently, we introduce a new selector—which we call the Fisher-Markov selector—to identify those features that are the most useful in describing essential differences among the possible groups. In particular, in this paper we present a way to represent essential discriminating characteristics together with the sparsity as an optimization objective. With properly identified measures for the sparseness and discriminativeness in possibly high-dimensional settings, we take a systematic approach for optimizing the measures to choose the best feature subset. We use Markov random field optimization techniques to solve the formulated objective functions for simultaneous feature selection. Our results are noncombinatorial, and they can achieve the exact global optimum of the objective function for some special kernels. The method is fast; in particular, it can be linear in the number of features and quadratic in the number of observations. We apply our procedure to a variety of real-world data, including mid-dimensional optical handwritten digit data set and high-dimensional microarray gene expression data sets. The effectiveness of our method is confirmed by experimental results. In pattern recognition and from a model selection viewpoint, our procedure says that it is possible to select the most discriminating subset of variables by solving a very simple unconstrained objective function which in fact can be obtained with an explicit expression. Index Terms—Classification, feature subset selection, Fisher’s linear discriminant analysis, high-dimensional data, kernel, Markov random field.

Ç 1

INTRODUCTION

I

N many important pattern recognition and knowledge discovery tasks, the number of variables or features is high; oftentimes, it may be much higher than the number of observations. Microarray data, for instance, often involve thousands of genes, while the number of assays is on the order of hundreds or less. In neurodevelopmental studies with functional MRI (fMRI) images, the number of experiments for the subjects is limited, whereas each fMRI image may have tens of thousands of features. For these data, feature selection is often used for dimensionality reduction. It aims at evaluating the importance of each feature and identifying the most important ones, which turns out to be critical to reliable parameter estimation,

underlying group structure identification, and classification. With a reduced dimension, the classifier can be more resilient to data overfitting. By eliminating irrelevant features and noise, the time and space efficiencies become significantly better. Feature selection has several challenges in facing the following problems: 1.

. Q. Cheng and H. Zhou are with the Department of Computer Science, Faner Hall, Mailcode 4511, Southern Illinois University Carbondale, 1000 Faner Drive, Carbondale, IL 62901. E-mail: {qcheng, hongboz}@cs.siu.edu. . J. Cheng is with the Department of Computer Science and Engineering, College Hall 3D, University of Hawaii-Hilo, 200 W. Kawili Street, Hilo, HI 96720. E-mail: [email protected]. Manuscript received 21 Oct. 2009; revised 10 May 2010; accepted 6 Sept. 2010; published online 20 Oct. 2010. Recommended for acceptance by S.B. Kang. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2009-10-0710. Digital Object Identifier no. 10.1109/TPAMI.2010.195. 0162-8828/11/$26.00 ß 2011 IEEE

2.

A small sample size with a large number of features. This has been acknowledged as an important challenge in contemporary statistics and pattern recognition. Selecting the best subset of features is, in general, of combinatorial complexity. Classifying high-dimensional data without dimensionality reduction is difficult and time-consuming, if not impossible. The irrelevant variables make no contribution to the classification accuracy of any classifier but, rather, have adverse effects on the performance. With too many irrelevant variables, the classification performance would degrade into that of random guesses, as empirically observed in [1], [2], and theoretically shown in [3]. Linearly nonseparable classes. These cases can be difficult for many feature selectors. In a highdimensional space, the difficulty may be alleviated thanks to Cover’s theorem [4].

Published by the IEEE Computer Society

1218

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Noisy features. High-dimensional data usually have significant noise. The noise or irrelevant components can easily lead to data overfitting, especially for any data with a large number of features but a small sample size. The main reason is because, for highdimensional data, it is easy to find some noise components, which may be different for each class given a small number of observations; however, they may not correctly represent the importance of features. In this paper, we study choosing a desirable subset of features for supervised classification. Even though our main focus is on high-dimensional data, the proposed FisherMarkov selector can be applied to general data with arbitrary dimensions, as shown with experiments in Section 5. The Fisher-Markov selector can be regarded as a filter method, meaning the selected features can be properly used by any available classifiers. In the same spirit as Fisher’s discriminant analysis (FDA) [5], we use the class separability as a criterion to select the best subset of features; that is, we maximize the betweenclass distance while minimizing the within-class distance— hence the first part of the name. The way we use the class separability, however, is different than the FDA (or its kernelized version KFDA), either the linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA). The FDA is used for extracting features through feature vector transformation (the KFDA does this in a kernel space); in contrast, the Fisher-Markov selector is a general variable (feature) selection strategy. The goal, formulations, and optimization techniques of the Fisher-Markov selectors are completely different from those of the FDA (or KFDA). In defining the class separability, we incorporate kernel tricks to map each original input to a higher dimensional kernel space. By doing so, we may define our feature selection problem in a broad setting, with the aim of subjugating the difficulty of the above-mentioned second challenge when the dimensionality of the input space is insufficiently high. To reduce the adverse effect from noise components, the FisherMarkov selector employs an l0 -norm to penalize the complexity. As experimentally shown in Section 5, the overfitting problem can be overcome neatly with a principled method. There is of course a huge literature on feature selection, and it is well known that the best subset selection is generally of combinatorial complexity, with which it is literally infeasible for high-dimensional data. By taking advantage of some special kernel functions, the FisherMarkov selector boils down the general subset selection to seeking an optimal configuration on the Markov random fields (MRFs). For the MRF problem, the Fisher-Markov selector then constructs efficient ways to achieve the exact global optimization of the class separability—hence the second part of the name. In this paper, scalars and vectors are denoted by lowercase letters, and a vector is clearly defined when it first appears. Matrices are denoted by capital letters. Script letters denote spaces or classes, especially, Rk denotes the k-dimensional space of real numbers. The operation < ;  > denotes the euclidean inner product of two vectors. The transpose of a vector x is denoted by xT , and its lq -norm by kxkq , 0  q  1. We use the notation of ðÞi to denote the ith element of a vector, or ðÞij the ij-th element of a matrix. We 3.

VOL. 33,

NO. 6,

JUNE 2011

use 0 to denote a scalar zero or a vector of zeros whose dimension is clear in the context. The paper is organized in the following way: We begin by discussing related work in Section 2. We formulate a class of new selectors in Section 3, and explicitly construct the Fisher-Markov selector in Section 4. Section 5 introduces experiments demonstrating that our approach is effective in practical applications. And finally, Section 6 summarizes our findings and their consequences for variable selection and supervised classification.

2

RELATED WORK

There is a huge literature on variable or feature selection, and many procedures motivated by a wide array of criteria have been proposed over the years. We review in the following only the methods that are closely relevant to our work. Subset selection methods in discriminant analysis for pattern recognition were initially studied by Dixon and Massey, and Kendall [6], [7]. These and other early work focused on normal-based linear discriminant function (NLDF) for two groups under the assumption of normality and homoscedasticity; see Devijver and Kittler [8], Fukunaga [9], and McLachlan [10] for excellent discussions. Traditional methods considered various selection criteria and optimization techniques. For example, Fu devised mathematical programming approaches to feature selection [11]. The Kullback-Leibler divergence (KLD) [12] and the Bhattacharyya distance [13] were used as selection criteria. Narendra and Fukunaga described the branch and bound algorithm that maximizes a criterion function over all possible subsets of a given size [14]. Rissanen characterized the stochastic complexity of the data by constructing the minimum description length (MDL) principle, and he proposed two methods for variable selection with the MDL principle [15]. These traditional methods are able to handle only large-sample cases, where the number of observations is much larger than that of features. For high-dimensional data, usually they are inapplicable. An exception is Kira and Rendall’s Relief [16], which weighed features to maximize a so-called margin. It can be used for high-dimensional data, though the results are far from optimal. The Lasso method was proposed by Tibshirani, which is an l1 penalized squared error minimization problem [17]. It has been shown to be effective in estimating sparse features. More recently, there has been a thrust of research activities in sparse representation of signals—among which are Donoho and Elad [18], Donoho [19], and Candes et al. [20]. These sparse representation techniques have a close relationship with feature selection, especially because the l1 minimization may approximate the sparsest representation of the signal and discard irrelevant variables. In this vein, Candes and Tao have proposed the Dantzig selector that is suitable for highdimensional data by employing the Lasso-type l1 regularization technique [21]. It appears to be ineffective for linearly nonseparable data. An interesting model-based approach to feature selection has been constructed by McLachlan et al. for high-dimensional data [22]. Peng et al. have proposed a method, mRMR, by combining max-relevance and minredundancy criteria [23]. Their method uses a greedy optimization of mutual information for univariate feature

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

selection. It handles low to mid-dimensional data effectively. Wang has presented a feature selection method using a kernel class separability, a lower bound of which is optimized with a gradient descent-type of optimization technique [24]. This method finds local optimum(s) iteratively and is applicable only to stationary kernels. The above-mentioned work chooses a subset of feature variables by optimizing feature selection criteria, including the probability of error, KLD, NLDF class separations, Bhattacharyya distance, MDL, etc. They select features independently of the subsequent classifiers, which in the literature are referred to as filter methods. Also often used are wrapper methods, which fix a classifier and choose features to maximize the corresponding classification performance. Weston et al.’s method and Guyon et al.’s recursive feature elimination (RFE), e.g., choose features to minimize the classification errors of a support vector machine (SVM) [25], [26]. Wrapper methods choose features in favor of a particular classifier, and they incur higher computational complexities by evaluating the classifier accuracy repeatedly. The feature selector proposed in this paper is mainly a filter method and its resultant subset of features is proper for any classifiers. Optimizing a variable selection criterion or classifier performance needs a searching strategy or algorithm to search through the space of feature subsets. Globally optimal (in the sense of a chosen criterion) multivariate selection methods usually have a combinatorial complexity [27]. A notable exception is Narendra and Fukunaga’s Branch and Bound method, whose optimality, however, can only be guaranteed under a monotonicity condition [14]. To alleviate the difficulty, many practically useful, though suboptimal, search strategies have been considered, including random search [28], [29], sequential search, etc. The sequential search methods select features one at a time and they are widely used for their efficiency [25], [26], [30]. The selector presented in this paper is an efficient multivariate selection method that may attain or approximate the global optimum of the identified discrimination measure. For instance, the resulting LFS (see Section 4) has a complexity linear in the number of features and quadratic in the number of observations. The feature selection problem with more than two groups or classes is usually more difficult than those with two groups. The Fisher-Markov selector is built directly for multiple-class classification, that is, the number of classes is greater than or equal to two. Also, there is a close tie between discriminant analysis and multiple regression. Fowlkes et al. [31] considered the variable selection in three contexts: multiple regression, discriminant analysis, and clustering. This paper only focuses on the problem of feature selection for supervised classification.

3

FORMULATION OF FISHER-MARKOV SELECTOR

In many research fields, feature selection is needed for supervised classification. Specifically, with training data fðxk ; yk Þgnk¼1 , where xk 2 Rp are p-dimensional feature vectors and yk 2 f!1 ; . . . ; !g g are class labels, the most important features are to be chosen for the most discriminative representation of a multiple-class classification with g classes Ci , i ¼ 1; . . . ; g. Each class Ci has ni observations. Given a set of new test observations, the selected features

1219

are used to predict the (unknown) class label for each observation. This paper focuses on an important situation in which p is much greater than n, though, as readers will see later in Sections 4 and 5, the proposed method is applicable to general data, including the large-sample case with p less than n.

3.1 Selection in the Spirit of Fisher Feature selection methods, especially those independent of the classifiers, need to set a selection criterion. Often used are the KLD [12], Bhattacharyya distance [13], mutual information [23], etc. Fisher’s FDA is well known [8], [9], [10]. It discriminates the classes by projecting high-dimensional input data onto low-dimensional subspaces with linear transformations, with the goal of maximizing interclass variations while minimizing intraclass variations. The LDA is simple, theoretically optimal (in a certain sense [10]) and routinely used in practice; for example, in face recognition, bankruptcy prediction, etc., [32]. When the number of observations is insufficient, however, it suffers from a rank deficiency or numerical singularity problem. Some methods have been proposed to handle such numerical difficulty, including Bickel and Levina’s naive Bayesian method [33]. Even though Fisher’s discriminant analysis performs no feature selection but projection onto subspaces through linear transformations [8], [9], [10], the spirit of maximizing the class separations can be exploited for the purpose of feature selection. This spirit induces the Fisher-Markov selector’s class separability measure (see Section 3.2). In the spirit of Fisher’s class separation, to construct a maximally separable, and at the same time, practically useful feature selector which is applicable to general (including, e.g., both large-sample and high-dimensional) data sets, two main questions are yet to be addressed: Q1. Can the selector manage to handle linearly nonseparable and highly noisy data sets with many irrelevant features? Q2. Can the selector’s class separation measure be optimized in an algorithmically efficient way? The first question (Q1) can be approached by using kernel ideas, inspired by [34], [35], [36], [37], [24], and by using an l0 -norm. The use of the l0 -norm is to offer a model selection capability that determines how many features are to be chosen. In response to Q2, the feature selector must be constructed to admit efficient optimizations. A group of features may lead to the best classification results although each individual feature may not be the best one. To capture the group effect, it appears necessary to seek multivariate features simultaneously instead of sequentially or in a univariate fashion. The multivariate selection is essentially a combinatorial optimization problem for which exhaustive search becomes infeasible for a large number of features. To overcome this computational difficulty and for simultaneous selection, this paper exploits MRF-based optimization techniques to produce multivariate and fast algorithms.

3.2 Class Separability Measure In the original input (or measurement) space, denote the within, between (or inter)class, and total scatter matrices by Sw , Sb , and St :

1220

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Sw ¼

nj g  ðjÞ T  ðjÞ 1 XX x  j Þ xi  j ; n j¼1 i¼1 i

1X Sb ¼ nj ðj  Þðj  ÞT ; n j¼1

ð1Þ

g

St ¼

n 1X ðxi  Þðxi  ÞT ¼ Sw þ Sb ; n i¼1

ð2Þ

ð3Þ

ðjÞ

where xi denotes the ith observation in class Cj and j and  are sample means forPclass Cj and the whole data set, P nj ðjÞ respectively, i.e., j ¼ n1j i¼1 xi and  ¼ n1 ni¼1 xi . To measure the class separations, the traces or determinants of these scatter matrices may be used [9]. In this paper, to form a class separation criterion, the traces of these scatter matrices are employed. The above scatter matrices measure the data scattering using means and variances, based on an implicit assumption that the data in each class follow a normal distribution. This assumption works well for largesample data. To address the linearly nonseparable problem (Q1), kernels will be incorporated into the above scatter matrices. Let ðÞ be a possibly nonlinear mapping from the input space Rp to a kernel space RD : D

fKðÞgkl ¼ kðxk  ; xl  Þ;   ðiÞ fK ðiÞ ðÞguv ¼ k xðiÞ u  ; xv   ;

argmax 2f0;1gp

ð4Þ

ðxÞ ! z:

ð5Þ

subject to

ð6Þ

TrðS~b Þ ¼

 1  1X 1 Sum K ðiÞ  2 SumðKÞ; n i¼1 ni n

ð7Þ

TrðS~t Þ ¼

1 1 TrðKÞ  2 SumðKÞ; n n

ð8Þ

g

where the operators SumðÞ and TrðÞ calculate, respectively, the summation of all elements and the trace of a matrix, and K and K ðiÞ are size n  n and ni  ni matrices defined by  ðiÞ    ðiÞ fKgkl ¼ kðxk ; xl Þ; K uv ¼ k xðiÞ ð9Þ u ; xv ; for k; l 2 f1; . . . ; ng, u; v 2 f1; . . . ; ni g, and i ¼ 1; . . . ; g. The feature selector is denoted by  ¼ ½1 ; . . . ; p T 2 f0; 1gp with k ¼ 1 indicating the kth feature is chosen or 0 not-chosen, k ¼ 1; . . . ; p. The selected features from a feature vector x are given by xðÞ ¼ x  ;

ð10Þ

where the operator  represents the Hadamard product (also known as the Schur product), which is an entrywise

ð11Þ

TrðS~b ÞðÞ  TrðS~t ÞðÞ;

ð12Þ

where  is a free parameter whose suitable range of values will be discussed in Section 4. Now we first consider the effect of the signs of . When  > 0, by using the Lagrangian multiplier technique [38], it is noted that the unconstrained optimization is equivalent to the following constrained optimization problem:

:R !R ;

g   1 1X 1 Sum K ðiÞ ; TrðKÞ  n n i¼1 ni

TrðS~w Þ ¼

JUNE 2011

f o r k; l 2 f1; . . . ; ng, u; v 2 f1; . . . ; ni g, a n d i ¼ 1; . . . ; g. Hence, so do the traces of the scatter matrices, which are denoted by TrðS~w ÞðÞ, TrðS~b ÞðÞ, and TrðS~t ÞðÞ. The goal of the feature selection is to maximize the class separations for the most discriminative capability of the variables. In the spirit of Fisher, we formulate the following unconstrained optimization for class separability:

2f0;1gp

With the kernel trick [35], the inner product in the kernel space becomes < ðx1 Þ; ðx2 Þ >¼ kðx1 ; x2 Þ, with kð; Þ being some kernel function. After mapping into the kernel space, the scatter matrices can be calculated in the kernel space, where the within, between-class, and total scatter matrices are denoted by S~w , S~b , and S~t . By simple algebra, it is not hard to find the traces of S~w , S~b , and S~t to be

NO. 6,

product. With feature selection, K and K ðiÞ become functions of :

argmax p

VOL. 33,

TrðS~b ÞðÞ

ð13Þ

TrðS~t ÞðÞ  constant:

ð14Þ

It maximizes the interclass scattering while keeping the total scattering bounded (so the intraclass scattering is automatically minimized); nonetheless, it may be sensitive to spurious features of the data in high-dimensional case. When   0, the unconstrained optimization in (12) actually maximizes not only the discriminativeness but the expressiveness jointly. For high-dimensional data where p  n, there may be many irrelevant or highly noisy components, among which it is easy to find some spurious features that discriminate different groups quite well; for subsequent classifications, however, these spurious features are only misleading. To avoid degrading the generalization accuracy, it is imperative that we avoid spurious features and select those truly discriminative variables that represent the essential information or structure. That is, we should preserve the so-called expressive power of the data. A classical way for obtaining the expressiveness of the data is to use the principal component analysis (PCA), which maxT imizes the quotient of vvTSvt v over nonzero v 2 Rp —Kernel PCA (KPCA) does this by using S~t in the kernel space. Motivated by the KPCA, we use T rðS~t ÞðÞ to describe the expressiveness, and we maximize it through our feature selector . The normalization by vT v ¼ kvk22 in the PCA will also have a regularization counterpart in our formulation (which will be kk0 and introduced subsequently). Now the objective function inspired by the KFDA is to maximize TrðS~b ÞðÞ  1 TrðS~t ÞðÞ, where 1 > 0 may be a Lagrangian multiplier; and the objective function inspired by the KPDA is to maximize TrðS~t ÞðÞ. To maximize both objectives jointly, we combine them using some 2 > 0, and we have TrðS~b ÞðÞ  ð1  2 ÞTrðS~t ÞðÞ. Letting  ¼ 1  2 thus yields (12). Hence, when   0, by using a nonnegative

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

linear combination of TrðS~b ÞðÞ and TrðS~t ÞðÞ, our feature selector maximizes simultaneously the discriminativeness and expressiveness, particularly useful for high-dimensional data. Our analytical results and experiments have found that   0 often leads to a superior classification performance that is fairly stable for a range of  values. Now we are in a position to contrast to the LDA or KFDA to highlight the differences. The LDA computes a linear transformation, often in the form of a matrix, to project onto a subspace of the original space (KFDA does this in the kernel space), whereas our goal is to select variables using  2 f0; 1gp , which is a highly nonlinear process. In terms of optimization strategies, the LDA or KFDA often uses generalized eigenvalue techniques [38], whereas our selectors make use of the MRF techniques that will be specified in Section 4. In the presence of a large number of irrelevant or noise variables, jointly optimizing discriminativeness and expressiveness can help but it alone is still insufficient: The classifier’s performance may still be degraded as observed empirically by Fidler et al. [39]. The reason is that overfitting becomes overly easy. To avoid detrimental effect of overfitting, model selection has to be explicitly considered. There is a huge literature on model selection, and many procedures have been proposed such as Akaike’s AIC [40], Schwarz’s BIC (also called Schwarz Criterion) [41], and Foster and George’s canonical selection procedure [42]. Foster and George’s procedure makes use of an l0 -norm, which is the number of nonzero elements of a predictor for multiple regression problems. The l0 -norm has also been used by Weston et al. [43]. They formed essentially an l0 -norm SVM, rather than the l2 [35], [37] or l1 -norm SVM. The optimization of the l0 -norm SVM, however, is not convex and can only be performed approximately. To handle the noise robustness issue raised in (Q1), the l0 -norm is utilized in our feature selector to regularize the discrimination measure. Hence, we obtain the following feature selection criterion: argmax 2f0;1gp

fTrðS~b ÞðÞ  TrðS~t ÞðÞ  kk0 g;

ð15Þ

where  is a constant factor for the l0 -norm whose feasible values will be discussed in Section 4. More explicitly, plugging the expressions for the scatter matrices into (15), we have the following feature selector: " g   1 X1 Sum K ðiÞ ðÞÞ  TrðKðÞ ðF SÞ argmax p n n 2f0;1g i¼1 i # ð16Þ 1 SumðKðÞÞ  kk0 : þ n The formulated feature selection (FS) process, in general, is a combinatorial optimization problem for many kernel functions, which is computationally feasible only when p is not large. For some special kernel functions, surprisingly, global optimum can be obtained efficiently for large p. To address the computational efficiency issue raised in Q2 and to obtain global optimum solutions, we consider these special kernel functions for selecting maximally separable feature subset via (FS).

4

1221

FEATURE SUBSET SELECTION USING FISHER-MARKOV SELECTOR

To obtain efficient and global optimization for (FS) for large p, some special kernels will be considered, including, particularly, the polynomial kernels. We consider the polynomial kernel [35], [36], [37] kðx1 ; x2 Þ ¼ ð1þ < x1 ; x2 >Þd ;

ð17Þ

where d is a degree parameter, or, alternatively, k0 ðx1 ; x2 Þ ¼ ð< x1 ; x2 >Þd :

ð18Þ

4.1

Fisher-Markov Selector with Linear Polynomial Kernel Incorporating the feature selector , with d ¼ 1, the kernel becomes k1 ðx1 ; x2 ÞðÞ ¼ 1 þ < x1  ; x2   >¼ 1 þ

p X

x1i x2i i or

i¼1

k01 ðx1 ; x2 ÞðÞ ¼

p X

x1i x2i i :

i¼1

Now, plugging k1 ð; Þ (or k01 ð; Þ) into (15) or (16), and defining j to be j ¼

g ni n n 1X 1 X X 1 X ðiÞ ðiÞ xuj xvj  x2ij þ 2 xuj xvj ; ð19Þ n i¼1 ni u;v¼1 n i¼1 n u;v¼1

we have the following special case of (FS): Proposition 4.1. With a linear polynomial kernel of d ¼ 1 and j defined in (19), the Fisher-Markov feature selector, which maximizes the class separations, turns out to be ðLF SÞ

argmax 2f0;1gp

p X

ðj  Þj :

j¼1

For given  and , the feature selector (LFS) has an optimum ? 2 f0; 1gp such that j >  () ?j ¼ 1:

ð20Þ

The computational complexity for obtaining ? with the given training data is Oðn2 pÞ. The (LFS) is a special case of MRF problem with binary labels in which there is no pairwise interaction term. For this particular maximization problem, ? is obviously the unique global optimum solution. When  and  are given, the Fisher-Markov selector has a computational complexity linear in p and quadratic in n. For high-dimensional data with p  n, for instance gene expression levels, the number of features may be thousands or tens of thousands, while the size of a training sample is at the order of hundred. The (LFS) is fast in terms of the number of features. For feature selection, all of the features, while not necessarily all of the observations, need to be considered; therefore the LFS handles both large-sample and smallsample data efficiently. The coefficient j indicates the importance of the jth feature. The greater the value of j , the more important the corresponding feature. As observed in Section 5, the Fisher-Markov selector is fairly insensitive to

1222

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

the value of  as long as it is in the feasible range of values that will be discussed later. The value of , however, is important in determining the number of chosen features. It is clear from the above Proposition 4.1 that the regularization factor  turns out to be a global threshold. This paper uses cross validation during training to set a proper value for . Now, some performance analysis on the (LFS) is in order. It is usually true that the observations xu and xv are statistically independent when u 6¼ v. Let us first consider simple cases where each feature variable follows a Gaussian distribution. If the jth variable contains no class discrimination information, it behaves like random noise ðiÞ across the classes. That is, xuj are independently and ðiÞ identically distributed (i.i.d.) Gaussian white noise, xuj 2 i:i:d: Nð0; j Þ for any u ¼ 1; . . . ; ni and i ¼ 1; . . . ; g. On the other hand, if the jth variable is an important feature for classification, it does contain class discrimination informaðiÞ tion, and thus we model it by xuj i:i:d: Nði;j ; 2j Þ. Then, we have the following property of the (LFS): Proposition 4.2. Using the (LFS), if the jth variable contains no class discrimination information, then j

 2j  2 g   2n þ ð  1Þ 21 ; n

ð21Þ

if the jth variable is an important feature for classification, then 2 !2 3 g g X X 2 j ð1  Þ4

i i;j 

i i;j 5 i¼1

i¼1

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !2 u g

g 2 X uX

ð  1Þ i t 2 2 þ  2 i i;j þ

i i;j G1 þ 2j n n i¼1 i¼1  2j  2 þ g   2n þ ð  1Þ 21 ; n ð22Þ is a Chi-square distribution with k degrees of freedom where (d.o.f.), G1 is a standard normal distribution, and i is defined P to be ni =n, thus gi¼1 i ¼ 1. Proof. The proof is obtained from direct calculations of Gaussian random variables. When the jth feature variable contains no discrimination information, Pni ðiÞ p1ffiffiffiffi with mean zero u¼1 xuj is Gaussian distributed P ni ðiÞ 2 ni ffi and variance 2j . It follows that ðp1ffiffiffi u¼1 xuj Þ has a ni 2 2 the first additive term in j distribution of j 1 . Thus, 2 has a distribution of nj 2g . Similarly, we can calculate the second and third terms in j and hence (21). When the jth P variable is an important feature for pffiffiffiffiffi ðiÞ ni ffi ni i;j þ j G1 . Then, we classification, p1ffiffiffi u¼1 xuj ¼ ni have the first term in j as !2 ! g g ni 2 X X  1X 2 j j ðiÞ

i x ¼

i 2i;j þ pffiffiffiffiffi G1 þ 21 ni ni u¼1 uj ni i¼1 i¼1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi g g X 2j j X ¼

i 2i;j þ 2 pffiffiffi

i 2i;j G1 þ 2g : n i¼1 n i¼1

NO. 6,

JUNE 2011

Similarly, we can derive the expression for the second and third terms in j , and hence (22). u t Though the above proposition is derived based on Gaussian distributions, the conclusion is still true for non-Gaussian distributions in the asymptotic regime, even when the observations have certain correlations—e.g., with the Lindeberg-Feller condition [44], or when a stationary random sequence or a random field satisfies some mixing conditions [45], [46]—provided the central limit theorem (CLT) holds. When the total number of observations n P i ðiÞ ffi nu¼1 xuj becomes large and ni =n tends to a constant i , p1ffiffiffi ni follows approximately a Gaussian distribution as a consequence of the CLT, and the rest of the proof of Proposition 4.2 remains the same. As a consequence of the above proposition, the mean and variance of j are, in the absence of discrimination information, 2j ðn þ 1  g  Þ; n 24j V arðj Þ ¼ 2 ðg þ n 2 þ ð  1Þ2 Þ; n

Eðj Þ ¼ 

ð23Þ

and in the presence of discrimination information for classification, 2 !2 3 g g X X Eðj Þ ¼ ð1  Þ4

i 2i;j 

i i;j 5 i¼1 i¼1 ð24Þ 2 j  ðn þ 1  g  Þ; n 2 V arðj Þ ¼ 42j 4

g X

i i¼1

2k

VOL. 33,

þ

n

þ  2 2i



g ð  1Þ2 X 2i;j þ

i i;j n i¼1

!2 3 5

24j ðg þ n 2 þ ð  1Þ2 Þ: n2 ð25Þ

A feature variable will be selected only when it contributes to distinguishing the classes. The means, in the absence or presence of discrimination information, have to be separated sufficiently far apart with respect to the variances. Specifically, we require that the difference between the means be larger than times the standard deviations, usually with 2. This leads to a separation condition for the jth feature to be selected as stated in the following corollary: Corollary 4.3. A condition for the jth feature variable to be chosen by the (LFS) is that j1  j U is larger than both pffiffiffi 2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2j g þ n 2 þ ð  1Þ2 and n " !2 g U 1 þ ð  1Þ2 X 2 2 2 j þ W þ

i i;j n n i¼1 #1=2 2j  2 2 þ 2 g þ n þ ð  1Þ ; 2n

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

where 2 g X U¼4

i 2i;j  i¼1

g X

!2 3

i i;j 5

jl ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi g X and W ¼

2i 2i;j :

i¼1

j1  j U > 2 j jjW :

Algorithm 1. Algorithm for LFS: Fisher-Markov Selector Using Polynomial Kernel with d ¼ 1. 1: Input: A data matrix of training examples ½x1 ; . . . ; xn  2 Rpn for g classes, a vector of class labels of the training examples y ¼ ½y1 ; . . . ; yn , where yk 2 f!1 ; . . . ; !g g, k ¼ 1; . . . ; n. 2: Compute the Markov coefficients j by (19). 3: Solve the MRF maximization problem of (LFS) in Proposition 4.1 by (20). 4: Output: Estimated feature selector of ? .

4.2

Fisher-Markov Selector with Quadratic Polynomial Kernel With d ¼ 2, the polynomial kernel in (17) or (18) as a function of  becomes k2 ðx1 ; x2 ÞðÞ ¼ ð1þ < x1  ; x2   >Þ2 !2 p X ¼ 1þ x1i x2i i ; or

ð27Þ

!2 x1i x2i i

:

ð28Þ

i¼1

Now, plugging 12 k2 ðx1 ; x2 ÞðÞ (or 12 k02 ðx1 ; x2 ÞðÞ) into (15) or (16) and defining jl to be

ð29Þ

1  j; l  p;

we obtain the following special case of (FS): Proposition 4.4. With a quadratic polynomial kernel 1 2 k2 ðx1 ; x2 ÞðÞ, the Fisher-Markov feature selector turns out to be ðQF SÞ

argmax 2f0;1gp

p X

1XX jl j l : 2 j¼1 l¼1 p

ðj  Þj þ

j¼1

p

Alternatively, with a quadratic polynomial kernel it is

1 0 2 k2 ðx1 ; x2 ÞðÞ,

ðQF S 0 Þ

argmax 2f0;1gp

p X

1XX jl j l : 2 j¼1 l¼1 p

 j þ

j¼1

p

Here, j is defined in (19) and jl in (29). An exact global optimum can be obtained efficiently by optimizing the above binary-label MRF if and only if jl 0 for all 1  j < l  p. Clearly, the (QFS) is an extension to the (LFS) with additional pair-wise interaction terms, whereas the (QFS) and (QFS’) have the same pairwise interactions. Both (QFS) and (QFS’) seek optimal configurations for binary-label MRFs. A few comments to optimizing the MRFs are in order now. Ever since Geman and Geman’s seminal work [47], the MRFs have been widely used for vision, image processing and other tasks [48], [49]. It is generally hard to find a global optimum and the optimization can even be NP-hard [50]. In a seminal work [51], Picard and Ratliff computed a maximumflow/minimum-cut on a graph to optimize the MRF energy. Ishikawa [52] showed that convexity of the difference of pairwise labels is a sufficient and necessary condition (even for more than two labels). Kolmogorov and Zabih [53] gave a sufficient and necessary condition for optimizing binary MRF with pairwise interactions via s-t minimum-cut. For the MRFs that do not satisfy the conditions, significant progress has been made recently with mainly two types of methods to approximate the global optimum, those based on graph-cuts [52], [54] and those based on belief propogation (also called message passing) [55], [56]. Proof. For the problem of (QFS), it is easily seen that p p p X 1XX ðj  Þj þ jl j l 2 j¼1 l¼1 2f0;1g j¼1

p p X X   j  1=4 ðjk þ kj Þ j ¼ min p

max p

2f0;1g

i¼1

p X

n 1 X xuj xvj xul xvl ; 2 n u;v¼1

ð26Þ

Then, the feasible range of  values can be chosen to be L <   U=ðU þ 2 j W Þ, where L is a lower bound for  with L ¼ U=ðU  2 j W Þ < 0 if U < 2 j W or L ¼ 1 if U 2 j W . The above condition provides insights into the feasible range of  values and why, oftentimes, a negative  may lead to a better performance than a positive one, as pointed out in Section 3. This condition may also guide us to choose proper values of , which need to be chosen to cut off from the feature distributions in the absence of class discriminative information. As empirically observed in our experiments, there is usually a large range of  values satisfying the above condition, and the (LFS) is quite insensitive to  values. This paper uses the cross validation to choose a suitable  given a classifier. The procedures of the LFS are illustrated in Algorithm 1 with chosen  and .

k02 ðx1 ; x2 ÞðÞ ¼

þ

i¼1

In the above, the term U is always nonnegative due to the Cauchy-Schwartz inequality. Corollary 4.3 can be used to design a proper  that separates the discriminative from nondiscriminative features well. P In the case that 4j  2 ¼ oðnÞ, 1 þ ð  1Þ2 ð gi¼1 i i;j Þ2 ¼ oðnÞ, the separation condition essentially becomes

g ni n 1X 1 X X ðiÞ ðiÞ ðiÞ ðiÞ xuj xvj xul xvl  x2 x2 n i¼1 ni u;v¼1 n i¼1 ij il

1223

j¼1

1XX þ jl ðj  l Þ2 : 4 j¼1 l¼1 p

ð30Þ

k¼1

p

ð31Þ

For the problem of (QFS’), the same pairwise terms are obtained. As each pairwise interaction term is a convex function in terms of ðj  l Þ for binary MRF energy minimization problem, invoking Ishikawa’s result [52], the conclusion immediately follows. Alternatively, this can be shown by using Kolmogorov and Zabih’s result

1224

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

[53]. Defining E jl ðj ; l Þ ¼ jl j l , then the sufficient and necessary condition for binary MRF energy minimization, which is the so-called regularity condition in [53], becomes E jl ð0; 0Þ þ E jl ð1; 1Þ  E jl ð0; 1Þ þ E jl ð1; 0Þ for j < l, that is, jl 0. The exact solution to the MRF can be sought using, e.g., the minimum-cut algorithm specified in [53]. u t In the following, we provide a sufficient and necessary condition, in terms of , for attaining exact global optimal solutions: Corollary 4.5. An equivalent condition, both sufficient and necessary, for guaranteeing an exact global optimum for solving (QFS) or (QFS’) is " !2 !2 # g ni n X X 1 1 X ðiÞ ðiÞ n ; x x  xuj xul  Bjl n u¼1 uj ul ð32Þ u¼1 i¼1 i for 1  j < l  p; with a term Bjl defined as 2 n X Bjl ¼ 4n x2ij x2il  i¼1

n X

!2 3 xuj xul 5:

u¼1

If Bjl ¼ 0 for some ðj; lÞ, then 1=Bjl is defined to be þ1. In the above, Bjl 0 for any pair of ðj; lÞ due to the Cauchy-Schwartz inequality, and Bjl ¼ 0 if and only if xij xil is a constant for all i, 1  i  n. Given a pair of indexes j; l, the complexity for calculating a single jl is Oðn2 Þ, and thus the complexity for calculating all jl is Oðn2 p2 Þ. The MRF solver has a complexity only in terms of p. For traditional largesample problems when p n, (QFS) or (QFS’) may be solved with an efficient MRF solver for an optimal (or near-optimal) subset of features. In high-dimensional settings where p  n, the complexity of solving (QFS) or (QFS’) may be prohibitive and thus it is recommended that (LFS) be used instead. In the case that p < n, when the pairwise interaction terms satisfy the sufficient and necessary condition, e.g., by choosing proper  values, the graph-cut method can be utilized to find the exact, globally optimal configuration to (QFS) or (QFS’). When there is at least a pairwise interaction term that does not satisfy the condition, however, it has been shown for binary MRFs that the optimization is NPhard [53]. In this case, globally optimum solutions to the MRFs can still be approximated efficiently by using either a graph-cut method or a message passing algorithm [55]. The procedures of the QFS are illustrated in Algorithm 1 with chosen  and . Algorithm 2. Algorithm for QFS: Fisher-Markov Selector Using Polynomial Kernel with d ¼ 2. 1: Input: A data matrix of training examples ½x1 ; . . . ; xn  2 Rpn for g classes, a vector of class labels of the training examples y ¼ ½y1 ; . . . ; yn , where yk 2 f!1 ; . . . ; !g g, k ¼ 1; . . . ; n. 2: Compute the Markov coefficients j by (19) and jl by (29). 3: Solve the MRF maximization problem of ðQF SÞ (or (QFS’)) in Proposition (4.4) by using an MRF solver,

VOL. 33,

NO. 6,

JUNE 2011

e.g., the one specified in [53]. 4: Output: Estimated feature selector of ? . In a similar vein to obtaining (LFS), (QFS), and (QFS’), as we apply the polynomial kernel with d ¼ 3, an MRF with triplewise interaction terms can be constructed. The resulting optimization problem is an extension of (QFS) or (LFS). However, the computation of all jkl would incur Oðn2 p3 Þ complexity, and thus it is not suitable for applications with high-dimensional data. The derivations are nonetheless similar and the sufficient and necessary conditions for exactly maximizing the triplewise MRF can also be studied in a similar way to (QFS).

5

EXPERIMENTS

5.1 Experiment Design To verify the validity and efficiency of our proposed methods, we perform experiments on a number of synthetic and real-world data sets. We make comprehensive comparisons with several state-of-the-art feature selection methods such as mRMR [23], SVM-RFE [26], l0 -norm SVMs [43], and Random Forests [29]. We use standard RFE and l0 -SVM provided by the Spider package,1 the Random Forest package of Breiman,2 as well as the mRMR package.3 Four types of classifiers are used in our experiments, Naive Bayesian (NB), C4.5, SVM, and K-SVM (SVM with a RBF kernel). Both NB and C4.5 are the Weka implementation.4 For SVM and K-SVM, we use multiple-class LibSVM (extended from binary classification with a one-versus-all approach) in the MatlabArsenal5 package. We will denote our LFS FisherMarkov selector by MRF (d ¼ 1) or simply MRF, and QFS by MRF (d ¼ 2) in the following sections. Also, we will denote the l0 -SVM feature selection method by L0. Table 1 shows the characteristics of the data sets used in our experiments. The smallest real-world one is Iris [5], which has four features; the largest one, Prostate Cancer, has 12,600 features. As a comprehensive evaluation, we consider various data sets: those from different applications ranging from optical handwritten digits recognition to gene analysis, those with different numbers of classes ranging from 2 to 10 classes, and those with a small training size and a large number of features. All data and programs in our experiments are available online.6 5.2

Evaluation Criteria and Statistical Significance Test For a given data set and a given classifier, we adopt the following criteria to compare these five feature selection methods: 1.

If a feature selection method can produce a smaller classification error rate than the other four methods, then its performance is better than the other four.

1. downloaded from http://www.kyb.tuebingen.mpg.de/bs/people/ spider/main.html. 2. downloaded from http://www.stat.berkeley.edu/users/breiman/ RandomForests. 3. downloaded from http://penglab.janelia.org/proj/mRMR/. 4. downloaded from http://www.cs.waikato.ac.nz/ml/weka/. 5. downloaded from http://www.informedia.cs.cmu.edu/yanrong/ MATLABArsenal/MATLABArsenal.htm. 6. available at http://www.cs.siu.edu/~qcheng/featureselection/.

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

1225

TABLE 1 Characteristics of the Data Sets Used in Our Experiments

If more than one feature selection method can give rise to the same error rate, the one using the smallest number of features in achieving this error rate is the best. For high-dimensional data sets, it is usually unnecessary to run through all features to arrive at meaningful conclusions. It is often sufficient to take the number of features from 1 to a maximum number of 60 (or 120) in our experiments. To measure the mean and variance of the performance, we take a multifold random split approach with the random splits usually repeated 20 times. Besides the comparison on any single data set, we also conduct statistical significance tests over multiple data sets; on the recommendation from [57], we use the Wilcoxon signed rank test to compare MRF with the other four methods. 2.

5.3 Parameter Setting for MRF We conduct experiments to examine the parameter settings of the proposed MRF (d ¼ 1) method. Regardless of the setting for classifiers, there are two parameters  and  for MRF. The parameter  controls the number of features to be selected; a proper value of  can be found during the training stage. To investigate the relationship between  and the performance of the MRF method, we designed an experiment using the Optical Pen data set. We let  vary from 10 through 10 with a step size of 0.5. For each  value, we run MRF with the number of selected features ranging from 1 up to 64. The smallest classification errors attained by four classifiers are shown in Fig. 1. It illustrates that the performance of MRF is quite stable for a large range of  values as discussed in Section 4.1.

Table 4 summarizes the multifold random split performance evaluation, and 4. Table 5 shows the results from a statistical significance test over multiple data sets. From Table 2, we observe that MRF attains the smallest classification error rates (underlined results) over all six data sets and, especially on high-dimensional data sets such as Prostate Cancer and Lung Cancer, MRF outperforms the other four methods. Other methods also produce good results on some data sets, but they are more time-consuming than MRF, as shown in Table 3 for time efficiency comparisons. From the multifold random split test results in Table 4, we can see that MRF often outperforms the other four methods. On several data sets, other method(s) may also achieve the lowest average error rates (those underlined in Table 4); in this case, MRF has smaller variances. Finally, we conduct the Wilcoxon signed rank test according to the recommendation from [57]. The rank test is between MRF and each of the other four methods based on the results in Tables 2 and 4. The testing results in terms 3.

5.4 Summary of the Results A summary of comparison results is reported here for these five methods with four classifiers and further details will be provided subsequently. We have four tables: 1.

2.

Table 2 is a summary of performance evaluation for various feature selection methods for several data sets, Table 3 is a summary of their computational efficiency,

Fig. 1. The relationship between  and the smallest classification error rates by using the MRF method on the Optical Pen data set. Four different classifiers are used.

1226

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 33,

NO. 6,

JUNE 2011

TABLE 2 Performance Evaluation on Some Real-World Data Sets

The values in the fourth to eighth columns are the best performances (the smallest error rate, in percentage) achieved by MRF, Random Forest, mRMR, RFE, and L0, respectively, by using different numbers of features from 1 up to Max# (given in the second column). For each data set, the performance of Random Forest is averaged over 10 random runs (the other four methods are deterministic). For each classifier, the bold results are the best among these five methods. For each data set, the underlined results are the best over all classifiers and all five methods.

of p-values are shown in Table 5 and it is evident that the differences between MRF and the other four methods are statistically significant. TABLE 3 Computational Efficiency Evaluation (in Seconds) on Several Real-World Data Sets

The time cost of Random Forest is averaged over 10 runs. It should be noted that mRMR is not scalable and its time cost increases quickly with respect to the number of selected features.

5.5

Experiments on Synthetic Data Sets

5.5.1 Three-Dimensional Linearly Nonseparable Problem The first synthetic data set, Synthetic-1, is mainly designed for visual inspection in a three-dimensional euclidean space. On the X-Y plane, we first generate two classes of data. For the first class, we draw 982 points from a uniform distribution on a unit circle centered at ð0; 0Þ; for the second class, we draw 250 points from a Gaussian distribution with a variance of 0.2 and a mean of ð0; 0Þ. These two classes are linearly separable in a transformed space induced by the following kernel:

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

1227

TABLE 4 Random Splits Performance Evaluation on Real-World Data Sets (Prostate Cancer and Lung Cancer Are Not Included Here as They Have Independent Testing Data Sets)

The third to seventh columns are the best performances (the entry aðbÞ%: a% = the average of the smallest error rates, and b% = variance (both in percentage) achieved by using different numbers of features from 1 upto Max#. Each class of each data set is randomly divided into #F olds, with one fold for testing and the rest for training. For each classifier, the bold values are the best results among these five methods. For each data set, the underlined values are the best results over all classifiers and five methods. For large data sets, we terminated mRMR after running over three hours: The mRMR is not scalable, as shown in Table 3.

   Kðx; yÞ ¼ x21  0:001x1 y21  0:001y1    þ x22  0:001x2 y22  0:001y2 :

ð33Þ

Subsequently, for every two-dimensional point ðx; yÞ, a third dimension z is added as noise from a uniform distribution in ½0:2; 0:2. Fig. 2 illustrates the data points in the original and transformed spaces. We perform tests using MRF (d ¼ 1) and MRF (d ¼ 2) methods on this data set. For each class, three quarters of the data are uniformly sampled for training and the rest are used for testing. Since the third dimension z is independent of the

class label, the meaningful features in this trivial problem are the first two variables. The coefficient distribution is illustrated in Fig. 3a, from which we can readily determine the importance order (y, x, z) for this data set. Figs. 3b and 3c show classification error rates of MRF (d ¼ 1) and mRMR in the original space, respectively. With exactly the same settings, MRF (d ¼ 1) correctly selects (y), (y, x), and (y, x, z) while mRMR selects (y), (y, z), and (y, z, x). It shows that mRMR cannot work in the original space for this linearly nonseparable problem.

TABLE 5 Statistical Significance Test Using the Wilcoxon Signed Rank Rest [57] over Multiple Data Sets

The rank test is performed between MRF and each of the other four methods based on Tables 2 and 4.

1228

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 33,

NO. 6,

JUNE 2011

Fig. 4. The performance of MRF (d ¼ 2) on the Synthetic-1 data set. (a) The relationship between the number of selected features and parameter  with  ¼ 0:5. (b) The classification error rate with respect to the number of selected features. A two-class SVM with a polynomial kernel of order 2 is used as a classifier in this experiment.

Fig. 2. Illustration of the data points of the Synthetic-1 data set in the original and transformed spaces. The red “+” represents the points of the first class and blue “*” of the second class. (a) A three-dimensional view of the original space and (b) the projection of (a) onto the X-Y plane. (c) The data distribution in the transformed space using the kernel function given in (33), and (d) the projection of (c) onto the X-Y plane.

We further perform tests using MRF (d ¼ 2) in the original space, and the results are shown in Fig. 4. This method correctly selects features in the order of (y), (y, x), and (y, x ,z). For other types of kernels, directly solving (16) might not be computationally efficient or even feasible when p is large. In this case, a potentially useful approach might be to use Tailor or other types of expansions to approximate or bound the objective function with linear or quadratic polynomial kernels, and we leave this as a line of future research due to the space limit.

Fig. 5. The distribution of the coefficients j for the Synthetic-2 data set by using MRF with  ¼ 0:5. The first two features admit much higher coefficients than the other 52 noise features.

5.5.2 52-Dimensional Linearly Nonseparable Problem To further verify the performance of our proposed MRF method in higher dimensional situations, we perform experiments on a set of 52-dimensional synthetic data, Synthetic-2, originally used by Weston et al. [43]. This

Fig. 6. MRF (d ¼ 1) classification error rates w.r.t the number of selected features for the Synthetic-2 data set. SVM with a polynomial kernel of order 2 is used. When noise features are included, the classification error rates go higher. When only two informative features are selected (xi1 and xi2 ), we have the lowest classification error rate.

linearly nonseparable classification problem consists of 1,000 examples assigned to two classes: y ¼ 1 and y ¼ 1; each example ei admits 52 features, ei ¼ ðxi1 ; xi2 ; . . . ; xi52 Þ. The data are generated in the following way: Fig. 3. The MRF (d ¼ 1) coefficient distribution and the comparison between MRF (d ¼ 1) and mRMR on the Synthetic-1 data set. (a) The distribution of the coefficients j using the MRF (d ¼ 1) method ( ¼ 0:5) and the coefficient values of j indicate the importance of the jth feature. (b) The classification error rate by using MRF (d ¼ 1). (c) The classification error by using mRMR. A two-class linear SVM classifier is used in this experiment.

. .

Step 1. The probability of y ¼ 1 or 1 is equal. Step 2. If y ¼ 1, then ðxi1 ; xi2 ÞT are drawn from P P Nð1 ; Þ or Nð2 ; Þ with equal probabilities, P T 1 ¼ ð 34 ; 3Þ , 2 ¼ ð34 ; 3ÞT , and ¼ I; if y ¼ 1, then T ðxi1 ; xi2 Þ are drawn from Nð3 ; Þ or Nð4 ; Þ with equal probabilities, 3 ¼ ð3; 3ÞT , and 4 ¼ ð3; 3ÞT .

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

1229

Fig. 7. Performance evaluation of using MRF (d ¼ 2) on the Synthetic-2 data set. (a) The selected number of features w.r.t. the value of  with  ¼ 0:5. (b) The classification error rate w.r.t the number of the selected features. SVM with a polynomial kernel of order 2 is used. When only two features are selected (xi1 and xi2 ), we have the lowest classification error rate in (b).

Step 3. The rest features xij are noise randomly generated from Nð0; 20Þ, j 2 f3; . . . ; 52g, i 2 f1; . . . ; 1;000g. For each class, we uniformly sample three-quarters of the data as training examples and the rest as test examples. The distribution of the coefficients j is shown in Fig. 5. From this distribution, we can clearly determine the (most) important features xi1 and xi2 ; and the other features are distinct from these two. The classification error rates with respect to (w.r.t.) the number of features are illustrated in Fig. 6. It becomes evident that more features may not always induce lower classification errors. Actually, more incorrectly selected features may result in larger classification errors. In Fig. 7, we also report the results by using MRF (d ¼ 2) which works well on this data set. .

Fig. 8. The distribution of the coefficients j for Iris data by using MRF with  ¼ 0:5. From this figure, we can readily determine the importance of features: The third feature, “Petal Width,” has the highest coefficient and thus is of the greatest importance.

5.6

Experiments on Real-World Data sets with Small or Medium Feature Size

5.6.1 Iris As one of the classical examples, Fisher’s Iris data set [5] has three different classes: Setosa, Versicolor, and Virginica. Each observation consists of four ordered features: “Sepal Length,” “Sepal Width,” “Petal Length,” and “Petal Width.” Nine-tenths of the observations in each class are used as training examples and the rest as test examples. The distribution of coefficients j is illustrated in Fig. 8. As seen from this figure, the third feature, “Petal Length,” has the highest MRF coefficient value. While there is no big difference among the rest three features, we can still obtain an importance order: “Petal Width” > “Sepal Length” > “Sepal Width” (varying  values in the feasible ranges can highlight the difference). These results are in accordance with previous conclusions on the Iris data set in the literature.

Fig. 9. Comparison of mRMR, RFE, L0, and MRF on the Iris data. L0 and RFE overlap in this figure and so do mRMR and MRF. Both L0 and RFE fail in this example, whereas mRMR and MRF produce the correct feature subsets. C4.5 is used as the classifier in this example.

1230

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 10. Coefficient distribution of j for the Optical Pen data set by using MRF with  ¼ 0:5.

VOL. 33,

NO. 6,

JUNE 2011

Fig. 12. Comparison of mRMR, Random Forest (100 trees), RFE, L0, and MRF on the Leukemia-S3 data. A linear SVM classifier is used. MRF outperforms the other four methods. Including more noise features will lead to larger classification error rates for all five feature selection methods. 1

MRF Random Forest mRMR RFE L0

Classification error rate

0.9

Fig. 11. Comparison of mRMR, Random Forest (100 trees), MRF, RFE, and L0 on the Optical Pen data set. SVM with RBF kernel is used as the classifier. MRF achieves the lowest classification error rate when about 39 features are selected.

0.8

0.7

0.6

0.5

0.4

0

data set and the results are illustrated in Fig. 9. In the same settings and using C4.5 as a classifier, both mRMR and our MRF methods can correctly determine the importance of the features, while L0 and RFE fail on this data set.

5.6.2 Optical Pen To test its performance on multiclass data sets, we perform experiments on the Optical Pen data set from the UCI repository [58]. This middle-sized data set (64 features) has 10 different classes, one class for one digit [59]. We uniformly sample three-quarters of the observations of each class as training examples and the rest as test examples. Fig. 10 shows the coefficient distribution for MRF method. From Fig. 11, we can see that MRF achieves the lowest classification error rate when 39 features are selected. We also perform experiments on this data set using different classifiers, with the comprehensive comparison results provided in Tables 2 and 4.

10

20

30

40

50

60

The number of features selected

We compare several feature selection methods on this

Fig. 13. Comparison of mRMR, Random Forest (100 trees), RFE, L0, and MRF on the NCI9 data. A linear SVM classifier is used. Both MRF and mRMR methods achieve the lowest error rate, but MRF requires only 12 features, while mRMR requires more than 34 features.

5.7

Experiments on Real-World High-Dimensional Data Sets

5.7.1 Leukemia Leukemia-S3,7 originally by Golub et al. [60], has 7,070 features. We run MRF and the other four methods to select a maximum of 120 features, and the classification results using a linear SVM classifier are illustrated in Fig. 12. It is clear that a proper feature selection method is vital for this data set as only a handful of features are meaningful for class discrimination. Moreover, with more noise features included, the classification error rates go higher for all feature selection methods. The comprehensive comparison results are listed in Tables 2, 3, and 4. 7. data set can be downloaded from http://penglab.janelia.org/proj/ mRMR/.

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

1231

7

9

x 10

Coefficient for each feature

8 7 6 5 4 3 2 1 0

0

2000

4000

6000

8000

10000

12000

The index of features Fig. 14. Coefficient distribution of j for Prostate Cancer data ( ¼ 0:5). There are only a handful of (around 20) distinct features.

Fig. 16. Coefficient distribution of j for Lung Cancer data ( ¼ 0:5). There are only a small fraction of highly discriminative features.

Fig. 15. Comparison of mRMR, Random Forest (100 trees), RFE, L0, and MRF on Prostate Cancer data. The first 60 most important features are selected for each method. A linear SVM classifier is used. MRF outperforms all other four methods.

Fig. 17. Comparison of mRMR, Random Forest (100 trees), RFE, L0, and MRF on Lung Cancer data. The first 60 most important features are selected for each method. A linear SVM classifier is used. MRF outperforms the other four methods.

5.7.2 NCI9 NCI98 [61], [62] has 60 observations, 9,703 features, and 9 types of cancers. The training set is generated by uniformly sampling three quarters of the observations for each class, and the rest is served as a test set. This data set is challenging and the lowest classification error rate (by selecting up to 60 features) is 36.84 percent by using MRF and mRMR with a linear SVM classifier. For these five different methods and a linear SVM, the classification error rates are shown in Fig. 13. The comprehensive comparison results are shown in Tables 2, 3, and 4.

60 most significant features. The resulting error rates are shown in Fig. 15. MRF outperforms the other four methods. The comprehensive comparison results are provided in Tables 2 and 3.

5.7.3 Prostate Cancer The Prostate Cancer data set [63], [64] has 12,600 features. We use the data from [63] as a training data set and those from [64] as an independent testing data set. The coefficient distribution of j is shown in Fig. 14. From this figure, we can see that there are only a handful of distinct features (around 20). We perform classification tests on the first 8. data set can be downloaded from http://penglab.janelia.org/proj/ mRMR/.

5.7.4 Lung Cancer The coefficient distribution of j Lung Cancer data set9 (more than 12,000 features) is shown in Fig. 16. The classification error rates are shown in Fig. 17. MRF achieves the best performance on this data set. The comprehensive comparison results using different classifiers are given in Tables 2 and 3.

6

CONCLUSIONS

In this paper, we study selecting the maximally separable subset of features for supervised classifications from among all subsets of variables. A class of new feature selectors is formulated in the spirit of Fisher’s class separation criterion of maximizing between-class separability and minimizing within-class variations. The formulation particularly ad9. downloaded from http://www.chestsurg.org.

1232

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

dresses three challenges encountered in pattern recognition and classifications: a small sample size with a large number of features, linearly nonseparable classes, and noisy features. By choosing specific kernel functions, we construct the Fisher-Markov selectors that boil down the general subset selection problem (simultaneous selection) to seeking optimal configurations on the MRFs. For the consequent MRF problem, the Fisher-Markov selectors admit efficient algorithms to attain the global optimum of the class separability criterion without fitting the noise components. The resulting multivariate feature selectors are capable of selecting features simultaneously and have efficient computational complexity; for example, the LFS is linear in the number of features and quadratic in the number of observations. The Fisher-Markov selectors can be applied to general data with arbitrary dimensions and multiple classes, and the selected feature subset is useful for general classifiers. We have conducted extensive experiments on synthetic data as well as standard real-world data sets, and the experimental results have confirmed the effectiveness of the Fisher-Markov selectors.

ACKNOWLEDGMENTS The authors thank the editor and anonymous referees’ insightful comments and helpful suggestions, which helped improve the paper’s quality and composition. This work was partially supported by the US National Science Foundation (NSF) 0855221 and Southern Illinois University Doctoral Fellowship. Q.C. and J.C. produced the formulation, algorithms, and analyses of this work. H.Z. performed the experiments and wrote the Experiments portion of this paper. Q.C. wrote the overall paper.

REFERENCES [1] [2] [3] [4]

[5] [6] [7] [8] [9] [10] [11] [12] [13]

P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Machine Learning, vol. 29, pp. 103-130, 1997. S. Dudoit, J. Fridlyand, and T. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002. J. Fan and Y. Fan, “High Dimensional Classification Using Features Annealed Independence Rules,” Annals of Statistics, vol. 36, pp. 2232-2260, 2008. T.M. Cover, “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition,” IEEE Trans. Electronic Computers, vol. 14, no. 3, pp. 326-334, June 1965. R.A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, vol. 7, pp. 179-188, 1936. W.J. Dixon and F.J. Massey, Introduction to Statistical Analysis, second ed. McGraw-Hill, 1957. M.G. Kendall, A Course in Multivariate Analysis. Griffin, 1957. P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, 1982. K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990. G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, 2004. K.-S. Fu, Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. K.-S. Fu, P.J. Min, and T.J. Li, “Feature Selection in Pattern Recognition,” IEEE Trans. Systems Science and Cybernetics, vol. 6, no. 1, pp. 33-39, Jan. 1970. C.H. Chen, “On a Class of Computationally Efficient Feature Selection Criteria,” Pattern Recognition, vol. 7, pp. 87-94, 1975.

VOL. 33,

NO. 6,

JUNE 2011

[14] P. Narendra and K. Fukunaga, “A Branch and Bound Algorithm for Feature Subset Selection,” IEEE Trans. Computers, vol. 26, no. 9, pp. 917-922, Sept. 1977. [15] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Company, 1989. [16] K. Kira and L.A. Rendall, “A Practical Approach to Feature Selection,” Proc. Int’l Conf. Machine Learning, pp. 249-256, 1992. [17] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. Royol Statistical Soc. Series B: Methodological, vol. 58, pp. 267-288, 1996. [18] D.L. Donoho and M. Elad, “Optimally Sparse Representation in General (Nonorthogonal) Dictionaries via l1 Minimization,” Proc. Nat’l Academy of Sciences USA, vol. 100, pp. 2197-2202, 2003. [19] D.L. Donoho, “Compressed Sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289-1306, Apr. 2006. [20] E.J. Candes, J. Romberg, and T. Tao, “Stable Signal Recovery from Incomplete and Inaccurate Measurements,” Comm. Pure and Applied Math., vol. 59, pp. 1207-1223, 2006. [21] E. Candes and T. Tao, “The Dantzig Selector: Statistical Estimation When p Is Much Larger Than n,” Annals of Statistics, vol. 35, no. 6, pp. 2313-2351, 2007. [22] G.J. McLachlan, R.W. Bean, and D. Peel, “A Mixture Model-Based Approach to the Clustering of Microarray Expression Data,” Bioinformatics, vol. 18, pp. 413-422, 2002. [23] H. Peng, F. Long, and C. Ding, “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005. [24] L. Wang, “Feature Selection with Kernel Class Separability,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1534-1546, Sept. 2008. [25] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature Selection for SVMs,” Advances in Neural Information Processing Systems, T.K. Leen, T.G. Dietterich, and V. Tresp, eds., pp. 668-674, MIT Press, 2000. [26] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002. [27] A. Webb, Statistical Pattern Recognition, second ed. Wiley, 2002. [28] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1999. [29] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001. [30] D. Koller and M. Sahami, “Toward Optimal Feature and Subset Selection Problem,” Proc. Int’l Conf. Machine Learning, pp. 284-292, 1996. [31] E.B. Fowlkes, R. Gnanadesikan, and J.R. Kettenring, “Variable Selection in Clustering and Other Contexts,” Design, Data, and Analysis, C.L. Mallows, ed., pp. 13-34, Wiley, 1987. [32] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000. [33] P. Bickel and E. Levina, “Some Theory of Fisher’s Linear Discriminant Function, ‘Naive Bayes,’ and Some Alternatives Where There Are Many More Variables Than Observations,” Bernoulli, vol. 10, pp. 989-1010, 2004. [34] S. Mika, G. Ratsch, and K.-R. Muller, “A Mathematical Programming Approach to the Kernel Fisher Algorithm,” Advances in Neural Information Processing Systems, vol. 13, pp. 591-597, MIT Press, 2001. [35] V.N. Vapnik, Statistical Learning Theory. Wiley, 1998. [36] B. Scholkopf and A.J. Smola, Learning with Kernels. MIT Press, 2002. [37] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1999. [38] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univ. Press, 2004. [39] S. Fidler, D. Slocaj, and A. Leonardis, “Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 337-350, Mar. 2006. [40] H. Akaike, “A New Look at the Statistical Model Identification,” IEEE Trans. Automatic Control, vol. 19, no. 6, pp. 716-723, Dec. 1974. [41] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, vol. 6, pp. 361-379, 1978. [42] D.P. Foster and E.I. George, “The Risk Inflation Criterion for Multiple Regression,” Annals of Statistics, vol. 22, pp. 1947-1975, 1994.

CHENG ET AL.: THE FISHER-MARKOV SELECTOR: FAST SELECTING MAXIMALLY SEPARABLE FEATURE SUBSET FOR MULTICLASS...

[43] J. Weston, A. Elisseeff, B. Schlkopf, and M.E. Tipping, “Use of the Zero-Norm with Linear Models and Kernel Methods,” J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003. [44] P.E. Greenwood and A.N. Shiryayev, Contiguity and the Statistical Invariance Principle. Gordon and Breach, 1985. [45] M. Rosenblatt, Gaussian and Non-Gaussian Linear Time Series and Random Fields. Springer, 2000. [46] D. Bosq, Nonparametric Statistics for Stochastic Processes. Springer, 1998. [47] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721-741, Nov. 1984. [48] G. Winkler, Image Analysis, Random Fields and Dynamic Monte Carlo Methods: A Mathematical Introduction, third ed. Springer-Verlag, 2006. [49] S. Dai, S. Baker, and S.B. Kang, “An MRF-Based Deinterlacing Algorithm with Exemplar-Based Refinement,” IEEE Trans. Image Processing, vol. 18, no. 5, pp. 956-968, May 2009. [50] D.S. Hochbaum, “An Efficient Algorithm for Image Segmentation, Markov Random Fields and Related Problems,” J. ACM, vol. 48, no. 2, pp. 686-701, 2001. [51] J.P. Picard and H.D. Ratliff, “Minimum Cuts and Related Problem,” Networks, vol. 5, pp. 357-370, 1975. [52] H. Ishikawa, “Exact Optimization for Markov Random Fields with Convex Priors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1333-1336, Oct. 2003. [53] V. Kolmogorov and R. Zabih, “What Energy Can Be Minimized via Graph Cuts?” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147-159, Feb. 2004. [54] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001. [55] M. Wainwright, T. Jaakkola, and A. Willsky, “MAP Estimation via Agreement on (Hyper)Trees: Message-Passing and Linear Programming,” IEEE Trans. Information Theory, vol. 51, no. 11, pp. 3697-3717, Nov. 2005. [56] J. Yedidia, W. Freeman, and Y. Weiss, “Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms,” IEEE Trans. Information Theory, vol. 51, no. 7, pp. 2282-2312, July 2004. [57] J. Demsar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Machine Learning Research, vol. 7, pp. 1-30, 2006. [58] C.L. Blake, D.J. Newman, S. Hettich, and C.J. Merz, UCI Repository of Machine Learning Databases, http://www.ics. uci.edu/mlearn/MLRepository.html, 1998. [59] M.D. Garris et al., NIST Form-Based Handprint Recognition System, NISTIR 5469, 1994. [60] T. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, http://www.broad.mit.edu/cgibin/cancer/datasets.cgi, 1999. [61] D.T. Ross et al., “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines,” Nature Genetics, vol. 24, no. 3, pp. 227-234, 2000. [62] U. Scherf et al., “A cDNA Microarray Gene Expression Database for the Molecular Pharmacology of Cancer,” Nature Genetics, vol. 24, no. 3, pp. 236-244, 2000. [63] D. Singh et al., “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, pp. 203-209, http:// www.broad.mit.edu/cgi-bin/cancer/datasets.cgi, 2002. [64] J.B. Welsh et al., “Analysis of Gene Expression Identifies Candidate Markers and Pharmacological Targets in Prostate Cancer,” Cancer Research, vol. 61, pp. 5974-5978, 2001.

1233

Qiang Cheng received the BS degree in mathematics and the MS degree in applied mathematics and computer science from Peking University, China, and the PhD degree from the Department of Electrical and Computer Engineering at the University of Illinois, UrbanaChampaign (UIUC). Currently, he is an assistant professor in the Department of Computer Science at Southern Illinois University Carbondale. During the summer of 2000, he worked at the IBM T.J. Watson Research Center, Yorktown Heights, New York. In June 2002, he joined the Electrical and Computer Engineering Department at Wayne State University, Detroit, Michigan, as an assistant professor. From May to August 2005, he was an AFOSR Faculty fellow at the Air Force Research Laboratory, Wright-Patterson, Ohio. From August 2005 to June 2007, he was a senior researcher/ research scientist at Siemens Medical Solutions, Siemens Corporate Research, Siemens Corp., Princeton, New Jersey. His research interests include pattern recognition, machine learning theory, signal and image processing, and their applications to biomedicine and engineering. He was the recipient of various awards and privileges, including an Outstanding Leadership Award, the 12th IEEE International Conference on Computational Science and Engineering (CSE-09) and the IEEE International Conference on Embedded and Ubiquitous Computing (EUC-09), Vancouver, Canada, September 2009, the Achievement Awards from Siemens Corporation, 2006, the University Research/Enhancement Awards from Wayne State University in 2003, 2004, and 2005, the Invention Achievement Award from the IBM T.J. Watson Research Center, 2001, a University Fellowship at UIUC from 1997 to 1998, a Guanghua University Fellowship at Peking University, China, from 1995 to 1996, the Antai Award for Academic Excellence from Peking University, China, 1995, and other awards and scholarships. He has a number of international (US, Germany, and China) patents issued or filed with the IBM T.J. Watson Research Laboratory, Yorktown Heights, and Siemens Medical, Princeton, New Jersey. He is a member of the IEEE and the IEEE Computer Society. Hongbo Zhou received the bachelor’s degree in physics and the master’s degree in mathematics from the Beijing University of Aeronautics and Astronautics, China. He is currently a PhD candidate in the Computer Science Department, Southern Illinois University Carbondale. His research interests include sparse coding and various robust signal representations, approximation inference and optimization in large-scale probability graphical models, and he has published more than 10 papers on these subjects. He is a member of the IEEE Communications and Information Security Technical Committee. Jie Cheng received the BS and MS degrees in computer science from Southern Illinois University Carbondale (SIUC) in 2003 and 2005, respectively, and the PhD degree in electrical and computer engineering from SIUC in 2009. She is an assistant professor in the Department of Computer Science at the University of Hawaii at Hilo. Prior to her current position, she was an assistant professor for one year in the Department of Mathematics and Computer Science at Bemidji State University, Minnesota. She held a doctoral fellowship and a number of scholarships and awards at SIUC. Her research interests includes pattern recognition, bioinformatics, high-performance computing, and visualization.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Suggest Documents