Document not found! Please try again

Pattern Recognition Performance of feature-selection ... - CiteSeerX

23 downloads 20583 Views 1MB Size Report
Genomics Research Institute, Phoenix, AZ 85004, USA. Tel.: +19798628154. ..... Although the centers of the two classes are closer com- pared to the ..... call or more than 10% of the sample points have their values miss- ing. This reduced the ...
Pattern Recognition 42 (2009) 409 -- 424

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

Performance of feature-selection methods in the classification of high-dimension data Jianping Hua a , Waibhav D. Tembe b , Edward R. Dougherty a,c,∗ a b c

Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA High Performance Bio-Computing Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA

A R T I C L E

I N F O

Article history: Received 1 October 2007 Received in revised form 13 June 2008 Accepted 1 August 2008 Keywords: Classification Feature selection Microarray

A B S T R A C T

Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifierindependent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction Contemporary high-throughput technologies used in molecular biology offer the ability to simultaneously measure vast numbers of biological variables (features), thereby providing enormous amounts of multivariate data with which to build classifiers. The most common scenario uses data from gene-expression microarrays, where thousands of genes constitute the features and the expression values on a single microarray chip form a sample point. Applications include disease diagnosis [1,2] and prognosis prediction [3]. In a typical microarray study, each microarray chip normally contains 20,000 or more probes (features) and the sample size (number of microarrays) is usually at most a few hundred, and in most cases is less than

∗ Corresponding author at: Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA. Tel.: +1 979 8628154. E-mail addresses: [email protected] (J. Hua), [email protected] (W.D. Tembe), [email protected] (E.R. Dougherty).

0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.08.001

100. This setting constitutes an extreme case of high-dimensionality, meaning there is a very large number of available features relative to the sample size. In such circumstances, it is commonplace for a classification rule to overfit the data, meaning that the rule yields a classifier that separates the sample data well but does poorly on the general distribution, so that there is poor average performance on future data. Relative to the number of features, overfitting manifests itself in the peaking phenomenon, whereby the error of the designed classifier first declines as more features are used and then begins to increase [4–9]. Consequently, feature selection is inevitably part of classifier design. From the perspective of classifier design, a feature-selection algorithm is part of the classification rule, where feature selection is subsequently followed by a standard classifier that is applied to the selected subset of features. If the feature selection is conducted independent of the classifier as in the t-test [10], it is normally referred to as the filter method. If feature selection uses the classifier to evaluate the performance of each subset as in sequential floating forward selection (SFFS) [11], it is normally referred to as the wrapper

410

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

method. Feature selection can also be a combination of both, which is essentially the generalized wrapper method. In some cases, feature selection is strongly coupled with the classifier design, as in boosting or recursive ridge regression [12], which is referred to as the embedded method. A more complete review on feature-selection methods and their applications on expression-based classification are available at Jain and Zongker [13] and Saeys et al. [14]. In any case, feature selection is part of the overall classification rule and, relative to this rule the number of features is the number of the data measurements, not the final number used in the designed classifier. Feature selection results in a subfamily of the original family of classifiers, and thereby constitutes a form of constraint. It does yield a reduction in the dimensionality of the feature space relative to design. Numerous feature-selection algorithms have been proposed during the last few decades and several extensive comparative studies have been conducted [13,15,16]. However, these studies have been carried out under scenarios dramatically different from what one encounters in contemporary high-dimensional biological settings, the salient difference being the number of features considered. In Ref. [13], the scale of feature sets is 20 (synthetic data) and 18 (SAR images). In Refs. [15,16], the largest feature size is only 125. In Table 6 of Ref. [16], the recommendation of feature-selection algorithms is broken down according to the size of the full feature sets, and any feature size larger than 100 is categorized as “very large”. One cannot expect the results of these studies to satisfactorily describe settings involving 20,000 features. Recently, comparative studies have also been done on expressionbased classification [17–21]. For instance, one study compares the performance of several feature-selection algorithms with regard to finding genes that provide good performance for a specific classification algorithm [20]. These studies use only real data, which in most cases are for a sample size less than 100. The lack of ground truth in real data limits the basis of judgment regarding error rate. One possibility is potential mislabeling. Even if labeling is correct, a much greater problem in the case of small samples, when the training data must be used for error estimation, arises from the fact that cross-validation and bootstrap error estimators are poorly correlated to the true error, which compounded with their substantial variance in small-sample settings, renders them unreliable [22]. In addition, most studies treat the classification rule as a whole, which takes all factors into consideration, rather than focusing on the general trends and properties of feature selection itself. In this paper we directly compare feature-selection methods as they apply in different circumstances, with emphasis on contemporary high-dimensional settings. From practical standpoint, it is impossible to conduct a comprehensive study on all existing featureselection methods. To avoid unnecessary confounding factors, we fix the classifier and focus on the trends of feature-selection algorithms given that classifier. We consider several popular featureselection algorithms from filter and wrapper methods. The choice of algorithms is not meant to indicate that algorithms excluded from our study, e.g., the embedded methods, are inferior to the ones we have chosen; rather, we have chosen what we believe is a representative sample to examine feature selection for highdimensional data. For comparison purposes, our study is conducted on both synthetic and real data. The synthetic data are generated from various feature-label distribution models that emulate situations in microarray-based problems, such as multiple subtypes, different chip designs, complicated correlation structure among the features, and different sample sizes. Our intention is not simply to compare feature-selection procedures, but also to define distribution models that characterize the kinds feature relations one sees in contemporary experiments. In the case of real data sets, to ensure reliable error estimation, all data sets selected have sample sizes larger than 180. Owing to the scale of the simulation, not all results can

be included in the paper. We provide illustrative results in the paper and the complete results are available at the companion website (http://compbio.tgen.org/paper_supp/fs_highdim/). 2. Methods 2.1. Feature selection The objective of our study is to compare the classification performance of feature-selection methods when training sample sizes are limited and the number of features is huge. We consider several popular feature-selection algorithms. There are eight filter (classifierindependent) methods, including information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t-test, and ReliefF. The first six algorithms are implemented based on the code from RankGene software [23]. ReliefF is an extension of the popular Relief algorithm [24,25]. It can handle multiple classes and is more robust to noise. In this study, we choose three nearest neighbors (NNs) when searching for the hits and misses in ReliefF. All methods are univariate methods except ReliefF, which is a multivariate method. Besides the filter-based algorithms, there are two wrapper (classifier-specific) methods: sequential forward selection (SFS) and SFFS [11]. Descriptions of these feature-selection methods are provided at the companion website. Both SFS and SFFS take an iterative approach, and the computation cost quickly becomes prohibitive for large-scale problems like expression-based classification. Since our study also intends to cover 180 distribution models of a wide range of parameters, we choose a two-stage feature-selection scheme to allow our simulation to be completed in a reasonable time. It has been reported by Kudo and Sklansky [15] that the two-stage feature-selection method can significantly reduce the computational cost and possibly find an even better feature subset than directly applying a classifier-specific feature-selection algorithm to the full feature set. In the first stage of a two-stage design, a classifier-independent feature-selection algorithm is used to remove most of the noninformative features. In the second stage, a classifier-specific featureselection algorithm is applied to further refine the feature set from the first stage. In sum, during feature selection, the candidate feature set is filtered down to a moderate size in the first stage and passed to the second stage for another round of feature selection, which returns the feature subsets up to a designated size. Any of the eight filter methods considered in this paper can be used in the first stage; however, to exhaust all possible combinations would increase the simulation enormously and generate an overwhelming amount of results. Hence, for our study we pick two typical filter methods, t-test and ReliefF, in the first stage to represent the univariate and multivariate methods, respectively. For the second stage, SFS and SFFS are used. Note that the common feature-selection algorithms can be viewed as a special case of the two-stage scheme by adding no feature selection as a special choice to both stages. If there is no feature selection in the second stage, then the feature selection contains just t-test, ReliefF, or one of the other six filter methods. A similar choice can be applied to the first stage if one wants to apply SFS or SFFS to the full feature set, which is not considered in this study. Throughout this paper, filter methods are denoted by their names, e.g., t-test and ReliefF. The terms SFS and SFFS refer to the wrapper schemes where the first stage can be either t-test or ReliefF (which will be specified for particular applications). 2.2. Distribution models for synthetic data In this section, we describe how the distribution models for synthetic data generation are constructed and explain how our choice

Class 1 Class 0

SAMPLES

411

Subclass 0 Subclass 1

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

Global markers

Heterogeneous markers

High-variance non-markers FEATURES

Low-variance non-markers

Fig. 1. A demonstration of four feature types in constructing the synthetic data.

of model structure fits various observations made in microarray expression-based studies. We have constructed series of distribution models and every synthetic data set is generated by its own distribution model. Each distribution model emulates a full feature-label distribution of feature size D = 20, 000. Two-class models (class 0 vs. class 1) of equal likelihood are considered. For class 1, we further assume that there are c sub-classes, which emulate cases like the subtypes or stages of a certain disease. These sub-classes are mutually exclusive, so that each sample point in class 1 belongs to one and only one sub-class. We further assume that all c sub-classes of class 1 are of equal likelihood and each sub-class has its own distribution. The D = 20, 000 features are made of two types: markers and non-markers. They are further divided into four small types: global markers, heterogeneous markers, high-variance non-markers, and low-variance non-markers. Fig. 1 provides a symbolic demonstration of how the synthetic data set is constructed in general.

Fig. 2 shows the class density functions and decision boundary of several two-feature distributions. When both features are global markers with equal variance, the decision boundary is linear, as shown in Fig. 2(a). By changing one feature to a heterogeneous marker, as in Fig. 2(b), the decision boundary becomes nonlinear. When both features are heterogeneous markers, the decision boundary forms a corner shape, as in Fig. 2(c). Another common way to introduce nonlinearity is by changing the variance. The resulting boundary is a smooth quadratic curve, as in Fig. 2(d). Combining both heterogeneous markers and unequal variance leads to more nonlinear boundaries like those in Figs. 2(e) and (f). For a general high-dimension distribution, one can perceive that the decision boundary will become more nonlinear with less global markers and larger differences in variance. We use a block-based structure for the marker covariance matrices. Since global markers and heterogeneous markers share the same structure, in the remaining part of the paper, we will use {R0 , R1 } to gm

2.2.1. Markers and distribution model types Markers are the features that have different class-conditional distributions in the two classes. We define two types of markers in our models: • Global marker: There are altogether Dgm global markers and they are homogeneous in each class. The class-conditional distribugm gm tions are Dgm -dimension Gaussian: N(0 , R0 ) for class 0 and gm

gm

gm

gm

N(1 , R1 ) for class 1, where 0 and 1 are the mean vecgm gm tors of class 0 and 1, respectively, and R0 and R1 the covariance matrices. • Heterogeneous marker: Each sub-class of class 1 is associated with Dhm heterogeneous markers distributed as a Dhm -dimension hm hm Gaussian N(hm , Rhm 1 ), where 1 is the mean vector and R1 1 the covariance matrix. The sample points in other sub-classes , Rhm and class 0 have the Gaussian distribution N(hm 0 ) over the 0

same Dhm features, where hm is the mean vector and Rhm 0 the 0 covariance matrix. Like the sample points of sub-classes, the heterogeneous markers are mutually exclusive. One heterogeneous marker can only be associated with one sub-class. Thus, there are altogether c × Dhm heterogeneous markers.

Introduction of sub-classes and the corresponding heterogeneous markers results in more complicated, nonlinear decision boundaries.

gm

hm represent both {R0 , R1 } and {Rhm 0 , R1 } in the cases when there is no confusion. The markers are divided into equal-size blocks, with each block containing k markers. Markers of different groups are uncorrelated and markers of the same group possess the same correlation, , between each other. It should be noted that one would not expect to observe such a simplified relationship in real data. But hopefully the results gleaned from these synthetic data can serve as a reference to help us make better decision on real data applications. In our distribution models, we let R0 and R1 have the identical structure, except for the variance, namely, R0 = 20 R and R1 = 21 R, where R is of the form



1 ⎢ ⎢ ⎢ ⎢ ⎢q ⎢ ⎢ ⎢ ⎢ ⎢ R=⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣



q ..

.

0

···

0

···

0

..

. ..

1

q

1 ..

0

.

q . ..

1 . ..

. 1

0

0

···

..

q

.

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ q⎥ ⎥ ⎥ ⎦ 1

412

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

equal variance cases

0.4 0.3 0.2 0.1 0 −1

0.4 0.3 0.2 0.1 0 −1

0.4 0.3 0.2 0.1 0 −1

−0.5

−0.5

t er he

0

er

l mark

globa

1.5 r1

−0.5

1 rke

−1

1.5

ma

2

0.5

1

2

us

er 1

l mark

globa

0.5

eo

0

r rke

−0.5

1.5

ma

−1

1

us

2

0.5

1.5

1

eo

1.5

2

en

0.5

en

2 er ark

1

0

og ter

0

og

lm ba glo

0.5

he

−0.5 0

2

−1

−0.5

0

1.5

1

0.5

2

er 2

s mark

geneou

hetero

unequal variance cases

0.8 0.6 0.4 0.2 0 −1

0.8 0.6 0.4 0.2 0 −1

0.8 0.6 0.4 0.2 0 −1 −0.5

og en

0.5

eo

1

−0.5

0

g

1.5 r1

−1

1 rke

2

0.5 arker lobal m

1.5

ma

1.5

2

us

r

−1

r ke

2 2

0.5 0 −0.5 arker 1 global m

1.5

1

ma

1

1.5

2

us

1

0

ter

0.5

eo

en

er

ark

lm

ba

0.5

−0.5 0

og

glo

0

he

ter

he

−0.5

0.5

2

−1

1

1.5

2

0 rker 2 −0.5 us ma geneo hetero

Fig. 2. Effects of heterogeneous markers on decision boundary. The distribution contains two features. (a) Both features are global markers, equal variance. (b) One feature is global marker, the other is heterogeneous marker, equal variance. (c) Both markers are heterogeneous markers, equal variance. (d) Both features are global markers, unequal variance. (e) One feature is global marker, the other is heterogeneous marker, unequal variance. (f) Both markers are heterogeneous markers, unequal variance.

As mentioned earlier in this section, setting different variance pairs of {0 , 1 } is a common way to introduce nonlinearity. Next, by choosing different mean vectors, we define three basic distribution model types for our study. Again, like covariance matrices, since both global markers and heterogeneous markers share gm gm the same structure, we will use 0 , 1 to represent both {0 , 1 } and {hm , hm }: 0 1

• Redundant model type: In this model type, we let 0 = (0, 0, . . . , 0) and 1 = (1, 1, . . . , 1). Since the correlated features have identical mean value, the performance gain from extra correlated features decreases for larger , and becomes totally redundant when  = 1. The left plot of Fig. 3 shows a contour plot of the density functions of the redundant model type, assuming the full distribution contains two global markers, with block size k = 2, 0 = 1 = 0.4, and  = 0.8. The shown case has a Bayes error of approximately 0.2. • Synergetic model type: Again we let 0 = (0, 0, . . . , 0). For 1 , the markers are divided into groups of size k according to the blockstructure of the covariance matrix. For the k markers of same block, the mean vector is (1, (k − 1)/k, . . . , 1/k). This structure favors multivariate selection methods over univariate methods. The middle plot of Fig. 3 shows a contour plot of the synergetic model type with the same general settings as those of the redundant model type. Although the centers of the two classes are closer compared to the redundant model type, due to the high correlation between features, the two classes are still well-separated and the Bayes error drops to approximately 0.19. However, feature two is now less separated than feature one, and hence is harder to be detected by a univariate feature-selection algorithm such as the t-test. • Marginal model type: Again we let 0 = (0, 0, . . . , 0) and divide the markers of 1 into groups of size k according to the block-structure

of the covariance matrix. For the k markers in each block, the mean vector is set to (1, 0, . . . , 0). This is an extreme case of synergetic effect, because the benefit of extra features is marginal. The right plot of Fig. 3 shows a typical contour plot of the synergetic model type, with the same general settings as the other two model types. Feature two now has zero discriminating power by itself. This makes it literally impossible for a univariate algorithm to detect feature two. However, the high correlation between features makes the two classes more separated and this example has a much lower Bayes error, approximately 0.09, indicating the need for multivariate feature-selection methods. More discussions on the synergetic effects among features and the impact on feature selection can be found in a study by Dougherty and Brun [26]. 2.2.2. Non-markers Non-markers are features that provide no discriminating power between the two classes. In our study, we define two types of nonmarkers: • High-variance non-markers: These features are totally uncorrelated. For any feature, the distribution will be a mixture of Gaussians, pN(0, 0 ) + (1 − p)N(1, 1 ), where 0 and 1 are the same values used in markers. The probability p is randomly drawn from a uniform distribution over [0, 1] for each high-variance non-marker and is applied to all sample points of both classes. In a microarray study, these features can be viewed as genes regulated by mechanisms unrelated to the one that regulates the class-0 and class-1 phenotypes. The number of high-variance non-markers is denoted by Dhv . • Low-variance non-markers: These features are also uncorrelated. For any feature, the distribution is just N(0, 0 ). The number of low-variance non-markers is Dlv = D − Dgm − c × Dhm − Dhv .

413

2

2

1.5

1.5

1.5

1

1

1

0.5

N(μ1, Σ1)

0

feature 1

2

feature 1

feature 1

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

N(μ1, Σ1)

0.5 0

−0.5 −1

−0.5

N(μ0, Σ0)

−1 −1

−0.5

0

0.5

1

1.5

feature 2 Redundant model type

2

0.5 0

−0.5

N(μ0, Σ0)

N(μ1, Σ1)

N(μ0, Σ0)

−1 −1

−0.5

0

0.5 1 1.5 feature 2 Synergetic model type

2

−1

−0.5

0

0.5 1 1.5 feature 2 Marginal model type

2

Fig. 3. Example contour plots of the density functions of the three distribution model types. The full distribution contains two global markers, with block size k = 2, 0 = 1 = 0.4 and  = 0.8. The Bayes errors for the redundant model type are 0.202, synergetic model type 0.189, and marginal model type 0.094.

2.2.3. Observable feature set Although our full distribution is designed based on a total feature size D, we also consider cases where only a subset of the features is available. This is because, in practice, one often does not have all genes of interest covered in an experiment. This can be due to the trade-off between cost and coverage, the limitations of microarray chip design technology, or many other practical reasons. For example, Affymetrix's Human HG-Focus Target Array represents slightly over 8500 genes, while its Human Genome U133 Plus 2.0 Array provides a coverage of over 47,000 transcripts. Since all genes represented in the HG-Focus Target Array are fully covered in the Human Genome U133 Plus 2.0 Array, Affymetrix suggests that by using the HG-Focus Target Array, one can have a cost-efficient preliminary study, while still keeping the full compatibility to any future data obtained through larger chips. In our study, we assume the size of the observable feature set available for feature selection and classifier design is some value D  D. If D = D, then there is no hidden feature space and the full distribution is presented. If D < D, then we consider two ways of choosing D features based on how the markers are chosen: • Optimal design: All markers are kept in the D features. This could represent a well-designed chip covering all related genes. • Random design: D features are randomly chosen from the D features. The non-markers are randomly selected from the full distribution. In this way, for each distribution, various choices of D plus the optimal/random design can provide a series of distributions of different complexities for feature selection. In real microarray experiments, since the chips are usually designed for a more general purpose, the data should fall somewhere between the two designs. Before we complete our description of the distribution model and move to the simulation set-up, we comment that the models designed here are not intended to cover every possible real scenario. For example, we deliberately exclude the extremely nonlinear case, such as XOR, where univariate filter methods will certainly fail and SFFS/SFS will very likely be ineffective.1 We have designed our

1 A simple example of a two-dimension XOR distribution occurs when the class-conditional distributions of both classes are mixtures of two equiprobable Gaussians. The two mean vectors of class 0 are 00 = (1, −1), 01 = (−1, 1). The two mean vectors of class 1 are 10 = (1, 1), 11 = (−1, −1). The covariance matrices of all four Gaussians are identical. A similar example in the real biological study can be found in the review paper by Carlborg and Haley [27].

Table 1 Distribution model parameters used to create various distributions of the synthetic data Parameters

Values/descriptions

Distribution model type Variances

Redundant/synergetic/marginal 1 = 0.4 (equal and small variance) 1 = 0.7 (equal and large variance) 1 = 0.5 (unequal and small variance) 1 = 0.8 (unequal and large variance) k=5  = 0.8 c=2 Dgm = 0, Dhm = 50 Dgm = 20, Dhm = 40 Dgm = 40, Dhm = 30 Dhv = 2000 D = 5000/10, 000/20, 000 Optimal/random design

Feature block size Feature block correlation Sub-classes Markers

Non-markers Observable feature size Observable feature set type

0 = 0.4, 0 = 0.7, 0 = 0.3, 0 = 0.5,

Altogether there are 180 different observable distributions.

models to cover a range of distributions of various complexity levels for popular feature-selection schemes and hope that the simulation results will provide a valuable reference to help make better decisions in real applications. Certainly, one could always proceed to consider more models; however, as it stands, given the enormous amount of computation involved, this study covers a large number of combinations of distribution models, feature-selection methods, and classifiers. 2.3. General simulation set-up for synthetic data The parameters and values selected to generate various distributions for the synthetic data are shown in Table 1. Recalling that the total number of features is D = 20, 000, so for D = 20, 000, the two ways of choosing D features, i.e., optimal design and random design, are equivalent. Hence, there is a total of five combinations for D and observable feature sets. Altogether, there are 180 different observable distributions for the synthetic data. The simulation on synthetic data is conducted in an iterative way. In each iteration, a random sample of size N is drawn from a chosen distribution as the training data. The two-stage featureselection scheme is then applied to the training data, by first reducing the features to D in the first stage, then to the final feature sets of various sizes, from 1 up to a designated size d, in the second stage. Then classifiers are constructed for the d feature sets based on the training data, and tested on Nt independent sample points drawn from the same distribution. The simulation is repeated for Itr times,

414

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

Table 2 General set-ups for the simulation of synthetic data Classifier Feature selection

Training sample size

Testing sample size

Iteration

SFS 30 SFFS No selection

60 120 180

100

500

500

SFS 20 No selection

60 120 180

100

100

1000

SFS 20 No selection

60 120 180

100

300

First stage

D

Second stage

LDA

t-test ReliefF

1000

NN

t-test ReliefF

SVM

t-test ReliefF

d

and the results are averaged to obtain the average error rates. Besides the error rates, we also count the average numbers of markers found by the feature-selection schemes. Three classifiers are considered in our simulation study: NN, linear discriminant analysis (LDA), and linear support vector machine (LSVM). We have used NN instead of the more popular 3-NN classifier to achieve a significant savings in simulation time. The branchand-bound algorithm [28] is used to enable fast NN search. For LSVM, we use the codes provided by LIBSVM 2.4 [29]. To improve feature-selection accuracy, within SFS and SFFS, bolstered error estimation [30] is chosen as the error estimation criterion. Bolstered error estimation reduces the variance of the error estimation by applying a density kernel on the training data as the empirical distribution. This approach distinguishes the points near or far from the decision boundary. For point close to the decision boundary, there is significant kernel mass that will go to the other side of the boundary. So when the kernel masses are integrated, points close to the boundary contribute more to the error. In our study, we use semi-bolstered resubstitution error estimation for the NN classifier and bolstered resubstitution error estimation for LDA and LSVM classifiers for best performance [31]. To ensure that the simulation can be finished in an acceptable time frame, we have different simulation set-ups based on the classifiers. The general set-ups of the simulation are shown in Table 2. Owing to the overall size of the simulations and the long computation time for SFFS with NN and LSVM, SFFS is only applied to the LDA classifier (in fact, SFFS run faster with LDA than SFS with either NN or LSVM). The target feature size d and iterations Itr are larger for LDA because we can search up to 30 features with 500 iterations for LDA and still use less computation time than for NN or LSVM. First stage feature size D and iterations Itr are smaller for NN because they induce longest computation time. To have smoother curves, extra iterations are conducted on selected cases. 2.4. Simulation set-up for real data In addition to the simulation study on synthetic data, we also conduct a set of experiments on real data. We have carefully chosen four real data sets for our comparison study. To ensure reliable estimation, all data sets picked have sample sizes larger than 180. Here we give a brief description of these data sets: • Multiple myeloma data set (MM data set): This data set has been obtained from a study on multiple myeloma (MM) and monoclonal gammopathy of undetermined significance (MGUS) [32,33]. The data contain four subtypes: MM (559 sample points), MGUS (44 sample points), smoldering MM (SMM, 12 sample points), and healthy donors with normal plasma-cell (NPC, 22 sample points).

The data were collected using Affymetrix's Human Genome U133 Plus 2.0 Array (Santa Clara, CA) and are publicly available at the NIH Gene Expression Omnibus (GEO), under accession numbers GSE5900 and GSE2658. This microarray chip contains 54,613 probe sets (features) to cover all kinds of gene transcripts and variants. In our study, no further processing has been applied to the downloaded data. In general, MM is a cancer in the bone marrow characterized by the clonal expansion of malignant plasma cells, while MGUS represent the condition where an insignificant amount of monoclonal paraprotein is detected. Although many genetic lesions in MM are also present in MGUS and its advanced phase SMM, only 1% of MGUS and SMM progress into MM every year. Hence, it is an open question whether there exists a certain molecular signature that can discriminate MM from others. Here we have labeled the data into two classes: one containing MM sample points and the other containing MGUS, SMM, and NPC sample points (78 sample points). Since the number of MM patients is overwhelming and can have significant effects on the efficiency of feature selection and the accuracy of error estimation, we have randomly selected 156 sample points from among the 559 MM sample points and pair them with the 78 sample points of MGUS/SMM/NPC. Hence, the total sample size is 234. As for the scenarios of different observable feature sets, we have formulated data into the form of the other two Affymetrix arrays: Human HG-Focus Target Array and Human Genome U133A 2.0 Array. The HG-Focus array represents slightly over 8700 probe sets, while the U133A 2.0 array is comprised 22,215 probe sets. The probe sets of three arrays are fully embedded according to the probe sets size, i.e., all probe sets represented on the HGFocus array are identically contained on the U133A 2.0 array, and all probe sets on the U133A 2.0 array are contained on the U133 Plus 2.0 array. In addition, the same controlling probe sets used for normalization and scaling are identically replicated on all three arrays to achieve “data equity” across all arrays. Thus, one can extract the 22,215 probe sets out of the U133 Plus 2.0 array to form a U133A-2.0-array-equivalent data set and be quite confident that it represents what one would get if the genuine U133A 2.0 array chips were actually used. The same is true for the HG-Focus array. With this approach, MM data set actually contains three data sets. • Acute lymphoblastic leukemia data set (ALL data set): This data set has been obtained from a study on pediatric acute lymphoblastic leukemia (ALL) [34]. ALL is a complicated disease containing several subtypes. Data points are labeled into six subtypes: TALL (43 sample points), E2A-PBX1 (27 sample points), TEL-AML1 (79 sample points), BCR-ABL (15 sample points), MLL (20 sample points), and hyperdiploid with > 50 chromosomes (64 sample points). The data have been collected using Affymetrix's Human Genome HG_U95Av2 array (Santa Clara, CA) and are publicly available at http://www.stjuderesearch.org/data/ALL1. This microarray chip contains 12,000 probe sets (features). We have removed the features in which less than 1% of the sample points have a present call or more than 10% of the sample points have their values missing. This reduced the total feature size to 5077. The missing values were filled by averaging across all sample points. For the comparison study, we labeled the data into two classes: one containing T-ALL, E2A-PBX1 and TEL-AML1 subtypes (149 sample points), the other containing BCR-ABL, MLL and hyperdiploid > 50 subtypes (99 sample points). Hence, the total sample size is 248. We constructed one observable feature set for the ALL data set. • Drugs and toxicants response on rats data set (drug response data set): This data set has been obtained from a study characterizing the gene expression of different drugs and toxicants on live rats [19]. Altogether 22 drugs and toxicants have been fed to male

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

415

t-test

t-test+SFFS

t-test+SFS

ReliefF

ReliefF+SFFS

ReliefF+SFS

Observable feature size D = 5,000, random design

0.15 0.1

0.25

0.25

0.2

0.2

error rate

0.2

error rate

error rate

0.25

0.15

0.1

0.1

0.05

0.05

0.05 0

5

10 15 20 feature size

25

30

0.15

0

5

10 15 20 feature size

25

30

0

5

10 15 20 feature size

25

30

0

5

10

25

30

25

30

Observable feature size D = 20,000

0.2 0.15 0.1

0.25

0.25

0.2

0.2

error rate

error rate

error rate

0.25

0.15 0.1

0.05

0.1

0.05 0

5

10

15

20

25

30

0.15

0.05 0

5

10

feature size

15

20

25

30

feature size

15

20

feature size

0.25

0.25

0.2

0.2

0.2

0.15 0.1

error rate

0.25 error rate

error rate

Observable feature size D = 5,000, optimal design

0.15 0.1

0.05

0.1

0.05 0

5

10 15 20 feature size

25

30

0.15

0.05 0

5

10 15 20 feature size

N = 60

N =120

25

30

0

5

10 15 20 feature size N = 180

Fig. 4. Synthetic data results of LDA classifier on different feature sizes and sample sizes: t-test/ReliefF and SFS/SFFS, redundant model type, equal and large variance, global marker Dgm = 20.

Sprague–Dawley rats for several durations and up to 12 tissues have been harvested. The treatments correspond to four categories: fibrates (36 sample points), statins (31 sample points), azoles (53 sample points) and toxicants (61 sample points). The data are publicly available at the NIH GEO, under accession number GSE2187. The data have been collected on cRNA microarray chips containing 8565 probes (features). We have removed the features in which more than 10% of the sample points have their values missing. This reduced the total feature size to 8491. The missing values were filled by averaging across all sample points. For the comparison study, we labeled the data into two classes: one containing the toxicants (61 sample points), the other containing the remaining three categories (120 sample points). Hence, the total sample size is 181. We constructed one observable feature set for the drug response data set.

• Acute myeloid leukemia data set (AML data set): This data set has been obtained from a study on the prognostic profiling of acute myeloid leukemia [35]. The data and the associated clinical information are publicly available at the NIH GEO, under accession number GSE1159. The data have been collected using Affymetrix's Human Genome U133A Array (Santa Clara, CA), which contains 22,215 probe sets (features). The missing values were filled by averaging across all sample points. For the comparison study, we labeled the data into two classes according to their karyotypes: one containing normal karyotype sample (116 sample points), the other containing abnormal karyotype sample (147 sample points). Hence, the total sample size is 263. Because Human HGFocus Target Array's probe sets are fully embedded to the Human Genome U133A array used here, like MM data, we constructed another observable feature set for the AML data set. Hence, for

416

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

t-test

max minority

max minority

gini index

ReliefF

sum minority

information gain

twoing

0.25

0.25

0.2

0.2

0.2

0.15 0.1

error rate

0.25 error rate

error rate

Observable feature size D = 5,000, random design

0.15

0.1

0.1

0.05

0.05 0

5

10 15 20 feature size

25

30

0.15

0.05 0

5

10 15 20 feature size

25

30

0

5

10 15 20 feature size

25

30

0

5

10

25

30

25

30

0.25

0.25

0.2

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05 0

5

10 15 20 feature size

25

error rate

0.25 error rate

error rate

Observable feature size D = 20,000

0.1 0.05 0

30

0.15

5

10 15 20 feature size

25

30

15

20

feature size

0.25

0.25

0.2

0.2

0.2

0.15

error rate

0.25 error rate

error rate

Observable featuresize D = 5,000, optimal design

0.15

0.15

0.1

0.1

0.1

0.05

0.05

0.05

0

5

10 15 20 feature size N = 60

25

30

0

5

10 15 20 feature size

25

30

0

5

N = 120

10 15 20 feature size N = 180

t-test

t-test+SFFS

t-test+SFS

ReliefF

ReliefF+SFFS

ReliefF+SFS

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05 0

5

10 15 20 feature size N = 240

25

30

0.25 error rate

0.25 error rate

error rate

Fig. 5. Synthetic data results of LDA classifier on different feature sizes and sample sizes: eight filter methods, redundant model type, equal and large variance, global marker Dgm = 20.

0.2 0.15 0.1 0.05

0

5

10 15 20 feature size N = 300

25

30

0

5

10 15 20 feature size

25

30

N = 360

Fig. 6. Synthetic data results of LDA classifier on different sample sizes: t-test/ReliefF and SFS/SFFS, redundant model type, equal and large variance, global marker Dgm = 20, observable feature size D = 5000, random design.

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

t-test ReliefF

t-test+SFFS

t-test+SFS

ReliefF+SFFS

ReliefF+SFS

equal and large variance

0.45

0.25

error rate

0.35 0.3

error rate

0.25

0.4 error rate

417

0.2 0.15

0.05

0.05

0.2 0

5

10

15

20

25

0

30

0.15 0.1

0.1

0.25

0.2

5

10

feature size

15

20

25

30

0

5

feature size

10

15

20

25

30

feature size

unequal and large variance 0.45

0.35 0.3

error rate

error rate

error rate

0.25

0.25

0.4

0.2 0.15

0.05

0.05

0.2 0

5

10 15 20 feature size

25

0

30

5

10 15 20 feature size

25

30

equal and small variance

0.12

0.1

0.1

error rate

error rate

0.2

0.08

0.08

0.15 0.1

error rate

0.12 0.3 0.25

0.06 0.04

0 5

10 15 20 feature size

25

30

5

10 15 20 feature size

25

30

0

5

10

25

30

25

30

0.06 0.04

0

0 0

0

0.02

0.02

0.05

0.15 0.1

0.1

0.25

0.2

0

5

10 15 20 feature size

25

30

15

20

feature size

unequal and small variance

0.2 0.15

0.12

0.1

0.1

0.08

0.08

error rate

error rate

error rate

0.3 0.25

0.12

0.06 0.04

0.1

0.02

0.05 0

5

10 15 20 feature size

25

30

0.04 0.02

0

0

0.06

0 0

5

10 15 20 feature size

Dgm = 0

Dgm = 20

25

30

0

5

10 15 20 feature size Dgm = 40

Fig. 7. Synthetic data results of LDA classifier on different global marker sizes and variances: t-test/ReliefF and SFS/SFFS, redundant model type, training sample size N = 60, observable feature size D = 20, 000. Note that the plots are in different scales.

the AML data set we have two observable data sets of different sizes. With limited sample size, one cannot apply the same technique used in synthetic data simulation. Instead, we choose the hold-out-based

schemes. In each iteration, we randomly draw N sample points from the 234 sample points as the training data. The same three classifiers, LDA, NN, and LSVM, are used. The obtained classifiers are tested on the remaining 234 − N sample points. The procedure is repeated and the average error rates are calculated. We have tested on three

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

t-test+SFFS

t-test+SFS

ReliefF

ReliefF+SFFS

ReliefF+SFS

Synergetic model type equal and large variance

0.5 0.49 0.48 0.47 0.46 0.45 0.44 0.43 0.42 0.41 0.4

0.25 error rate

0.25

0.2 0.15 0.1

5

10 15 20 feature size

25

30

0.2 0.15 0.1

0.05 0

0.05 0

5

10 15 20 feature size

25

30

equal and small variance

0.4

0.12

0.35

0.1

0.1

0.08

0.08

0.25 0.2 0.15

0.06 0.04 0.02

0.1 0.05 5

10 15 20 feature size

25

30

5

10 15 20 feature size

25

30

0

5

10 15 20 feature size

25

30

0

5

10 15 20 feature size

25

30

0

5

10 15 20 feature size Dgm = 40

25

30

0.06 0.04 0.02

0 0

0

0.12

error rate

0.3

error rate

error rate

t-test

error rate

error rate

418

0 0

5

10 15 20 feature size

25

30

0.25

0.25

0.2

0.2

error rate

0.5 0.49 0.48 0.47 0.46 0.45 0.44 0.43 0.42 0.41 0.4

error rate

error rate

Marginal model type equal and large variance

0.15 0.1

0.1

0.05 0

5

10 15 20 feature size

25

30

0.15

0.05 0

5

10 15 20 feature size

25

30

0.12

0.35

0.1

0.1

0.08

0.08

0.3 0.25 0.2 0.15

error rate

0.12

error rate

error rate

equal and small variance 0.4

0.06 0.04 0.02

0.1 0.05 5

10 15 20 feature size Dgm = 0

25

30

0.04 0.02

0 0

0.06

0 0

5

10 15 20 feature size Dgm = 20

25

30

Fig. 8. Synthetic data results of LDA classifier on different distribution models: t-test/ReliefF and SFS/SFFS, synergetic and marginal model types, training sample size N = 60, observable feature size D = 20, 000. Note that the plots are in different scales.

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

419

Table 3 Number of markers found, part 1 Sample size

Dgm = 0 60

Dgm = 20 120

180

60

120

180

LDA, redundant model type, equal and large variance, D = 20, 000 t-test NA/13 NA/28 t-test + SFFS NA/6.4 NA/16 t-test + SFS NA/5.5 NA/16 ReliefF NA/9.7 NA/21 ReliefF + SFFS NA/7.7 NA/18 ReliefF + SFS NA/7 NA/18

NA/30 NA/21 NA/21 NA/27 NA/23 NA/24

19/5.5 8.4/2.2 7.9/1.9 13/5 8.6/3.1 8.4/2.6

20/9.7 11/6.5 11/6 18/9.2 11/8.4 11/7.5

20/10 12/9.7 12/8.8 19/10 12/11 12/11

LDA, synergetic model type, equal and large variance, D = 20, 000 t-test NA/6 NA/17 t-test + SFFS NA/2.8 NA/9.7 t-test + SFS NA/2.4 NA/9.3 ReliefF NA/3 NA/7.2 ReliefF + SFFS NA/2.9 NA/10 ReliefF + SFS NA/2.8 NA/10

NA/26 NA/15 NA/15 NA/11 NA/15 NA/15

9.3/4 4.6/1.4 4.3/1.2 5.6/2.1 4.8/1.7 4.7/1.4

13/11 6.2/4.8 6.1/4 8.4/4.6 6.3/5.8 6.2/5.2

13/15 6.2/8.9 6.4/8 9.9/7 6.3/10 6.3/9.6

LDA, marginal model type, equal and large variance, D = 20, 000 t-test NA/3.5 NA/11 t-test + SFFS NA/1.5 NA/8 t-test + SFS NA/1.5 NA/7.6 ReliefF NA/1.7 NA/4.1 ReliefF + SFFS NA/2.1 NA/7.6 ReliefF + SFS NA/1.9 NA/7.4

NA/16 NA/13 NA/13 NA/6.2 NA/12 NA/12

3.9/2.8 2.9/0.72 2.7/0.79 3.1/1.2 3.2/1.1 3.1/1

4/8.7 4/3.8 4.2/3.4 3.9/3 4/4.6 4.3/4

4/13 4.1/7.9 4.6/6.6 4/4.9 4.1/7.9 4.6/7.2

The results are averaged based on the feature sets of size d = 30. In each cell, the left number denotes the average number of global markers found and is not available for Dgm = 0 cases. And the right number denotes the average number of heterogeneous markers found.

Table 4 Number of markers found, part 2 Sample size

Dgm = 0 60

Dgm = 20 120

180

60

120

180

LDA, redundant model type, equal and large variance, D = 20, 000 Max minority NA/6.7 NA/19 Sum minority NA/9 NA/24 Sum of variance NA/9.7 NA/26 Information gain NA/9.8 NA/25 Gini index NA/10 NA/26 Twoing NA/9.8 NA/26

NA/27 NA/29 NA/30 NA/30 NA/30 NA/30

15/3.4 17/4.2 17/4.5 17/4.8 17/4.7 17/4.5

20/7.5 20/8.9 20/9.4 20/9.4 20/9.3 20/9.2

20/9.6 20/9.9 20/10 20/10 20/10 20/10

LDA, synergetic model type, equal and large variance, D = 20, 000 Max minority NA/2.8 NA/9.1 Sum minority NA/3.6 NA/12 Sum of variance NA/4.2 NA/14 Information gain NA/4.1 NA/13 Gini index NA/4.2 NA/14 Twoing NA/4.2 NA/14

NA/16 NA/21 NA/22 NA/21 NA/22 NA/22

6.7/2.1 7.4/2.6 7.7/2.9 7.4/2.8 7.6/2.9 7.5/2.9

11/6 11/8 11/8.7 11/8.6 11/9 11/9.1

12/10 13/13 13/14 12/14 13/14 13/14

LDA, marginal model type, equal and large variance, D = 20, 000 Max minority NA/1.6 NA/5.6 Sum minority NA/2.2 NA/7.8 Sum of variance NA/2.6 NA/8.9 Information gain NA/2.5 NA/8.8 Gini index NA/2.5 NA/9 Twoing NA/2.5 NA/9

NA/10 NA/13 NA/15 NA/14 NA/15 NA/15

3.3/1.2 3.6/1.6 3.7/1.8 3.6/1.9 3.7/1.9 3.7/1.8

4/4.2 4/6 4/6.9 4/7 4/7.1 4/7.2

4/7.7 4/11 4/12 4/11 4/12 4/12

The results are averaged based on the feature sets of size d = 30. In each cell, the left number denotes the average number of global markers found and is not available for Dgm = 0 cases. And the right number denotes the average number of heterogeneous markers found.

different training sample sizes, N = 60, 90, and 120. Other aspects, like the feature-selection schemes, iteration time, first stage feature size D , and target feature size d, are identical to what are used in the synthetic data simulation. Because the true markers are unavailable, only the error rates are reported in the real data study. 3. Results and discussion Although exact performance varies from case to case, there are general trends in the simulation study of both synthetic and real data: • Filter methods have very similar error trends, although exact performances differ.

• Filter methods, like t-test, have better or compatible performance than SFS and SFFS on harder cases, but this performance is generally accompanied by significant peaking. • For sufficiently large samples, SFS and SFFS perform better when used in two-stage feature selection. • ReliefF, the classifier-independent multivariate filter method, usually has worse performance than the t-test and most other univariate filter methods, but ReliefF-based SFS and SFFS show compatible performance to their t-test-based counterparts. 3.1. Results on synthetic data Altogether, 180 different distributions were used in our simulation study and over 500 figures were generated. For clarity, we only

420

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

t-test

t-test+SFS

ReliefF

ReliefF+SFS

0.3

0.3

0.25

0.25

0.25

0.2 0.15 0.1

error rate

0.3 error rate

error rate

NN classifier

0.2 0.15 0.1

0.05

0.15 0.1

0.05 0 2 4 6 8 10 12 14 16 18 20 feature size

0.2

0.05 0 2 4 6 8 10 12 14 16 18 20 feature size

0 2 4 6 8 10 12 14 16 18 20 feature size

0.25

0.2

0.2

0.2

0.15 0.1

error rate

0.25 error rate

error rate

LSVM classifier 0.25

0.15 0.1

0.05

0.1

0.05 0 2 4 6 8 10 12 14 16 18 20 feature size N = 60

0.15

0.05 0 2 4 6 8 10 12 14 16 18 20 feature size N = 120

0 2 4 6 8 10 12 14 16 18 20 feature size N = 180

Fig. 9. Synthetic data results of NN and LSVM classifiers on different sample sizes and feature sizes: t-test/ReliefF and SFS/SFFS, redundant model type, equal and large variance, global marker Dgm = 20, observable feature size D = 5000, random design. Note that the NN and LSVM plots are in different scales.

include here a few plots showing results. We have noticed that the major trends are quite consistent across all classifiers and different distribution models. Therefore, in this section we will focus on the results of one classifier, LDA, with brief comments on other classifiers at the end. The complete results are provided on the companion website. Figs. 4 and 5 show the performance of the LDA classifier at different feature sizes and sample sizes for the redundant model type. For clarity, the results for t-test/ReliefF and SFS/SFFS are shown in Fig. 4, while results of all eight filter methods (also including t-test and ReliefF) are shown in Fig. 5. In each figure, the plots in the first row are based on the (D = 5000, random) design. They have the smallest observable feature size, D = 5000, and smallest marker size, about 25. In the second row, the percentage of markers is kept the same, however, since all D = 20, 000 features in the full distribution are available, there are 100 markers. The plots of the third row have again D = 5000 observable features, with all 100 markers included. Intuitively, the feature selection starts from the hardest problem at the top and moves to easier ones at the bottom. A similar arrangement holds for the columns: the left column has minimal training sample size N = 60, while the right column has the largest training sample size, N = 180. Thus, from left to right, with increasing sample size, feature selection should be more reliable. The results show that the easier the problem becomes, the better the advanced two-stage feature-selection methods work. On the other hand, in harder problems, although the t-test has significant peaking, it outperforms SFS and SFFS at its best performance. When the sample size is small, N =60, the t-test outperforms other featureselection methods in yielding the best performance. A codicil to this observation is that, for the (D = 5000, optimal) design, the curves for SFFS and SFS have not flattened, so it is possible that the perfor-

mances of SFFS and SFS surpass the best of the t-test in larger feature sizes; however, one should keep in mind that to select more features requires considerably longer search time. Although the t-test has the best performance, it also peaks very early and the performance drops quickly. When the sample size is large, N = 180, the curves of SFFS and SFS outperform both the t-test and ReliefF from very small feature sizes in almost all cases. For N = 180, only for the hardest case (D = 5000, random) does the t-test outperform the others. Although based on different selection criterion, ReliefF has similar trends as the t-test, but with poorer performance in almost all cases. In comparison, the ReliefF-based SFFS and SFS have slightly better performance than the two-stage methods based on the t-test. This suggests that ReliefF finds more markers in the D = 1000 features passed as the first-stage feature selection for SFFS and SFS, although the rank provided by ReliefF is not as good as the t-test. As for the other six filter methods, their trends are very close to t-test and ReliefF. Among all methods, information gain and t-test have relatively better performance across all cases, while ReliefF, max minority and sum minority are the worst. Similar trends are observed for all other synthetic data studies (results not shown here), although the exact difference in error can vary. So in the remaining part of this section, we will focus on t-test and ReliefF as representatives of the filter methods, and refer the interested readers to our companion website for the full results. To further verify the effects of sample size, we focus on the first row of Fig. 4 (D = 5000, random) design, where the t-test still outperforms other methods at N = 180. We extend the simulation to include larger sample sizes, 240, 300, and 360. The results are shown in Fig. 6, where one sees that, as the sample size grows, SFFS and SFS gradually narrow the gap between their curves and the curve of the t-test, and outperform the t-test in small feature sizes. Similar

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

421

t-test

t-test+SFFS

t-test+SFS

ReliefF

ReliefF+SFFS

ReliefF+SFS

0.25

error rate

error rate

U133 Plus 2.0 array feature set

0.2

0.15

0.25

0.2

0.15 0

5

10

15

20

25

0

30

5

10

15

20

25

30

25

30

25

30

feature size

feature size

0.25

error rate

error rate

U133A 2.0 array feature set

0.2

0.15

0.25

0.2

0.15 0

5

10

15

20

25

30

0

5

feature size

10

15

20

feature size

0.25

error rate

error rate

HG-Focus array feature set

0.2

0.15

0.25

0.2

0.15 0

5

10 15 20 feature size N = 60

25

30

0

5

10 15 20 feature size N = 120

Fig. 10. Real data results: MM data set, LDA classifier, t-test/ReliefF and SFS/SFFS methods.

simulations are conducted on the (D = 5000, random) for other distribution models and classifiers, and consistent trends are observed. The full results are provided on the companion website. Fig. 7 shows the performance of the LDA classifier on the redundant distribution model type, but with different numbers of global markers and both large and small variances. The training sample size is fixed at N = 60 and the observable feature size is D = 20, 000. The figure shows that the presence of global markers makes significant differences in the trends of the error curves. For Dgm =20 and 40, the curves are very similar. For large variance and less global markers, the t-test experiences significant peaking while outperforming SFS and SFFS. With more global markers and smaller variance, SFFS and SFS have better performance and quickly surpass the t-test, which in the meanwhile, shows less peaking. ReliefF again has trends similar to the t-test, but never performs better. For the case where Dgm = 0 in the distribution model, the decision boundary becomes more nonlinear and all methods have

poor error rates at small sample sizes. In this case, without the presence of any global marker, t-test and ReliefF experience no significant peaking, especially in the small variance cases, and are outperformed by SFFS and SFS. Again ReliefF-based SFFS and SFS have the best performances. When we move to the synergetic and marginal model types, we assume that the difficulty intentionally set for the univariate methods will lead to relatively poorer performance of the t-test and perhaps give SFS and SFFS an edge. As can be seen from the left column of Fig. 8, the performance of the t-test does drop significantly; however, the performance of SFS and SFFS drops even more than the t-test. Although in both the synergetic and marginal model types there are markers that can, and should, be picked up by methods like SFFS and SFS, these wrapper algorithms cannot always take advantage of this feature structure with such small sample sizes. Only when the variance is small will SFFS and SFS start to outperform the t-test and ReliefF.

422

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

t-test ReliefF

t-test+SFFS

t-test+SFS

ReliefF+SFFS

ReliefF+SFS

ALL data set, LDA classifier 0.25

0.25

0.2 error rate

error rate

0.2 0.15 0.1 0.05

0 0

5

10 15 20 feature size

25

30

0

0.35

0.35

0.3

0.3

0.25 0.2

10 15 20 feature size

25

30

0.25 0.2 0.15

0.15 0.1

5

Drug response data set, LSVM classifier 0.4

error rate

error rate

0.1 0.05

0

0.4

0.15

0.1

0 2 4 6 8 10 12 14 16 18 20 feature size

0 2 4 6 8 10 12 14 16 18 20 feature size

0.4

error rate

error rate

AML data set, U133A array feature set, NN classifier

0.35

0.3

0.4

0.35

0.3

0 2 4 6 8 10 12 14 16 18 20 feature size

0 2 4 6 8 10 12 14 16 18 20 feature size

0.4

error rate

error rate

AML data set, HG-Focus array feature set, NN classifier

0.35

0.3

0 2 4 6 8 10 12 14 16 18 20 feature size N = 60

0.4

0.35

0.3

0 2 4 6 8 10 12 14 16 18 20 feature size N = 120

Fig. 11. Real data results: t-test/ReliefF and SFS/SFFS, ALL data set, drug response data set, and AML data set. Note that plots of different data sets are under different scale.

As shown in the middle and right column of Fig. 8, for cases where global markers are available, the relationship between curves is almost the same as the redundant model type. Differences lie in details such as error rates and peaking point. Although the t-test still

outperforms others, it has more severe peaking. Even in the small variance model, SFFS and SFS cannot take the extra information from the correlated features and surpass the t-test, indicating that they need larger samples for these cases.

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

Fig. 7 also shows the corresponding unequal-variance cases. These are very similar to the equal-variance cases. This similarity is consistent throughout the experiments. We have also counted the average number of markers found. Selected results are shown in Tables 3 and 4, with full results on the companion website. These results match well with the error-rate results. In general, the t-test finds the most markers and is closely followed by other univariate filter methods except max minority. Even for the cases where SFS and SFFS outperform the t-test, the t-test still finds more markers. This indicates that the t-test is most efficient in finding individual features that are truly correlated to the class label; however, as a univariate selection method, it ranks the features by individual scores and fails to take advantage of any interaction between features. In contrast, SFS and SFFS have the merit of finding potential intricate interactions of features, as long as their results have not been corrupted by a significant amount of false positives. The simulations using other classifiers show similar general trends. Fig. 9 shows some typical results for the NN and LSVM classifiers. No SFFS-based simulation is carried out for these two classifiers owing to the long simulation time. The distribution models are the same as in Fig. 4, although only (D = 5000, random) design is shown here. For both classifiers, we see that the peaking phenomenon is not as severe as for the LDA results. Instead, in most cases, we see that the improvement of performance stops rather abruptly at a certain feature size and afterwards becomes almost flat. Nevertheless, when this happens, the t-test has relatively better performance, as the general trends indicate. When the distribution becomes easier for classification, the results of SFS can quickly catch up those of the t-test and ReliefF, although in this instance we lack a quantitative measure of distributional complexity and severity of peaking. Again we see ReliefF-based SFS has slightly better performance than t-test-based SFS in most cases, whereas ReliefF itself closely follows the trends of the t-test with a much worse performance in most cases. 3.2. Results on real data As in the case of the synthetic data results, we can only include here a few plots showing results and refer readers to our companion website for the full results. In most results, there is a remarkable resemblance to the typical results for synthetic data, especially in the cases where global markers are available. The simulation results for the MM data set on different array sizes for LDA classifier are shown in Fig. 10. Only t-test/ReliefF and SFS/SFFS methods are compared here. The results of other filter methods, just as what have been shown in Fig. 5, are very close to t-test and ReliefF, and are omitted here. One can see the clear peaking with small feature sizes, and the t-test has better performance when N = 60. When N increases to 120, the peaking become less significant for the t-test and ReliefF, and SFFS and SFS catch up. Although ReliefF itself has inferior performance, ReliefF-based SFS has very close performance to t-test-based SFS. As for the effects of different array sizes, there is no significant improvement or degradation in classification performance; however, we do notice that for the smallest array, the HG-Focus array, the performances of SFFS and SFS still lag behind at larger sample sizes. This could indicate that for this case, U133 Plus 2.0 array and U133A 2.0 array bring enough extra markers of which SFFS and SFS can take advantage. The selected results for the other three data sets are shown in Fig. 11, with LDA results for the ALL data set, LSVM results for the drug response data set, and NN results for the AML data. Even though the three data sets show three different scenarios, the general trends still hold.

423

The ALL data set shows similar trends to the MM data set. At N =60, the t-test shows moderate peaking, and SFS and SFFS are only slightly behind. Their performances catch up at N = 120. In the drug response data set, all methods show no obvious sign of peaking. SFS is already slightly better than the t-test with small sample size and increases its advantage over the t-test with large sample size. In both the ALL and the drug response data sets, ReliefF is worse than the t-test, but ReliefF-based SFS is almost indistinguishable to its t-test counterpart. The AML data set looks very similar to the drug response data set, except SFS is worse for small sample size. Again, the gap was narrowed when the sample size grows to N = 120. In this case, we see that ReliefF has better performance than the t-test, and also in the corresponding SFS method. Like MM data set, we find that the improvement of SFS is less significant in the smaller chip, HG-Focus array, when the sample size increases. 4. Conclusion The explosion in high-dimensional data has encouraged an explosion in the number of proposed feature-selection algorithms. For the most part, the conditions under which these algorithms successfully perform have not been determined and, in particular, their performances on high-dimensional small-samples, precisely the environment for which they purportedly have been proposed, have not been validated. On the contrary, there is strong evidence that feature-selection algorithms cannot be expected to find close-tooptimal feature sets in such settings, nor does their failure to do so say anything about the existence of good-performing feature sets [36]. This paper proposes feature-label distribution models in which the numbers of available features is of the same order as that faced in real-world experiments and in which relations among the features reflect real-world situations. When this is done, we can observe trends in the behaviors of feature-selection algorithms relative to specific model conditions, such as sample size and the numbers of global and heterogeneous markers. If one is going to propose a feature-selection algorithm to apply in extremely high-dimensional settings, then it is scientifically incumbent that algorithm performance be characterized in relevant settings. Obviously our distribution models are not restricted to the methods compared here and can be used to investigate the performances of other featureselection methods and classification schemes. Moreover, the models can be straightforwardly extended to multi-class distributions. With regard to these issues, we plan to do further simulations in the future and thereby enhance the information available to the research community. Acknowledgments We would like to acknowledge the support of the National Science Foundation (CCF-0634794 and CCF-0514644). We would also like to thank Edward Suh and James Lowey of the Translational Genomics Research Institute for providing high-performance computing support, which was critical in successfully completing the large simulations.

References [1] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, E. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537. [2] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, I. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, N. Hayward, J. Trent,

424

[3]

[4] [5] [6] [7] [8]

[9]

[10] [11] [12] [13]

[14] [15]

[16] [17]

[18]

[19]

[20]

J. Hua et al. / Pattern Recognition 42 (2009) 409 -- 424

Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature 406 (2000) 536–540. L.J. van 't Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A. Hart, M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, S.H. Friend, Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530–536. G. Hughes, On the mean accuracy of statistical pattern recognizers, IEEE Trans. Inf. Theory 14 (1) (1968) 55–63. G.V. Trunk, A problem of dimensionality: a simple example, IEEE Trans. Pattern Anal. Mach. Intell. 1 (3) (1979) 306–307. I.J. Raudys, Determination of optimal dimensionality in statistical pattern classification, Pattern Recognition 11 (1979) 263–270. A.K. Jain, W.G. Waller, On the optimal number of features in the classification of multivariate Gaussian data, Pattern Recognition 10 (1978) 365–374. J. Hua, Z. Xiong, J. Lowey, E. Suh, E. Dougherty, Optimal number of features as a function of sample size for various classification rules, Bioinformatics 21 (8) (2005) 1509–1515. J. Hua, Z. Xiong, E. Dougherty, Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution, Pattern Recognition 38 (3) (2005) 403–421. C.T. Le, Introductory Biostatistics, Wiley, Hoboken, NJ, 2003. P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection, Pattern Recognition Lett. 15 (1994) 1119–1125. F. Li, Y. Yang, Analysis of recursive gene selection approaches from microarray data, Bioinformatics 21 (19) (2005) 3741–3747. A.K. Jain, D. Zongker, Feature selection: evaluation, application, and small sample performance, IEEE Trans. Pattern Anal. Mach. Intell. 19 (2) (1997) 153–158. Y. Saeys, I. Inza, P. Larranaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (19) (2007) 2507–2517. M. Kudo, J. Sklansky, Classifier-independent feature selection for two-stage feature selection, in: A. Amin, H.F.D. Dori, P. Pudil (Eds.), Lecture Notes in Computer Science, Advances in Pattern Recognition, vol. 1451, Springer, Berlin, 1998, pp. 548–554. M. Kudo, J. Sklansky, Comparison of algorithms that select features for pattern classifiers, Pattern Recognition 33 (2000) 24–41. T. Li, C. Zhang, M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20 (15) (2004) 2429–2437. J.W. Lee, J.B. Lee, M. Park, S.H. Song, An extensive comparison of recent classification tools applied to microarray data, Comput. Stat. Data Anal. 48 (4) (2005) 869–885. G. Natsoulis, L. El Ghaoui, G.R.G. Lanckriet, A.M. Tolley, F. Leroy, S. Dunlea, B.P. Eynon, C.I. Pearson, S. Tugendreich, K. Jarnagin, Classification of a large microarray data set: algorithm comparison and analysis of drug signatures, Genome Res. 15 (5) (2005) 724–736. P. Silva, R. Hashimoto, S. Kim, J. Barrera, L. Brandao, E. Suh, E. Dougherty, Feature selection algorithms to find strong genes, Pattern Recognition Lett. 26 (10) (2005) 1444–1453.

[21] I.B. Jeffery, D.G. Higgins, A.C. Culhane, Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data, BMC Bioinformatics 7 (2006) 359. [22] B. Hanczar, J. Hua, E.R. Dougherty, Decorrelation of the true and estimated classifier errors in high-dimensional settings, EURASIP J. Bioinformatics Syst. Biol. (2007) 38473. [23] Y. Su, T.M. Murali, V. Pavlovic, M. Schaffer, S. Kasif, Rankgene: identification of diagnostic genes based on expression data, Bioinformatics 19 (12) (2003) 1578–1579. [24] K. Kira, L. Rendel, The feature selection problem: traditional methods and a new algorithm, in: M. Press (Ed.), Proceedings of Tenth National Conference on Artificial Intelligence, 1992, pp. 129–134. [25] I. Kononenko, Estimating attributes: analysis and extensions of relief, in: Machine Learning: ECML-94, Lecture Notes in Computer Science, vol. 784, Springer, Berlin, Heidelberg, 1994, pp. 171–182. [26] E. Dougherty, M. Brun, On the number of close-to-optimal feature sets, Cancer Inf. 2 (2006) 189–196. [27] O. Carlborg, C.S. Haley, Epistasis: too often neglected in complex trait studies?, Nat. Rev. Genet. 5 (8) (2004) 618–625. [28] K. Fukunaga, P. Narendra, A branch and bound algorithm for computing k-nearest neighbors, IEEE Trans. Comput. C-24 (1976) 750–753. [29] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001. [30] U. Braga-Neto, E. Dougherty, Bolstered error estimation, Pattern Recognition 37 (2004) 1267–1281. [31] C. Sima, S. Attoor, U. Braga-Neto, J. Lowey, E. Suh, E. Dougherty, Impact of error estimation on feature-selection algorithms, Pattern Recognition 38 (12) (2005) 2472–2482. [32] F. Zhan, Y. Huang, S. Colla, J. Stewart, I. Hanamura, S. Gupta, J. Epstein, S. Yaccoby, J. Sawyer, B. Burington, K. Hollmig, M. Pineda-Roman, G. Tricot, F. van Rhee, R. Walker, M. Zangari, J. Crowley, B. Barlogie, J. Shaughnessy, The molecular classification of multiple myeloma, Blood 108 (2006) 2020–2028. [33] F. Zhan, B. Barlogie, V. Arzoumanian, Y. Huang, D. Williams, K. Hollmig, M. Pineda-Roman, G. Tricot, F. van Rhee, M. Zangari, M. Dhodapkar, J. Shaughnessy, Gene-expression signature of benign monoclonal gammopathy evident in multiple myeloma is linked to good prognosis, Blood 109 (2007) 1692–1700. [34] E.-J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. Williams, D. Patel, R. Mahfouz, F.G. Behm, S.C. Raimondi, M.V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W.E. Evans, C. Naeve, L. Wong, J.R. Downing, Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell 1 (2) (2002) 133–143. [35] P.J.M. Valk, R.G.W. Verhaak, M.A. Beijen, C.A.J. Erpelinck, S. Barjesteh van Waalwijk van Doorn-Khosrovani, J.M. Boer, H.B. Beverloo, M.J. Moorhouse, P.J. van der Spek, B. Lowenberg, R. Delwel, Prognostically useful gene-expression profiles in acute myeloid leukemia, N. Engl. J. Med. 350 (16) (2004) 1617–1628. [36] C. Sima, E. Dougherty, What should be expected from feature selection in small-sample settings, Bioinformatics 22 (19) (2006) 2430–2436.

About the Author—JIANPING HUA received the B.S. and M.S. degrees in Electrical Engineering from the Tsinghua University, Beijing, China, in 1998 and 2000, respectively. He received the Ph.D. degree in Electrical Engineering from Texas A&M University in 2004. Currently, he is associated investigator in computational biology division at Translational Genomics Research Institute (TGen) at Phoenix, AZ. His main research interest lies in bioinformatics, genomic signal processing, signal and image processing, image and video coding and statistic pattern recognition. About the Author—WAIBHAV TEMBE obtained his Ph.D. in Computer Science and Engineering from the University of Cincinnati, Cincinnati, OH, USA in 2004. His areas of research are statistical pattern recognition, scientific computing, and data mining algorithms as applied to various fields, such as computational biology and natural language processing. In addition, he specializes in the application of high performance parallel computing to computationally intensive algorithms and data analysis. Currently, he is a Sr. Scientific Programmer at the Translational Genomics Research Institute (TGen), Phoenix, AZ, USA. About the Author—EDWARD R. DOUGHERTY is a Professor in the Department of Electrical and Computer Engineering at Texas A&M University in College Station, TX, where he holds the Robert M. Kennedy Chair and is Director of the Genomic Signal Processing Laboratory. He is also the Director of the Computational Biology Division of the Translational Genomics Research Institute in Phoenix, AZ. He holds a Ph.D. in Mathematics from Rutgers University and an M.S. in Computer Science from Stevens Institute of Technology, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology in Finland. He is a fellow of the International Society of Optical Engineering (SPIE), has received the SPIE President's Award, and served as the editor of the SPIE/IS&T Journal of Electronic Imaging. At Texas A&M he has received the Association of Former Students Distinguished Achievement Award in Research, been named Fellow of the Texas Engineering Experiment Station, and named Halliburton Professor of the Dwight Look College of Engineering. Prof. Dougherty is author of 14 books, editor of five others, and author of more than 200 journal papers. He has contributed extensively to the statistical design of nonlinear operators for image processing and the consequent application of pattern recognition theory to nonlinear image processing. His research in genomic signal processing is aimed at diagnosis and prognosis based on genetic signatures and using gene regulatory networks to develop therapies based on the disruption or mitigation of aberrant gene function contributing to the pathology of a disease.