Biomarker discovery using 1-norm regularization for multiclass ...

3 downloads 14826 Views 459KB Size Report
Feb 22, 2012 - a Department of Computer and Information Science, University of Mississippi, University, MS 38677, USA b School of ... technologies such as microarrays have been applied to measure .... uc,j or vc,j has to equal to zero, then Jwc,jJ ¼ uc,j юvc,j. ...... Dawn Wilkins received Bachelors and Masters degrees.
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy Neurocomputing 92 (2012) 36–43

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Biomarker discovery using 1-norm regularization for multiclass earthworm microarray gene expression data Xiaofei Nan a,n, Nan Wang b, Ping Gong c, Chaoyang Zhang b, Yixin Chen a, Dawn Wilkins a a

Department of Computer and Information Science, University of Mississippi, University, MS 38677, USA School of Computing, University of Southern Mississippi, Hattiesburg, MS 39406, USA c SpecPro Inc., Environmental Services, Vicksburg, MS 39180, USA b

a r t i c l e i n f o

a b s t r a c t

Available online 22 February 2012

Novel biomarkers can be discovered through mining high dimensional microarray datasets using machine learning techniques. Here we propose a novel recursive gene selection method which can handle the multiclass setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multiclass classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The empirical results demonstrate that the selected features (genes) have very competitive discriminative power. In addition, the selection process has fast rate of convergence. & 2012 Elsevier B.V. All rights reserved.

Keywords: Biomarker discovery 1-norm regularization Multiclass classification Microarray gene expression data

1. Introduction Discovery of novel biomarkers is one of the most important impetuses driving many biological studies including biomedical research. In this post-genomics era, many high throughput technologies such as microarrays have been applied to measure a biological system ranging from cell to tissue to whole animal. In the last decade, environmental scientists, particularly ecotoxicologists, have increasingly applied omics technologies in the hunt for biomarkers that display both high sensitivity and specificity. However, it is still a big challenge to sieve through high dimensional data sets and look for biomarker candidates that meet high standards and sustain experimental validation. Previously, we developed an integrated statistical and machine learning (ISML) pipeline to analyze a multiclass earthworm gene expression microarray dataset [15]. As a continuation to this effort of biomarker discovery, here we developed a new feature selection method based on 1-norm regularization. In machine learning, feature selection is a technique of seeking the most representative subset of features. It is the focus of research in applications where datasets have dramatic amounts of variables, e.g. text processing and gene expression data analysis. When applied to gene expression array analysis, the technique detects the influential genes by which biological researchers could

n

Corresponding author. E-mail addresses: [email protected], [email protected] (X. Nan), [email protected] (N. Wang), [email protected] (P. Gong), [email protected] (C. Zhang), [email protected] (Y. Chen), [email protected] (D. Wilkins). 0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.09.035

discriminate normal instances from abnormal ones, and therefore, facilitates further biological research or judgments. We only focus on supervised learning in this paper, which means a label is given for each instance. Unsupervised and semisupervised learning could be found in other literatures [25,16,22]. Feature selection algorithms roughly fall into two categories, variable ranking (or feature ranking) and variable subset selection [9]. The latter essentially is divided into wrapper, filter, and embedded methods. Variable ranking acts as a preprocessing step or auxiliary selection mechanism because of its simplicity and scalability. It ranks individual features by a metric, e.g. correlation, and eliminates features that do not exceed a given threshold. Variable ranking is computational efficient because it only computes feature scores. Nevertheless, this method only focuses on the predictive power of individual features. It is prone to the selection of redundant features. Variable subset selection, on the other hand, attempts to select subsets of features that, jointly, produce good prediction performance. Filter methods consist of using a feature subset relevance criterion to yield a reduced subset of features which may be used for future prediction. Wrapper methods [13] search through feature subset space. Each subset is applied to a certain machine learning model and assessed by the learning performance. In these methods, learning models act as black boxes. Embedded approaches [14] implement feature selection in the process of learning. While wrapper methods search the space of all feature subsets, the searching step in embedded methods is guided by the learning algorithm. This guidance could be obtained from estimating changes in the objective function by adding or removing

Author's personal copy X. Nan et al. / Neurocomputing 92 (2012) 36–43

features. For example, Guyon et al. [10] proposed Support Vector Machine Recursive Feature Elimination (SVM-RFE) algorithm to recursively classify the instances by the SVM classifier and eliminate the feature(s) with the least weight(s). The number of features to be eliminated in each iteration is ad hoc. Moreover, there is no firm conclusion about when to terminate the recursive steps. Most feature selection in the literature is designed for binary problems. When extended to real-life multiclass tasks, combining several binary classifiers are typically suggested, such as oneversus-all and one-versus-one [23]. For situations with k classes, one-versus-all constructs k binary classifiers, each of which is trained with all the instances in a certain class with positive labels and all other examples with negative labels. It is computationally expensive and has highly imbalanced data for each binary classifier. On the other hand, one-versus-one method constructs kðk1Þ=2 binary classifiers for all pairs of classes. An instance is predicted for the class with the majority vote. Similar to the one-versus-all approach, the one-versus-one approach has heavy computational burden. Platt et al. [20] proposed a directed acyclic graph SVM (DAGSVM) algorithm whose training phase is the same as one-versus-one by solving kðk1Þ=2 binary problems. However, DAGSVM uses a rooted acyclic graph to make a decision from kðk1Þ=2 prediction results. Some researchers proposed methods solving multiclass tasks in one step: build a piecewise separation of the k classes in a single optimization. This idea is comparable to the one-versus-all approach. It constructs k classifiers, each of which separates a class from the remaining classes, but all classifiers are obtained by solving one optimization problem. Weston and Watkins [27] proposed a formulation of the SVM that enables a multiclass problem. But, solving multiclass problem in one step results in a much larger scale optimization problem. Crammer and Singer [5] decomposed the dual problem into multiple optimization problems of reduced size and solved them by a fixed-point algorithm. A comparison of different methods for multiclass SVM was done by Hsu and Lin [12]. A multiclass optimization cost function typically comprises two parts, empirical error and model complexity. The model complexity is usually approximated by a regularizer, e.g. 2-norm or 1-norm [3]. The use of 1-norm was advocated in many applications, such as multi-instance learning [4], ranking [19] and boosting [6], because of its sparsity-favoring property. Several literatures discussed the multiclass problem based on 1-norm regularization and various loss functions for the empirical error. For example, Friedman et al. [8] introduced 1-norm into multinomial logistic regression which is capable of handling multiclass classification problems. Bi et al. [2] chose E-insensitive loss function. Liu and Shen [17] defined a specific loss function c-loss that replaces the convex SVM loss function by a nonconvex function. Other works mainly used hinge loss with different variations [24,26]. In this paper, the hinge loss function we apply is similar to that in [27], but has not be used in any 1-norm multiclass work. Feature selection under the framework of 1-norm multiclass regularization is achieved by discarding the least significant features, i.e., features with zero weights. The sparsity of the weights is determined by a regularization parameter that controls the trade off between empirical error and model complexity. However, the selection of a proper regularization parameter is a challenging problem. We only know the trend of tuning the parameter to make the number of selected features smaller or larger, but it is difficult to associate a parameter value with a particular feature subset and at the same time achieve a high learning performance, unless the entire regularization path is computed. As 1-norm is non-differentiable (so is hinge loss), calculating the accurate regularization path is difficult (some other loss functions, such as logistic loss, have defined gradients).

37

Even though the regularization path is piecewise linear, pathfollowing methods are slow for large-scale problems. Instead of computing an approximate regularization path, we introduce an iterative 1-norm multiclass feature selection method that selects a small number of features with high performance. In this paper, we propose a multiclass 1-norm regularization feature selection method, L1MR (Linear 1-norm Multiclass Regularization), and its simple variation SL1MR, that solve a single linear program. An iterative feature elimination framework is proposed to obtain a minimum feature subset. The sparsity favoring property of 1-norm regularization enables fast convergence of the iterative feature elimination process. In our empirical studies, the algorithm typically converges in no more than ten iterations. The reminder of the paper is organized as follows. Section 2 proposes the 1-norm multiclass regularization. Section 3 describes the iterative feature elimination process. Section 4 demonstrates the experimental results. Conclusions are presented in Section 5 along with a discussion of future work.

2. Learning a multiclass linear classifier via 1-norm regularization Consider a set of l instances (X,Y) from an unknown fixed distribution, where X A Rn is the earthworm microarray gene expression data, and output Y is the class label. In a k-category classification task, y is coded as f1, . . . ,kg. For the earthworm data studied in this article, k¼3 (control, TNT, RDX), n¼869 and l¼248. Given k linear decision functions f 1 , . . . ,f k where fc corresponds to class c, each decision function is defined as f c ðxÞ ¼ wTc x þ bc , c ¼ 1, . . . ,k, where, the parameters wc ¼ ½wc,1 , . . . ,wc,n T A Rn and bc A R. We consider a winner-takes-all classification rule specified as

FðxÞ ¼ arg max f c ðxÞ, c

which assigns input x to class FðxÞ with the highest decision value. If the instances are separable, there exist decision functions that satisfy f yi ðxi Þ Z f c ðxi Þ,c ayi : The above inequality is equivalent to ðwyi wc Þxi þ ðbyi bc Þ Z1,c a yi : To handle the non-separable cases, we introduce slack variables into the model, i.e., ðwyi wc Þxi þ ðbyi bc Þ Z1xyi ,c ,c ayi ,

ð1Þ

where xyi ,c Z0 is a slack variable. The learning of the classifier can be formulated as an optimization problem. Our goal is to seek f that minimizes the sum of empirical error and model complexity. Empirical error can be computed as the proportion of non-separable instances, i.e., errors on training data, or approximated using various loss functions, such as hinge loss, logistic loss. These loss functions were originally defined for binary classes. Extending to multiclass case, there are different variations. For example, [26] utilized hinge loss P function c a yi ½f c ðxi Þ þ 1 þ , where ðÞ þ  maxð,0Þ. In this paper, we apply the hinge loss function: X ½1ðf yi ðxi Þf c ðxi ÞÞ þ , c a yi

which was introduced by Weston and Watkins [27]. This hinge loss function has not been used in the multiclass setting with 1-norm penalty before. And it is equivalent to the slack variable defined in (1). Model complexity is approximated using 1-norm. Zhu et al. [29] argued that the 1-norm regularization yields sparse results,

Author's personal copy 38

X. Nan et al. / Neurocomputing 92 (2012) 36–43

i.e., only a small number of feature weights wc,j are non-zero, hence facilitating adaptive feature selection. For the purpose of feature selection, we propose 1-norm multiclass regularization, L1MR, as

l

wc ,bc , xy ,c

k X

Jwc J1 þ

c¼1

i

s:t:

k X

l X

xyi ,c

i ¼ 1 j ¼ 1,j a yi

ðwyi wc Þxi þ ðbyi bc Þ Z 1xyi ,c

xyi ,c Z0, ð2Þ i ¼ 1, . . . ,l, c ¼ 1, . . . ,k,c ayi : Pk Pk Pn where Jw J ¼ Jw J is a 1-norm penalty, c 1 c,j c¼1 j¼1 Pl Pk c ¼ 1 i¼1 j ¼ 1,j a yi xyi ,c can be viewed as an upper bound on the training error; l is a cost parameter that controls the training accuracy and sparsity of the solution. The optimization problem (2) can be formed as a linear program by rewriting wc,j ¼ uc,j vc,j where uc,j ,vc,j Z 0. If either uc,j or vc,j has to equal to zero, then Jwc,j J ¼ uc,j þvc,j . Therefore, (2) is equivalent to min

l

uc ,vc ,bc , xy ,c

ðuc,j þ vc,j Þ þ

l X

k X

i ¼ 1 j ¼ 1,j a yi

c¼1j¼1

i

ð3Þ

þ

Compared with the slack variables in (3), which capture gaps between each pair of decision functions, the new slack variable considers the maximum gap. A simplified version of L1MR, SL1MR, is then formulated as, min

uc ,vc ,bc , xi

s:t:

l

ðuc,j þ vc,j Þ þ

c¼1j¼1

40

10−3 10−2 Log(λ)

10−1

100

101

10−1

100

101

SL1MR

140 120 100 80 60 40 20 0

xi

i¼1

ðuyi uc vyi þvc Þxi þ ðbyi bc Þ Z1xi

xi Z 0, uc,j Z 0, vc,j Z0 i ¼ 1, . . . ,l, c ¼ 1, . . . ,k, c a yi :

60

160

The number of slack variables is ðk1Þl in L1MR. Motivated by [5], we shrink the number of slack variables to l by introducing a new slack variable as   xi ¼ max f c ðxi Þ þ 1f yi ðxi Þ :

l X

80

180

xyi ,c Z 0,uc,j Z0, vc,j Z 0

k X n X

100

10−5

ðuyi uc vyi þvc Þxi þ ðbyi bc Þ Z1xyi ,c

c

120

0

xyi ,c

i ¼ 1, . . . ,l, c ¼ 1, . . . ,k,c a yi :

140

20

Number of nonzero weight features

s:t:

k X n X

160

Number of nonzero weight features

min

L1MR

180

10−5

10−3 10−2 Log(λ)

Fig. 1. The relationship between l and the number of non-zero-weight features.

ð4Þ

Experimental results for both L1MR and SL1MR are shown in Section 4.

3. Recursive feature elimination via 1-norm regularization It has frequently been observed that 1-norm regularization leads to many feature weights to be zero. This makes it a natural feature selection process, where features with zero weight values should be discarded with no risk. In this paper, we call a feature j having zero-weight if all k values, w1,j ,w2,j , . . . ,wk,j , in the k weight vectors are zero. Otherwise, the corresponding feature has nonzero weight. The purpose of feature selection is to choose a small subset of features and achieve good classification performance. Achieving these two goals simultaneously is difficult. Although the solutions of the above linear programs are in general sparse, the number of selected features still can be quite large, especially when the dimension of the original feature space is high. By the nature of the 1-norm penalty, increasing the value of parameter l in (3) and (4) will reduce the number of non-zero weight features. Fig. 1 shows the relationship between the number of features selected (non-zero-weight features) and the parameter l on the earthworm dataset. It suggests that, by properly tuning the value of l, different number of features could be selected.

Finding a ‘‘good’’ value for l is a challenging problem. Increasing l value reduces the size of the selected feature subset, but not necessarily improves the learning performance of the associated model. An accurate solution could be obtained by computing the entire regularization path of L1MR or SL1MR. However, neither of the hinge loss and 1-norm penalty in the multiclass scenario has well-defined gradient [21], which makes computing the whole solution path extremely difficult. Inspired by the idea of RFE, we propose a recursive feature elimination approach based on the 1-norm regularization. At each recursive step, a L1MR or SL1MR model is built based on the remaining features from the previous step, and zero-weight features will be eliminated. The recursive process stops when there is no zero-weight feature left. And the final feature subset is chosen based on the associated model with the highest learning accuracy among all the steps. Compared with other embedded feature selection methods, such as SVM-RFE, 1-norm regularized multinomial logistic regression and other single multiclass classifier, the proposed feature selection method has the following properties:

 The recursive process narrows down the feature subset step by step, hence minimizes the number of features selected, which is more aggressive than 1-norm regularized multinomial logistic regression and other single multiclass classifiers.

Author's personal copy X. Nan et al. / Neurocomputing 92 (2012) 36–43

 The number of feature to be eliminated at each step is auto-





matically determined, i.e., features with zero weights. Whereas in SVM-RFE, the number of features to be eliminated has to be predetermined, e.g. deleting one feature with the least feature weight. The recursive process normally involves a smaller number of iterations than SVM-RFE. We have observed that the number of features eliminated at each step varies greatly: dramatic reduction occurs at the first (few) step(s), and shrinks when it comes close to the optimal feature subset. The convergence is typically very fast. It has a natural stopping criteria: when the weights of the remaining features are non-zero.

To obtain a stable variable selection result, the instances are resampled multiple times and we choose variables that occur in a large portion of the multiple selection results. This idea is similar as stability selection [18] in as sense that multiple random sampling reduces the result variance. However, stability selection tests all regularization parameter (cost-parameter) under consideration and our approach includes a parameter selection procedure. As emphasized in [28,1], feature selection steps external to the resampling procedures may severely bias the evaluation in favor of the learned method due to ‘‘information leak’’. In our method, the feature selection is put inside the resampling. Because the L1MR and SL1MR include the process of tuning a cost-parameter, a double loop [23] is used: the inner loop is used to select the best parameter, and the outer loop is used for estimating the performance of the learning model built using the previously found best parameter on an independent test set. The recursive feature selection based on 1-norm procedure is described in detail as follows: 1. Resample data set by k random splits.1 2. Recursive feature selection at each split Fj. This step is stated in detail as follows: (a) At iteration i, build L1MR or SL1MR model with current Si,F j features. Initially, S1,F j consists of all the gene features. This sub-step includes an inner cross validation loop or bootstrap to decide the best parameter with which the learning model yields the best performance on the inner validation set.2 (b) Remove the zero-weight features from the old feature set and obtain a new one, Si þ 1,F j . (c) Set i ¼ i þ 1. Repeat from sub-step (a) until Si,F j does not change (in other words, there are no zero-weight features). 3. For each split, record the optimal feature subset S% F j with the maximum accuracy rate Acc% Fj . 4. Aggregate results from different splits. In the outer loop, each split generates different subsets of features. These subsets usually overlap with each other significantly. Nevertheless, we do need to combine these results. In [28], Zhang et al. utilized a frequency-based selection method that ignores the ordering of features. Because the list of S% f j could be viewed as a rank list where gene features are sorted in descending order based on their importance, Borda’s method is used in this paper for rank aggregation [7]. Suppose there are k ranking lists. Borda’s method first assigns to each feature a k-dimension vector indicating the positional scores of the feature in the k lists. It then sorts the features by the sum of these vectors. A major advantage

of using Borda’s method is its low computational complexity, linear to the size of the list. Also, this method could be extended to partial lists by apportioning the excess scores equally to all unranked features. The number of features selected is the average size of the selected subset in each split, i.e., $ % Pk j ¼ 1 9SF j 9 N¼ , ð5Þ k with k¼20 in our experiment, where 9SF j 9 is the size of selected feature subset in split Fj. After Borda’s method, the top N features are selected from the aggregation ranking list. The estimated accuracy is calculated as the average of the k-split accuracies, i.e., Pk % j ¼ 1 Acc F j , ð6Þ k where Acc% F j indicates the cross validation accuracy within split Fj. 4. Experimental results In this section, we first present the microarray experiment design. We then demonstrate the results. 4.1. Microarray experimental design Adult earthworms (Eisenia fetida) were exposed in soil spiked with TNT (0, 6, 12, 24, 48, or 96 mg/kg) or RDX (8, 16, 32, 64, or 128 mg/kg) for 4 or 14 days. The 4-day treatment was repeated with the same TNT concentrations and the RDX concentration being 2, 4, 8, 16 or 32 mg/kg soil. Each treatment originally had 10 replicate worms with 8–10 survivors at the end of exposure, except the two highest TNT concentrations. At 96 mg TNT/kg, no worms survived in the original 4-day and 14-day exposures, whereas at 48 mg TNT/kg, all 10 worms died in the original 4-day exposure. Total RNA was isolated from the surviving worms as well as the Day 0 worms (worms sampled immediately before experiments). A total of 248 worm RNA samples (eight replicate worms with 31 treatments) were hybridized to a customdesigned oligo array using Agilent’s one-colour Low RNA Input Linear Amplification Kit. The array contained 15,208 non-redundant 60-mer probes (GEO platform accession number GPL9420), each targeting a unique E. fetida transcript. After hybridization and scanning, gene expression data were acquired using Agilent’s Feature Extraction Software (v.9.1.3). In the current study, the 248-array dataset was divided into three worm groups: 32 untreated controls, 96 TNT-treated, and 120 RDX-treated. This dataset has been deposited in NCBI’s Gene Expression Omnibus and is accessible through GEO Series accession number GSE18495.3 4.2. Results We applied the L1MR and SL1MR on the earthworm microarray gene expression data which is categorized into three classes. L1MR and SL1MR were implemented using Matlab, and 1-norm multiclass regularization was transformed into standard optimization problem format. Then the optimization problem was solved by CPLEX (an optimization software package). Tables 1–4 show the detailed results using L1MR and SL1MR methods, respectively. The number of iterations is less than ten in each split, which suggests fast convergence of our methods, greatly alleviating the computational burden posed by the

1

Twenty random splits is used in our experiment. The 10-fold cross validation was recommended in [1] and is used in the inner loop of our experiment.

39

2

3

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18495

Author's personal copy 40

X. Nan et al. / Neurocomputing 92 (2012) 36–43

Table 1 Classification and feature selection results of L1MR (Part I). Fsize Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Table 3 Classification and feature selection results of SL1MR (Part I).

I1

I2

I3

I4

I5

I6

I7

I8

I9

Fsize

144 53 75 69 144 63 61 78 138 65 121 133 75 144 77 137 63 67 127 69

122 50 65 63 122 59 59 – 110 58 101 130 67 75 73 117 59 63 103 67

103 – 63 61 103 57 – – 97 50 78 112 61 63 69 103 – 61 93 63

70 – 59 60 70 – – – 93 49 76 79 53 59 63 97 – – 87 59

63 – 53 56 63 – – – 85 – – 63 49 52 61 83 – – 83 57

59 – 52 54 59 – – – 83 – – 59 – – – 70 – – 70 –

58 – – 52 58 – – – 80 – – 58 – – – 67 – – 63 –

57 – – – 57 – – – 70 – – 57 – – – – – – – –

55 – – – 55 – – – 67 – – 55 – – – – – – – –

Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split

Fsize: the number of features selected. I1–I9: from the first iteration to the 9th iteration.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

I1

I2

I3

I4

I5

I6

I7

I8

65 53 68 54 33 51 73 85 43 45 65 55 37 44 53 78 66 33 79 47

55 50 61 50 31 50 66 76 38 41 54 54 35 41 50 75 63 30 73 41

54 – 58 – 30 – 63 73 – 39 52 50 – – 44 73 59 – 70 37

52 – – – – – 60 72 – – – – – – 41 66 52 – 68 –

– – – – – – 56 66 – – – – – – – 63 – – 65 –

– – – – – – 54 64 – – – – – – – 56 – – 60 –

– – – – – – 52 62 – – – – – – – – – – 56 –

– – – – – – 51 61 – – – – – – – – – – 51 –

Fsize: the number of features selected. I1-I9: from the first iteration to the 9th iteration.

Table 4 Classification and feature selection results of SL1MR (Part II).

Table 2 Classification and feature selection results of L1MR (Part II).

Accuracy

I1 (%)

I2 (%)

I3 (%)

I4 (%)

I5 (%)

I6 (%)

I7 (%)

I8 (%)

Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split

94.0 72.8 85.3 86.0 90.0 89.5 90.0 90.5 94.0 70.0 92.1 88.0 90.0 86.5 77.8 79.2 94.5 88.0 86.0 90.0

86.0 72.8 85.3 74.0 94.0 93.7 90.0 98.2 94.0 62.0 90.0 88.0 89.8 90.6 75.4 81.0 94.5 84.5 85.5 88.5

90.0 – 85.3 – 98.0 89.8 98.0 90.5 – 66.0 92.1 84.5 – – 78.0 88.0 92.0 – 89.8 86.4

90.0 – – – – – 90.0 86.6 – – – – – – 77.8 83.3 92.0 – 90.0 –

– – – – – – 90.0 86.6 – – – – – – – 82.0 – – 88.5 –

– – – – – – 90.0 90.5 – – – – – – – 83.3 – – 89.8 –

– – – – – – 86.0 94.3 – – – – – – – – – – 86.0 –

– – – – – – 78.0 90.5 – – – – – – – – – – 88.5 –

Accuracy I1 (%) I2 (%) I3 (%) I4 (%) I5 (%) I6 (%) I7 (%) I8 (%) I9 (%) Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split Split

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

74.0 72.8 85.3 78.0 74.0 89.5 82.0 94.3 86.0 62.0 77.5 73.3 74.0 82.0 79.2 69.8 92.1 79.0 81.0 75.8

74.0 72.8 89.5 86.0 74.0 93.7 86.0 – 98.0 58.0 77.5 75.0 72.8 86.0 82.0 71.5 89.0 84.6 82.1 74.0

86.0 – 77.0 78.0 86.0 89.8 – – 94.0 70.0 77.7 77.5 78.0 78.0 91.5 73.0 – 82.3 79.5 79.5

90.0 – 77.0 86.0 90.0 – – – 94.0 58.0 77.7 79.0 89.5 92.0 78.0 78.0 – – 77.0 81.2

78.0 – 81.2 78.0 78.0 – – – 94.0 – – 81.0 83.0 89.0 82.0 71.5 – – 82.1 79.5

90.0 – 81.2 78.0 90.0 – – – 94.0 – – 82.3 – – – 73.3 – – 79.0 –

86.0 – – 82.0 86.0 – – – 98.0 – – 78.0 – – – 72.0 – – 79.0 –

86.0 – – – 86.0 – – – 98.0 – – 86.0 – – – – – – – –

86.0 – – – 86.0 – – – 98.0 – – 73.0 – – – – – – – –

Accuracy: the classification accuracy. I1–I9: from the first iteration to the 9th iteration.

traditional RFE method. In all of our tests, the first iteration in both methods always achieves the largest feature elimination. Sometimes, the process achieves an optimal solution after just one iteration. In most scenarios, accuracy follows a bell shaped curve along the selection steps. We compared L1MR and SL1MR with two well-known statistical methods in the literature – correlation-based feature selection (CFS) and information gain feature selection (IG), and also with two embedded multiclass feature selection methods – SVMRFE and L1 regularized multinomial logistic regression. CFS [11] offers a heuristic of individual features for predicting the class labels, whereas IG measures the expected reduction in entropy. SVM-RFE [10] trains a multiclass SVM classifier by the method of one-versus-one and eliminates the least-weight-feature at a time. L1 regularized multinomial logistic regression could handle multiclass classification problem [8], and feature selection is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Accuracy: the classification accuracy. I1–I9: from the first iteration to the 9th iteration.

implemented by eliminating zero-weight features. SVM-RFE was coded by Matlab and libsvm, a free library for Support Vector Machines.4 L1 regularized multinomial logistic regression was implemented by a free Matlab package offered by Mark Schmidt.5 We utilized the CFS and IG packages in Weka directly. CFS was combined with BestFirst search algorithm and IG was associated with the Ranker search method. Most of the settings are default setting in Weka, except for the number of features to select in IG is set to 64 to match the results of L1MR and SL1MR. In order to make fair comparisons, the same 20 random splits were also carried out in CFS, IG, SVM-RFE, L1 regularized multinomial logistic regression with the data splits identical to those used in

4 5

http://www.csie.ntu.edu.tw/  cjlin/libsvm/ http://www.cs.ubc.ca/  schmidtm/Software/L1General.html

Author's personal copy X. Nan et al. / Neurocomputing 92 (2012) 36–43

L1MR and SL1MR. Because CFS and IG are pure feature selection approaches, to evaluate the learning performances on the selected feature subsets, additional model-building process is needed. An outline of the CFS and IG processes is given below: 1. Use the same setting to divide the data set into 20 splits. 2. For each split, implement CFS and IG feature selection separately. For either CFS or IG, aggregate the 20 feature selection results by Borda’s method. CFS selects 60 features after the aggregation step. 3. Generate a new data set by eliminating any feature that is not on the aggregation list. 4. Train a classifier by L1MR method and estimate the classification accuracy. This learning step implements double loop cross validation as we do in L1MR and SL1MR learning, with the same outer loop data partition.

The comparison results are given in Table 5. Because of 20 random splits, accuracy average and standard deviation are provided. From here, we use abbreviations ‘‘L1-RMLR’’ for ‘‘L1 regularized multinomial logistic regression’’. Except for L1-RMLR, other approaches selected small number of features. Though, L1MR and SL1MR achieved higher average accuracies than all the comparison methods, the high standard deviation makes it hard to claim the superiority of our methods. Therefore, we ran a t-test to compare the difference between our method (L1MR or SL1MR) and any comparison method (L1-RMLR, SVM-RFE, CFS, or IG). For each pair, e.g. L1MR and SVM-RFE, the t value was computed based on their learning performances on the 20 random splits. These performance values are shown in Fig. 2. The

Table 5 Comparison of L1MR, SL1MR, L1-RMLR, SVM-RFE, CFS, and IG. Method

Fsize

Accur

L1MR SL1M L1-RMLR SVM-RFE CFS IG

64 52 682 35 64 64

86.25% 77.58% 88.57% 77.56% 83.72% 75.35% 85.62% 75.58% 77.98% 75.22% 78.16% 74.39%

Fsize, Accur, same as Tables 1 and 2 CFS: Correlation-based feature selection. IG: Information gain feature selection.

41

calculated t values are shown in Table 6. According to the t table with 2  202 ¼ 38 degrees of freedom, we see that for the threshold¼0.05 the tabled valued is 2.0244, and for threshold¼0.01 the tabled valued is 2.7116. Our calculated values are mostly larger than the tabled value at threshold¼0.01 except for the pair of L1MR and SVM-RFE. So we reach the conclusion that SL1MR outperformed all the comparison methods, and L1MR achieved better performance than L1-RMLR, CFS and IG. The difference between L1MR and SVM-RFE is not statistically significant. In all, SL1MR shows the best feature selection result by selecting a small number of features and generating the highest classification accuracy. Table 7 lists the top 20 earthworm genes selected by SL1MR and the position rank of these gene given by L1MR, CFS and IG. Due to page limit, the remaining 32 genes (ranked between 21th and 52th by SL1MR) are not shown. These 52 genes together are identified as the classifier genes that are highly discriminative in separating the earthworm samples into control, TNT-treated and RDX-treated categories. Among the top 20 genes, 19 genes (95%) have meaningful annotation with a wide range of biological functions spanning from detoxification (glutathione S-transferase pi and Cadmiummetallothionein) to spermatogenesis (valosine containing peptide-2 or VCP-2) and signal transduction (signal peptides). VCP-2, a gene expressed specifically in the anterior segments of sexually mature earthworms [30], is identified by three algorithms and ranked in the top 10, suggesting that both TNT and RDX may affect spermatogenesis. Elongation factor 2 or EF2, a putative gene targeted by two probes, TA1-057226 and TA2-006089, is one of the proteins that facilitate the events of translational elongation, the steps in protein synthesis from the formation of the first peptide bond to the formation of the last one [31]. This strongly indicates that protein synthesis may be targeted by both TNT and RDX. Nevertheless, more work should be devoted to exploring biological functions and interactions of the identified top classifier genes that may lead or be linked to toxicological effects or biochemical endpoints. Table 6 The t values for L1MR/SL1MR and comparison method. Method

L1-RMLR

SVM-RFE

CFS

IG

L1MR SL1M

3.1724 6.0317

0.7826 3.6335

10.4234 13.2457

10.5488 13.4653

Learning Accuracies for 20 Random Splits Using Different Feature Selection Methods 100

L1MR SL1MR L1−RMLR SVM−RFE CFS IG

95

Accurcy (%)

90

85

80

75

70

0

2

4

6

8

10 12 20 Random Splits

14

16

Fig. 2. Learning accuracies for 20 random splits using different feature selection methods.

18

20

Author's personal copy 42

X. Nan et al. / Neurocomputing 92 (2012) 36–43

Table 7 Top 20 genes selected by SL1MR. Rank in SL1MR

Probe name

Rank in L1MR

Rank in L1-RMLR

Rank in SVM-RFE

Rank in CFS

Rank in IG

Target gene annotation





Succinate dehydrogenase Iron–sulfur protein Elongation factor-2 Cytoplasmic 40S ribosomal protein S8 Tropomyosin Elongation factor 2 Cytoplasmic Valosine containing peptide-2 Unknown 40S ribosomal protein S30 Signal peptide Peptidase-C13 MT-EISFO Cadmium-metallothionein peptidase-C13 Cytoplasmic Signal peptide Glutathione S-transferase pi Lumbricus rubellus mRNA For 40S ribosomal protein S10 Rattus norvegicus similar to Finkel–Biskis–Reilly murine Signal peptide

1

TA1-153745

6

17

8

2 3 4 5 6 7 8 9 10 11 12 13

TA1-057226 TA1-058194 TA2-010658 TA2-095861 TA2-006089 TA1-167854 TA1-194525 TA1-056351 TA1-166421 TA2-164073 TA1-016697 TA1-003758

2 4 9 14 3 13 8 12 18 17 23 19

4 9 2 36 1 57 23 74 15 39 55 76

5 7 4 19 6 26 3 20 13 9 16 –

23 – – – – 13 – 60 – – – –

2 – – – 35 – – – – – – –

14 15 16 17 18

TA2-111531 TA2-179329 TA2-065306 TA2-131920 TA1-030037

25 11 5 28 1

130 49 13 61 3

– – 33 15 1

– – – – 7

– – – – –

19

TA2-026889

34

155







20

TA2-113782

15

86

28

10



5. Conclusion and future works

References

In this paper, we propose a novel multiclass joint feature selection and classification method L1MR and its simplified variation SL1MR. Both methods formulate multiclass classification as an optimization problem using 1-norm regularization. Because the 1-norm penalty tends to yield sparse solutions, the proposed formulation has embedded feature selection property. Combined with the idea of recursive feature selection, L1MR and SL1MR identify a small subset of discriminative features effectively and efficiently. The experimental results show that the iterative elimination process typically completes in less than 10 iterations. Compared with the results obtained from the correlation-based feature selection and information gain methods, L1MR and SL1MR achieve better performance in terms of both the size of selected feature subset and the subset’s discriminative power. Compared with support vector machine-recursive feature elimination, our methods have shorter running time and higher accuracy. As a continuation of this work, we would like to validate the biological significance of the identified genes because our main purpose was to discover novel gene biomarkers sensitive and specific to environmental toxicants. In the future, we would also like to perform experiments comparing these algorithms on a wider range of data sets.

[1] C. Ambroise, G.J. McLachlan, Selection bias in gene extraction on the basis of microarray gene-expression data, Proc. Natl. Acad. Sci. USA 99 (2002) 6562–6566. [2] J. Bi, K.P. Bennett, M. Embrechts, C. Breneman, M. Song, Dimensionality reduction via sparse support vector machines, J. Mach. Learn. Res. 3 (2003) 1229–1243. [3] O. Chapelle, S.S. Keerthi, Multi-class feature selection with support vector machine, in: Proceedings of the American Statistical Association, 2008. [4] Y. Chen, J. Bi, J.Z. Wang, MILES: multiple-instance learning via embedded instance selection, IEEE Trans. Pattern Anal. Mach. Intell. 28 (12) (2006) 1931–1946. [5] K. Crammer, Y. Singer, On the algorithm implementation of multiclass kernelbased vector machines, J. Mach. Learn. Res. 2 (2001) 265–292. [6] J. Duchi, Y. Singer, Boosting with Structural sparsity, in: ICML’09: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 297–304. [7] C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank aggregation methods for the web, in: Proceedings of the 10th International Conference on World Wide Web, 2001, pp. 613–622. [8] J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, J. Statist. Software 33 (1) (2010) 1–22. [9] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [10] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machine, Mach. Learn. 46 (1–3) (2002) 389–422. [11] M.A. Hall, Correlation-Based Feature Selection for Machine Learning, Ph.D. Thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1998. [12] C. Hsu, C. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Networks 13 (2002) 415–425. [13] R. Kohavi, G. John, Wrappers for feature selection, Artif. Intell. 97 (1–2) (1997) 273–324. [14] T.N. Lal, O. Chapelle, J. Weston, A. Elisseeff, Embedded methods, in: I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh (Eds.), Feature Extraction, Foundations and Applications, Springer-Verlag, 2006, pp. 137–165. [15] Y. Li, N. Wang, E.J. Perkins, C. Zhang, P. Gong, Identification and optimization of classifier genes from multi-class earthworm microarray dataset, PLoS One 5 (10) (2010) e13715. [16] B. Liu, C. Wan, L. Wang, An efficient semi-unsupervised gene selection method via spectral biclustering, IEEE Trans. Nano-Biosci. 5 (2) (2006) 110–114. [17] Y. Liu, X. Shen, Multicategory psi-learning, J. Am. Statist. Assoc. 101 (474) (2006) 500–509. ¨ [18] N. Meinshausen, P. Buhlmann, Stability selection, J. R. Statist. Soc. Ser. B 72 (2010) 417–473. [19] X. Nan, Y. Chen, X. Dang, D. Wilkins, Learning to rank using 1-norm regularization and convex hull reduction, in: Proceeding of ACM Southeast 2010, 2010.

Acknowledgment Xiaofei Nan, Nan Wang, Chaoyang Zhang, Yixin Chen, and Dawn Wilkins were supported in part by the US National Science Foundation under award number EPS-0903787. Ping Gong was supported by U.S. Army Environmental Quality Technology Research Program. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the National Science Foundation. Permission to publish this information was granted by the U.S. Army Chief of Engineers.

Author's personal copy X. Nan et al. / Neurocomputing 92 (2012) 36–43

[20] J.C. Platt, N. Cristianini, J.S. Taylor, Large margin DAGs for multiclass classification, in: Advances in Neural Information Processing Systems, 2000, pp. 547–553. [21] S. Rosset, Tracing curved regularized optimization solution paths, in: Advances in Neural Information Processing Systems, 2004. [22] S. Salem, L. Jack, A. Nandi, Investigation of self-organizing oscillator networks for use in clustering microarray Data, IEEE Trans. Nano-Biosci. 7 (2008) 65–79. [23] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. levy, A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics 21 (5) (2006) 631–643. [24] S. Szedmak, J. Shawe-Taylor, C.J. Saunders, D.R. Hardoon, Multiclass classification by L1 norm support vector machine, Pattern Recognition and Machine Learning in Computer Vision Workshop, May 2004, Grenoble, France. [25] V.S. Tseng, C.P. Kao, Efficiently mining gene expression data via a novel parameterless clustering method. in: IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 2, 2005, pp. 355–365. [26] L. Wang, X. Shen, Multi-category support vector machines, feature selection and solution path, Statist. Sin. 16 (2006) 617–633. [27] J. Weston, C. Watkins, Multi-class support vector machines, in: Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), Bruges, 1999. [28] X. Zhang, X. Lu, Q. Shi, X. Xu, H.E. Leung, L.N. Harris, J.D. Iglehart, A. Miron, J.S. Liu, W.H. Wong, Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data, BMC Bioinf. 7 (2006) 197–210. [29] J. Zhu, S.Rosset, T. Hastie, R. Tibshirani, 1-Norm Support Vector Machines, Technical Report, Stanford University, Stanford. [30] T. Suzuki, M. Honda, S. Matsumoto, S. Sturzenbaum, S. Gamou, Valosinecontaining proteins (VCP) in an annelid: identification of a novel spermatogenesis related factor, Gene 362 (2005) 11–18. [31] R. Jorgensen, A. Merrill, G.R. Andersen, The life and death of translation elongation factor 2, Biochem. Soc. Trans. 34 (2006) 1–6.

Xiaofei Nan is a Ph.D. student in Computer and Information Science, University of Mississippi. Her major research focuses are machine learning and artificial intelligence. She received her bachelor degree in Automatic Control and master degree in Pattern Recognition and Intelligent system from Northeastern University, China.

Nan Wang is an Assistant Professor of School of Computing at University of Southern Mississippi. Her research in computational biology focuses on several areas. (1) Microarray data analysis, including biomarker identification, generation of reference gene regulatory network for non-model organisms and discovery of changes in pathways under different chemical conditions; (2) Glycan classification for the purpose of identifying the glycan molecules and the structural features in these glycans that determine canine influenza infection; (3) developing new methods for microRNA identification based on genome sequence; and (4) biological database construction with a primary goal of developing a platform for biologists in the EPSCoR community.

43

Ping Gong is a senior Army contract scientist. He received his Ph.D. in environmental toxicology and has applied bioinformatics, genomics, molecular and computational biology to decode toxicological mechanisms and discover novel biomarkers. His current interests include toxicant effects on mRNA/microRNA expression and copy number variation, systems/synthetic biology, genome re-sequencing and assembly, gene regulatory networks inference, and bioinformatic/computational tool development.

Chaoyang Zhang joined the Department of Computer Science at the University of Southern Mississippi as an assistant professor in 2003. He received his Ph.D. at Louisiana Tech University in 2001 and was a research assistant professor in the Department of Computer Science at the University of Vermont from 2001 to 2003. Currently He is the Director of School of Computing. His research interests include high performance computing (parallel computing, distributed computing and grid computing applications and algorithms), computational biology and bioinformatics (microarray data analysis, classification and gene network reconstruction), information technology (Webbased information retrieval, machine learning and data mining), imaging and visualization (3D image reconstruction, information visualization and inverse problems), data analysis and modeling.

Yixin Chen received B.S. degree (1995) from the Department of Automation, Beijing Polytechnic University, the M.S. degree (1998) in control theory and application from Tsinghua University, and the M.S. (1999) and Ph.D. (2001) degrees in electrical engineering from the University of Wyoming. In 2003, he received the Ph.D. degree in computer science from The Pennsylvania State University. He had been an Assistant Professor of computer science at University of New Orleans. He is now an Associate Professor at the Department of Computer and Information Science, the University of Mississippi. His research interests include machine learning, data mining, computer vision, bioinformatics, and robotics and control. Dr. Chen is a member of the ACM, the IEEE, the IEEE Computer Society, the IEEE Neural Networks Society, and the IEEE Robotics and Automation Society.

Dawn Wilkins received Bachelors and Masters degrees in Mathematical Systems from Sangamon State University (now University of Illinois–Springfield) in 1981 and 1983, respectively. She received the Ph.D. in Computer Science from Vanderbilt University in 1995. Currently she is an Associate Professor in the Computer and Information Science department at the University of Mississippi, where she has been a faculty member since 1995. Her primary research interests are in the areas of Machine Learning, Computational Biology, Bioinformatics and Database Systems.