A Two-Stage Feature Selection Method for Gene Expression Data

4 downloads 40 Views 2MB Size Report
sults show that the application of the two-stage feature selection approach results in fewer features being selected and higher classification accuracy compared ...
OMICS A Journal of Integrative Biology Volume 13, Number 2, 2009 © Mary Ann Liebert, Inc. DOI: 10.1089/omi.2008.0083

A Two-Stage Feature Selection Method for Gene Expression Data Li-Yeh Chuang,1 Chao-Hsuan Ke,2 Hsueh-Wei Chang,3,4,5 and Cheng-Hong Yang2

Abstract

Microarray data referencing gene expression profiles provide valuable answers to a variety of problems, and contributes to advances in clinical medicine. Gene expression data typically has a high dimension and a small sample size. Generally, only relatively small numbers of gene expression data are strongly correlated with a certain phenotype. To analyze gene expression profiles correctly, feature (gene) selection is crucial for classification. Feature (gene) selection has certain advantages, such as effective extraction of genes that influence classification accuracy, elimination of irrelevant genes, and improvement of the classification accuracy calculation. In this paper, we propose a two-stage feature selection method, which uses information gain to implement a gene-ranking process, and combines an improved particle swarm optimization with the K-nearest neighbor method and support vector machine classifiers to calculate the classification accuracy. The experimental results show that the proposed method can effectively select relevant gene subsets, and achieves higher classification accuracy than previous studies. Introduction

D

NA MICROARRAY TECHNOLOGY allows for the simultaneous monitoring and measurement of thousands of gene expression activation levels in a single experiment, and is universally used in medical diagnosis and genetic analysis. Many microarray analysis research projects focus on clustering analysis and classification accuracy (Famili et al., 2004; Statnikov et al., 2005). In clustering analysis studies, the purpose of clustering is to analyze group genes that show a correlation to the gene expression data, and to provide insight into gene–gene interactions and gene functions (Wang et al., 2005b). In classification accuracy studies, the purpose of classification is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification (Lee and Lee, 2003). The classification task when using gene expression data lies in differentiating between tissue samples and normal samples, or classifying tissue samples into different classes of diseases. However, gene expression typically contains certain characteristics, one of them being that the number of features (genes) greatly exceeds the number of instances (tissue samples). This poses a major problem when gene expression data has to be classified. In general, only a relatively small number of gene expression data show a strong correlation

with a certain phenotype compared to the total number of genes investigated, which means that of the thousands of genes investigated, only a small number show significant correlation with the phenotype in question. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process. There exist some methods for data reduction, or specifically for feature selection, in the context of microarray data analysis; these can be classified into two major groups: filter and wrapper approaches (Kohavi and John, 1997). The filtering approach process separates data before the classification process, and then calculates feature weight values. Thus, features representing the original data set better can be identified; however, the filter approach does not account for interactions amongst the features. Methods in this category include the t-test, information gain (IG) (Quinlan,1986), mutual information (MI) (Dudoit et al., 2002), entropy-based methods (Wang et al., 2005a) and chi-squared method (2method) (Li et al., 2004). The wrapper approach depends on the addition or deletion of features to compose subset features, and uses an evaluation function for the learning algorithm to estimate the subset features. This kind of approach is similar to using optimizing algorithms to search for optimal solutions in a dimension space. The wrapper approach usually conducts a

1Institute

of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan, Republic of China. of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China. 3Faculty of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Taiwan, Republic of China. 4Graduate Institute of Natural Products, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China. 5Center of Excellence for Environmental Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China. 2Department

127

128 search for a good subset using the optimizing algorithm, and then employs a classification algorithm to evaluate the subset. There are many commonly used optimizing algorithms, such as genetic algorithms (GA) (Raymer et al., 2000), particle swarm optimization (PSO) (Huang and Dun, 2008) and tabu search (Zhang and Sun, 2002). Particle swarm optimization is a population-based stochastic optimization technique, which was developed by Kennedy and Eberhart in 1995. PSO simulates the social behavior of organisms, such as birds in a flock, or fish in a school, to describe an automatically evolving system. In PSO, each single candidate solution can be considered “an individual bird of the flock,” that is, a particle in the search space. Each particle makes use of its own memory and knowledge gained by the swarm as a whole to find the best (optimal) solution. All of the particles have fitness values, which are evaluated by an optimized fitness function. They also have velocities that direct the movement of the particles. During movement, each particle adjusts its position according to its own experience, and according to the experience of a neighboring particle, thus making use of the best position encountered by itself and its neighbor. The particles move through the problem space by following a current of optimum particles. The process is then reiterated a predefined number of times or until a minimum error is achieved. PSO was originally developed to continuous-valued solve real-number problems. In order to deal with discrete problems, the authors proposed a discrete PSO in 1997 (Kennedy and Eberhart, 1997), which they called binary particle swarm optimization (BPSO). BPSO is similar to other evolutionary algorithms that are capable of parallel search (Huang and Dun, 2008). In this paper, we propose an improved binary particle swarm optimization (IBPSO), which avoids the commonly encountered problem of particles getting trapped in a local optimum associated with BPSO. We propose a two-stage feature selection approach to select useful gene subsets for classification of the gene expression data. The first-stage calculates each feature information gain value, and selects features that are ideal for differentiating the different classes. In the second stage, IBPSO is used to select features from the first-stage again and to evaluate the influence of the selected features on classification accuracy by using both, the K-nearest neighbor (KNN) and support vector machine (SVM) classifiers. The experimental results show that the application of the two-stage feature selection approach results in fewer features being selected and higher classification accuracy compared to other results published in the literature. Methods Information gain Quinlan (1986) proposed a classification algorithm called ID3, which introduces the concept of information gain. Information gain simply is a reduction of entropy of a classification based on the observation of a particular variable and is used in machine learning by a decision tree to calculate the significance of attributes. Each feature basis obtains the information gain value to decide whether it is selected or is deleted. Therefore, a threshold value for selecting features must be established first; a feature is selected when its in-

CHUANG ET AL. formation gain value is bigger than the threshold value, otherwise it is deleted. Let S be the set of n instances, and let C be the set of k classes. Let P(Ci, S) be the fraction of the examples in S that have class Ci. Then, the expected information from this class membership is given by: k

Info(S)   P(Ci,S)  log(P(Ci,S))

(1)

i1

If a particular attribute A has v distinct values, the expected information required for the decision tree with A as the root is then the weighted sum of the expected information of the subsets of A according to distinct values. Let Si be the set of instances whose value of attribute A is Ai. v Si InfoA(S)     Info(Si) i1 S

(2)

Then, the difference between Info(S) and InfoA(S) gives the information gained by partitioning S according to testing A. Gain(A)  Info(S)  InfoA(S)

(3)

If the information gain is high, the chances of getting pure classes in a target class are also high if split on the variable with the highest gain. Particle swarm optimization Continuous PSO. PSO is a population-based stochastic optimization technique. The PSO is initialized with a population of random solutions and searches for an optimal solution by updating generations. In PSO, a potential solution is called a particle. Each particle makes use of its own memory and knowledge gained by the swarm as a whole to find the best (optimal) solution in a d-dimensional search space. A velocity is attributed to each particle, which directs its movement. Each particle can be represented as xi  (xi1, xi2, . . . , xid), where d is the dimension number. The velocity for the ith particle can be written as vi  (vi1, vi2, . . . , vid) and is limited by Vmax, a user-defined variable. The optimal previous position of the ith particle (the position giving the best fitness value) is recorded and represented by pi  (pi1, pi2, . . . , pid), a value called pBesti. The global pBesti value is g  (g1, g2, . . . , gd) and called gBest. At each interaction, a particle is updated according to the following equations: old old vnew id  w  v id  c1  rand1  (pBestid  xi )

 c2  rand2  (gBestd  xold i ) (4) old new xnew id  x id  vid

(5)

In these equations, w is the inertia weight, c1 and c2 are acceleration (learning) factors, and rand1 and rand2 are ranold dom numbers. Velocities vnew id and v id are those of the new old and old particle, respectively, x id is the current particle position (solution), and xnew id is the updated particle position. BPSO. Although PSO was originally introduced as an optimization technique for real-number optimization problems, many optimization problems are set in a space featuring discrete or qualitative distinctions between variables. In 1997, Kennedy and Eberhart introduced BPSO, which can be applied to discrete binary variables. There are two charac-

A TWO-STAGE FEATURE SELECTION METHOD GENE EXPRESSION DATA teristics distinguishing the original PSO from BPSO. In the first, each particle is composed of the binary variable and decides on “yes” or “no,” “true” or “false,” or {1} or {0}. In the latter, the velocity will be transformed into a change of probability, which is the chance of the binary variable taking the value of 1. However, in BPSO, the velocity must be restricted to within a range of [0.0, 1.0]. In order to map the real value numbers of the velocity to the range, the sigmoid function was proposed to handle the probability of variables. 1 S(vnew new pd )   1  evpd if (rand() 

new S(vpd ))

then

new xpd

 1; else

Support vector machine. The SVM is a new and promising technique for data classification and regression. In the past 10 years, it has become an important part of machine learning and pattern recognition. The original idea behind SVMs is to use a linear separating hyperplane, which maximizes the distance between two classes to create a classifier. Given a training vector of instance-label pairs (xi, yi), i  1, 2, . . . , m where xi  Rn and yi  {1, 1}, the linear SVM tries to find the largest margin hyper-plane separating two classes f(x)  wT  x b.

(6) new xpd

0

(7)

new new Here, S(vpd ) denotes the probability of bit xpd , if (rand()  new new new S(vpd )) then xpd  1; else xpd  0; rand() is a random

number selected from a uniform distribution in [0.0, 1.0]. To new avoid S(vpd ) approaching 0 or 1, a constant Vmax is used to new limit vpd , with the range of the maximum velocity being [Vmax, Vmax]. IBPSO. In BPSO, each particle adjusts its position based on the two fitness values pBest and gBest. Entrapment in a local optimum can be avoided by fine-tuning the inertia weight. pBest is a local fitness value, whereas gBest constitutes a global fitness value. However, if the gBest value is itself trapped in a local optimum, each particle will limit its search to the same local area, thus preventing them from searching for potentially better solutions in other areas of the search space. This behavior severely limits the usefulness of the classification results. To avoid this, we propose a method that retires gBest under such circumstances and called it IBPSO. By resetting gBest entrapment in a local optimum can be avoided, and superior classification results can be achieved with a reduced number of selected genes.

m 1 min  wT  w  C  i w, b 2 i1

(8)

yi(w  xi  b) 1  i, i 0, i  1, 2, . . . , m

(9)

Here, w  is a vector of weights of training instances; C is penalty parameter on the training error, and i are nonnegative slack variables. The penalty term C m\i  1 i in the objective function and training errors are allowed. The constraints given in (9) allow training data to be on the wrong side of the separating hyperplane  wT  b while we minimize the training error m\i  1 i in the objective function. Hence, if the penalty parameter C is large enough, the data can be separated correctly. Using SVM to solve the optimal hyperplane is a quadratic programming problem (QP problem). It can be transformed into a dual problem by introducing the Lagrange multipliers i (Hsu and Lin, 2002): Rn

m 1 L ()  i  MAX D   2 i1

K-nearest neighbor. The K-nearest neighbor (KNN) method was first introduced (Fix and Hodges, 1951), and is one of the most popular nonparametric methods. The purpose of the algorithm is to classify a new object based on attributes and training samples. The KNN method consists of a supervised learning algorithm where the result of a new instance query is classified based on the majority of the KNN categories. The classifiers do not use any model for fitting, and are only based on memory, which works based on a minimum distance from the query instance to the training samples to determine the KNN. Any tied results are solved by a random procedure. In KNN, a large category tends to have a small classification error, while the classification error for minority classes is usually rather large, a fact that lowers the performance of KNN under such circumstances. In this paper, the leave-oneout crossvalidation (LOOCV) method was used. When there are n data to be classified, the data are divided into one testing sample and n  1 training samples at the each iteration of the evaluation process, and finally, a classifier is constructed by training the n  1 samples. The testing sample category can be judged by the classifier. In this paper, the 1NN with leave-one-out cross validation method served as a classifier to calculate classification accuracies.

m

 ijyiyjxi  xj i,j1

s.t 0 i C, i  1, . . . , m

(10)

m

 iyi  0 i1

In order to find the hyperplane, which has maximized the distance between two classes, the nonnegative i under the m

constraints Classification algorithm

129



i yi  0 and 0 i C are calculated by

i1

using a dual Lagrangian LD(). After i is obtained, the other hyperplane parameters, w and b, can also be obtained, and the optimal hyperplane decision function f(x)  sgn(w  x  b) can be written as: m

 iyixi  x  b  i1

sgn

(11)

However, not all input data can be linearly separated in the real world. In order to solve this problem, the kernel function  is introduced to the decision function. The kernel function maps the input data xi into a higher dimension feature space (xi), so that the input data can be separated by applying the linear SVM formulation. Several different kernel functions be used, for example, the radial basis function (RBF)erxxj2, polynomial (xTixj/r  )d, and sigmoid tanh(rxTixj  ), where r, d and  are kernel parameters. After the kernel function is introduced, the nonlinear SVM classifier can be rewritten as: m

sgn

   y (x )  (x)  b i i

i

(12)

i1

SVM techniques were originally designed for binary classification. How to effectively extend it to the multiclass classification is still an on-going research issue. Several methods have been proposed, in which a multiclass classifier is typi-

130

CHUANG ET AL.

FIG. 1.

Flowchart of purposed method.

cally constructed by combining several binary classifiers. The one-versus-rest method (OVR) is used for multi-class SVM classification in this paper. The OVR assembles classifiers that distinguish one class from all other classes. For each i, 1 i k, a binary classifier separating class i from the rest is TABLE 1.

built. To predict the class label of a given data point, the output of each of the k classifiers is obtained. If there is a unique class label, say j, which is consistent with all the k predictions, the data point is assigned to class j. Otherwise one of the k classes is selected randomly. In practice, a situation in

CANCER-RELATED HUMAN GENE EXPRESSION DATA SETS Number of

Data set name 9_Tumors 11_Tumors 14_Tumors Brain_Tumor1 Brain_Tumor2 Leukemia1

Leukemia2 Lung_Cancer SRBCT Prostate_Tumor DLBCL

Diagnostic task

Sample

Genes

Classes

Reference

Nine various human Tumor types Eleven various human Tumor types Fourteen various human Tumor types and 12 normal tissue types Five human brain tumor types Four malignant glioma types Acute myelogenous leukemia (AML), acute lymphoblastic leukemia (ALL) B-cell, and ALL T-cell AML, ALL, and mixed-lineage leukemia (MLL) Four lung cancer types and normal tissues Small, round blue cell tumors of children Prostate tumor and normal tissue Diffuse large B-cell lymphomas and follicular lymphomas

60

5,726

9

174

12,533

11

308

15,009

26

Staunton et al. (2001) Su et al. (2001) Ramaswamy et al. (2001)

90

5,920

5

50

10,367

4

72

5,327

3

72

11,225

3

Armstrong et al. (2001)

203

12,600

5

83

2,308

4

102

10,509

2

77

5,469

2

Bhattacharjee et al. (2001) Khan et al. (2001) Singh et al. (2002) Shipp et al. (2002)

Pomeroy et al. (2002) Nutt et al. (2003) Golub et al. (1999)

A TWO-STAGE FEATURE SELECTION METHOD GENE EXPRESSION DATA TABLE 2.

ACCURACY

OF

CLASSIFICATION

FOR

131

GENE EXPRESSION DATA USING DIFFERENT METHODS With feature selection

Without feature selection Data set

aKNN

9_Tumors 11_Tumors 14_Tumors Brain_Tumor1 Brain_Tumor2 Leukemia1 Leukemia2 Lung_Cancer SRBCT Prostate_Tumor DLBCL Average

43.90 78.51 50.40 87.94 68.67 83.57 87.14 89.64 86.90 85.09 86.96 77.16

cIG

dIG

dIG

eIG

eIG

VM

 KN N

cIG

 SVM

 BPSO/K NN

 BPSO/S VM

 IBPSO/K NN

 IBPSO/S VM

65.10 94.68 74.98 91.67 77.00 97.50 97.32 96.05 100.00 92.00 97.50 89.44

66.67 83.33 56.82 88.89 78.00 93.06 91.67 90.15 98.80 89.22 93.51 84.56

75.00 97.76 71.70 91.11 84.00 95.71 97.14 95.12 100.00 93.00 97.14 90.70

85.00 93.68 65.26 93.33 84.00 98.61 95.83 94.58 100.00 95.10 100.00 91.40

91.67 98.35 80.02 93.33 88.00 100.00 98.57 96.07 100.00 96.00 100.00 94.73

90.00 95.40 69.16 96.67 92.00 100.00 100.0 96.06 100.00 98.04 100.00 94.30

90.00 98.35 81.11 95.56 90.00 100.00 98.57 97.00 100.00 99.00 100.00 95.42

bMC-S

aKNN:

K-nearest neighbor. support vector machines. cIG  KNN, IG  SVM: only information gain is used to select number of genes, and KNN and SVM are used to calculate classification accuracy directly. dIG  BPSO/KNN, IG  BPSO/SVM: information gain and binary particle swarm optimization are combined with KNN and SVM to evaluate accuracy. eIG  IBPSO/KNN, IG  IBPSO/SVM: Information gain and an improved binary particle swarm optimization are combined with KNN and SVM to evaluate accuracy. bMulticlass

which consistent class assignment does not exist arises quite often though. Two Stage Feature Selection Experiments Framework of experiments A number of filter approaches for feature selection of microarray data have been proposed in the literature, such as

the t-test (Xiong et al., 2001), the ratio of between-groups to within-groups sum of squares (BSS/WSS) (Dudoit et al., 2002) minimal-redundancy-maximal-relevance (mRMR) (Hanchuan et al., 2005), and signal to noise (S2N) (Golub et al., 1999). The advantage of filter models is the fast selection of useful gene subsets. However, they do not take gene–gene interaction into account (Wang et al., 2005b, Zhang and Deng, 2007), which can result in a lower classification accuracy. In

TABLE 3. SELECTED NUMBER OF GENES AND THE PERCENAGE OF SELECTED GENES FOR GENE EXPRESSION DATA USING DIFFERENT METHODS Two-stage feature selection aNumber

Data set 9_Tumors 11_Tumors 14_Tumors Brain_Tumor1 Brain_Tumor2 Leukemia1 Leukemia2 Lung_Cancer SRBCT Prostate_Tumor DLBCL aNumber

of original genes (nonselected) 5726 12,533 15,009 5,920 10,367 5327 11,225 12,600 2,308 10,509 5,469

Filter (single stage) bIG

165 3181 1986 1612 4465 848 4596 9561 669 2016 882

(2.88) (25.34) (13.23) (27.23) (43.07) (15.92) (40.94) (75.88) (28.99) (19.18) (16.13)

dIG

 BPSO/KNN

 BPSO/SVM

 IBPSO/KN N

49 1,370 1,132 474 1,855 186 459 3,643 158 654 252

68 1,618 1,064 409 1,503 160 1,049 4,215 107 532 200

34 482 254 108 342 34 220 683 37 201 28

cIG

(0.86) (10.93) (7.54) (8.01) (17.89) (3.49) (4.09) (28.91) (6.85) (6.22) (4.61)

cIG

(1.19) (12.91) (7.09) (6.91) (14.50) (3.00) (9.35) (33.45) (4.64) (5.06) (3.66)

(0.59) (3.85) (1.69) (1.82) (3.30) (0.64) (1.96) (5.42) (1.60) (1.91) (0.51)

sIG

 IBPSO/SV M 35 1,002 922 71 601 31 185 828 49 152 27

(0.61) (7.99) (6.14) (1.20) (5.80) (0.58) (1.65) (6.57) (2.12) (1.45) (0.49)

of original genes. the selected number of genes using information gain alone. cIG  BPSO/KNN, IG  BPSO/SVM: number of genes selected using information gain combined with binary particle swarm optimization and KNN and SVM classification algorithm, respectively. dIG  IBPSO/KNN, IG  IBPSO/SVM: number of genes selected using information gain combined with improved binary particle swarm optimization and KNN and SVM classification algorithm, respectively. bIG:

132

CHUANG ET AL.

FIG. 2.

The number of iterations versus classification accuracy.

A TWO-STAGE FEATURE SELECTION METHOD GENE EXPRESSION DATA

FIG. 2.

Continued.

133

134

CHUANG ET AL.

FIG. 3.

Graphic comparison of KNN classification accuracies obtained via different methods.

order to solve this problem, wrapper model approaches have been proposed. Commonly used wrapper model approaches are sequential forward selection, sequence backward selection, beam search, and genetic algorithms (Saeys et al., 2007). A wrapper model approach generally yields higher classification accuracy. However, because a classifier has to be incorporated into the feature selection process, computation time and computational-load necessities are increased as well. In order to utilize the advantages of both the filter model and the wrapper model, and to avoid their respective disadvantages, we propose a two-stage method for gene expression data that uses information gain to implement a gene-ranking process, and combines it with an improved particle swarm optimization to select optimal gene subset from the gene expression data. KNN and support vector machine classifiers are used to calculate the classification accuracy. A flowchart of the proposed method is depicted in Figure 1.

FIG. 4.

Encoding of candidate solution In the first stage, in order to achieve effective selection of gene subsets, a threshold value for information gain needs to be set. We chose a threshold value of 0; if the weight value of the gene exceeds 0, it is selected, otherwise it is discarded. A higher weight value indicates that this gene has a higher discrimination of this feature, meaning that the feature can be used to effectively calculate classification results. In the second stage, we focused on the feature genes selected during the first stage, and then used IBPSO, in which two classification algorithms served as evaluators for the classification, to implement feature (genes) selection again. The procedure can be summed up by the following example: initially, the position of each particle is represented in binary string form and the selected feature genes are generated randomly during the first stage. The bit value {1} represents a selected feature, whereas the bit value {0} represents a nonselected feature.

Graphic comparison of SVM classification accuracies obtained via different methods.

A TWO-STAGE FEATURE SELECTION METHOD GENE EXPRESSION DATA Fitness function Classification accuracy is used as a criterion to evaluate the fitness of the selected gene subset and to design the fitness function. In order to evaluate the performance of the IBPSO search capability, the fitness function value calculated by fitness function was used as a guide for the search operation. In our study, we expected to obtain maximized classification accuracy with a minimum number of selected genes. Huang et al. (2007) and Huang and Chang (2007) considered the number of selected genes a penalty parameter used to reduce the number of selected genes. However, we have not included the number of selected genes in the fitness function so that we could establish the minimum number of genes needed to obtain maximum classification accuracy. Parameters settings We set the number of particles in IBPSO to 30, and the whole procedure was repeated until either the fitness (classification accuracy) of a particle was 1.0 or the number of iterations had reached 100 (maximum number of iterations). The three factors rand1, rand2 and were random numbers between [0.0, 1.0], whereas c1 and c2 were learning factors, c1  c2  2 (Jiang et al., 2007). The parameters for the BPSO were taken from Shi and Eberhart (1998). We set [Vmax, Vmax]  [6, 6], which yielded a range of [0.9975, 0.0025] using the sigmoid limiting transformation (Kennedy and Eberharts 1997, 2001). Result and Discussion In this study, the gene expression data was obtained by the oligonucleotide technique, except in the case of SRBCT, which was obtained by continuous image analysis. Our data sets consisted of 11 gene expression profiles, which were downloaded from http://www.gems-system.org. They were tumor, Brain tumor, Leukemia, Lung_Cancer, cDNA microarray gene expression data and Prostate tumor samples. The data is summarized in Table 1, which includes the data set name, number of samples, number of genes, and category number. Microarray gene expression data can be effectively used for gene identification, cell differentiation, pharmaceutical development, cancer classification, disease diagnosis, and disease prediction. However, microarray data analysis warrants special strategies because gene expression data sets have several important characteristics, one of which is the fact that relatively few samples with a large dimension of feature genes are used. The proper selection of genes relevant for gene expression classification is a common challenge in bioinformatics. Attempts to analyze the entire microarray data without a reduction of the dimension may not be appropriate, because microarray data provide a massive amount of gene information simultaneously. Theoretically, feature selection problems are NP-hard problems. Conducting an exhaustive search over the entire solution space is not feasible, because this would require a prohibitive amount of computing time and cost. Using feature selection as a pretreatment method prior to the actual classification of gene expression data can effectively reduce the calculation time, because only a relatively

135

small number of features is selected. This does not affect the classification accuracy negatively; on the contrary, classification accuracy can even be improved. In general, feature selection is based on two aspects: one is to obtain a set of genes that have similar functions and a close relationship, the other is to find the smallest set of genes that can provide meaningful diagnostic information for disease prediction without trading off accuracy. Some hybrid feature selection methods currently exist, for instance, methods where only information gain is used with two different classifiers; these are called IG  KNN and IG  SVM. Methods combining information gain and BPSO with KNN and SVM are IG  BPSO/KNN and IG  BPSO/SVM. Similarly, we termed the proposed methods IG  IBPSO/ KNN and IG  IBPSO/SVM. The information gain values in this study were calculated using the Weka software package (Frank et al., 2004). A weight value was subsequently calculated for each gene, and a fixed number of genes was selected. Table 2 compares the classification accuracy achieved by our method and the results from previously published studies (Statnikov et al., 2005). The average classification accuracy achieved with IG (KNN, SVM), IG  BPSO (KNN, SVM), and IG  IBPSO (KNN, SVM) was (84.56, 90.70), (91.40, 94.73), and (94.30, 95.42), respectively. The highest average classification accuracy in Statnikov et al. was 89.44, achieved with SVM. The classification accuracy achieved by only using IG with KNN and SVM is much lower than the accuracy obtained by IG with BPSO/KNN (BPSO/SVM) and IBPSO/KNN (IBPSO/SVM). This shows that the proposed method selects feature genes relevant for classification accuracy effectively. The classification accuracy obtained by IG  IBPSO with KNN and SVM is superior to the results obtained by IG  BPSO with KNN and SVM; it is almost 6% higher than the best results in Statnikov et al. The proposed methods IG  IBPSO (KNN) and IG  IBPSO (SVM) obtained the highest classification accuracy in 6 and 7 out of the 11 gene expression data sets, respectively. These numbers add up to more than 11, because IG  IBPSO (KNN) reached 100% classification accuracy four times out of the six cases where it had the highest classification accuracy; IG  IBPSO (SVM) reached 100% classification accuracy three times. These instances where 100% accuracy was achieved are countered double, because 100% classification accuracy is the highest value that can be achieved. Even though IG  IBPSO (KNN) reached 100% more often than IG  IBPSO (SVM), the results obtained by the classifier SVM are generally superior (or equal) to the results obtained with KNN as a classifier. Table 3 shows the number of genes selected after implementing feature selection by the different methods. It clearly demonstrates that the number of genes selected by the proposed methods is much lower than for any other selection method from the literature. The average number of genes selected could be reduced to 28.07, (9.74, 9.25), and (2.12, 2.15) percent for IG, IG  BPSO(KNN, SVM), and IG  IBPSO(KNN, SVM), respectively. The average classification accuracy the genes selected in percent is (84.56, 90.70), (91.40, 94.73), and (94.30, 95.42) for IG (KNN, SVM), IG  BPSO (KNN, SVM), and IG  IBPSO (KNN, SVM), respectively. This means that many genes exist in the microarray data that are irrelevant for classifications. A robust feature selection

136 method should ignore these irrelevant genes and thus increase classification accuracy and efficiency. For several data sets the classification accuracy still reached 100% with IG  IBPSO, even though a large number of genes had been removed; this was the case for the Leukemia1, SRBCT, and DLBCL data sets. Inza et al. (2004) and Xiong et al. (2001) used a wrapper approach to implement feature selection, and selected feature subsets for classification accuracy. Nevertheless, if only a wrapper approach is used to implement feature subsets, it could be difficult to find optimal solutions due to the vastness of the search space. We therefore employed information gain to select feature subsets in the first stage before we used the wrapper approach to implement feature selection in the second stage. This reduces the size of the search space, and optimal solutions can thus more easily be identified. Using evolutionary algorithms to implement feature selection is analogous to searching for the best combination of solutions in the large search space. Each particle represents a possible solution. However, Elbeltag et al. (2005) indicate that PSO is superior to evolutionary algorithms in terms of search capability. Therefore, we used a binary PSO version (BPSO) instead of the GAs used in other studies to select feature (gene) subsets. However, BPSO has a critical disadvantage. In general, after several generations, each particle will be influenced by pBest and gBest, and continues to change its position in search of other possible solutions. Therefore, if gBest is not changed continuously, all particles tend to cluster together around gBest, and thus the superior search ability of BPSO is negated. Hence, we proposed an improved binary particle swarm optimization (IBPSO), in which gBest is reset under these circumstances. Resetting gBest prevents BPSO from getting trapped in a local optimum, and thus enhances the search performance. Figure 2a–k shows the number of iterations versus classification accuracy for the 11 microarray gene expression data sets we analyzed. At each generation, we record the gBest accuracy (the best of all particles), which represents the best combination of solutions. The changes for four different methods, IG  IBPSO/SVM, IG  IBPSO/KNN, IG  BPSO/SVM, and IG  BPSO/KNN, are plotted. Figures 3 and 4 show graphic comparison of KNN and SVM classification obtained via different methods, respectively. Conclusion We propose a two-stage feature selection method, which sequentially combines feature (gene) ranking by information gain and an IBPSO method to select gene subsets relevant for the classification of gene expression data. Both, the KNN method and support vector machines were used as classifiers to evaluate the classification accuracy. The conducted experiments demonstrate that the proposed method can reduce the number of genes selected and increase the classification accuracy. Overall, 95.42% classification accuracy was achieved by the proposed method, even when the number of feature subsets was significantly reduced. Aside from gene expression data, the proposed method could also be used for other data sets with high-dimensional features. The proposed method constitutes an ideal preprocessing tool that helps optimize the feature selection process, because it in-

CHUANG ET AL. creases the classification accuracy and, at the same time, keeps computational resources to a minimum. Acknowledgments This work is partly supported by the National Science Council in Taiwan under grant NSC96-2622-E-151-019-CC3, NSC96-2622-E214-004-CC3, NSC95-2221-E-151-004-MY3, NSC952221-E-214-087, NSC95-2622-E-214-004, NSC94-2622-E-151025-CC3, NSC94-2622-E-151-025-CC3, and KMU-EM-97-2.1a. Author Disclosure Statement The authors declare that no competiting financial interests exist. References Armstrong, S.A., Staunton, J.E., Silverman, L.B., et al. (2001). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30, 41–47. Bhattacharjee, A., Richards, W.G.., Satunton, J., et al. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98, 13790–13795. Dudoit, S., Fridlyand, J., and Speed, T.P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97, 77–87. Elbeltagi, E., Hegazy, T., and Grierson, D. (2005). Comparison among five evolutionary-based optimization algorithms. Adv Eng Informatics 19, 43–53. Famili, A.F., Liu, G., and Liu, Z. (2004). Evaluation and optimization of clustering in gene expression data analysis. Bioinformatics 20, 1535–1545. Fix, E., and Hodges, J.L. (1951). Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, TX. Frank, E., Hall, M., Trigg, L., et al. (2004). Data mining in bioinformatics using Weka. Bioinformatics 20, 2479–2481. Golub, T., Slonim, D., Tamayo, P., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537. Hanchuan, P., Fuhui, L., and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Machine Intell 27, 1226–1238. Hsu, C.-W., and Lin, C.-J. (2002). A simple decomposition method for support vector machines. Machine Learn 46, 291–314. Huang, C.L., and Dun, J.F. (2008). A distributed PSO-SVM hybrid system with feature selection and parameter optimization. Appl Soft Comput (in press). Huang, H.L., and Chang, F.L. (2007). ESVM: evolutionary support vector machine for automatic feature selection and classification of microarray data. Biosystems 90, 516–528. Huang, H.-L., Lee, C.-C., and Ho, S.-Y. (2007). Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers. Biosystems 90, 78–86. Inza, I., Larranaga, P., Blanco, R., and Cerrolaza, A.J. (2004). Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 31, 91–103. Jiang, M., Luo, Y.P., and Yang, S.Y. (2007). Stochastic convergence analysis and parameter selection of the standard particle swarm optimization algorithm. Inform Process Lett 102, 8–16.

A TWO-STAGE FEATURE SELECTION METHOD GENE EXPRESSION DATA Kennedy, J., and Eberhart, R. (1995). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks, vol. 4, pp. 1942–1948. Kennedy, J., and Eberhart, R. (1997). A discrete binary version of the particle swarm algorithm. Proceedings of 1997 IEEE International Conference on Systems, Man, and Cybernetics, ‘Computational Cybernetics and Simulation’, pp. 4104–4108. Kennedy, J., and Eberhart, R. (2001). Swarm Intelligence. San Mateo, CA: Morgan Kaufmann. Khan, J., Wei, J., Ringner, M., et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7, 673–679. Kohavi, R., and John, G.H. (1997). Wrappers for feature subset selection. Artif Intell 97, 273–324. Lee, Y., and Lee, C.-K. (2003). Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19, 1132–1139. Li, T., Zhang, C., and Ogihara, M. (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437. Nutt, C.L., Mani, D.R., Betensky, R.A., et al. (2003). Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63, 1602–1607. Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442. Quinlan, J.R. (1986). Induction of decision trees. Machine Learn 1, 81–106. Ramaswamy, S., Tamayo, P., Rifkin, R., et al. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98, 15149–15154. Ramymer, M.L., Punch, W.F., Goodman, E. D., et al. (2000). Dimensionality reduction using genetic algorithms. IEEE Trans Evolut Comput 4, 164–171. Saeys, Y., Inza, I., and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517. Shi, Y., and Eberhart, R. (1998). Parameter selection in particle swarm optimization. Proceedings of the 7th International Conference on Evolutionary Programming, vol. VII, pp. 591–600.

137

Shipp, M.A., Ross, K.N., Tamayo, P., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8, 68–74. Singh, D., Febbo, P.G., Ross, K., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cell Press 1, 203–209. Statnikov, A., Aliferis, C.F., and Tsamardinos, I. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21, 631–643. Staunton, J.E., Slonim, D.K., Coller, H.A., et al. (2001). Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA 98, 10787–10792. Su, A.I., Welsh, J.B., Sapinoso, L.M., et al. (2001). Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61, 7388–7393. Wang, Y., Makedon, F.S., Ford, J.C., et al. (2005a). HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 21, 1530–1537. Wang, Y., Tetko, I.V., Hall, M.A., et al. (2005b). Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem 29, 37–46. Xiong, M., Fang, X., and Zhao, J. (2001). Biomarker identification by feature wrappers. Genome Res 11, 1878–1887. Zhang, H., and Sun, G. (2002). Feature selection using tabu search method. Pattern Recog 35, 701–711. Zhang, J.-G., and Deng, H.-W. (2007). Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics 8, 370.

Address reprint requests to: Dr. Chen-Hong Yang Department of Electronic Engineering National Kaohsiung University of Applied Sciences Kaohsiung, Taiwan, 807 R.O.C. E-mail: [email protected]