Comparison of hybrid feature selection models on gene expression data

3 downloads 5976 Views 144KB Size Report
Phayung Meesad. Department of Teacher Training in Electrical Engineering. Faculty of Technical Education, KMUTNB. Bangkok, Thailand [email protected].
2010 Eighth International Conference on ICT and Knowledge Engineering

Comparison of Hybrid Feature Selection Models on Gene Expression Data Patharawut Saengsiri

Sageemas Na Wichian

Department of Information Technology Faculty of Information Technology, KMUTNB Bangkok, Thailand [email protected]

Department of Applied Science and Social College of Industrial Technology, KMUTNB Bangkok, Thailand [email protected]

Phayung Meesad

Unger Herwig

Department of Teacher Training in Electrical Engineering Faculty of Technical Education, KMUTNB Bangkok, Thailand [email protected]

Department of Communication Network, Faculty of Mathematics and Computer Science Fern University in Hagen, Germany [email protected] is very important to help recognize a gene that has discriminant power. Co-expressed genes using microarray data represent some characteristics of gene function by merging gene identification, and specific regular genes can uncover duplicate network handles. Gene selection is a basic technique to search for genes that have classification power. Nowadays, feature selection focuses on filter and wrapper approaches, firstly associated with individual discriminate power of a gene without involving the induction algorithm. For example, information gain, gain ratio and correlation. Other methods connected with induction algorithms to determine the correctness of a selected subset of genes are Forward Stepwise Feature Selection (FSFS) or Backward Stepwise Feature Selection (BSFS). As for time consumption, the filter method uses less than the wrapper approach. In contrast, accuracy rate of the wrapper approach is higher than the filter approach. However, ranking technique isn’t enough to gather a subset of informative genes. Feature transformation such as Principle Component Analysis (PCA), Single Value Decomposition (SVD) and Independent Component Analysis (ICA) are not suitable for gene selection because they don’t move any dimension, protect irrelevant features and also difficult to translate the meaning of genes [2].

Abstract—— Microarray data contains thousands of genes which are used to evaluate expression level. However, most of them are not associated with cancer diseases and leads to the curse of dimensionality. The challenge based on microarray data is feature selection which searches for subsets of informative genes. At the moment, these techniques focus on filter and wrapper approaches to discover subsets of genes. Filter approach is better than wrapper approach in terms of time consuming. On the contrary, the accuracy of wrapper approach is higher than that of filter approach. However, it is more beneficial to reduce the time process and increase accuracy simultaneously when searching for subsets of genes. Thus, this paper proposes comparison of hybrid feature selection models on gene expression datasets, this consists of four steps 1) filter subgroup of gene using Correlation based Feature Selection (CFS), Gain Ratio (GR), and Information Gain (INFO) 2) transfers output of each filter method into a wrapper approach that’s based on the Support Vector Machine (SVM) classifier and two heuristic searches which are Greedy Search (GS) and Genetic Algorithm (GA) 3) generate hybrid feature selection model CFSSVMGA, CSFSVMGS, GRSVMGA, GRSVMGS, INFOSVMGA, and INFOSVMGS 4) performance comparison using precision, recall, F-measure, and accuracy rate. Results from the experiment concluded the CFSSVMGA model outperformed other models on three public gene expression datasets.

As mentioned above, high dimensionality of gene expression data is challenging because it is time consuming and also has high miss classification. Many algorithms are not suitable for large dimensional data. Nevertheless, there are a few methods which work with the filter and wrapper approaches together. So, this research proposes comparison of hybrid feature selection models on gene expression data. The result of the experiment displayed better performance which reduced subsets of genes and also increased accuracy. This paper is organized as follows; Section 2 represents a summary of the literature reviews. In Section 3, the proposed novel hybrid feature selection framework for gene expression data will be explained in detail. The comparison between new

Keywords- gene expression, feature selection, support vector machine

I.

INTRODUCTION

At the moment, gene expression level is evaluated using microarray techniques which measure thousands of gene expressions in a single experiment. This technique is adapted to hybridization of nucleic acid for confirmation of many types of gene expression data at the same time [1]. Nevertheless, most of the genes are not connected with others. So, often biologists spend more time searching for discriminate genes. Thus, constructing a subset of informative genes with microarray data

978-1-4244-9875-8/10/$26.00 ©2010 IEEE

13

2010 Eighth International Conference on ICT and Knowledge Engineering

methods with other related algorithms will be shown in Section 4. Finally, conclusion will be presented in Section 5. II.

B. Gene Classification Discriminant gene is specified using large value of gene expression. Classification of microarrays to nearest centroid (CalNC) is proposed by [10] based on Linear Discriminant Analysis (LDA) and t-statistics value. Later applicable score is added in CalNC [11]. Space between each sample centroid is measured using applicable score. A gene close to centroid is chosen using Mahalanobis distance measurement.

LITERATURE REVIEWS

A. Feature Selection on Gene Expression Data The basic technique for feature selection on gene expression data starts by ranking methods such as t-statistics and difference of means. The key characteristic of t-statistic, the variance of sample which is assumed to be equal. In this case, this method is better than others. However, difference of means has higher performance than others when the variance is assumed to be unequal. Significant Analysis of Microarray (SAM) is proposed by [3]. SAM can carry out over a scope of limitation that recognizes genes. SAM scores individual gene based on combination of gene expression that is related to standard deviation of measurement iteration.

In the case of classification, genetic algorithms are very useful. The basis of multi-objective strategy from a genetic algorithm to find subsets of genes proposed by [12]. Because GASVM has only one-objective and it is limited to less than 1000 genes. Thus, multi-objective optimization (MOO) focuses on relation between varied objective and a class called MOGASVM. A multi-class problem reduces the performance of classification. So, [13] introduce a stochastic method called Optimal Feature Weighting algorithm (OFW). This technique merges SVM and CART, afterwards it evaluates between the filter and wrapper approaches on public microarray datasets. Ranking genes are evaluated using Pareto-front analysis [14] that estimates exclusive classwise between-class and the sum of squares. This method shows high performance when using four cancer datasets of multi-class genes.

At present, the common point of view of feature selection tens to move towards the filter and wrapper approaches. One popular technique of filter is the correlation method which is wildly used in gene selection processes. For example, genes are merged into a group when the correlation value is higher than a threshold by choosing the best gene from the top-rank gene. Nevertheless, better genes are indicated by using correlation value that is less than the threshold. This technique uses Fuzzy Clustering for classification [4]. Some techniques for gene selection consist of two steps. Firstly, genes are chosen using CFS. Then, selection input from the first step is iterated by binary particle swarm optimization algorithm (BPSO). In this case, it is similar to the wrapper approach [5]. Forward variable selection method (FSM) proposed by [6]. This method is based on Mahalanobis distance and F-value which searches the subset of genes. The measurement is compared with Simples and S2N techniques.

III.

A. Information Gain (INFO) To find node impurity, this is the main idea to select the best split. Several concepts are GINI Index, Entropy and Misclassification error [15]. INFO based on Entropy measurement reduces because of the separate method. Entropy at a given node t is given in (1):

Integrating filter and wrapper methods based on Gene boosting technique proposed by [7]. First, a subset of gene is selected from the top rank using a filter method. Then, a subset of gene is chosen using a wrapper method based on the induction algorithm which creates a new subset of gene. This process has finished when expressions are converged. But accuracy rate is not better than the training set. The pattern of gene is high dimensionality and small sample size. So, two algorithms are integrated in which BPSO, K-nearest neighbor (K-NN) and evaluated using Leave-one-out cross-validation (LOOCV). Nevertheless, high dimension of gene creates local minimum which is the primary problem of BPSO. The impact of gene selection on imbalanced microarray [8] uses 11 public microarray data sets and five feature selection techniques: Cfs, Chi Square, IG, ReliefF, and Symmetrical Uncertainty combined with four supervised learning techniques: C30-NN, SVM, RF-100, and PART. According to [8], SVM technique has higher efficiency than others. The bias of SVM leads to low performance when subsets of genes are less than 10. Hybrid model [9] proposed gene selection based on performance comparison using Naïve Bayes, Instance-based Learner (IB1), and Decision Tree (C4.5). Genes are evaluated one-by-one and divided into gene subsets which increase accuracy of classification. The problem of this technique is that it is impossible to determine correlation of each gene.

978-1-4244-9875-8/10/$26.00 ©2010 IEEE

METHODS

 x

Entropy (t )

x

(1)

i

p(j|t) is associated with frequency of category j at node t. Gain



 ¦ p ( j | t ) log 2 p ( j | t )

§ k n · Entropy( p)  ¨ ¦i 1 i Entropy(i ) ¸    n © ¹

INFO is shown in (2) the parent node, p is split into k partition; nj is number of records in partition i. Nevertheless, bias splits can happen with large partitions.

B. Gain Ratio (GR) GR technique improves the problem of INFO. The structure of method is created by using to-down design. GR was developed by Quinlan in 1986 and based on evaluation of information theory. Generally, probability, (P(vi)), is to answer vi, then the information (I) of the answer is given by [16]. SplitINFO is presented in (3) to resolve bias in INFO.

14

2010 Eighth International Conference on ICT and Knowledge Engineering

x

Therefore, define value of y, -1 and +1 using equation (9), (10) and combined into (11).

In (4), INFO is adapted by using the entropy of the partitioning (SplitINFO). Thus, higher Entropy partitioning is adjusted.

w ˜ x  b t 1 ; yi T

SplitINFO =  §¨ k ni log ni ·¸ ¦i 1 n n ¹ © Gain ratio

w ˜ x  b d 1 ; y i

'info SplitInfor

(4)

k r cf

(5)

is a set including k features;

rff is mean feature-feature inter-relation. D. Support Vector Machine (SVM) Vapnik introduced the SVM technique in 1995. The objective of SVM is to reduce error and maximize margin, this is different from other general algorithms such as Artificial Neural Network (ANN) and LDA. Definition characteristics of the experiment data D = {xi, yi; I =1,2,…,n} Let xi = (xi1, xi2,…,xin)  Rn is input data and yi  (+1, -1), yi represent class. Decision equations for linear and non-linear are shown in (6) and (7), respectively. sign



I ( x)

`

[I1 ( x ), I 2 ( x ), ..., I n ( x )]

T

1.

Reproduction or Fitness Value

2.

Crossover

3.

Mutation IV.

(7)

I (x) represents the function of classification and transformation of non-linear input data into linear data [18]. w j is weighting that links feature space to output space and b

The procedure detail is shown below:

is bias threshold shown in (8)

1.

Datasets: DLBCL (240x7400), leukemia (72x7130), and colon cancer (62x2001). Table I represent the details of all datasets.

2.

Data preprocessing: data is replaced using missing mean value and normalized into -1 to +1.

n

f ( x)

¦ w I ( x)  b j

j

(8)

j 1

978-1-4244-9875-8/10/$26.00 ©2010 IEEE

EXPERIMENT AND RESULT

All of algorithms in this paper were performed using the WEKA’s version 3.7. Firstly, three filter approaches CFS, GR, and INFO were used to select a subset of gene. This approach transfers output from the first step into the wrapper approach that based on SVM and heuristic searches GS and GA. This research selects Radial Basis Function (RBF) because it is more efficiency than other kernel functions [5][18]. In additional, evaluation method focused on precision, recall, Fmeasure and accuracy rate. Figure 1 represents the experiment design and comparison of hybrid feature selection models.

(6)

w jI j ( x)  b

j 1

(11)

F. Genetic Algorithm (GA) GA was created by John Holland in 1975. This technique is based on evolutionary theory and random search method. In this case, randomness is added to the search process to avoid local optimum. In additional, private search is increased for large dimensional data using measurement of each dimension in accordance with expression. Thus, features are recognized correctly with expression and top-ranked. [19]. The GA process is as follows:

rcf is average feature-class relation (f  S); and

y

yi ( w ˜ x  b )  1 t 0 ; i

E. Greedy Search (GS) Many sequential search techniques are based on greedy methods. It is not suitable for global optimality but acceptable for local group of genes [19]. For instance, orderly searches consist of forward and backward selection. Sequential backward selection was developed by Marill and Green in 1963. This technique starts from full dimension and orderly reduces dimension each time that’s based on the objective function. On the other hand, Whitney in 1971 developed sequential forward selection that starts from a blank set and orderly increases each time. However, orderly forward and backward search techniques are not only more computationally expensive based on O(N2) but also cannot undo processes such as deleting or inserting features.

k  k (k  1)r ff

n

(10)

T

where Ms is heuristic “merit” of feature subsets; S

(9)

1

T

C. Correlation based Feature Selection (CFS) CFS method was developed by Mark in 1999 and focuses on the relationship between attribute and class that’s based on heuristic estimating [17]. A drawback of CFS is the measurement process which determines high correlation of dimensionality subset with class and ignores relation among them. Therefore, irrelevant features are reduced and power features are chosen by this algorithm. Ms

1

(3)

15

2010 Eighth International Conference on ICT and Knowledge Engineering

3.

Filtering approach: Cfs, GR, and INFO select the toprank genes according to their individual discriminative power without involving any induction algorithm that showed in Table II.

4.

Wrapper approach: extracts a set of genes which are useful for classification by repeating the processes consisting of a combinatorial gene selection and discrimination by classifier. In this case, the result of each filter approach is chose using SVM+GA and SVM+GR. Table III represent the output of this step.

5.

Evaluation: CFSSVMGA, CFSSVMGS, GRSVMGA, GRSVMGS, INFOSVMGA, and INFOSVMGS are based on five folds cross validation measurement using precision, recall, F-measure and accuracy rate. K-fold cross validation (CV) separate data into k subsets, one subset should be retrieved from the dataset to use as test data, and then the remaining data is trained, the measurement process repeated k times. The result of evaluation step is shown in Table IV. TABLE I.

Features of three gene expression datasets are decreased using hybrid feature selection models as shown in Table III. In this way, more features are found when using GRSVMGA model when compared to others. Meanwhile, a few features are chosen when using GRSVMGS. In Table IV, CFSSVMGA model shows higher accuracy than others while INFOSVMGS model shows the lowest. Precision, recall, and F-measure are presented in the same Table. This also reveal that CFSSVMGA has better performance while INFOSVMGS has minimum efficiency.

DETAIL OF GENE EXPRESSION DATASET

Datasets

Instances

Attributes

Class Values

Colon Cancer

61

2001

Positive and Negative

DLBCL

240

7399

Dead and Alive

Leukemia

76

7129

ALL and AML

TABLE II. Datasets

The experiment results showed that the CFSSVMGA model provided higher value of precision, recall, F-measure, and accuracy rate on gene expression data than other models. In this case, it indicates that the performance of this model was suitable for hybrid feature selection. In contrast, joining INFO, SVM and Greedy (INFOSVMGS) gave lower measurement value when selecting a subset of gene. However, all of the hybrid models can reduce feature and still perform better classification than the original feature selection techniques. The results are shown in Table III and IV as follows.

Figure 2 shows the result of the hybrid models on Colon cancer dataset. In this case, CFSSVMGA outperforms other hybrid models. Figure 3 represents the outcome of these models based on DLBCL dataset in which accuracy rate of CFSSVMGA and INFOSVMGA is equal. On the other hand, subset of gene based on INFOSVMGA is more than CFSSVMGA. Most of the hybrid models are nearly a hundred percent based on Leukemia dataset thats depicted in Figure 4 except for the GRSVMGA and INFOSVMGS.

THE RESULT OF THREE FILTERING APPROACHES

Dimension reduction

Attributes CFS

GR

INFO

Colon

2001

26

135

135

DLBCL

7399

26

36

57

Leukemia

7129

76

874

874

Figure 2. Comparing performance of hybrid models on colon cancer dataset

Figure 1. EXPERIMENT DESIGN FOR COMPARISON OF HYBRID FEATURE

Figure 3. Comparing performance of hybrid models on DLBCL dataset

SELECTION MODELS

978-1-4244-9875-8/10/$26.00 ©2010 IEEE

16

2010 Eighth International Conference on ICT and Knowledge Engineering

research is comparison of hybrid feature selection models created from filter and wrapper approaches; CFSSVMGA, CFSSVMGS, GRSVMGA, GRSVMGS, INFOSVMGA, and INFOSVMGS based on three public gene expression datasets Colon cancer, DLBCL, and Leukemia. In case of dimension reduction, hybrid feature selection models are better than existing methods because they decrease dimension lower than traditional feature selection technique. For instance, feature subset from CFS (26, 26, 76), GR (135, 36 874) and INFO (135,57, 874) is reduced into 2, 8, 4 using CFSSVMGS, 2, 5, 3 using GRSVMGS and 2, 6, 3 using INFOSVMGS as follows. The experiment result concludes CFSSVMGA hybrid feature selection model combined with CFS, SVM and GA had higher efficiency over the other models. This hybrid model can eliminate dimension from three gene expression datasets which are Colon Cancer (2100), DLBCL (7399) and Leukemia (7129) to 5, 18, and 18, and conduct accuracy rate 90.32%, 75%, and 100%, respectively. However, the trend towards small accuracy rate is going to increase when hybrid models choose few features. For example, I) CFSSVMGS, GRSVMGS, and INFOSVMGS have the same number of attribute selection in Colon Cancer dataset. So, their accuracy rate is 87.09% which is lower than other hybrid models. II) INFOSVMGS contained 3 attributes and achieved low accuracy rate (94.44%) over other methods in the Leukemia dataset.

Figure 4. Comparing performance of hybrid models on Leukemia dataset TABLE III.

SUBSET OF GENE EXPRESSION DATA USING HYBRID MODELS

GRSVMGA

GRSVMGS

INFOSVMGA

INFOSVMGS

Colon Cancer

CFSSVMGS

Datasets

CFSSVMGA

Hybrid Models

5

2

46

2

47

2

DLBCL

18

8

36

5

26

6

Leukemia

18

4

77

3

77

3

REFERENCES [1]

TABLE IV.

Hongbo Xie, Uros Midic, Slobodan Vucetic, and Zoran Obradovic, “Hand Book of Applied Algorithms,” John Wiley & Sons, 2008, pp. 116-117. [2] P. Lance, H. Ehtesham, and L. Huan, "Subspace Clustering for High Dimensional Data: A Review," SIGKDD Explor. Newsl. 1931-0145, vol. 6, 2004, pp.90-105. [3] Mukherjee, S. and S. J. Roberts. “A Theoretical Analysis of Gene Selection,” Computational Systems Bioinformatics Conference, CSB 2004. Proceedings, 2004, pp.131-141. [4] Jaeger J., R. Sengupta , W. L. Ruzzo, ”Improved Gene Selection for Classification of Microarrays,” Pacific Symposium on Biocomputing 8, 2003, pp.53-64. [5] Cheng-San, Y., C. Li-Yeh, et al, “A Hybrid Approach for Selecting Gene Subsets Using Gene Expression Data,” Soft Computing in Industrial Applications, SMCia '08. IEEE Conference, 2008, pp.159-164. [6] Hikaru Mitsubayashi, Seiichiro Aso, Tomomasa Nagashima, and Yoshifumi Okada, “Accurate and Robust Gene Selection for Disease Classification Using a Simple Statistic,” SSN 0973-2063 (online) 0973-2063 (print), Bioinformation 3(2), 2008, pp.68-71. [7] Jin-Hyuk H. and C. Sung-Bae, “Cancer classification incremental gene selection based on DNA microarray data,” Computational Intelligence in Bioinformatics and Computational Biology, IEEE Symposium, 2008, pp.70-74. [8] Kamal A., X. Zhu, A. Pandya, S. Hsu, and M. hoaib, “The Impact of Gene Selection on Imbalanced Microarray Expression Data,” Bioinformatics and Computational Biology, 2009,pp.259-269. [9] R. Ruiz, et al., "Incremental Wrapper-based Gene Selection from Microarray Data for Cancer Classification," Pattern Recognition, vol. 39, 2006, pp.2383-2392. [10] R. Dabney, “Classification of Microarrays to Nearest Centroids,” Bioinformatics vol. 21(22), 2005, pp.4148-4154. [11] Q. Shen, W.-m. Shi, and W. Kong, “New Gene Selection Method for Multiclass Tumor Classification by Class Centroid,” Journal of Biomedical Informatics, vol. 42, 2009, pp.59-65. [12] M. Mohamad, S. Omatu, S. Deris, M. Misman, and M. Yoshioka, “A Multi-Objective Strategy in Genetic Algorithms for Gene Selection of

COMPARING PRECISION (P), RECALL (R), F-MEASURE (F), AND ACCURACY RATE( AR)

INFOSVMGA

INFOSVMGS

87.1

88.7

87.10

87.20

88.65

87.20

87.09

88.70

87.09

70.10

70.60

75.00

74.10

72.10

70.00

70.80

75.00

74.20

72.00

70.50

70.70

75.00

74.15

72.08

70.00

70.83

75.00

74.16

GRSVMGA

87.30

CFSSVMGS

88.60

CFSSVMGA

Colon Cancer

87.3

Efficiency

Datasets

GRSVMGS

Efficiency on Hybrid Feature Selection Models (%)

P

90.30

87.30

88.60

R

90.30

87.10

88.70

F

90.30

87.20

88.65

AR

90.32

87.09

88.71

P

74.9

71.90

R

75.00

F

74.95

AR

75.00

DLBCL

Leukemia

P

100

100

98.60

100

100

94.90

R

100

100

98.60

100

100

94.40

F

100

100

98.60

100

100

94.65

AR

100

100

98.61

100

100

94.44

V.

CONCLUSION

DNA microarray, is a technique that applies hybridization nucleic acid for validation of gene expression data at the same time. However, many genes can be found but most of them aren’t related with cancer types. The point of view in this

978-1-4244-9875-8/10/$26.00 ©2010 IEEE

17

2010 Eighth International Conference on ICT and Knowledge Engineering

[13]

[14]

[15] [16] [17]

[18]

[19]

Gene Expression Data,” Artificial Life and Robotics, vol. 13, 2009, pp.410-413. K.-A. L. Cao, A. Bonnet, and S. Gadat, “Multiclass Classification and Gene Selection with A Stochastic Algorithm,” Computational Statistics & Data Analysis, vol. 53, 2009, pp.3601-3615. P. Mundra and J. Rajapakse, “F-score with Pareto Front Analysis for Multiclass Gene Selection,” Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, 2009, pp.56-67. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining. Addison Wesley, 2006, pp.150-163. Quinlan, J. R, “Induction of Decision Trees,” Machine Learning 1(1), 2006, pp.81-106. Mark A. Hall, “Correlation-based Feature Selection for Machine Learning,” Doctor of Philosphy Department of Computer Science, The University of Waikato Newzealand, 1999, pp.69-71. D. Tammasiri, and P.Meesad, “Credit Scorring using Data Mining based on Support Vector Machine and Grid,” The 5th National Conference on Computing and Information Technology, 2009, pp.249-257. Huan Liu and Hiroshi Motoda, Computational Methods of Feature Selection. Chapman & Hall/CRC, 2008.

978-1-4244-9875-8/10/$26.00 ©2010 IEEE

18

Suggest Documents