Quality Assessment of Gene Selection in Microarray Data - CiteSeerX

13 downloads 1693 Views 129KB Size Report
ranking under the changes of training samples or different selection criteria. ... Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA ... College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, ...
Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1–9

Quality Assessment of Gene Selection in Microarray Data Cheong Hee Park, Moongu Jeon, Panos Pardalos & Haesun Park (Received 00 Month 200x; In final form 00 Month 200x) In microarray data, gene selection can make data analysis efficient and biological interpretations of the selected genes can be very useful. However, microarray data has typically several thousands of genes but only tens of samples, referred to as a small sample size problem. In this paper, we discuss some problems on gene selection which can occur due to a small sample size: whether gene selection relying on the extremely small number of samples is reliable and meaningful. Experimental comparisons of well-known three gene selection methods show that classification performances can be very sensitive to training samples and preprocessing steps. We also measure consistency in gene ranking under the changes of training samples or different selection criteria. Keywords: classification, feature selection, gene selection, microarray data 2000 Mathematics Subject Classifications: 68U01, 68W99, 65K99, 62P10

1

Introduction

Feature selection refers to a problem to select a subset of features which are most optimal for intended tasks such as classification, clustering or regression [1–3]. When the goal is classification performance, optimal features are the ones that discriminate classes well. Hence one needs an optimization criterion by which good features can be distinguished from bad features. The optimization criterion for feature selection can be independent with the classifier which works on the selected features, or it can be combined with a classifier in both a selection process and a classification stage. Recently microarray data has brought much attention to bioinformatics research area. Microarray data is a recording of expression levels of thousands of genes which are measured in various experimental settings [4]. Microarray

Dept. of Computer Science and Engineering, Chungnam National University, 220 Gung-dong, Yuseong-gu, Daejeon, 305-764, South Korea ([email protected]) Dept. of Mechatronics, Gwangju Institute of Science and Technology, 1 Oryong-dong, Buk-gu, Gwangju 500-712, South Korea ([email protected]) Dept. of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA ([email protected]) College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA, 30332, USA ([email protected]), and The National Science Foundation, 4201 Wilson Boulevard, Arlington, Virginia 22230, USA ([email protected])

2

data has typically several thousands of genes (features) but only tens of experiments (samples), referred to as a small sample sized problem. Selection of a small subset of genes among thousands of genes can make data analysis efficient and also biological interpretations of the selected genes can be very useful. Many approaches from traditional statistical data analysis to newly developed data mining algorithms have been applied for gene selection problems and showed that using a subset of genes less than several hundreds can be effective for classification [5–8]. In [5], discriminating power of individual genes in separating classes from each other is measured using means and standard deviations of gene expression levels. In the recursive feature elimination, Support vector machine (SVM) is trained and the feature with the smallest ranking criterion is removed. This process is repeated producing gene ranking backwards [6]. A heuristic algorithm of sequential forward feature selection with Fisher discriminant analysis (FDA) was applied to colon cancer classification problem [7]. Theses methods can be considered representing different methodologies: forward feature selection [7] vs backward feature elimination [6], independent gene ranking [5] vs sequential gene selection [6, 7], and maximizing margins between classes [6] vs optimizing the between-class and within-class scatters [5, 7]. In this paper, we discuss some fundamental but important problems on gene selection in microarray data. Through experimental comparisons of above three methods, we address some questions regarding whether gene selection relying on the extremely small number of samples is reliable and meaningful. It is shown that classification performances can be very sensitive to training samples and preprocessing steps applied. We also measure consistency of selected genes under the changes of training samples within each method and between different methods. 2

Gene Selection Methods

Microarray data can be represented as an  g11 · · ·  .. A= . gm1

m × n matrix  g1n ..  , .  · · · gmn

(1)

where a column vector sj = [g1j , · · · , gmj ]T denotes the j-th sample and gi = [gi1 , · · · , gin ] represents the expression levels of the i-th gene across n samples. Let ni (i = 1, 2) be the number of samples in each class and n = n1 + n2 . Classification refers to the problem which assigns a class label to new samples based on the given information. Especially in cancer classification it corre-

3

sponds to distinguishing a cancer class from a normal class, or classifying different cancer classes. It is expected that some genes are better than others in discriminating two classes, independently or in combination with other genes. Gene selection aims at selecting those genes, improving classification accuracy and identifying most responsible genes for discriminating classes. In this section, three methods [5–7] for gene selection are reviewed briefly which will be used for experimental comparisons in the next section.

2.1

Gene Ranking using Correlation

The method proposed by Golub et al. first orders genes by the discriminating power which is measured by class means and deviations [5]. For the i-th gene expression gi , let µij and σij (j = 1, 2) be the mean and standard deviation in each class, respectively. The class separability of the gene expression gi is defined as p(gi ) =

µi1 − µi2 σi1 + σi2

where the positive (negative) sign of p(gi ) corresponds to being highly expressed in class 1 (class 2). Let p1 , · · · , pk be the index list of the k most informative genes composed of k/2 genes with the largest p(gi ) and k/2 genes with the largest −p(gi ). Each one in the k selected genes serves as a class predictor independently. Let x = [x1 , · · · , xm ]T be a new sample to be classified. A class predictor by the gene expression gi is set by 

µi1 + µi2 vi (xi ) = p(gi ) xi − 2



,

where a positive value casts a vote for class 1 and a negative value gives a vote for class 2. The final prediction is made by summing all votes as k X

vpi (xpi ).

i=1

Prediction strength by the k most informative genes was defined in [5] as

PS =

k X i=1

vpi (xpi )/

k X i=1

|vpi (xpi )|

4

where P S reflects the relative strength in the prediction. In [5], the classification decision was left uncertain if the prediction strength is less than a threshold. For more details, please refer to [5]. The number of the selected genes, k, needs to be determined. In Section 3, we come back to the discussion of choosing k.

2.2

Recursive Backward Feature Elimination by SVM

While the method in Section 2.1 selects the best genes based on the discriminating power of the individual genes, Guyon et al. applied Support vector machine (SVM) for backward feature elimination [6]. SVM is recognized as a state-of-the-art data mining method for binary classification. It finds a hyperplane which maximizes the margin between classes in the original data space or in the nonlinearly transformed feature space by identifying data points called support vectors on the class boundaries [9]. Most practical SVM algorithms solve the dual of a soft margin optimization problem described as n X

n

1 X yi yj αi αj si · sj α 2 i,j=1 i=1  Pn i=1 yi αi = 0 subject to 0 ≤ αi ≤ C max

αi −

(2)

where si ’s and yi ’s are the training data and the corresponding class labels (-1 or 1), αi are the Lagrange multipliers that we want to compute, and C is the softmargin parameter that can be supplied by a user. By solving the optimization problem (2), we obtain αi ’s and a linear classifier expressed as y =w·x+b

(3)

P where w = ni=1 yi αi si and b = yi −w·si . The weight vector w = [w1 , · · · , wm ]T is used as a ranking criterion for genes. A gene or a chunk of genes with the least weight |wi | are eliminated, and SVM is applied again with the remaining gene features. When genes are eliminated one by one at each step, gene ranking is obtained backwards.

2.3

Forward Feature Selection by FDA

Fisher Discriminant Analysis (FDA) is a statistical dimension reduction method [10]. It finds a linear transformation which optimizes class separability

5

in the reduced dimensional space. The class separability is measured by using the between-class and within-class scatters. A linear transformation which maximizes the between-class scatter and minimizes the within-class scatter projects the original data to a one-dimensional space. In the recursive backward feature elimination, a classifier is trained and the feature with the smallest ranking criterion is removed and this process is repeated [6]. On the other hand, the forward feature selection method selects one feature at each step which gives the best prediction accuracy in combination with the previously selected features. Forward feature selection using FDA is proceeded as follows [7]: (i) After k features are selected, perform the steps (a)-(c) for the remaining candidate features a) Add one feature to the set of k features already selected b) Perform FDA for the k+1 features and project data samples c) Measure prediction accuracy in the projected space (ii) Find the k + 1-th best feature by choosing the feature which gives the highest prediction accuracy. (iii) The processes (i) and (ii) are repeated producing gene ranking forwards. Since FDA is a dimension reduction method not a classifier, a classifier needs to be applied in the reduced dimensional space. Simple methods such as the centroid-based classifier or the nearest neighbor classifier can work. When the number of features becomes greater than the number of samples, FDA is not applicable due to the singularity of the within-class scatter matrix. For such a case, a method LDA/GSVD which generalizes FDA by the generalized singular value decomposition (GSVD) can be applied [11]. In order to make complete gene ranking we applied LDA/GSVD for the cases where the number of features becomes beyond the number of samples.

3

Quality Assessment in Gene Selection and Experimental Results

High dimensionality in microarray gene expression data makes feature selection as an important problem, but at the same time a small sample size poses some problems. Since the number of samples is extremely small (usually less than hundreds) compared with the number of features up to several thousands, overfitting can occur in the process of determining the optimal parameters as well as gene ranking. In this section we discuss some problems which are important but occasionally ignored in the gene selection problem. In order to illustrate our arguments, we conduct experimental comparisons of the three methods in Section 2 using two microarray data sets, ALL(acute lymphoblastic leukemia) and AML(acute

6

myeloid leukemia) cancer data, and colon cancer data. The leukemia cancer data set has 7129 genes and 72 samples (47 ALL cancer and 25 AML cancer) [5] and colon cancer data set has 2000 genes and 62 samples - 22 normal tissues and 40 colon cancer tissues [12]. Both data sets have been used as benchmark data in data mining algorithms for microarray data. 3.1

Optimal Number of Genes To be Selected

In data analysis, one of experimental strategies is to divide original data to the training and test sets. An algorithm is run on the training set and optimal values for parameters are determined, and the test set is used to evaluate performances. Experiments can be repeated several times by randomly splitting original data to the training and test sets, and mean and standard deviation can be used for performance evaluation. Selection of the parameter value, i.e., the number of optimal genes, can be performed by the leave-one-out method based on the training set. Leaving one sample out for validation, the algorithm is run based on the remaining data points and classification for the left-out sample is performed. This process is repeated until all samples in the training set are used for validation, and the average prediction accuracy is computed. The parameter value which gives the highest average prediction accuracy by the leave-one-out method is chosen as an optimal parameter value. For the leukemia data, as in [5] 27 ALL and 11 AML cancer samples were assigned to the training set and the remaining to the test set. After ranking genes based on the training set, performances both in the training set (by the leave-one-out method) and in the test set were measured using the top k genes. The three figures in Figure 1 illustrate the prediction accuracies (%) by the three gene selection methods, respectively. The horizontal axis denotes top 100 genes in the order of gene ranking, and prediction accuracies (%) is shown on the vertical axis. The figures show that highs and lows of the prediction accuracy curve in the training set do not agree with those obtained in the test set. Especially when feature selection process is combined with a classifier, it becomes worse as in the third figure. Prediction accuracy by the leave-one-out method in the training set goes up to 100% with only top two genes. After then, gene selection depending on the prediction accuracy by leave-one-out in the training set does not seem to make effective gene selection. In order to test effects of preprocessing on microarray data, two different preprocessing steps, gene normalization and sample normalization, were applied. Gene normalization refers to making mean zero and standard deviation one for each gene expression, while sample normalization is performed for each sample. Colon cancer data was split randomly to the training and test sets in equal size. Three figures in the first column of Figure 2 show the prediction accuracies obtained from the colon cancer data preprocessed by gene

7

normalization, while the figures in the second column were produced with the preprocessing of sample normalization followed by gene normalization. From the top, the gene selection method by Golub et al., recursive backward feature elimination by SVM, and forward feature selection by FDA were applied respectively. The figures in Figure 3 show the results from different random splitting but under the same experimental conditions as in Figure 2. It is quite difficult to pick up the number of optimal genes to be selected by the leave-one-out method in the training set. While prediction accuracies in the training set reach the highest points quickly and stay stable in most cases, prediction accuracies in the test set show wiggling. Compare corresponding figures in Figure 2 and 3. Different splittings to the training and test sets give quite different figures. Preprocessing steps can also make great effects on the classification performances. 3.2

Consistency in Gene Ranking

In this section we discuss another question: would selected genes be consistent under the changes in the training samples, different preprocessing steps, or different selection criteria? In order to measure consistency in gene ranking, we define the following measure. Let A and B be sets composed of the top k genes in two different gene ranking lists. The gene ranking lists can be obtained by using different training sets or applying different selection methods. The ratio of genes which are common in A and B to genes which belong to either A or B, γ=

|A ∩ B| , |A ∪ B|

(4)

can measure the degree of how many top k genes in two gene ranking lists are common. In (4), | | denotes the number of the elements in a set. If γ is zero, then A and B are totally different. The value γ increases up to one which indicates total agreement in the selected genes. Figure 4 compares consistency in gene ranking up to top 100 genes between different methods in the leukemia data. Forward feature selection by FDA shares genes less than 10% among the top 100 genes selected by other two methods. Figure 5 and 6 compare consistency in gene ranking in the colon cancer data. For this experiment, we used gene normalization preprocessing. Colon cancer data was split randomly to the training and test sets in equal size. Experiments repeated by ten times random splitting give ten lists of gene ranking for each method. The γ values for each pair from 10 ranking lists are computed and we used the averages in γ values as a measure of consistency in gene ranking against the changes in training samples. Figure 5 shows the

8

average γ values for each method. Figure 6 compares the consistency between different methods. Up to top 200 genes, only genes less than 40% and 20% agree when different training sets are used and different selection methods are applied respectively. Considering that one of the purposes in gene selection is biological interpretation of selected genes, it should be noted that the selected genes depend on the training samples and selection criteria. The dependence on the training data poses a dilemma between the necessity and reliability of feature selection due to high dimensionality and extremely small sample size.

4

Discussions

We discussed some issues on gene selection in microarray data: whether choosing optimal number of genes which results in gene ranking is reliable and meaningful with the small sample size of microarray data, and whether selected genes are consistent under the changes in training sets and under the different methods. Gene selection in microarray data is an attractive and important problem. However, problems caused by the small sample sizes should not be treated lightly and the experimental results in microarray data should be examined carefully.

References [1] A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE trans. Pattern Anal. Mach. Intell., 19:153–157, 2000. [2] D.W. Hosmer Jr. and S. Lemeshow. Applied logistic regression. New York: Wiley, 1989. [3] R. Kohavi and G. john. Wrappers for feature subset selection. Artificial Intelligence, 97:12:273– 324, 1997. [4] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. Proc. Natl Acad. Sci., 95:14863–14868, 1998. [5] T.R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. [6] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389–422, 2002. [7] L. Wuju and X. Momiao. Tclass: tumor classification system based on gene expression profile. bioinformatics, 18:2:325–326, 2002. [8] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research (Special Issue on Variable and Feature Selection), 3:1157–1182, 2003. [9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:3:273–297, 1995. [10] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Wiley-Interscience, New York, 2001. [11] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE trans. Pattern Anal. Mach. Intell., 26:8:995–1006, 2004. [12] U. Alon et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci., 96:6745– 6750, 1999.

100

100

98

98

Prediction Accuracy (%)

Prediction Accuracy (%)

9

96 94 92 90 88 86 84

Training Test

82 80

10

20

30

40

50

60

70

80

90

96 94 92 90 88 86 84

Training Test

82 80

100

10

20

30

No. of Selected Genes

40

50

60

70

80

90

100

No. of Selected Genes

100

Prediction Accuracy (%)

95

Training Test

90

85

80

75

70

65

60

10

20

30

40

50

60

70

80

90

100

No. of Selected Genes

Figure 1. Prediction accuracies in the training set (by leave-one-out) and in the test set for the leukemia data. From the top left, gene selection by Golub et al., recursive feature elimination by SVM, and forward feature selection by FDA were applied.

10 32

32

Training Test

correctly predicted samples

correctly predicted samples

Training Test 30

28

26

24

22

20 0

20

40

60

80

100

120

140

160

180

30

28

26

24

22

20 0

200

20

40

no. of selected genes

80

100

120

140

160

correctly predicted samples

28

26

24

22

30

28

26

24

22

Training Test 20

40

60

80

100

120

140

160

180

Training Test 20 0

200

20

40

no. of selected genes

80

100

120

140

160

180

200

32

correctly predicted samples

correctly predicted samples

60

no. of selected genes

32

30

28

26

24

22

30

28

26

24

22

Training Test

Training Test 20 0

200

32

30

20 0

180

no. of selected genes

32

correctly predicted samples

60

20

40

60

80

100

120

140

no. of selected genes

160

180

200

20 0

20

40

60

80

100

120

140

160

180

200

no. of selected genes

Figure 2. Prediction accuracies in the colon cancer data. Three figures in each column were obtained by the three methods respectively. For the left three figures gene normalization preprocessing was applied and the right three figures were from sample normalization followed by gene normalization preprocessing.

11 32

32

Training Test

correctly predicted samples

correctly predicted samples

Training Test 30

28

26

24

22

20 0

20

40

60

80

100

120

140

160

180

30

28

26

24

22

20 0

200

20

40

no. of selected genes

60

80

100

120

140

160

180

200

no. of selected genes

32

32

30

30

correctly predicted samples

correctly predicted samples

Training Test

28

26

24

22

28

26

24

22

Training Test 20 0

20

40

60

80

100

120

140

160

180

20 0

200

20

40

no. of selected genes

80

100

120

140

160

180

200

32

correctly predicted samples

32

correctly predicted samples

60

no. of selected genes

30

28

Training Test 26

24

22

30

28

26

24

22

Training Test 20 0

20

40

60

80

100

120

140

no. of selected genes

160

180

200

20 0

20

40

60

80

100

120

140

160

180

200

no. of selected genes

Figure 3. Prediction accuracies in the colon cancer data. The figures show the results from other random splitting but under the same experimental conditions as in Figure 2.

12 0.5

Golub’s & SVM Golub’s & FDA FDA & SVM

consistency in gene ranking

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

10

20

30

40

50

60

70

80

90

100

no. of selected genes

Figure 4. Consistency in gene ranking between different selection methods in the leukemia data.

0.5

consistency in gene ranking

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1

Golub’s SVM FDA

0.05 0

20

40

60

80

100

120

140

160

180

200

no. of selected genes

Figure 5. Consistency in gene ranking obtained by using different training samples in the colon cancer data.

0.5

Golub’s & FDA Golub’s & SVM FDA & SVM

consistency in gene ranking

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

20

40

60

80

100

120

140

160

180

200

no. of selected genes

Figure 6. Consistency in gene ranking between different methods in the colon cancer data .

Suggest Documents