Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Giorgio Valentini
Marco Muselli
Francesca Ruffino
DSI, Dip. di Scienze dell’ Informazione IEIIT, Istituto di Elettronica DIMA, Dipartimento di Matematica Universit`a degli Studi di Milano, Italy e di Ingegneria dell’Informazione Universit`a di Genova, Italy INFM, Istituto Nazionale e delle Telecomunicazioni, di Fisica della Materia, Consiglio Nazionale delle Ricerche, Italy Email:
[email protected] Email:
[email protected]
Abstract— Extracting information from gene expression data is a difficult task, as these data are characterized by very high dimensional, small sized, samples and large degree of biological variability. However, a possible way of dealing with the curse of dimensionality is offered by feature selection algorithms, while variance problems arising from small samples and biological variability can be addressed through ensemble methods based on resampling techniques. These two approaches have been combined to improve the accuracy of Support Vector Machines (SVM) in the classification of malignant tissues from DNA microarray data. To assess the accuracy and the confidence of the predictions performed proper measures have been introduced. Presented results show that bagged ensembles of SVM are more reliable and achieve equal or better classification accuracy with respect to single SVM, whereas feature selection methods can further enhance classification accuracy.
I. I NTRODUCTION DNA microarray technology provides fundamental insights into the mRNA levels of large sets of genes, offering in such a way an approximate picture of the proteins of a cell at one time [13]. The large amount of gene expression data produced requires statistical and machine learning methods to analyze and extract significant knowledge from DNA microarray experiments. Typical problems arising from this analysis range from prediction of malignancies [15], [17] (a classification problem from a machine learning point of view) to functional discovery of new classes or subclasses of diseases [1] (an unsupervised learning problem), to the identification of groups of genes responsible or correlated with malignancies or polygenic diseases [11] (a feature selection problem). Several supervised methods have been applied to the analysis of cDNA microarrays and high density oligonucleotide chips. These methods include decision trees, Fisher linear discriminant, Multi-Layer Perceptrons (MLP), NearestNeighbour classifiers, linear discriminant analysis, Parzen windows and others [5], [8], [10], [12], [14]. In particular, Support Vector Machines (SVM) have been recently applied to the analysis of DNA microarray gene expression data in order to classify functional groups of genes, normal and malignant tissues and multiple tumor types [5], [9], [17]. Other works pointed out the importance of feature selection methods to reduce the high dimensionality of the input space and to select the most relevant genes associated with specific functional classes [11].
0-7803-7898-9/03/$17.00 ©2003 IEEE
Furthermore, ensembles of learning machines are wellsuited for gene expression data analysis, as they can reduce the variance due to the low cardinality of available training sets, and the bias due to specific characteristics of the learning algorithm [7]. Indeed, in recent works, combinations of binary classifiers (one-versus-all and all-pairs) and Error Correcting Output Coding (ECOC) ensembles of MLP, as well as ensemble methods based on resampling techniques, such as bagging and boosting, have been applied to the analysis of DNA microarray data [8], [15], [17]. In this work we show that the combination of feature selection methods and bagged ensembles of SVM can enhance the accuracy and the reliability of predictions based on gene expression data. In the next section the standard technique for training SVM with soft margin is presented together with a description of the considered method for feature selection. Then, procedure for bagging SVM is introduced examining different possible choices for the combination of classifiers. Finally, proper measures are employed to evaluate the performance of the proposed approach on two data sets available online, concerning tumor detection based on gene expression data produced by DNA microarrays. II. SVM TRAINING AND FEATURE SELECTION We can represent the output of a single experiment with a DNA microarray as a pair (x, y), being x ∈ Rd a vector containing the expression levels for d selected genes and y ∈ {−1, +1} a binary variable determining the classification of the considered cell. As an example, y = +1 can be used to denote a tumoral cell and y = −1 for a normal cell. It is then evident that in our analysis every cell is associated with an input vector x containing the gene expression levels. When n different experiments are performed, we obtain a collection of n pairs T = {(xj , yj ) : j = 1, . . . , n} (training set); suppose, without loss of generality, that the first n+ pairs have yj = +1, whereas the remaining n− = n − n+ possess a negative output yj = −1. The target of a machine learning method is to construct from the pairs {(xj , yj )}nj=1 a classifier, i.e. a decision function h : Rd → {−1, +1}, that gives the correct classification y = h(x) for every cell (determined by x). To achieve this target, many available techniques generate a discriminant function f : Rd → R from the sample T at hand
1844
and build h by employing the formula h(x) = sign(f (x))
(1)
where the function sign(z) gives as output +1 if z ≥ 0 and −1 otherwise. Among these techniques, SVM [6] turn out to be a promising approach, due to their theoretical motivations and their practical efficiency. They employ the following expression for the discriminant function f (x) = b +
n X
αj yj K(xj , x)
(2)
j=1
where the scalars αj are obtained, in the soft margin version, through the solution of the following quadratic programming problem: minimize the cost function W (α) =
III. BAGGED ENSEMBLES OF SVM
n n n X 1 XX αj αk yj yk K(xj , xk ) − αj 2 j=1 j=1 k=1
subject to the constraints n X
αj yj = 0 ,
0 ≤ αj ≤ C for j = 1, . . . , n
j=1
being C a regularization parameter. The symmetric function K(·, ·) must be chosen among the kernels of Reproducing Kernel Hilbert Spaces [16]; three possible choices are: • Linear kernel: K(u, v) = u · v γ • Polynomial kernel: K(u, v) = (u · v + 1) 2 2 • Gaussian kernel: K(u, v) = exp(−ku − vk /σ ) Since the point α of minimum of the quadratic programming problem can have several null components αj = 0, the sum in Eq. 2 receives the contribution of a subset V of patterns xj in T , called support vectors. The bias b in the SVM classifier is usually set to n 1 XX b= αj yj K(xj , x) |V | j=1 x∈V
where |V | denotes the number of elements of the set V . The accuracy of a classifier is affected by the dimension d of the input vector; roughly, the greater is d the lower is the probability of correctly classifying a pattern x. For this reason, feature selection methods are employed to choose a subset of relevant inputs (genes) for the problem at hand, so as to reduce the number of components xi . A simple feature selection method, originally proposed in [10], associates with every gene expression level xi a quantity ci given by ci =
+
n 1 X = + xji , n j=1
µ− i =
The low cardinality of the available data and the large degree of biological variability in gene expression suggest to apply variance-reduction methods, such as bagging, to these tasks. Denote with {Tb }B b=1 a set of B (bootstrapped) samples, whose elements are drawn with replacement from the training set T according to a uniform probability distribution. Let fb be the discriminant function obtained by applying the softmargin SVM learning algorithm on the bootstrapped sample Tb . The corresponding decision function hb is computed as usual through Eq. 1. The generalization ability of classifiers hb (base learners) can be improved by aggregating them through the standard formula (for two class classification problems) [3]: ! Ã B X hb (x) (4) hst (x) = sign b=1
In this way the decision function hst (x) of the bagged ensemble selects the most voted class among the B classifiers hb . Other choices of discriminant function for the bagged ensemble are possible, some of which lead to the above standard decision function hst (x) through Eq. 1. The following three expressions allow also to evaluate the quality of the classification offered by the bagged ensemble: favg (x) = fwin (x) =
(3)
1 n−
n X j=n+ +1
xji
B 1 X fb (x) B b=1 1 X fb (x) |B ∗ | ∗ b∈B
fmax (x) =
− µ+ i − µi + σi + σi−
− where µ+ i and µi are the mean value of xi across all the input patterns in T with positive and negative output, respectively
µ+ i
having denoted with xji the ith component of the input vector xj . Similarly, σi+ and σi− are the standard deviation of xi computed in the set of pairs with positive and negative output, respectively. Then, the genes are ranked according to their ci value, and the first m and the last m genes are selected, thus obtaining a set of 2m inputs. The main problem of this approach is the underlying independence assumption of the expression patterns of each gene: indeed it fails in detecting the role of coordinately expressed genes in carcinogenic processes. Eq. 3 can also be used to compute the weights for weighted gene voting [10], a minor variant of diagonal linear discriminant analysis [8].
hst (x) · max∗ |fb (x)| b∈B
where the set B ∗ = {b : hb (x) = hst (x)} contains the indices b of the base learners that vote for the class hst (x). Note that favg (x) is the average of the fb (x), whereas fwin (x) and fmax (x) are, respectively, the average of the discriminant functions of the classifiers having indices in B ∗ and the signed maximum of their absolute value.
1845
0.9
0.6
0.8 0.4
0.7 0.2
SINGLE SVM
SINGLE SVM
0.6
0.5
0.4
0
−0.2
0.3 −0.4
0.2 Succ Acc
0.1
0 0 10
1
10
2
10 NUMBER OF GENES
3
−0.8 0 10
4
10
Mext Mmed
−0.6
10
1
10
2
10 NUMBER OF GENES
(a)
3
4
10
10
(b)
1
0.5
0.4
0.9 0.3
0.8 SINGLE SVM
SINGLE SVM
0.2
0.7
0.1
0
0.6 −0.1
0.5
Succ Acc
0.4 0 10
1
10
2
10 NUMBER OF GENES
3
10
M−ext M−med
−0.2
−0.3 0 10
4
10
1
10
2
10 NUMBER OF GENES
(c)
3
10
4
10
(d)
Fig. 1. Results obtained with single SVM for different numbers of selected genes. Colon data set: (a) Success and acceptance rate (b) Extremal and median margin. Leukemia data set: (c) Success and acceptance rate (d) External and median margin.
The corresponding decision functions are given by
havg (x)
=
sign(favg (x))
hwin (x) hmax (x)
= =
sign(fwin (x)) = hst (x) sign(fmax (x)) = hst (x)
IV. A SSESSMENT OF CLASSIFIERS QUALITY Besides the success rate n 1 X Succ = |yj + h(xj )| 2n j=1
While hwin (x) and hmax (x) are equivalent to the standard choice hst (x), havg (x) selects the class associated with the average of the discriminant functions computed by the base learners. Thus, the decision of each classifier in the ensemble is weighted via its prediction strength, measured by the value of the discriminant function fb ; on the contrary, in the decision function hst (x) each base learner receives the same weight.
which is an estimate of the generalization error, several alternative measures can be used to assess the quality of classifiers producing a discriminant function f (x). These measures can then be directly applied to evaluate the confidence of the classification performed by simple SVM and bagged ensembles of SVM. By generalizing a definition introduced in [10], [11], a first choice is the extremal margin Mext , defined as
1846
Mext =
θ+ − θ− max f (xj ) − min f (xj )
1≤j≤n
1≤j≤n
(5)
1
0.8
0.75
0.95
0.7 0.9 0.65
ACC
SUCC
0.85 0.6
0.8 0.55 0.75 0.5 single SVM f−avg f−win,f−max
0.7
0.65 0 10
single SVM f−avg f−win f−max
0.45
1
2
10
3
10 NUMBER OF GENES
0.4 0 10
4
10
10
1
10
2
10 NUMBER OF GENES
(a)
3
10
4
10
(b)
0
0.5
−0.05
0.45
−0.1
0.4
−0.15
M−MED
M−EXT
0.35 −0.2
0.3 −0.25
0.25
single SVM f−avg f−win f−max
−0.3 single SVM f−avg f−win f−max
−0.35
−0.4 0 10
1
10
2
3
10 NUMBER OF GENES
10
0.2
0.15 0 10
4
10
1
10
2
10 NUMBER OF GENES
(c)
3
10
4
10
(d)
Fig. 2. Comparison of results obtained with single and bagged SVM on the Leukemia data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin.
The sets Jλ+ (resp. Jλ− ) contain the indices j of the input patterns xj in the training set, where the discriminant function f (xj ) is greater (resp. lower) than λ:
where the quantities θ+ and θ− are given by θ+ = min +f (xj ) , 1≤j≤n
θ− =
max f (xj )
n+ +1≤j≤n
It can be easily seen that the larger is the value of Mext , more confident is the classifier; note that if there are no classification errors Mext is positive. An alternative measure, less sensitive to outliers, is the median margin Mmed , which is defined as Mmed =
λ+ − λ− max f (xj ) − min f (xj )
1≤j≤n
Jλ+ = {j : f (xj ) > λ} ,
Finally, the acceptance rate Acc measures the fraction of samples that are correctly classified with high confidence. It is defined by the expression
(6)
1≤j≤n
where λ+ and λ− are the median value of f (x) for the positive and negative class, respectively: λ+
= min{λ ∈ R : |Jλ+ | ≥ n+ /2}
λ−
= max{λ ∈ R : |Jλ− | ≥ n− /2}
Jλ− = {j : f (xj ) < λ}
Acc =
− |Jθ+ | + |J−θ | n
(7)
where θ = max{|θ+ |, |θ− |} is the smallest symmetric rejection zone to get zero error. It is important to remark that the acceptance rate is highly sensitive to the presence of outliers.
1847
0.88
0.35 single SVM f−avg f−win,f−max
0.3
0.86
single SVM f−avg f−win f−max
0.25 0.84
ACC
SUCC
0.2 0.82
0.15 0.8 0.1
0.78
0.05
0.76 0 10
1
2
10
10 NUMBER OF GENES
3
10
0 0 10
4
10
1
10
(a)
3
10
4
10
(b) 0.65
−0.35
−0.4
2
10 NUMBER OF GENES
single SVM f−avg f−win f−max
0.6
−0.45
single SVM f−avg f−win f−max
0.55
−0.5
0.5
M−MED
M−EXT
−0.55
−0.6
0.45
0.4
−0.65
0.35 −0.7
0.3
−0.75
0.25
−0.8
−0.85 0 10
1
10
2
10 NUMBER OF GENES
3
10
0.2 0 10
4
10
(c)
1
10
2
10 NUMBER OF GENES
3
10
4
10
(d)
Fig. 3. Comparison of results obtained with single and bagged SVM on the Colon data set, when varying the number of selected genes: (a) Success rate (b) Acceptance rate (c) Extremal margin (d) Median margin.
V. N UMERICAL EXPERIMENTS Here we present the results about the classification of DNA microarray data using the proposed techniques. We applied SVM linear classifiers to separate normal and malignant tissues with and without feature selection. Then we compare the results obtained with single and bagged SVM, using in all cases the filter method for feature selection described in Sec. II.
all values and by normalizing feature (gene) vectors. This has been performed by subtracting the mean over all training values, by dividing by the corresponding standard deviation and finally by passing the result through a squashing arctan function to diminish the importance of outliers. The whole data set has been randomly split into a training and a test set of equal size, each one with the same proportion of normal and malignant examples.
A. Data sets The proposed approach has been tested on DNA microarray data available on-line. In particular, we used the Colon cancer data set [2] constituted by 62 samples including 22 normal and 40 colon cancer tissues. The data matrix contains expression values of 2000 genes and has been preprocessed by taking the logarithm of
We also compared the different classifiers on the Leukemia data set [10]. It is composed by two variants of leukemia, ALL and AML, for a total of 72 examples split into a training set of 38 samples and a test set of 34 examples, with 7129 different genes.
1848
B. Results Fig. 1 summarizes the results with single SVM, obtained by varying the number of genes selected with the filter method described in Sec. II and by using the measures for classifier assessment introduced in Sec. IV. With the Colon data set, the accuracy does not change significantly when the feature selection method is applied; however, the prediction is more reliable, as attested by the higher values of Acc and Mmed (Fig. 1a and 1b), when the number of inputs lies beyond 256. On the contrary, we obtain the highest success rate on the Leukemia data set with only 16 selected genes; the corresponding acceptance rate is also significantly high (Fig. 1c). The extremal margin is negative but very close to 0, thus showing that the Leukemia data set is near linearly separable, with a relatively high confidence (Fig. 1d). Fig. 2 and 3 compare the results obtained through the application of bagged ensembles of SVM (for different choice of the decision function) with those achieved by single SVM. On the Leukemia data set, bagging seems not to improve the success rate, even if the predictions are more reliable, especially when a small number of selected genes is used (Fig. 2). On the contrary, bagging significantly improves the success rate scored on the Colon data set, both with and without feature selection (Fig. 3a). Considering the acceptance rate, there are no significant difference between bagged SVM employing favg or fwin and single SVM, whereas bagged SVM adopting fmax achieve the highest values of Acc if the number of genes is less or equal to 512; for higher values the opposite situation occurs (Fig. 3b). While bagged SVM (especially when fmax is used) show better values of the extremal margin with respect to single SVM when small numbers of genes are selected, we observe the opposite behavior if the number of considered genes is relatively large (Fig. 3c). Finally, bagged ensembles show clearly larger median margins with respect to single SVM, confirming a more overall reliability (Fig. 3d). Summarizing, bagged ensembles seem to be more accurate and confident in predictions with respect to single SVM. The simple gene selection method adopted is effective with the Leukemia data set, both when single and bagged SVM are used, while the accuracy for the Colon data set seems to be independent of the application of feature selection. The results obtained with single SVM are comparable to those presented in [11]; however, the application of the recursive feature elimination method allows to achieve better results than those obtained with bagged ensembles of SVM, at least for the Leukemia data set. Anyway, it is difficult to establish if a statistical significant difference between the two approaches exists, given the small size of the available samples. VI. C ONCLUSIONS
separating normal from malignant tissues, at least with Colon and Leukemia data sets. In fact, bagging is a variance reduction method which is able to improve the stability of classifiers [4], especially when the training set at hand has small size and large dimensionality, as in the present case. Despite its simplicity, the application of the feature selection method we used in our experiments allows to achieve better value of the success rate. However, it does not take into account the interactions of the expression levels between different genes. In order to manage this effect, we plan to employ more refined gene selection methods [11], in combination with bagging, to further improve the accuracy and the reliability of the predictions based on DNA microarray data. ACKNOWLEDGMENT This work was partially funded by INFM, unit`a di Genova. R EFERENCES [1] A. Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:503–511, 2000. [2] U. Alon et al. Broad patterns of gene expressions revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS, 96:6745–6750, 1999. [3] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [4] L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998. [5] M. Brown et al. Knowledge-base analysis of microarray gene expression data by using support vector machines. PNAS, 97(1):262–267, 2000. [6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995. [7] T.G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, Multiple Classifier Systems. First International Workshop, MCS 2000, Cagliari, Italy, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer-Verlag, 2000. [8] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. JASA, 97(457):77–87, 2002. [9] T.S. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906–914, 2000. [10] T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286:531– 537, 1999. [11] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1/3):389–422, 2002. [12] J. Khan et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6):673–679, 2001. [13] D.J. Lockhart and E.A. Winzeler. Genomics, gene expression and DNA arrays. Nature, 405:827–836, 2000. [14] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from heterogenous data. In Fifth International Conference on Computational Molecular Biology, 2001. [15] G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine, 26(3):283–306, 2002. [16] G Wahba. Spline models for observational data. In SIAM, Philadelphia, USA, 1990. [17] C. Yeang et al. Molecular classification of multiple tumor types. In ISMB 2001, Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pages 316–322, Copenaghen, Denmark, 2001. Oxford University Press.
The results show that bagged ensembles of SVM are more reliable than single SVM in classifying DNA microarray data. Moreover they obtain an equivalent or a better accuracy in
1849