Multicategory Classification Using an Extreme Learning ... - CiteSeerX

2 downloads 0 Views 3MB Size Report
be considered as a linear system and the output weights can be analytically determined ... 2.1 Mathematical Description of Unified SLFN. The output of an ...... sing to director of the Launch Vehicle Design Group, contributing to the design and ...
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 4,

NO. 3,

JULY-SEPTEMBER 2007

485

Multicategory Classification Using an Extreme Learning Machine for Microarray Gene Expression Cancer Diagnosis Runxuan Zhang, Guang-Bin Huang, Narasimhan Sundararajan, and P. Saratchandran Abstract—In this paper, the recently developed Extreme Learning Machine (ELM) is used for directing multicategory classification problems in the cancer diagnosis area. ELM avoids problems like local minima, improper learning rate and overfitting commonly faced by iterative learning methods and completes the training very fast. We have evaluated the multicategory classification performance of ELM on three benchmark microarray data sets for cancer diagnosis, namely, the GCM data set, the Lung data set, and the Lymphoma data set. The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to artificial neural networks methods like conventional back-propagation ANN, Linder’s SANN, and Support Vector Machine methods like SVM-OVO and Ramaswamy’s SVM-OVA. ELM also achieves better accuracies for classification of individual categories. Index Terms—Extreme learning machine, gene expression, microarray, multicategory, classification, SVM.

Ç 1

I

INTRODUCTION

the gene expression profiling-based classification area for cancer diagnosis, binary classification problems have been more extensively studied and only a small amount of work has been done on direct multicategory classification problems. Studies also indicate that direct multiclass classification is much more difficult than binary classification and the classification accuracy may drop dramatically when the number of classes increases [1]. Instead of directly dealing with multicategory problems, many classification methods for multicategory problems actually use some combination of binary classifiers on a One-Versus-All (OVA) or a One-Versus-One (OVO) comparison basis [1], [2], [3], [4], [5]. However, this way of implementation will result in combining many binary classifiers and thus increase system complexities. It also causes a greater computational burden and longer training time. For instance, consider the well-known Support Vector Machine (SVM) as an example. SVM as a binary classifier tries to map the data from a lower-dimensional input space to a higher-dimensional feature space so as to make the data linearly separable into two classes [6]. When using the oneversus-all approach to make binary classifiers applicable to multicategory problems, c (number of classes) binary classifiers should be built for SVM to distinguish one class N

. R. Zhang is with the Systems Biology Unit, Department of Genomes and Genetics, Institut Pasteur, 25-28 Rue du Dr. Roux, Paris, 75724, France. E-mail: [email protected]. . G.-B. Huang, N. Sundararajan, and P. Saratchandran are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798. E-mail: {egbhuang, ensundara, epsarat}@ntu.edu.sg. Manuscript received 28 May 2005; revised 13 Nov. 2005; accepted 8 May 2006; published online 10 Jan. 2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-0054-0505. Digital Object Identifier no. 10.1109/TCBB.2007.1012. 1545-5963/07/$25.00 ß 2007 IEEE

from all the rest of the classes. Similarly, when using the one-versus-one comparison approach, cðc  1Þ=2 binary classifiers should be built for SVM to distinguish between every two class combination. Thus, it can be seen that, when the number of classes c increases, the complexity of the overall classifier also increases. Ramaswamy et al. [2] have used the SVM-OVA as a multiclass classifier for the microarray gene expression GCM data set. In their study, 144 training samples and 46 primary testing samples were used. Training and testing data sets were combined (190 samples, 14 classes) and these data were then randomly split into 100 training and testing sets, each with 144 training samples and 46 testing samples in a class-proportional manner. The mean classification accuracy was calculated based on these 100 splits and these were compared with other methods like k-NN OVA (k-NN One-Versus-All), etc., in [2]. Artificial neural network (ANN) methods provide an attractive alternative to the above approach for a direct multicategory classification problem [3], [5], [7]. Neural networks can map the input data into different classes directly with one network. Besides, the neural network methods can easily accommodate nonlinear features of the gene expression data [7]. Neural networks can also be easily adapted to produce continuous variables instead of discrete class labels. This will be useful for cases where we need to predict the level of the medical indicator rather than classify the samples into binary categories [6]. However, conventional neural networks usually produce lower classification accuracy than SVM [8]. Recently, a new neural network-based algorithm, called Subsequent ANN (SANN), was developed for multicategory microarray analysis [5]. Using the benchmark data set GCM, SANN’s classification performance has been compared to the conventional backpropagation ANN method. The SANN method performs a preselection by a Published by the IEEE CS, CI, and EMB Societies & the ACM

486

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

simple ANN at the first stage and narrows down the decision scope by selecting the two most preferred classes with the highest activities at the corresponding output nodes. Then, a second ANN is used for the final decision on these two selected classes at the second stage. In essence, the second stage of the SANN uses a pairwise comparison. Linder et al. [5] have evaluated the SANN method on the GCM data set using different number of genes and compared their results with SVM-OVA and ANN methods. However, they have carried out this study using only the 144 training samples by performing a 5-fold cross validation. The resulting validation accuracy was then compared with the testing accuracy reported by Ramaswamy et al. [2]. As pointed out by Linder et al. [5], it should be noted that the results of [5] cannot be directly compared with those of [2] as [5] has reported the validation accuracy on 144 training data, whereas [2] has reported the testing accuracy on the 144 + 46 samples. Neural network schemes in the literature usually adopt gradient-based learning methods [3], [5], which are susceptible to local minima and long training times. To overcome these difficulties, in this paper, we propose using a neural network training algorithm called the Extreme Learning Machine (ELM) [9], [10], [11], [12] and evaluate it for multicategory microarray gene expression cancer diagnosis problems. We have evaluated the performance of the ELM algorithm on three multicategory microarray gene expression cancer diagnosis benchmark data sets, namely, the GCM data set, the Lung data set, and the Lymphoma data set. The number of categories of the GCM data set is 14, while the number of categories for the Lung and the Lymphoma data sets is 5 and 3, respectively. For the GCM data set, we have compared the performance of ELM with that of Linder et al. [5] and Ramaswamy et al. [2] separately, using similar conditions as used by them. We have also used SVM-OVO in our comparison study. For the Lung and the Lymphoma data sets, the comparison is done between ELM and SVM-OVO using an approach similar to that of Ramaswamy et al. [2]. For the GCM data set, the results indicate that ELM can perform direct classification for such multicategory microarray problems in a fast and efficient manner. ELM produces higher classification accuracies than those obtained by SANN, SVM-OVO, and SVM-OVA with a more compact network structure and shorter training time for the GCM data set. Studies on other two data sets indicate that the total training time for ELM is always lower than that of SVMOVO. However, in terms of the classification accuracy, ELM and SVM-OVO have similar performances when the number of classes is five. When the number of classes is three, SVM-OVO has better performance than ELM. Even here, ELM achieves accuracies of 96 percent or more.

2

APPROACH

Generally, in a feedforward ANN training scheme, parameters (like weights and biases) of all of the layers need to be tuned by the learning algorithms. Over the last two decades, gradient descent-based methods and their variations, such as

VOL. 4,

NO. 3,

JULY-SEPTEMBER 2007

Back-Propagation (BP), have formed the backbone of most of the learning algorithms of feedforward neural networks. However, it should be noted that these gradient descentbased learning methods are generally slower due to improper learning steps and may converge to local minima. Also, they use many iterative learning epochs to obtain good performance. Recently, Huang et al. [9], [10], [11] have proposed a new learning algorithm called the Extreme Learning Machine (ELM) for Single-hidden Layer Feedforward neural Networks (SLFNs). In ELM, one may randomly choose (according to any continuous sampling distribution) and fix all the hidden node parameters and then analytically determine the output weights of SLFNs [9]. After the hidden nodes parameters are chosen randomly, SLFN can be considered as a linear system and the output weights can be analytically determined through a generalized inverse operation of the hidden layer output matrices. Studies have shown [9] that ELM has good generalization performance and can be implemented easily. Many nonlinear activation functions can be used in ELM, like sigmoid, sine, hardlimit [12], radial basis functions [10], [11], and complex activation functions [13], etc. The activation functions used in ELM may be nondifferentiable or even discontinuous. A brief overview of ELM is given below.

2.1 Mathematical Description of Unified SLFN ~ hidden nodes (additive or The output of an SLFN with N RBF nodes) can be represented by fN~ ðxÞ ¼

N~ X

i Gðai ; bi ; xÞ; x 2 Rn ; ai 2 Rn ;

ð1Þ

i¼1

where ai and bi are the learning parameters of hidden nodes and i is the weight connecting the ith hidden node to the output node. Gðai ; bi ; xÞ is the output of the ith hidden node with respect to the input x. For the additive hidden node with the activation function gðxÞ : R ! R (e.g., sigmoid or threshold), Gðai ; bi ; xÞ is given by Gðai ; bi ; xÞ ¼ gðai  x þ bi Þ; bi 2 R;

ð2Þ

where ai is the weight vector connecting the input layer to the ith hidden node and bi is the bias of the ith hidden node. ai  x denotes the inner product of vectors ai and x in Rn . For an RBF hidden node with an activation function gðxÞ : R ! R (e.g., Gaussian), Gðai ; bi ; xÞ is given by Gðai ; bi ; xÞ ¼ gðbi k x  ai kÞ; bi 2 Rþ ;

ð3Þ

where ai and bi are the center and impact factor of the ith RBF node. Rþ indicates the set of all positive real values. The RBF network is a special case of SLFN with RBF nodes in its hidden layer. Each RBF node has its own centroid and impact factor and its output is given by a radially symmetric function of the distance between the input and the center.

2.2 Extreme Learning Machine In supervised batch learning, the learning algorithms use a finite number of input-output samples for training. Here, we consider N arbitrary distinct samples ðxi ; ti Þ 2 Rn  Rm ,

ZHANG ET AL.: MULTICATEGORY CLASSIFICATION USING AN EXTREME LEARNING MACHINE FOR MICROARRAY GENE EXPRESSION...

where xi is an n  1 input vector and ti is an m  1 target vector. If an SLFN with N~ hidden nodes can approximate these N samples with zero error, it then implies that there exist  i , ai , and bi such that fN~ ðxj Þ ¼

N~ X

 i Gðai ; bi ; xj Þ ¼ tj ; j ¼ 1;    ; N:

3

Equation (4) can be written compactly as ð5Þ

where Hða1 ;    ; aN~ ; b1 ;    ; bN~ ; x1 ;    ; xN Þ 2 Gða1 ; b1 ; x1 Þ    GðaN~ ; bN~ ; x1 Þ 6 .. .. ¼6 .  . 4 Gða1 ; b1 ; xN Þ

   GðaN~ ; bN~ ; xN Þ

2

3  T1 6 7  ¼ 4 ... 5  N~

T

3 7 7 5

;

ð6Þ

~ NN

2

and

3 tT1 6 7 T ¼ 4 ... 5

~ Nm

tTN

:

ð7Þ

Nm

H is called the hidden layer output matrix of the network [14]; the ith column of H is the ith hidden node’s output vector with respect to inputs x1 ; x2 ;    ; xN and the jth row of H is the output vector of the hidden layer with respect to input xj . ~ will In real applications, the number of hidden nodes, N, always be less than the number of training samples, N, and, hence, the training error cannot be made exactly zero but can approach a nonzero training error . The hidden node parameters ai and bi (input weights and biases or centers and impact factors) of SLFNs need not be tuned during training and may simply be assigned with random values according to any continuous sampling distribution [9], [10], [11]. Equation (5) then becomes a linear system and the output weights  are estimated as ^ ¼ Hy T;

ð8Þ

y

where H is the Moore-Penrose generalized inverse [15] of the hidden layer output matrix H. The ELM algorithm,1 which consists of only three steps, can then be summarized as ELM Algorithm: Given a training set @ ¼ fðxi ; ti Þjxi 2 Rn ; ti 2 Rm ; i ¼ 1;    ; Ng, ~ activation function gðxÞ, and hidden node number N, 1) Assign random hidden nodes by randomly generating parameters ðai ; bi Þ according to any continuous ~ sampling distribution, i ¼ 1;    ; N. 2) 3)

Calculate the hidden layer output matrix H. Calculate the output weight :  ¼ Hy T.

activation functions can universally approximate any continuous target functions in any compact subset of the euclidean space Rn . In this paper, the activation function used in ELM is the sigmoidal function gðxÞ ¼ 1þe1x .

ð4Þ

i¼1

H ¼ T;

487

(9)

The universal approximation capability of ELM has been analyzed in Huang et al. [16] using an incremental method and it has been shown that single SLFNs with randomly generated additive or RBF nodes with a wide range of 1. The source codes of ELM can be downloaded from http:// www.ntu.edu.sg/home/egbhuang/.

EXPERIMENTS

AND

RESULTS

In order to evaluate the performance of the ELM algorithm for multicategory cancer diagnosis, three benchmark microarray data sets, namely, the GCM data set, the Lung data set, and the Lymphoma data set, are used in this paper. All of the studies were carried out in a Matlab environment on a Pentium IV, 2.4 GHZ PC with 512 MB memory.

3.1

Experiment 1—Microarray Benchmark Data Set (GCM) The GCM data set is a collection of microarray data for snap frozen human tumor and normal tissue specimens, spanning 14 different tumor classes obtained from six institutions and hospitals. This is the first single reference database that covers the cancer diagnosis across all the common malignancies [2]. Ramaswamy et al. [2] have made the data available at http://www.broad.mit.edu/cgi-bin/ cancer/publications/pub_paper.cgi?mode=view&paper_ id=61. The training data is from the file named GCM_ Training.res, which can be downloaded from the abovementioned Web site. This file contains expression profiles comprised of 16,063 genes and 144 primary tumor samples spanning 14 common tumor types. There is also a file named GCM_Test.res which contains 46 primary testing samples. 3.1.1 Gene Selection Method For the purpose of comparison, we use the same gene selection method as used in [5] and [2], namely, the recursive feature elimination method. As introduced in [2], for a microarray data with n genes, each SVM-OVA classifier produces a hyperplane w, which is a vector of n elements, each corresponding to the expression of a particular gene. The absolute magnitude of each element in w can be considered as a measure of the importance of each corresponding gene. Each SVM-OVA classifier is first trained with all of the genes, then the genes corresponding to the bottom 10 percent, jwi j, are removed. Each classifier is then retrained after the removal of genes. This process is repeated iteratively and a rank of all of the genes based on the statistical significance of each class can be obtained. The most significant 14, 28, 42, 56, 70, 84, and 98 genes selected by this method can be found in the file OVA_MARKERS.xls from the above-mentioned Web site. 3.1.2 Comparison with the SANN algorithm Linder et al. [5] performed the classification task on the data set GCM_Training.res (144 samples) using their newly developed neural network-based method called SANN. Five-fold cross validation is carried out on these 144 samples and the validation accuracy is produced. For the purpose of comparison, we carried out the classification on the same data using experimental methods similar to [5]. Fivefold cross validations are carried out on

488

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

TABLE 1 Optimal Parameters for the ELM and SVM-OVO Algorithms for Selected Genes

the 144 samples using ELM and SVM-OVO for this comparison. In ELM, the number of output nodes is equal to the number of classes of the problem. The number index of the output node with the highest output indicates the class number of the corresponding input. For the microarray data set used in this paper, there are 14 classes and, thus, ELM has 14 output nodes. The activation function used in ELM is the sigmoid function fðxÞ ¼ 1þe1x . Experimental results show that a flatter sigmoid function gives better generalization performance when the ratio between the input dimension and the number of training samples per class is high. Therefore, we need to choose two parameters: gain parameter , which decides the flatness of the sigmoid function, and the number ~ The best combination of these two of hidden nodes, N. parameters is obtained by a grid search for each number of genes on the validation accuracy. We study the validation accuracy using different combinations of the hidden nodes ~ and gain parameter : N~ 2 f10; 15; . . . ; 90; 95; 100g number N and  2 f101 ; 102 ; . . . ; 109 ; 1010 g. The maximum number of nodes for ELM is set to 100 because there are only 115 training samples available (since we use a 5-fold cross validation on 144 samples). Therefore, for each problem, we try 19  10 ¼ 190 ~ ) for ELM. The best combination of parameters (N, parameters for each number of genes are selected according to the validation accuracy and are shown in Table 1. We have also carried out the simulations using the Oneversus-One SVM (SVM-OVO) classifier from the toolbox http://www.ece.osu.edu/~maj/osu_svm/. This toolbox provides a MATLAB interface to implement SVM classifiers in C++ based on the LIBSVM algorithm.2 The cost parameter C and kernel parameter  of SVM-OVO are obtained by grid searching for each gene number used in this experiment. We estimate the classification accuracy using different combinations of cost parameter C and kernel parameter : C 2 f212 ; 211 ; . . . ; 21 ; 22 g and  2 f24 ; 23 ; . . . ; 29 ; 210 g. Therefore, for each gene selection, we try 15  15 ¼ 225 combination of parameters (C, ) for SVM. The best parameters for each number of genes are shown in Table 1. For the 2. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

VOL. 4,

NO. 3,

JULY-SEPTEMBER 2007

TABLE 2 Validation Accuracy (%) of Different Algorithms

SVM-OVO and ELM algorithms, all of the input attributes will be scaled in the range of ½0; 1. The hidden node parameters ai and bi of ELM are randomly generated in the ranges of ½1; 1 and ½0; 1, respectively. Validation accuracy. The validation accuracy of the different algorithms for the benchmark data set GCM using 144 samples is presented in Table 2 and Fig. 1. These algorithms include ELM, SVM-OVO, SANN, and ANN. All of the results of the SANN and ANN algorithms are quoted from [5]. As observed from Table 2 and Fig. 1, the ELM, SVMOVO, and SANN algorithms achieve a much higher classification accuracy than ANN. For the ELM, SVMOVO, and SANN algorithms, classification accuracy tends to grow with the number of genes selected. It can be noted that, for all of these gene selections, ELM achieves the highest classification accuracy. Training time and network complexity. The total training time including the cross-validation time for ELM and SVM-OVO for 5-fold cross validation for each gene number is shown in Table 3. The time given for ELM is based on Matlab, while that for SVM-OVO is based on a C++ implementation. It should be noted that the C++ implementation usually runs 10-50 times faster than Matlab. In spite of this, ELM takes a much smaller training time than SVM,

Fig. 1. Comparison of the classification accuracies of ELM, SVM-OVO, SANN, and ANN.

ZHANG ET AL.: MULTICATEGORY CLASSIFICATION USING AN EXTREME LEARNING MACHINE FOR MICROARRAY GENE EXPRESSION...

TABLE 3 Training Time(s) for ELM and SVM-OVO Algorithms for 5-Fold Cross Validation on the GCM Data Set

especially when the number of genes selected is high. Compared with the training time for SANN given in [5], in which it takes 150 minutes to train the data with 14 genes where the experiments are carried out on a Pentium IV double processor (2 GHz) which is similar to the platform of our studies, ELM takes a significantly lower training time. Table 1 also presents the number of hidden nodes for ELM and the number of support vectors of SVM-OVO corresponding to the best classification performance for each gene selection. It can be seen that the number of hidden nodes for ELM is always smaller than the number of support vectors for SVM-OVO, indicating a more compact network realized by ELM. If one compares ELM with SANN [5], it can be seen that, in SANN, there are one ANN and up to 91 SANNs to be trained for each experiment. For each network, there are five modules each consisting of

489

10 hidden nodes. This means that, for each experiment, up to 4,600 hidden nodes are needed for the training process of SANN, while, for ELM, the network needs less than 50 hidden nodes. Classification performance on individual category. For a multicategory classifier, although the overall classification performance is important, one may also have to look at the classification performance of individual classes. A good classifier is the one that produces a good overall classification performance along with good individual class classification performance. To assess this, the classification power of all of the studied algorithms on each tumor class has been investigated. Figs. 2 and 3 show the classification results for each category for the different algorithms. The figures show the number of data points for each category and the number of successful classifications, called hits, for each of the algorithms. The hits shown in the figures are based on the simulation results of the 5-fold cross validation of 144 samples using 84 genes. As observed from Figs. 2 and 3, the ELM algorithm has higher or similar hits for most of the tumor categories compared with other algorithms. It should be noted that, for the samples of Ovary Class, ANN and SANN fail to correctly classify more than 80 percent of the samples. This indicates that, for this multicategory classification with 14 classes, SANN has shown some preference to patterns in other classes over Ovary, whereas the ELM and SVM-OVO algorithms possess a better balance among all of the classes for the classification. Robustness comparison of ELM and SVM-OVO. The robustness of the ELM and SVM-OVO algorithms with respect to the number of genes selected is further evaluated in this subsection. Here, we first choose the number of

Fig. 2. Comparison of individual category classification: Breast, Prostate, Lung, Colorectal, Lymphoma, Bladder, and Melanoma.

490

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 4,

NO. 3,

JULY-SEPTEMBER 2007

Fig. 3. Comparison of individual category classification: Uterus_Adeno, Leukemia, Renal, Pancreas, Ovary, Mesothelioma, and CNS.

genes and then, for this gene selection, the best parameters ~ Þ for ELM are selected. of ðC; Þ for SVM-OVO and ðN; These parameters are then frozen and used for all of the other gene selections and the resulting classification accuracy is evaluated. If the performance for a different number of genes do not vary drastically, it then indicates that the algorithm is robust to the number of genes selected. For example, when the number of genes selected is 14, we tune the parameters to get the best results. Then, we apply the same parameter values for all other gene numbers, namely, 28, 42, 56, 70, 84, 98, and compare the performance of the two algorithms. We repeat this process for each of the selected number of the genes. Using the above method for robustness evaluation, the classification performance of the two algorithms using the best parameter for 84 genes is presented in Fig. 4. Similar performance figures can also be obtained for the other cases of a different number of genes. As observed from Fig. 4, ELM produces a flatter curve than SVM-OVO, which indicates that the performance of ELM is not sensitive to the number of genes selected, i.e., it is more robust. It can also be seen that the accuracies produced by ELM are generally higher than those of SVMOVO. In these applications, if an algorithm is more robust, it implies that, when one wants to try out a different number of genes, the algorithm parameters need not be tuned every time in order to obtain the best performance, thereby saving a large amount of time and effort.

3.1.3 Comparison with the SVM-OVA Algorithm It should be noted that the results obtained by Linder et al. [5] and Ramaswamy et al. [2] cannot be compared directly in the sense that Linder et al. [5] has reported 5-fold cross validation results over 144 primary training samples while Ramaswamy et al. [2] have reported testing results over the 46 primary testing samples after the SVM has been trained using 144 primary training samples. In order to have a

direct comparison with [2], in the experimental studies of ELM and SVM-OVO, we also combined 144 training samples and 46 primary testing samples and then randomly split these 190 samples into 100 splits of training and testing sets of 144 and 46 samples in a class proportional manner. The average testing accuracy of the 100 splits of training and test sets of ELM and SVM-OVO together with the results of SVM-OVA from [2] are shown in Table 4 and Fig. 5. In order to assess whether the better accuracy of ELM over SVM-OVO is statistically significant, we performed the McNemar Test described of Salzberg [17] for the case of 98 genes. The test was performed using the null hypothesis that both the ELM and SVM-OVO algorithms have the same testing accuracy. The obtained p value is 4.934e-4. and this is far lower than the commonly used significance level of p ¼ 0:01. This implies that we can reject the null hypothesis, thus implying that the superior classification accuracy

Fig. 4. Comparison of the classification accuracies of ELM and SVMOVO using the best parameters for 84 genes.

ZHANG ET AL.: MULTICATEGORY CLASSIFICATION USING AN EXTREME LEARNING MACHINE FOR MICROARRAY GENE EXPRESSION...

TABLE 4 Testing Accuracy (%) of Different Algorithms

obtained by ELM is statistically significant compared to that of SVM-OVO. Training time and network complexity. The training times for ELM and SVM-OVO for 100 splits of training and test sets of 144 and 46 samples are shown in Table 5, where, for each split of training and testing set, the training set is further randomly partitioned into 10 train/validation splits. Obviously, ELM spends much less training time and parameter selection time than SVM, especially when the number of genes selected is high. For each split of training and test sets, the parameters are selected based on the validation accuracy within the training set, then the selected parameters are used on the whole training set, and, finally, it is applied to the test set for classification accuracy. The averaged selected hidden node numbers over 100 splits for ELM and the averaged support vectors for SVM-OVO for each gene number are also listed in Table 5. We can see that ELM can always achieve a much more compact network structure than the SVM-OVO algorithm. Classification performance on individual category. The confusion matrices for ELM and SVM-OVO algorithms are shown in Table 6 and Table 7, respectively. This is based on the average of 100 trials on different shuffles of the training

Fig. 5. Comparison of the testing accuracies of ELM, SVM-OVO, and SVM-OVA.

491

TABLE 5 Training Time(s) and Averaged Number of Hidden Nodes (Support Vectors) for the ELM and SVM-OVO Algorithms for 100 Splits of the Training and Test Set of 144 and 46 Samples on the GCM Data Set

and testing data with 98 selected genes. The diagonal elements of the confusion matrix indicate the correct classification percentages and the off diagonal elements give the misclassification percentages. It can be seen that the classification accuracies for both ELM and SVM-OVO are higher when the tumor category has a large number of patterns, such as Lymphoma, Leukemia, and CNS. This indicates that, with more tumor samples, the classification accuracy of ELM and SVM-OVO can be improved. Regarding misclassification, after a careful comparison of the two tables, there are many cases where both algorithms happen to misclassify the same cases with a similar probability. For example, the probability of Bladder tumor being categorized as Melanoma tumor and Ovary tumor is quite high for both of the algorithms. This indicates that these cases are difficult cases for classification by both algorithms.

3.2 Experiment 2—Lung Data Set The Lung data sets consists of 73 samples spanning five different classes, each with 918 genes. These samples include 41 adeno carcinomas (AC) samples, 17 squamous cell carcinomas (SCC) samples, four large cell lung cancers (LCLC) samples, five small cell lung cancers (SCLC) samples, and six normal lung cells samples. This data set is available at http://genome-www.stanford.edu/lung_ cancer/adeno/. The 73 samples are randomly split into 59 training samples and 14 testing samples at each trial and the average performance has been obtained over 100 trials for both ELM and SVM-OVO. The BSS/WSS method [18] is used for gene selection, which sorts the genes in descending order by the ratio of “between group to within group” sum of squares. Ten choices of the number of genes, from 10 to 100 in increments of 10, are selected and used in the study of both the ELM and SVM-OVO algorithms. The average testing accuracies for ELM and SVM-OVO are shown in Table 8.

492

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 4,

NO. 3,

JULY-SEPTEMBER 2007

TABLE 6 The Confusion Matrix Obtained by ELM for the GCM Testing Data

TABLE 7 The Confusion Matrix Obtained by SVM-OVO for the GCM Testing Data

The total training times (including model selection time) for both ELM and SVM-OVO are shown in Table 9. The total time taken by ELM is smaller than that of SVM, especially when the number of selected genes is high. For each trial, the hidden node number is selected based on the validation accuracy. The average hidden node numbers over 100 trials for ELM and the average support vectors for SVM-OVO (for each gene number) are also listed in Table 9. It can be seen from the table that ELM achieves a more compact network structure than SVM-OVO. In order to assess the statistical significance of the classification accuracy obtained by these two algorithms,

we also performed the McNemar Test [17] for the case of 100 genes. The test was performed using the null hypothesis that both the ELM and SVM-OVO algorithms have the same testing accuracy. The obtained p value of 0.64 indicates that we cannot reject the null hypothesis. This means that these two algorithms have similar testing accuracy for this data set.

3.3 Experiment 3—Lymphoma Data Set The Lymphoma data set is a data set about the three most prevalent adult lymphoid malignancies. It contains 62 samples consisting of 4,026 genes spanning three classes, which include 42 Diffuse Large B-Cell Lymphoma (DLBCL) samples, nine Follicular Lymphoma (FL) samples, and

ZHANG ET AL.: MULTICATEGORY CLASSIFICATION USING AN EXTREME LEARNING MACHINE FOR MICROARRAY GENE EXPRESSION...

TABLE 8 Testing Accuracy (%) of the ELM and SVM-OVO Algorithms on the Lung Data Set

11 B-cell Chronic Lymphocytic Leukemia (B-CLL) samples. The data set can be found at http://genome-www.stanford. edu/lymphoma/. The 62 samples are randomly split into 50 training samples and 12 testing samples at each trial and the average performance has been obtained over 100 trials for both ELM and SVM-OVO. The BSS/WSS method [18] is also used for gene selection. Ten different numbers of genes, from 10 to 100 in intervals of 10, are selected and used in the simulation of both the ELM and SVM-OVO algorithms. The average testing accuracies over 100 trials are shown in Table 10. For this problem, we performed the McNemar Test for the case of 100 genes. Here, the null hypothesis is that the ELM and SVM-OVO algorithms have the same testing TABLE 9 Training Time(s) and Average Number of Hidden Nodes (Support Vectors) for ELM and SVM-OVO Algorithms for 100 Splits of Training and Test Set on the Lung Data Set

493

TABLE 10 Testing Accuracy (%) for the ELM and SVM-OVO Algorithms on the Lymphoma Data Set

accuracy. We obtained a p value of 5.61e-6. This value is significantly lower than the commonly used significance level of 0.01. Hence, we can reject the null hypothesis with high confidence, which implies that the better accuracy of the SVM-OVO over the ELM algorithm is statistically significant. As in the earlier case, the total training time and the average number of hidden nodes/support vectors are given in Table 11. From the table, it can be seen that ELM takes less training time with a more compact network. TABLE 11 Training Time(s) and Averaged Number of Hidden Nodes (Support Vectors) for the ELM and SVM-OVO Algorithms for 100 Splits of Training and Test Set on the Lymphoma Data Set

494

3.4

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

Overall Comparison/Remarks

3.4.1 Classification Accuracy From the results on the three data sets, we find that ELM achieves better classification results (in a statistical sense) on the 14 category classification task (GCM), similar performance on the five category Lung data set, and lower performance on the three category Lymphoma data set as compared to SVM-OVO. Even for Lymphoma, the accuracy achieved by ELM is 95 percent or more. This is consistent with our hypothesis that ELM performs better in the multicategory classification applications where the number of classes is large. 3.4.2 Training Time and Network Complexity For all three data sets, ELM takes much less total training time than does the SVM-OVO algorithm. As we mentioned before, the SVM-OVO algorithm has to build cðc  1Þ=2 binary classifiers to distinguish between every two class combinations. For the three data sets, with the number of categories classified decreasing, the difference between ELM and SVM-OVO is also decreasing. Further, compared with SANN in [5], ELM takes significantly lower training time. For all three data sets, it can be seen that the number of hidden nodes for ELM is always smaller than the number of support vectors for SVM-OVO, indicating a more compact network realized by ELM. For the GCM data set, if one compares ELM with SANN [5], it can be seen that, in SANN, there is one ANN and up to 91 SANNs to be trained for each experiment. For each network, there are five modules, each consisting of 10 hidden nodes. This means that, for each experiment, up to 4,600 hidden nodes are needed for the training process, while, for ELM, the network is much more compact, with less than 50 hidden nodes.

4

CONCLUSION

In this paper, a fast and efficient classification method called the ELM algorithm for a multicategory cancer diagnosis problem based on microarray data is presented. Its performance has been compared with other methods such as the ANN, SANN, and SVM algorithms. SVM for multicategory classifications is done by modifying the binary classification method of SVM to a one-versus-all or one-versus-one comparison basis. This inevitably involves more classifiers, greater system complexities and computational burden, and a longer training time. ELM can perform the multicategory classification directly, without any modification. Study results are consistent with our hypothesis that, when the number of categories for the classification task is large, the ELM algorithm achieves a higher classification accuracy than the other algorithms with less training time and a smaller network structure. It can also be seen that ELM achieves better and more balanced classification for individual categories as well. Theoretical investigation on these is currently under way.

ACKNOWLEDGMENTS The authors wish to thank Dr. Roland Linder (Institute for Medical Informatics, Medical University of Luebeck,

VOL. 4,

NO. 3,

JULY-SEPTEMBER 2007

Germany) for his clarifications on SANN. They also wish to thank Dr. S. Suresh and S. Saraswathi for many helpful discussions for the study on the GCM data set.

REFERENCES [1]

[2]

[3]

[4]

[5]

[6] [7]

[8]

[9]

[10] [11] [12]

[13] [14] [15] [16]

[17] [18]

T. Li, C. Zhang, and M. Ogihara, “A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression,” Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub, “Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures,” Proc. Nat’l Academy Sciences, USA, vol. 98, no. 26, pp. 15149-15154, 2002. A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis,” Bioinformatics, vol. 21, no. 5, pp. 631-643, 2005. C.-H. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R.M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub, “Molecular Classification of Multiple Tumor Types,” Bioinformatics, vol. 17, pp. S316-S322, 2001. R. Linder, D. Dew, H. Sudhoff, D. Theegarten, K. Remberger, S.J. Poppl, and M. Wagner, “The ’Subsequent Artificial Neural Network’ (SANN) Approach Might Bring More Classificatory Power to ANN-Based DNA Microarray Analyses,” Bioinformatics, vol. 20, no. 18, pp. 3544-3552, 2004. M. Ringner, C. Peterson, and J. Khan, “Analyzing Array Data Using Supervised Methods,” Pharmacogenomics, vol. 3, no. 3, pp. 403-415, 2002. J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and S. Meltzer, “Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks,” Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001. J.W. Lee, J.B. Lee, M. Park, and S.H. Song, “An Extensive Comparison of Recent Classification Tools Applied to Microarray Data,” Computational Statistics and Data Analysis, vol. 48, pp. 869885, 2005. G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks,” Proc. Int’l Joint Conf. Neural Networks (IJCNN ’04), July 2004. G.-B. Huang and C.-K. Siew, “Extreme Learning Machine: RBF Network Case,” Proc. Eighth Int’l Conf. Control, Automation, Robotics, and Vision (ICARCV ’04), Dec. 2004. G.-B. Huang and C.-K. Siew, “Extreme Learning Machine with Randomly Assigned RBF Kernels,” Int’l J. Information Technology, vol. 11, no. 1, 2005. G.-B. Huang, Q.-Y. Zhu, K.Z. Mao, C.-K. Siew, P. Saratchandran, and N. Sundararajan, “Can Threshold Networks Be Trained Directly?” IEEE Trans. Circuits and Systems II, vol. 53, no. 3, pp. 187-191, 2006. M.-B. Li, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “Fully Complex Extreme Learning Machine,” Neurocomputing, vol. 68, pp. 306-314, 2005. G.-B. Huang, “Learning Capability and Storage Capacity of TwoHidden-Layer Feedforward Networks,” IEEE Trans. Neural Networks, vol. 14, no. 2, pp. 274-281, 2003. D. Serre, Matrices: Theory and Applications. Springer-Verlag, 2002. G.-B. Huang, L. Chen, and C.-K. Siew, “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 879-892, 2006. S.L. Salzberg, “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach,” Data Mining and Knowledge Discovery, vol. 1, pp. 317-328, 1997. S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Methods for Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, no. 457, pp. 7787, 2002.

ZHANG ET AL.: MULTICATEGORY CLASSIFICATION USING AN EXTREME LEARNING MACHINE FOR MICROARRAY GENE EXPRESSION...

Runxuan Zhang received the BE degree in electronic and information engineering and the BA in English in 2002 from the Dalian University of Technology, Dalian, China. From 2002 to 2005. He was a PhD candidate in the area of pattern classification using artificial neural network methods at Nanyang Technological University, Singapore. Currently, he is a postdoctoral researcher at Institut Pasteur, Paris, France. His current area of research interest is in the field of high throughput protein identification and quantification. He is a student member of the IEEE. Guang-Bin Huang received the BSc degree in applied mathematics and the MEng degree in computer engineering from Northeastern University, China, in 1991 and 1994, respectively, and the PhD degree in electrical engineering from Nanyang Technological University, Singapore, in 1999. During the undergraduate period, he was also concurrently studying in the Wireless Communication Department at Northeastern University, China. From June 1998 to May 2001, he worked as research fellow at the Singapore Institute of Manufacturing Technology (formerly known as the Gintic Institute of Manufacturing Technology), where he led/implemented several key industrial projects. Since May 2001, he has been an assistant professor in the School of Electrical and Electronic Engineering, Nanyang Technological University. His current research interests include extreme learning machine, machine learning, bioinformatics, and networking. He is an associate editor of the IEEE Transactions on Systems, Man, and Cybernetics-Part B and Neurocomputing. He is a senior member of the IEEE.

495

Narasimhan Sundararajan received the BE in electrical engineering with first class honors from the University of Madras in 1966, the MTech degree from the Indian Institute of Technology, Madras, in 1968, and the PhD degree in electrical engineering from the University of Illinois, Urbana-Champaign in 1971. From 1972 to 1991, he worked in the Indian Space Research Organization, Trivandrum, India, starting as a control system designer and progressing to director of the Launch Vehicle Design Group, contributing to the design and development of the Indian satellite launch vehicles. He also worked as an NRC Research Associate at NASA-Ames in 1974 and as a senior research associate at NASA Langley in 1981-1986. Since February 1991, he has been with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, first as an associate professor (from 1991 to August 1999) and presently as a professor. He was a “Prof. I.G. Sarma Memorial ARDB Professor” (an endowed visiting professor) during November 2002-February 2003, at the School of Computer Science and Automation, Indian Institute of Science, Bangalore, India. His research interests are in the areas of aerospace control, neural networks, and parallel implementations of neural networks and he has published more than 130 papers and also four books, titled Fully Tuned Radial Basis Function Neural Networks for Flight Control (Kluwer Academic, 2001), Radial Basis Function Neural Networks with Sequential Learning (World Scientific, 1999), Parallel Architectures for Artificial Neural Networks (IEEE CS Press, 1998), and Parallel Implementations of Backpropagation Neural Networks (World Scientific, 1996). Dr. Sundararajan is a fellow of the IEEE, an associate fellow of the AIAA, and also a fellow of the Institution of Engineers, (IES) Singapore. He was an associate editor for the IEEE Transactions on Control Systems Technology, IFAC Journal on Control Engineering Practice (CEP), IEEE Robotics and Automation Magazine, and ControlTheory and Advanced Technology (C-TAT), Japan. He was also a member of the Board of Governors (BoG) for the IEEE Control System Society (CSS) for 2005. He has contributed as a program committee member for a number of international conferences and was the general chairman for the Sixth International Conference on Automation, Robotics, Control, and Computer Vision (ICARCV ’00) held in Singapore in December 2000. He is listed in Marquis Who’s Who in Science and Engineering and Men of Achievement, International Biographical Center, Cambridge, United Kingdom. P. Saratchandran received the PhD degree in the area of control engineering from Oxford University, United Kingdom. He is an associate professor with Nanyang Technological University, Singapore. He has several publications in refereed journals and conferences and has authored four books, Fully Tuned RBF Networks for Flight Control (Kluwer Academic, 2002), Radial Basis Function Neural Networks with Sequential Learning (World Scientific, 1999), Parallel Architectures for Artificial Neural Networks (IEEE CS Press, 1998), and Parallel Implementations of Backpropagation Neural Networks (World Scientific, 1996). He is an editor for the journal Neural Parallel and Scientific Computations. His interests are in neural networks, bioinformatics, and adaptive control. He is listed in Marquis’ Who’s Who in the World and in Leaders in the World of the International Biographics Centre, Cambridge, United Kingdom.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.