Classification Using Gene Expression Data ... AbstractâThe cancer classification using gene expres- ..... 3) Colon Tumor Data: This data is adopted from.
Optimized Kernel Machine for Cancer Classification Using Gene Expression Data Huilin Xiong and Xue-Wen Chen Information and Telecommunication Technology Center Department of Electrical Engineering and Computer Science The University of Kansas Abstract— The cancer classification using gene expression data has shown to be very useful for cancer diagnose and prediction. However, the nature of very high dimensionality and relatively small sample size associated with the gene expression data make the tasks of classification quite challenging. In this paper, we present a new approach, which is based on optimizing the kernel function, to improve the performances of the classifiers in classifying gene expression data. Aiming to increase the class separability of the data, we utilize a more flexible kernel function model, the data-dependent kernel, as the objective kernel to be optimized. The experimental results show that using the optimized kernel usually results in a substantial improvement for the K-nearest-neighbor (KNN) and support vector machine (SVM) in classifying gene expression data.
I. I NTRODUCTION DNA microarray technology is designed to measure the expression levels of tens of thousands of genes simultaneously. As an important application of this novel technology, the gene expression data are used to determine and predict the state of tissue samples, which have shown to be very helpful in clinical oncology. The most fundamental task using gene expression data in clinical oncology is to classify tissue samples according to their gene expression levels. In combination with the techniques of pattern classification, gene expression data can provide more reliable approaches to diagnose and predict various types of cancers than the traditional clinical methods. Gene expression data are typically characterized by high dimensionality and small sample size. In the literature, various methods have been developed for cancer classification using microarray data[2], [3], [4], [21], [22], [23]. Generally, due to the nature of high dimensionality and small sample size, gene expression samples are supposed to be linearly separable in high dimensional feature space, therefore, the linear classifies, e.g. the
support vector machine (SVM) based on linear kernel function, and the linear discrimiant analysis (LDA) [9], are used favorablely. On the other hand, as shown in a recent benchmark study [6], for gene expression data, even in the case of small sample size, nonlinear support vector machine with well-tuned RBF kernels is never worse, and sometimes statistically significantly better than their linear counterparts. Furthermore, the kernel machines using nonlinear kernel functions are capable of exploring nonlinear discriminant information in the microarray data, hence, producing more precise classification results. This is especially true when more patient samples become available in the future. In this paper, we investigate the efficiency of a kernel approach, which is based on optimizing a data-dependent kernel model, to the task of gene expression data classification. Employing the data-dependent kernel model, our kernel optimization scheme is capable of further improving the performances for the classifiers such as KNN and SVM in classifying gene expression data. The experimental results on several gene expression data sets demonstrate the efficiency of the proposed scheme. II. DATA - DEPENDENT
KERNEL AND ITS
OPTIMIZATION
The kernel machines, such as support vector machine (SVM), kernel principal component analysis (KPCA), and kernel Fisher discriminant (KFD), work by mapping the input data X into a high or infinite dimensional space F , called feature space, : X ;! F , and then building linear algorithms in the feature space to implement nonlinear counterparts in the input space. The map , rather than being given in an explicit form, is presented by specifying a kernel function k (x; y ) as the inner product between each pair of points in the feature space, that is, < (x); (y ) >= k (x; y ). Different kernel functions create different spatial distributions of the data in the feature space, and lead to
different class discrimination. Since there is no general kernel function suitable to all data sets, in this paper, we adopt a data-dependent kernel model as the objective kernel to be optimized. A. Data-dependent kernel model
Let fxi ; i g (i = 1; 2; : : : ; m) be m d-dimensional training samples of the given gene expression data, where i = 1 represent the class labels of the samples. The so-called “conformal transformation of a kernel” [1] is used as our data-dependent kernel,
k(x; y) = q(x)q(y)k0 (x; y) (1) x; y 2 Rd , k0 (x; y), called the basic kernel, is
where an ordinary kernel such as a Gaussian or a polynomial kernel function, and q (:), the factor function, takes the form as
q(x) = 0 +
m X
ik1 (x; xi )
i=1 e; 1kx;xik2 , and
(2)
in which k1 (x; xi ) = i ’s are the combination coefficients. It is easy to see that the data-dependent kernel satisfies the Mercer condition for a kernel function [19]. Let the kernel matrices corresponding to k(x; y) and k0 (x; y) be K and K0 . Obviously, K = [q(xi )q(xj )k0 (xi; xj )]mm = QK0Q, where Q is a diagonal matrix whose diagonal elements are q (x1 ); q (x2 ); : : : ; q (xm ). Let us denote the vector (q(x1 ); q(x2 ); : : : ; q(xm ))T and (0 ; 1 ; "; : : : ; m )T by q and respectively, we have q = K 1 , where K1 is an m (m + 1) matrix
0 1 1 k1 (x1 ; x1 ) k1 (x1 ; xm ) B B 1 k1 (x2 ; x1) k1 (x2 ; xm) CCC K1 = B .. .. .. B CA . . . @ ... 1 k1 (xm ; x1 ) k1 (xm ; xm )
(3)
B. Optimizing the data-dependent kernel Out goal is to optimize the data-dependent kernel in Eq.(1) so that the data in the feature space possess relatively large class separability. This requires to optimize the combination coefficient vector . A Fisher scalar for measuring the class separability of the training data in the feature space is adopted as a criterion for our kernel optimization J = trtrSSb (4) w
where Sb represents the “between-class scatter matrix”, and Sw “within-class scatter matrix”.
Suppose the training data are grouped according to their class labels, i.e., the first m 1 data belong to class C1 , and the remaining m2 data belong to class C2 (m1 + m2 = m). Then the basic kernel matrix K 0 can be partitioned as
!
K110 K120 K0 = K 0 0 21 K22 0 , K 0 , K 0 , and where the sizes of the submatrices K 11 12 21 0 K22 respectively are m1 m1, m1 m2, m2 m1 , and m2 m2 . A close relation between the class separability measure J and the kernel matrices has been established [24].
T B0 q J = qqT W q
(5)
0
where
B0 =
0 m1 K11 0 1
!
0 ; m1 K0 1 0 K m2 22
0 0 0 0 W0 = diag(k11 ; k22 ; : : : ; kmm ) ; m1 0K11 1 0K 0 m2 22 Furthermore, using the relation q = K1 , we have T 0 J () = T M (6) N0 where M0 = K1T B0 K1 , N0 = K1T W0 K1 . Obviously, the optimal that maximizes the class separability measure J should be the eigenvector corre1
sponding to the maximum eigenvalue of the eigenstructure M0 = N0 (7) provided that the matrix N0 is nonsingular. Unfortunately, for gene expression data, the matrix N 0 is always singular due to the high dimensionality of the gene expression data. This makes the eigen decomposition based approach inapplicable. To avoid using the eigenvector solution, an updating algorithm based on the standard gradient approach is developed. This algorithm is summarized below, where the learning rate (n) is adopted in a gradually decreasing form as (n) = 0 (1 ; n ) (8)
N
where
0 represents an initial learning rate.
1) Group the data according to their class labels. Calculate K0 and K1 first, then B0 and W0 , and then M0 , N0 . 2) Initialize (0) by a vector (1; 0; : : : ; 0) T , and set n = 0. 3) Calculate q = K1 (n) . 4) Calculate J1 = q T B0 q , J2 = q T W0 q , and J .
!
Training Embedding before Kernel Optimization
Test Embedding before Kernel Optimization
3
2.5
2.5
2
1.5
2
1 1.5 0.5 1 0 0.5 −0.5 0 −1 −0.5
−1.5
−1
−2
−1.5 −1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−2.5 −1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
−0.01
0
(a)
−4
−4
8
Training Embedding after Kernel Optimization
x 10
10
Test Embedding after Kernel Optimization
x 10
8 6
6
4
4
2
2
0
−2
0
−4
−2
−6
−4
−8
−6
−10 −0.09
−0.08
−0.07
−0.06
−0.05
−0.04
−0.03
−0.02
−0.01
−8 −0.08
−0.07
−0.06
−0.05
−0.04
−0.03
−0.02
(b)
Fig. 1. Class separability for the Prosate training and test data. (a) 2-dimensional projection of the training and test data before the kernel optimization. (b) 2-dimensional projection of the training and test data after the kernel optimization.
5) Update
6)
(n) :
1 J (n+1) = (n) + (n) ( M0 ; N0 )(n) J2 J2 and normalize (n+1) so that k (n+1) k= 1. If n reaches a pre-specified number N , stop. Otherwise, set n = n + 1, go to step 3.
C. The effect of the kernel optimization In this section, we illustrate that the proposed kernel optimization algorithm can substantially improve the class separability, not only for training data , but for test
data as well. A microarray data set, Prostate Cancer, adopted from http://www.broad.mit.edu/, is used to test the proposed kernel optimization algorithm. The Prostate Cancer data set contains the expression levels of 12600 genes for 52 prostate tumor samples and 50 normal prostate samples. We select the 100 most discriminatory genes, and then, normalize the data set to a distribution with zero mean and unit variance. The data set is randomly partitioned into two equal disjoint subsets, one is used as the training set, the other as the test set. The parameters in the optimization algorithm
(
yi = i + " ifif 1i > mi m (9) r in which r is a sample randomly selected form fx i g with replacement and " denotes a normal random disturb, that is, " N (0; 2 ). The class labels are determined as ( i = i ifif 1i > mi m r Empirically, with the extended training data, we can effectively ameliorate the overfitting and reduce the possible computational unstableness. III. E XPERIMENTAL
RESULTS AND ANALYSIS
From a pattern classification point of view, since our kernel optimization scheme increases the class separability of the data in the feature space, the performances of the kernel machines should be improved. In this section, we conduct experiments on six microarray data
6
For the training data
5
Class Separability Measure
are set as 0 = 1:0e ; 06, 1 = 1:0e ; 03. The initial learning rate 0 and the iteration number N are set to 0.001 and 1000 respectively. Fig. 1(a) shows the projections of the training and test data onto the first two significant dimensions before the kernel optimization. Fig. 1(b) shows the corresponding projections of the training and test data after the kernel optimization. Fig. 2 illustrates the increasing nature of the class separability measure J with respect to the iteration of the algorithm, not only for the training data but also for the test data. It can been seen that the class separability measures for the training data as well as for the test data increase in a similar manner in the procedure of the kernel optimization. Although the kernel optimization usually improves the class separability for both the training and test data, in the case that the sample size is too small, the kernel optimization may cause overfiting, which means the consequent classification may work very well on the training data, but worse on the test data. To handle this problem, we adopt a strategy that is similar to the technique of boostraping resampling to increase the sample size of the training data. Considering the high dimensionality of the gene expression data, the training data should be sparsely distributed in the high dimensional space. It is reasonable to assume that the near points of a training datum have the same class characteristic as that of the training datum. Suppose fx i ; i g (i = 1; 2; : : : m) are the training data. We construct a new set of training data fy i ; i g (i = 1; 2; : : : ; 3m), where
4
For the test data
3
2
1
0
0
100
200
300
400
500 600 Iterations
700
800
900
1000
Fig. 2. The class separability measure J of the Prostate training and test data as a function of the number of iterations.
sets to demonstrate that using the optimized kernel, the performances of the K-nearest-neighbor (KNN) and support vector machine (SVM) can further be improved. A. Data sets Six publicly available microarray data sets are chosen to test our algorithm. The basic information about these data sets are summarized in table I. 1) ALL-AML Leukemia Data: This data set, taken from the website: http://sdmc.lit.org. sg/GEDatasets/, contains 72 samples of human acute leukemia. 47 samples belong to acute lymphoblastic leukemia (ALL), and the other acute myeloid leukemia (AML). Each sample presents the expression levels of 7129 genes. For the detail information, one can refer to [4]. 2) Breast Cancer Data: The data are available at http://mgm.duke.edu/genome/dna mic ro/work/. The expression matrix monitors 7129 genes in 49 breast tumor samples. There are two response variables respectively describing the status of the estrogen receptor (ER) and the lymph nodal (LN) status. For the ER status, 25 samples are ER+, whereas the remaining 24 samples are ER-. For the LN variable, there are 25 sample positive and 24 samples negative. The detailed information about this data set can be found in [23]. 3) Colon Tumor Data: This data is adopted from the website: http://sdmc.lit.org.sg/GE Datasets/. It contains 62 samples collected from colon-cancer
patients. Among them, 40 tumor samples are from tumors, and 22 normal biopsies are from healthy parts of the colons of the same patients. 2000 genes are selected to measure their expression levels. One can refer [?]. 4) Lung Cancer Data: This data set is taken from the website: http://sdmc.lit.org.sg/GE Datasets/. It contains 181 tissue samples, which are classified into two classes: malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA). Each sample is described by 12533 genes. More information about this data set can be found in [15]. 5) Prostate Cancer: This data set, adopted from http://www.broad.mit.edu/, contains the gene expression levels of 12600 genes for 52 prostate tumor samples and 50 normal prostate samples. One can refer [21] for the details about this data set. TABLE I T HE BASIC INFORMATION ABOUT THE GENE EXPRESSION DATA SETS
ALL-MAL Breast-ER Breast-LN Colon Lung Prostate
sample size 72 49 49 62 181 102
number of genes 7129 7129 7129 2000 12533 12600
B. Gene selection Usually, in microarray data, many genes express nondifferentially, which are not helpful or even harmful for the class discrimination. Removing these genes not only reduces the computational burden, but also can substantially improve the performances of many classifies. In this paper, we use class separability measure to select genes. For a gene j , its expression levels across the training samples construct an 1-dimension data, fxi (j ); i g, where i denote the class labels and i = 1; 2; : : : ; m. Then the class separability measure on gene j is calculated as
P2 m (x (j ) ; x(j ))2 g(j ) = P2 k=1P k k (x (j ) ; x (j ))2 k=1
i2Ck i
k
where Ck denotes the index set of the k -th class (k = 1; 2), mk is the number of samples in Ck (m1 +m2 = m), xk (j ) and x(j ) represent the average expression levels cross the k -th class and whole training samples on gene j.
C. Experimental settings Two classification algorithms, i.e., the K-nearestneighbor (KNN) and the support vector machine (SVM), are used to classify the gene expression data. We investigate the improvements in their performances when the optimized kernel is applied in classifying gene expression data. Each data set is first normalized to a distribution with zero mean and unity variance in every feature direction, and then, we randomly partition the samples of each class into two equal disjoint subsets, one is used to construct the training data, the other to construct the test data. We only consider Gaussian kernel function for SVM. For the KNN classifier, we always set the unique parameter k to 5. We investigate the improvements of the KNN algorithm when the optimized kernel metric is adopted to determine the “neighbors” of a given test sample. To capture the effect of feature dimension, the parameters 0 and 1 in the kernel optimization are 10;5 10;2 and p , where Nf denotes the number set to p
Nf
Nf
of selected genes. We use cross-validation technique to choose the disturb parameter in 9 from the set f0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0g. For the SVM classifier with ordinary Gaussian kernel, two parameters (C; 0 ) need to be set, where C is the regularization parameter. For the sake of computational simplicity, we set the parameter 0 as that in the kernel optimization, and choose the regularization parameter C by crossvalidation. D. Experimental results and analysis For each data set, we calculate the average test accuracy over 100 random partitions of the gene expression data. Table II compares the performances of the ordinary KNN scheme and the optimized kernelbased KNN scheme, denoted by “oknn” in terms of the average test accuracy (%) and the standard variances (in the parentheses), respectively corresponding to different number of the selected features. Table III shows the corresponding results for the SVM algorithm. We see that using the optimized kernel significantly improves the performances of the KNN classifier in almost all the cases. For the SVM classifier, using the optimized kernel can still further improve the performance in many cases. However, the improvement is not as significant as that of the KNN classifier. Moreover, in some cases, using the optimized kernel can degenerate the performance of the SVM classifier. The reason is that for a sophisticated classifier such as SVM, increasing
TABLE II T HE IMPROVEMENTS OF THE PERFORMANCES OF KNN AFTER USING THE OPTIMIZED KERNEL
#. of genes 50 knn oknn 100 knn oknn 200 knn oknn 400 knn oknn 600 knn oknn 800 knn oknn 1000 knn oknn 2000 knn oknn
ALL-MAL 94.29 (2.35) 95.54 (1.82) 94.43 (2.33) 96.11 (1.40) 96.76 (1.21) 97.03 (0.81) 95.81 (2.38) 97.24 (0.38) 96.46 (2.13) 97.30 (0.00) 95.05 (2.74) 97.30 (0.00) 94.24 (3.16) 97.24 (0.38) 90.24 (4.46) 94.95 (2.09)
Breast-ER 93.32 (4.29) 92.40 (4.20) 91.76 (5.02) 93.80 (4.63) 89.18 (5.24) 93.04 (5.23) 87.96 (6.14) 94.00 (5.16) 89.40 (5.43) 95.04 (3.46) 88.96 (5.86) 94.64 (4.56) 87.64 (6.15) 93.56 (4.99) 87.94 (6.20) 92.92 (5.30)
Breast-LN 84.48 (7.90) 90.72 (5.68) 85.56 (6.84) 91.64 (4.06) 85.92 (7.11) 89.00 (4.63) 83.64 (7.44) 87.16 (4.97) 78.76 (8.35) 84.56 (6.28) 76.80 (7.48) 83.40 (6.64) 75.60 (7.45) 81.92 (5.95) 70.56 (8.63) 74.96 (7.68)
Colon 86.13 (4.16) 87.00 (4.06) 87.03 (4.35) 87.81 (3.68) 84.48 (4.85) 87.35 (4.71) 84.16 (5.11) 86.77 (4.50) 82.39 (5.22) 86.71 (4.79) 81.16 (5.06) 85.94 (4.99) 78.74 (5.91) 84.68 (5.73) 70.26 (6.70) 79.94 (6.41)
Lung 98.64 (1.13) 99.00 (0.81) 98.33 (1.08) 99.15 (0.67) 97.97 (1.14) 99.60 (0.67) 97.54 (1.52) 99.33 (0.66) 97.17 (1.51) 99.16 (0.70) 97.21 (1.69) 98.87 (0.83) 97.45 (1.47) 98.81 (0.97) 96.97 (1.65) 97.73 (1.49)
Prostate 93.78 (2.59) 92.76 (2.74) 93.00 (2.51) 93.02 (2.51) 92.47 (2.41) 93.78 (2.54) 90.12 (3.37) 94.20 (3.23) 89.61 (3.94) 93.63 (3.18) 89.84 (3.58) 93.47 (3.56) 89.53 (3.79) 93.53 (3.01) 86.86 (4.10) 92.12 (3.64)
TABLE III T HE IMPROVEMENTS OF THE PERFORMANCES OF SVM AFTER USING THE OPTIMIZED KERNEL
#. of genes 50 svm oksvm 100 svm oksvm 200 svm oksvm 400 svm oksvm 600 svm oksvm 800 svm oksvm 1000 svm oksvm
ALL-MAL 95.27 (1.70) 95.70 (1.81) 96.03 (1.56) 95.95 (1.61) 96.59 (1.19) 96.15 (1.41) 96.85 (0.91) 96.88 (0.97) 97.27 (0.27) 97.30 (0.00) 97.30 (0.00) 97.30 (0.00) 97.28 (0.27) 97.22 (0.46)
Breast-ER 93.87 (3.79) 92.93 (4.17) 93.56 (4.25) 93.07 (4.73) 93.60 (5.02) 92.64 (5.33) 93.92 (5.29) 91.77 (5.86) 94.60 (4.20) 92.81 (5.13) 94.44 (4.56) 93.24 (4.93) 93.86 (4.46) 92.56 (4.99)
Breast-LN 85.12 (5.74) 88.52 (6.89) 90.20 (4.60) 89.82 (5.48) 90.64 (5.12) 90.18 (5.39) 88.76 (5.71) 87.86 (6.61) 87.96 (6.15) 85.24 (6.53) 87.24 (5.87) 84.28 (7.65) 88.04 (6.61) 85.32 (7.61)
the class separability of the data in the space does not necessarily lead to a significant improvement for its performance. More importantly, due to the lack of enough training samples, the delicate classifier is more easily to cause overfitting. Although we adopt the disturbing strategy to ameliorate the situation, we essentially can not avoid the overfitting. Actually, in the case that there are relatively more samples, e.g., for Colon, Lung and Prostate data sets, the kernel optimization-based SVM,
Colon 86.84 (5.59) 87.13 (4.19) 87.10 (4.94) 87.29 (3.91) 85.39 (5.42) 86.55 (4.96) 84.48 (6.24) 85.61 (6.21) 83.71 (5.73) 86.19 (4.70) 82.58 (6.40) 84.32 (5.27) 82.13 (5.45) 83.84 (6.17)
Lung 98.57 (0.81) 98.92 (0.81) 99.05 (0.83) 99.12 (0.81) 99.25 (0.71) 99.33 (0.62) 99.24 (0.68) 99.37 (0.66) 99.24 (0.68) 99.16 (0.70) 99.26 (0.65) 98.87 (0.83) 99.32 (0.59) 98.91 (0.82)
Prostate 91.53 (3.11) 92.82 (3.34) 93.35 (2.67) 94.39 (3.33) 93.96 (2.94) 94.27 (3.56) 93.55 (3.55) 94.04 (4.13) 94.29 (3.20) 93.78 (4.33) 93.22 (2.72) 93.53 (3.32) 92.82 (3.02) 92.75 (3.41)
denoted as “oksvm”, achieves better results. IV. C ONCLUSION We have developed and applied a kernel optimization scheme to classify gene expression data. Employing a data-dependent kernel model, the proposed scheme can increase the class separability of the data in the feature space. It has been shown that the performances of the kernel machines applied to gene expression data classi-
fication can be further improved by using the optimized kernel. V. ACKNOWLEDGEMENT This investigation was based upon work supported by the National Science Foundation under Grant No. EPS0236913 and matching support from the State of Kansas through Kansas Technology Enterprise Corporation, and by the University of Kansas General Research Fund allocations #2301770-003 and #2301478-003. R EFERENCES [1] S. Amari and S. Wu. Improving support vector machine classifiers by modifying kernel functions. Neural Networks, vol.12, no.6, pp.783–789, 1999. [2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack and A.J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, vol.96, pp.6745-6750, June 1999. [3] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Computational Biology, vol.7, pp.559-584, 2000. [4] S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination method for the classification of tumor using gene expression data. J. Am. Statistical Assoc., vol.97, pp.77-87, 2002. [5] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, vol.286, pp.531-537, 1999. [6] A. Ruiz and P.E. Lopez-de Teruel. Nonlinear kernelbased statistical pattern analysis. IEEE Trans. on Neural Networks, vol.12, no.1, pp.16–32, January 2001. [7] N. Pochet, F.D. Smet, J. A.K. Suykens, and B. L.R.D. Moor. Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics, vol.20, no.17, pp.3185-3195, 2004. [8] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural Computation, vol.12, no.10, pp.2385–2404, 2000. [9] C.J.C. Burges. Geometry and invariance in kernel based methods. Advance in Kernel Methods, Support Vector Learning, In B. Scholkopf, C.J.C. Burges, and A.J. Smola, editors, Cambridge, MA:MIT Press, 1998. [10] J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics, vol.1, no.4, pp.181190, 2004.
[11] G. C. Cawley. MATLAB support vector machine toolbox [ http://theoval.sys.uea.ac.uk/˜gcc/svm/ toolbox]. University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ, 2000. [12] N. Cristianini, J. Kandola, A. Elisseeff, and J. ShaweTaylor. On kernel target alignment. in Proceedings of the Neural Information Processing Systems, NIPS’01, pp.367-373, MIT Press. [13] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge Univ. Press, Cambridge, UK, 2000. [14] J.H. Friedman. Flexible metric nearest neighbor classification. Technical report, Dept. of Statistics, Stanford University, 1994. [15] K. Fukunaga. Introduction to Statistical Pattern Recognition. San Diego:Academic, 1990. [16] G.J. Gordon, R.V. Jensen, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res., vol.62, pp.4963-4967, 2002. [17] K.-R. M¨uller, S. Mika, G. R¨atsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Trans. on Neural Networks, vol.12, no.2, pp.181–201, March 2001. [18] E. Pekalska, P. Paclik, and Robert P.W. Duin. A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research, vol.2, pp.175– 211, 2001. [19] V. Roth and V. Steinhage. Nonlinear discriminant analysis using kernel functions. Advance in Neural Information Processing Systems 12, S.A. Solla, T.K. Leen and K.-R. Muller, Eds, Cambridge, MA:MIT Press, 2000, pp.568–574. [20] B Scholkopf, S. Mika, C.J.C. Burges, P. Knirsch, K.-R. Muller, G. Ratsch, and A.J. Smola. Input space versus feature space in kernel-based methods. IEEE Trans. on Neural Networks, vol.10, pp.1000–1017, September 1999. [21] B. Scholkopf, A.J. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, vol.10, pp.1299–1319, 1998. [22] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, W.R. Sellers. Gene expression correlations of clinical prostate cancer behavior. Cancer Cell, vol.1, pp.203-209, 2004. [23] L.J. van’t Veer, H. Dai, et.al.. Gene expression profiling predicts clinical outcome of breast cancer. Nature, vol.419, pp.530-536, 2002. [24] B. West, C. Blanchette, H. Dressman, E. Huang, and et.al.. Predicting the clinical status of human breast
cancer by using gene expression profiles. Proc. Natl. Acad. Sci. U S A, vol.98, pp.11462-11467, 2001. [25] H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the data-dependent kernel in the empirical feature space. IEEE Trans. on Neural Networks, vol. 16, pp.460-474, March, 2005. [26] A.B.A. Graf, A.J. Smola, and S. Borer. Classification in a normalized feature space using support vector machines. IEEE Trans. on Neural Networks, vol.14, no.3, pp.597605, May 2003.