An Ensemble Feature Extraction Classifier for the Analysis of Integrated Data of HCV-HCC Related DNA Microarray Salwa Eid College of Engineering and Technlogy Arab Academy for Science and Technology Cairo, Egypt
[email protected]
Aliaa Youssif College of Engineering and Technlogy Arab Academy for Science and Technology Cairo, Egypt
[email protected]
Abstract—Hepatocellular Carcinoma (HCC) is the one of leading causes of cancer related deaths worldwide. In most cases, the patients are first infected with Hepatitis C virus (HCV) which then progresses to HCC. HCC is usually diagnosed in its advanced stages and is more difficult to treat or cure at this stage. Early diagnosis increases survival rate as treatment options are available for early stages. Therefore, accurate biomarkers of early HCC diagnosis are needed. DNA microarray technology has been widely used in cancer research. Scientists study DNA microarray gene expression data to identify cancer gene signatures which helps in early cancer diagnosis and prognosis. Most studies are done on single data sets and the biomarkers are only fit to work with these data sets. When tested on any other data sets, classification is poor. In this paper, we combined four different data sets of liver tissue samples (100 HCV-cirrhotic tissues and 61 HCV-cirrhotic tissues from patients with HCC). Differently expressed genes were studied by use of high-density oligonucleotide arrays. By analyzing the data, an ensemble feature extraction-classifier was constructed. The classifier was used to distinguish HCV samples from HCV-HCC related samples. We identified a generic gene signature that would predict whether an HCV tissue also infected with HCC or not. Keywords-DNA microarray, HCV, HCC, integrative analysis, feature selection, ensemble, classifiers, svm-re, lasso regression
I.
INTRODUCTION
It was estimated in 2002, liver cancer is the sixth leading cancer type, with 62,6162 cases, and is the third leading cause of cancer death, with an estimated 58,321 deaths in that year [1]. In Egypt, HCC was reported to count for about 4.7% of infected patients worldwide [2]. HCV and HBV are the main risk factors for the development of HCC([3]-[5]). Egypt has the highest prevalence of HCV worldwide and up to 90% of HCC infected Egyptian patients was caused by HCV[6]. HCC’s only curative treatments are surgical resection and liver
Samar Kassim Faculty of Science Ain Shams University Cairo, Egypt
[email protected]
Waleed Fakhr College of Engineering and Technlogy Arab Academy for Science and Technology Cairo, Egypt
[email protected]
transplant ([7]-[9]). These treatments are not applicable to candidates in the early stages of HCC. The earlier HCC is detected followed by an appropriate treatment can reduce the number of deaths caused by tumours [10]. Serum αfetoprotein determination and ultrasongraphy are used for HCC diagnosis. Although serum α- fetoprotein is cheaper, serum α-fetoprotien determination is not always a biomarker for HCC especially anything related to HCC-HCV. In a series of 606 HCC patients, normal serum α-fetoprotein was observed in 40.4% of patients with small HCC tumours, in 24.1% of patients with tumours 2 to 3 cm in diameter, and in 27.5% of patients with 3 to 5 cm tumours [11]. As for ultrasonography has been described as highly user dependent[12]. Therefore improved biomarkers for early and accurate detection of HCC are needed. Several studies have been done on HCV and HCV-HCC related samples and gene sets were identified that could be useful as diagnostic tools. Most studies were assessed on single datasets via cross validation or tested on a small test set. It’s been noticed among the research done, that each study conducted using DNA microarray on HCV and HCV-HCC related resulted in a different gene signature and there was no overlap in the genes reported informative or just a few of them. These results make it difficult to identify the most predictive genes for detecting HCC in HCV patients. This is due to the differences in the array platforms, experiments, experiments’ condition and the underling biological heterogeneity of the disease. Therefore validation on large combined data set is still missing. The present study combined 4 datasets and has focused on finding biomarkers that can be used for the early detection of HCC caused by HCV. We investigated whether gene classifiers derived from two datasets using different array platforms could be independently validated by application to other datasets or not.
In this paper we concentrated on finding new biomarkers that would classify between two classes HCV and HCV-HCC related samples. When creating a classifier two aspects are important: selecting the features, that are informative and choosing a classifier that performs well. In section 2, we talked briefly about data integration which is followed by a description of the existing algorithms in section 3. In section 4, we proposed a novel ensemble feature classifier. Experiment and results are in section 5 and section 6, respectively. Conclusion and suggested future work are then presented in section 7. II.
DATA INTEGRATION
Due to the difference in research results on individual datasets, analysis of integrated datasets is needed. The integration of multiple datasets promises to yield more reliable and accurate results. The probability of the biomarkers being over fitted to a single dataset will be eliminated as the research will consist of multiple datasets. There are two methods to combine inter-study microarray data at different levels([13][16]). The first method is meta-analysis, which combines results from individual data sets to avoid the direct comparison of gene expression values and to increase the power of identifying significantly expressed genes among them. The second method is direct integration of common expression values among the datasets after specific data transformation and normalization on each data sets. Then new gene expression values from each datasets are combined to increase sample size and then the analysis is done to the merged data. III.
FEATURE EXTRACTION AND CLASSIFIERS
There are two phases to building a prediction model to differentiate between two conditions which are: gene reduction and choosing the most appropriate classifiers. It is considered as an optimization problem, selecting the best features with the best classifier which would give the minimum prediction error. A. Feature Extraction Logistic regression models are commonly used when working with HCV and HCV-HCC classes but shouldn’t be used when then number of predictor variables (p) exceeds the sample size (n) [17]. Three linear regression algorithms Least Angle Regression (LAR), Least Absolute Shrinkage Operator (LASSO) and Average Linear Regression (ALM) were evaluated in the prediction of classes on high dimensional gene expression data by Yingdong Zhao [18]. It was demonstrated that LAR and LASSO perform quite well and in a similar manner when used on data without noise and better than ALM but LASSO performed best on data with noise. In such cases, penalized methods are thought to give better results ([19], [20]). The LASSO is a penalized method for estimating a logistic regression model when p>n [21]. The LASSO model is estimated using maximum likelihood [21]. Regression coefficients are calculated for all genes to
minimize a weighted average of mean squared prediction error for the training set plus the sum of absolute values of all regression coefficients. It tends to assign zero coefficients to genes that are less informative. Then weighting factor is optimized by cross-validation [21]. LASSO algorithms attempt to avoid the over-fitting characteristic of least-squares linear regression when the number of variables is large compared to the number of cases [22]. The Support Vector Machine Recursive Feature Elimination (SVM RFE) is one of the methods used in feature selection in tumor problems. Isabelle Guyon illustrated that genes chosen by SVM-RFE performed better in prediction than by correlation techniques[23]. This method uses an SVM classifier trained on the data. Then genes are ranked according to their contribution to the prediction performance. The SVM algorithm uses a weighted linear combination of the gene expressions as a discriminator between the two classes. Genes that have a low absolute value of weight in the linear combination are removed. A new SVM classifier is created with the remaining genes. The process is repeated until the desired number of genes is achieved. B. Classifiers Hedenfalk used compound covariate classifier , to predict the BRCA1 and BRCA2 mutation status of breast cancer specimens [24]. The Compound Covariate Predictor is a weighted linear combination of log-ratios for genes that are univariately significant at a specified level. A two-sample ttest is performed. That is differentially expressed genes with log-expression ratios that best discriminates between the two classes and meets the specified level α, are selected. Then a single compound covariate is constructed using the differentially expressed genes for class prediction. The two-sample t statistic of each differentially expressed gene acts as its weight in the compound covariate. Thus, the value of the compound covariate for sample i is Ci=∑
(1)
where tj is the t-statistic for the two group comparison of labels with respect to gene j, xij is the log-ratio measured in sample i for gene j and the sum is over all differentially expressed genes. After the value of the compound covariate is computed for each sample in the training set, a classification threshold is calculated Ct. Ct=(C1+C2) / 2
(2)
,where C1 and C2 are the mean values of the compound covariate for samples in the training set with class label 1 and class label 2, respectively. Ct is the midpoint of the means of the two classes. A new specimen is predicted to be of class 1 if its compound covariate is closer to C1 and to be of class 2 if its value is closer to C2[25].
Diagonal Linear Discriminant Analysis is similar to Compound Covariate Predictor. It has the same prediction rule but with equal prior probabilities for the two classes [26]. It also ignores the correlations among the genes to avoid overfitting the data. Dudoit. reported that performance of simple classifiers such as linear discriminant analysis and the nearest neighbour performed well as much more sophisticated methods such as aggregated classification trees[27]. The K-Nearest Neighbor Predictor depends on the majority voting of the k-nearest neighbors. A sample is assigned to class if the expression profile of it, gets the most votes to the k-training samples in that class. Euclidean distance is used to calculate the distance metric.
Nearest Centroid predictor calculates the centroid for each class which is the average gene expression for each gene in each class divided by the within-class standard deviation for that gene. When predicting a new sample, the gene expression of that sample is compared to all of the class centroids. The class whose centroid, that it is closest to, in squared distance, is the predicted class for that new sample. Prediction Analysis of Microarray(PAM) is another method that uses the shrunken centroid algorithm developed by Tibshirani[28]. It is a modified method to Nearest Centroid. It shrinks each of the class centroids toward the overall centroid for all classes by a certain amount, threshold. This shrinkage consists of moving the centroid towards the threshold which is set by the user. Then the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids. Support Vector Machine(SVM) is another effective prediction method. It’s a machine learning technique. We used the SVMS with linear kernel functions. It is a linear function of the log-intensities that best differentiates the data with respect to penalty costs on the number of samples misclassified. Random forest is an alternative class predictor. It is an ensemble learner constructed from many classifiers. It was developed by Breiman[29]. It combines the results of all classifiers used. It is robust against over fitting and usually performs better than other classifiers. It consists of multiple random trees classifiers that all vote on classification for a given set of inputs. An input is classified to class on which gets the most votes. Features are selected randomly. In a study by Valeria, Random Forest was used for classification and feature selection[30]. Fifteen probesets were identified to classify between HCV and HCV-HCC related samples. IV. PROPOSED ALGORITHM In the current study, we aimed to generate a large enough dataset to create a highly genereralizeable set of minimum signatures that would be able to differentiate between HCV
and HCV-HCC classes. We integrated datasets and used ensemble feature extraction classifier. The creation of predictive signatures consisted of two steps: eliminating the less informative genes and selecting a predictive model with high performance in correctly predicting the class of an HCV or an HCV-HCC related sample. Each dataset was read and normalized independently. Then the datasets were merged together. We extracted the features using two different methods and in two stages. Using the LASSO regression model we determined the differentially expressed genes between the two classes. Then we took those differentially expressed genes and used SVM-RE to eliminate the number of genes ending with genes that have the highest discriminatory ability to correctly classify a sample. Cross validation was performed in each stage to validate the differentially expressed genes obtained. We then used class prediction methods to determine whether the genes selected by the feature extraction phase accurately identified the new HCV and new HCV-HCC related samples. Cross validation was also performed here. The algorithm is summarized in fig.1.
Data Integration 1. READ .CEL FILES OF EACH DATASET Gene Expression 2. DATA NORMALIZATION & FILTERING OF EACH DATASET
Gene Expression Normalized and Filtered 3. MERGE DATASETS TOGETHER
Combined data
3. FEATURE SELECTION STAGE 1 Informative Genes subset 1 4. FEATURE SELECTION STAGE 2
Informative Genes subset 2 5. CLASSIFIER
FIGURE 1.
PROPOSED MODEL
V.
EXPERIMENT AND RESULTS
A. Data Sets A summary of the four datasets are given in table. 1. The four datasets were obtained from the Gene Expression Omnibus (GEO) database publically available over the internet[31]. Each obtained from different institutes. Raw data (.cel files) from each dataset were imported into BRB array tools[32]. The HCV samples were the only ones retrieved from the first dataset. As for the second data set, we only used the HCV and early stage and very early stage HCC samples. For the third dataset, we read the whole data. It represented HCV and HCV- HCC related samples only. As for the fourth the data set we retrieved only the HCV-HCC related samples. TABLE I.
DATA SETS SUMMARY
D1
D2
D3
D4
GEO ID #
GSE14323
GSE6764
GSE17967
GSE19665
Institution
Virginia Commonw ealth University
Mount Sinai School of Medicine
Virginia Commonw ealth University
University of Tokyo
Pubmed ID *
19098997
17393520
19861515
20345479
Chip type
Affymetrix Human Genome U133A Array
Affymetrix Human Genome U133 Plus 2.0 Array
Affymetrix Human Genome U133A 2.0 Array
Affymetrix Human Genome U133 Plus 2.0 Array
58
31
63
5
Number of Samples
B. Normalization and Filtering Each data-set was normalized independently using the GCRMA algorithm ([33],[34]). Arrays in each dataset are normalized to a reference array. The reference array is the array whose median log-intensity value is the median over all median log-intensity values for the set of the array. GC-RMA methods adjust for background intensities that include optical noise and non-specific binding. A background correction on the perfect match(PM is done, followed by quantile normalization. Then the probe set summaries information are obtained using Tukey’s median polish algorithm[35]. Each dataset was filtered so that affymetrix control-probe set genes were excluded. The 22,215 common RMA probe-sets expression summaries across the three datasets were merged manually. C. Simulation Six experiments were done. Details of the six experiments are show in table. 2. The first three are individual analysis and other three are integrative analysis. We applied the LASSO to the training data in each experiment. We tried different values for K for the cross validation and ended up choosing k=10 as it gave us the lowest prediction error. Each time, 10% of the samples are omitted; the model is built using the remaining
90% of the samples. The prediction errors are recorded for the samples withheld. This is done 10 times, omitting each of the 10 subsets one at a time and the errors for the samples in each subset are obtained and totalled into an overall error. We avoided the leave-one-out cross-validation to avoid an over fitted model over the training data. The predicted class was HCV-HCC related sample if the fitted probability was greater or equal to 0.5, otherwise an HCV sample. Exp 4 gave us the best results and the genes included in the LASSO model done on Exp4, along with their coefficients and a percentage of CV support which provides information on the stability of gene selection across loops of the cross-validation are shown in table. 3. Therefore we continued to work with Exp4. We applied SVM-RE to the 38 genes obtained from lasso model performed on Exp4. We set the number of features to 20 and using these features, we constructed the following classifiers: PAM, Compound covariate predictor, Diagonal linear discriminate analysis, nearest neighbour predictor for K=1 and K=3, nearest centroid predictor, support vector machine predictor and random forest. Then we repeated the process for 10, 7 and 5 genes and a compared them. TABLE II. SUMMARY OF LASSO MODEL RESULTS FOR ALL EXPERIMENTS Exp
Training
Testing
Cross validation
Exp1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
D1 D2 D3 D2,D3 D1, D2 D1, D3
D2, D3,D4 D1,D3,D4 D1, D2, D4 D1, D4 D3, D4 D2, D4
88% 83% 100% 100% 88% 87%
VI.
Correct Classific ation 70% 73% 74% 95% 86% 94%
Number of genes 14 19 11 38 8 6
RESULTS
A. Comparison of integrative analysis with individual analysis for feature selection As we can see the integrative analysis for the three experiments gave higher results than the other three experiments. Sensitivity obtained for each dataset was 38.5%, 40%, 13.5%, 91%, 78% and 95% for Exp1, Exp2, Exp3, Exp4, Exp5 and Exp6 respectively. As for specificity obtained was 90%, 98%, 100% ,98%, 100% and 94% for Exp1, Exp2, Exp3, Exp4, Exp5 and Exp6 respectively. Sensitivity percentage for the individual analysis was quite low but was high for the integrative analysis. We have also calculated the area under the curve(AUC) for the receiver operating characteristic(ROC) curve for the 4 experiments as an evaluation. The ROC curve is a graphical plot of true positive(sensitivity) against false positive(1-specificity). It illustrates the quality of performance of a classifier. The greater the AUC, meaning the higher the classifier performs. An AUC of 1 means perfect classifier. The following are the values of AUC: 0.642, 0.691, 0.566 and 0.942 for Exp1,Exp2, Exp3 and Exp4 respectively. As observed, Exp4 has the highest AUC reflecting that it’s the best model.
TABLE III.
GENES OBTAINED FROM LASSO MODEL(EXP4)
Probe Set
Symbol
Entrez ID
Coefficient
% CV Support
201262_s_at 202053_s_at 202598_at 202910_s_at 202917_s_at 202953_at 203016_s_at 203582_s_at 203717_at 203900_at 204304_s_at 204417_at 204513_s_at
BGN ALDH3A2 S100A13 CD97 S100A8 C1QB SSX2IP RAB4A DPP4 KIAA0467 PROM1 GALC ELMO1
633 224 6284 976 6279 713 117178 5867 1803 23334 8842 2581 9844
0.00115 -0.04653 0.47828 0.526 0.07493 0.13903 -0.27354 -0.09397 -0.34647 -0.27724 0.60033 0.13688 -0.10681
50 20 90 90 60 40 90 10 100 30 100 50 40
204811_s_at 204840_s_at 205114_s_at 205171_at 205917_at 206561_s_at 207836_s_at 208695_s_at 208733_at 209481_at 209488_s_at 209777_s_at 210054_at 212663_at 212742_at 212758_s_at 212953_x_at 213508_at 214315_x_at 214972_at 218632_at 219065_s_at 219267_at 219848_s_at 221928_at
CACNA2D2 EEA1 NA PTPN4 ZNF264 AKR1B10 RBPMS RPL39 NA SNRK RBPMS SLC19A1 HAUS3 FKBP15 RNF115 ZEB1 CALR C14orf147 CALR MGEA5 HECTD3 NA GLTP ZNF432 ACACB
9254 8411 NA 5775 9422 57016 11030 6170 NA 54861 11030 6573 79441 23307 27246 6935 811 171546 811 10724 79654 NA 51228 9668 32
-0.18859 -0.28478 0.60932 -0.26679 -0.35938 -0.24597 1.29121 -2.63319 -0.2691 -0.21413 0.22174 -0.03816 -0.16743 0.06006 -0.08253 -0.3381 0.75398 -0.01653 0.06269 -1.31245 0.07716 -0.22339 0.12276 -0.117 0.11383
50 90 100 60 100 100 100 100 80 30 90 20 50 20 70 90 80 20 70 90 10 40 40 30 40
VII. CONCLUSION AND FUTURE WORK Our aim was to develop new valid biomarkers for HCV cirrhotic patients with or without HCC. Most studies were done on single datasets and the resulting classifier and features selected were over fitted to these datasets. Here our study combined different datasets from different hospitals and an ensemble feature extraction classifier was constructed. We filtered the less informative genes on two stages. We reduced the 22,215 genes to 38 genes during the first stage using LASSO regression model. Then we reduced the 38 genes to 20, 10, 7 and 15 genes using SVM-RE during the second stage. We evaluated the signatures by applying the different classifiers to validate the genes for the four different models. The 20 features model performed best overall with an average of 93.5%. We assessed the classifiers using the ROC curve and the AUC. The AUC illustrated that the Random Forest was the best classifier with a value of 0.942. The Random Forest also did quite well using the 10, 7 and 5 genes. Most studies reached biomarkers that perform well on certain datasets but when tested on different datasets performed poorly. Our biomarkers here were tested on two other data sets and the misclassification rate was very low. The signatures identified in this research are general and are not over fitted for one single dataset. The 38 features with the combination of the classifiers used in this paper and other classifiers could be tested on other datasets. These genes could be tested clinically and turned into real diagnostic biomarkers that could be used. It can be of great importance to identify signatures that can indicate the presence of HCC in cirrhotic tissues which could help in early diagnosis of cancer giving a better chance of treatment and survival. PAM
100% Random Forest
95% Corectly Classified
B. Comparison of different classifiers using features selected As we mentioned before we set different number of features for the SVM-RE and we compared them by applying several predictors. The results are shown in the Fig 2. It shows that the 20 features give the lowest misclassification error when compared with the 10, 7 and 5 genes among all the classifiers. The Random Forest has lowest misclassification error among the 20,10, 7 and 5 genes models. We plotted the ROC curves for the all the predictors that included the 20 genes. We can see in fig. 3 that the performance is pretty high as the AUC for the ROC curves obtained for all the classifiers is above 0.9. The random forest scored the highest among them, an AUC of 0.942.
Compound Covariate Predictor Linear Discriminat Analysis 1-Nearest Neighbour
90% 85% 80% 75%
3-Nearet Neighbour
70%
Nearest Centroid
20
10
7
Number of Genes
5
Support Vector Machine
FIGURE 2. COMPARISON OF DIFFERENT CLASSIFIERS(20,10, 7 AND 5 GENES)
FIGURE 3.
ROC CURVE FOR DIFFERENT CLASSSIFIERS(20 GENES) REFERENCES
Estimating the world Parkin DM, Bray F, Ferlay J, Pisani P.(2001)E cancer burden: Globocan 2000. Int. J. Cancerr 94:153-156 2) Rahman El-Zayadi A, Abaza H, Shawky S, Moohamed MK, Selim OE, Badran HM: Prevalence and epidemiologgical features of hepatocellular carcinoma in a Egypt-a single ccenter experience. Hepatol Res 2001, 19(2):170-179 3) Block TM, Mehta AS, Fimmel CJ, Jordan R:M Molecular viral oncology of hepatocellular carcinoma. Oncoggene 2003, 22:50935107. 4) Buendia MA: Hepatitis B viruses and canceroogenesis. Biomed Pharmacother 1998, 52:34-43 5) Colombo M: The role of hepatitis C virus in hhepatocellular carcinoma. Recent Results Cancer Res 1998, 154:337-334 6) Goldman R, Ressom HW, Abdelhamid M, et all. Candidate markers for the detection of hepatocellular carcinoma iin low molecular weight fraction of serum. Carcinogenesis. 20007; 28(10):21492153 7) Marsh JW, Dvorchik I. (2003) Liver organ alllocation for hepatocellular carcinoma: are we sure? Liveer Transpl. 9:693696. 8) Fisher RA, et al.(2007) Is hepatic transplantation justified for primary liver cancer? J Surg. Oncol. 95:6744-679 9) Barnett CC Jr, Curley SA. (2001) Ablative tecchniques for heptocellular carcinoma. Semin. Oncol. 28:4487-496. 10) Llovet JM, Burroughs A, Bruix J. Hepatocelluular carcinoma. Lancet 2003; 362:1907-1917. 11) Nomura F, Ohnishi K, Tanabe Y. Clinical feattures and prognosis of hepatocellular carcinoma with reference to serum alphafetoprotein levels. Analysis of 606 patients. C Cancer 1989; 64:1700-1707 12) Beale G, Chattopadhyay D, Gray J, et al. AFP P, PIVKAII, GP3, SCCA-1 and follisatin as surveillance biomarkkers for hepatocellular cancer in non-alcoholic and alccoholic fatty liver disease. BMC Cancer 2008:8:200 1)
13) Ghosh, D., et al., Statistical issuees and methods for meta-analysis of microarray data: a case study in prostate cancer. Functional & Integrative Genomics, 2003. 3:p. 180-188. 14) Rhodes, D.R., et al., Meta-Analyssis of Microarrays: Interstudy Validation of Gene Expression Prrofiles Reveals Pathway Dysregulation in Prostate Cancerr. Cancer Res, 2002. 62(15): p. 4427-4433. 15) Choi, J.K., et al., Combining multiple microarray studies and modeling interstudy variation. Bioinformatics, 2003. 19(90001): p. 84-90 mbining Affymetrix microarray 16) Stevens, J. and R.W. Doerge, Com results. BMC Bioinformatics, 200 05. 6(1): p. 57.. 17) Kellie J.Archer, Valeria R. Mas, Krystle David, Daniel G. Maluf, Karen Bornstein, Robert A. Fisheer, Identifying Genes for Establishing a Multigenetic Test for f Hepatocellular Carcinoma Surveillance in Hepatitis C Virus-Positive Cirrhotic Patient. Cancer Epidemiol Biomarkers 20 009:11:2929-2932 18) Y. Zhao, R. Simon, “Development and Validation of Predictive Indices for a Continuous Outcom me Using Gene Expression Profile,” Cancer Informatics., vol. 9, pp.105-114, 2010 19) Gui J, Li H. Penalized Cox regreession analysis in highdimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 2005:21:30013008 20) Ma S, Song X, Huang J. Supervissed group Lasso with applications to microarray data analysis. BM MC Bioinformatics 2007;8:60. 21) Tibshirani R. The lasso method for f variable selection in the Cox model. Stat Med 1997;16:385:39 95 22) Efron B, Hastie T, Johnstone I an nd Tibshirani R (2004). Least Angle Regression.Annals of Statisstics 32 (2): pp. 407–499. 23) Isabelle G, Jaso W., Stephen B., M.D, M Vladimir V., Gene selection for Cancer classification using Su upport Vector Machines, Machine Learning, Vol. 46, pp.1-3, 2002 24) Hedenfal, I., Dugga, D., Chen Y., Radmache, M., Bittner M., Simon R., Meltzer P., Gusterson B., Esteeller M.,” Gene-expression pro. les in hereditary breast cancer”, N. Engl. J. Med,344, pp. 539– 548,2002 nd Simon R. A paradigm for class 25) Radmacher MD, McShane LM an prediction using gene expression profiles. J Comput Biol 2002;9:505–12. 26) Morrison, D. F, Multivariate stattistical methods ,McGraw-Hill: New York,1967,pp. 130-133. 27) Raffeld, M.,S. Dudoit, J. Fridlyan nd, and P. Speed, “Comparison of discrimination methods for classif ification of tumors using gene expression data,” J.Amer. Statist.. Assoc., vol. 97, 2002, pp. 77–87, 28) Tibshirani, R., Hastie, T., Narasiimhan, B. & Chu, G. (2002) Proc. Natl. Acad.Sci. USA 99, 6567–65 572. 29) Breiman L, Random forests,. Macchine Learning 2001, 45:5-32. 30) Mas, V. R., D. G. Maluf, K. J. Arccher, K. Yanek, X. Kong, L. Kulik, C. E.Freise, K. M. Olthoff, R. M. Ghobrial, P. McIver, and R. arcinogenesis and tumor initiation Fisher. Genes involved in viral ca in hepatitis C virusinduced hepattocellular carcinoma. Mol. Med.2009: 15:85–94. 31) U.S. National Library of Medicine, National Center for Biotechnology Information, GEO O Datasets. [Online]. Available: http://www.ncbi.nlm.nih.gov/gds// [Accessed: 30 Sept. 2011] 32) Richard Simon, National Cancer Institute, Biometric Research Branch, BRB Array tools, 2002. [Online]. [ Available: http://linus.nci.nih.gov/BRB-Arra ayTools.html#content [Accessed: 30 Sept. 2011] 33) Richard, S, Lam A, Li MC, Ngan M, Menenzes S, et al. (2007) a Using BRB-Array Tools. Cancer Analysis of Gene Expression Data Inform 3: 11–17. 34) Wu Z, Irizarry RA (2004) Prepro ocessing of oligonucleotide array data. Nat Biotechnol 22: 656–658 8; author reply 658. 35) RAIrizarry, et. Al. “Summaries of Affymetrix Gene Chip probe level data” Nucleic Acids, Researrch, 2003, vol.31, No.4