Improving prediction accuracy of tumor classification by reusing genes ...

4 downloads 399 Views 339KB Size Report
Mar 20, 2008 - DOI : 10.1186/1471-2164-9-S1-S3 ... We demonstrate a framework for automatically selecting features to be input, output, and discarded by ...
BMC Genomics

BioMed Central

Open Access

Research

Improving prediction accuracy of tumor classification by reusing genes discarded during gene selection Jack Y Yang1, Guo-Zheng Li*2,3, Hao-Hua Meng2, Mary Qu Yang4 and Youping Deng5 Address: 1Harvard Medical School, Harvard University, Cambridge, Massachusetts 02140-0888 USA, 2School of Computer Engineering & Science, Shanghai University, Shanghai 200072, China, 3Institute of Systems Biology, Shanghai University, Shanghai 200072, China, 4National Human Genome Research Institute, National Institutes of Health, U.S. Department of Health and Human Services, Bethesda, MD 20892, USA and 5Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, MS 39406. USA Email: Jack Y Yang - [email protected]; Guo-Zheng Li* - [email protected]; Hao-Hua Meng - [email protected]; Mary Qu Yang - [email protected]; Youping Deng - [email protected] * Corresponding author

from The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07) Las Vegas, NV, USA. 25-28 June 2007 Published: 20 March 2008 BMC Genomics 2008, 9(Suppl 1):S3

doi:10.1186/1471-2164-9-S1-S3

The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07)

Jack Y Jang, Mary Qu Yang, Mengxia (Michelle) Zhu, Youping Deng and Hamid R Arabnia Research

This article is available from: http://www.biomedcentral.com/1471-2164/9/S1/S3 © 2008 Yang et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Since the high dimensionality of gene expression microarray data sets degrades the generalization performance of classifiers, feature selection, which selects relevant features and discards irrelevant and redundant features, has been widely used in the bioinformatics field. Multitask learning is a novel technique to improve prediction accuracy of tumor classification by using information contained in such discarded redundant features, but which features should be discarded or used as input or output remains an open issue. Results: We demonstrate a framework for automatically selecting features to be input, output, and discarded by using a genetic algorithm, and propose two algorithms: GA-MTL (Genetic algorithm based multi-task learning) and e-GA-MTL (an enhanced version of GA-MTL). Experimental results demonstrate that this framework is effective at selecting features for multitask learning, and that GA-MTL and e-GA-MTL perform better than other heuristic methods. Conclusions: Genetic algorithms are a powerful technique to select features for multi-task learning automatically; GA-MTL and e-GA-MTL are shown to to improve generalization performance of classifiers on microarray data sets.

Background Tumor classification is performed on microarray data collected by DNA microarray experiments from tissue and cell samples [1-3]. The wealth of such data for different stages of the cell cycle aids in the exploration of gene inter-

actions and in the discovery of gene functions. Moreover, genome-wide expression data from tumor tissues gives insight into the variation of gene expression across tumor types, thus providing clues for tumor classification of individual samples. The output of a microarray experiPage 1 of 12 (page number not for citation purposes)

BMC Genomics 2008, 9(Suppl 1):S3

http://www.biomedcentral.com/1471-2164/9/S1/S3

ment is summarized as an p × n data matrix, where p is the number of tissue or cell samples and n is the number of genes. Here n is always much larger than p, which degrades the generalization performance of most classification methods. To overcome this problem, feature selection methods are applied to reduce the dimensionality from n to k with k

Suggest Documents