Supervised Classification of Viral Genomes based on

0 downloads 0 Views 104KB Size Report
Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Baniré. Diallo. Department of Computer Science, Université du Québec `a Montréal, P.O. Box 8888.
Supervised Classification of Viral Genomes based on Restriction Site Distribution Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Banir´e Diallo Department of Computer Science, Universit´e du Qu´ebec a` Montr´eal, P.O. Box 8888 Downtown Station, Montreal, Quebec, H3C 3P8, Canada. [email protected] Abstract. Over the last decade, advances in sequencing technologies have led to a better knowledge on genomic and taxonomic characteristics of viruses. Due to the volume of new sequenced genomes in metagenomic and viral multi-infection data, it is important to provide efficient methods to genotype and classify the involved viruses (identify the type, class, species and/or gender). Molecular biology techniques such as the one based on Restriction Fragment Length Polymorphism (RFLP) [1] are powerful, but limited and expensive to be applied to thousands of genomes. Here, we modelled the RFLP technique to fit a computational framework. We propose an original approach of genotyping that exploits supervised machine learning methods on restriction site fragment distributions. To this end, we combined a set of 516 different types of attributes on the restriction site distributions for 3 viral datasets (Papilloma Viruses (PV), Hepatitis B viruses(HBV) and Human Immunodeficiency viruses(HIV)), containing more than 3000 whole genomes. We assessed the approach on 7 kinds of supervised classifiers such as decision tree based algorithms, SVM, KNN with a 10fold cross-validation. The classification performance of divergent viral sequences (inter-viral) and conserved viral sequences (inter-gender and inter-species) highlights correct prediction of 96% for inter-species, and 99% for inter-gender as well as inter-viral classificatiosn in PV genomes. Similar trends have been found in HBV and HIV. With high prediction rates and robustness, as well as rapidity, such an approach will be essential in all large scale viral studies.

References Saiki, R.K., Scharf, S., Faloona, F., Mullis, K.B., Erlich, H.A., Arnheim, N. (1985): Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230 (4732): 13501354.

Keywords CLASSIFICATION, VIRUS GENOMES, GENOTYPING, SUPERVISED LEARNING, KNOWLEDGE DISCOVERY 144

Suggest Documents