AphidBase: a database for aphid genomic resources - Semantic Scholar

1 downloads 1977 Views 68KB Size Report
been loaded in the database using the GMOD open source software ... For Permissions, please email: [email protected]. 783 ...
BIOINFORMATICS APPLICATIONS NOTE

Vol. 23 no. 6 2007, pages 783–784 doi:10.1093/bioinformatics/btl682

Databases and ontologies

AphidBase: a database for aphid genomic resources Jean-Pierre Gauthier1, Fabrice Legeai2, Alain Zasadzinski3, Claude Rispe1 and Denis Tagu1, 1

INRA, Agrocampus Rennes, UMR 1099 BiO3P (Biology of Organisms and Populations applied to Plant Protection), F-35653 LE RHEU, France, 2INRA, URGI - Genoplante Info, Infobiogen, 523 place des Terrasses, F-91000 Evry, France and 3CNRS INIST, 2, alle´e du Parc de Brabois 54500 Vandoeuvre-le`s-Nancy, France

Received and revised on October 13, 2006; accepted on January 5, 2007 Advance Access publication January 19, 2007 Associate Editor: Alvis Brazma

ABSTRACT

2

AphidBase aims to (i) store recently acquired genomic resources on aphids and (ii) compare them to other insect resources as functional annotation tools. For that, the Drosophila melanogaster genome has been loaded in the database using the GMOD open source software for a comparison with the 17 069 pea aphid unique transcripts (contigs) and the 13 639 gene transcripts of the Anopheles gambiae. Links to FlyBase and A.gambiae Entrez databases allow a rapid characterization of the putative functions of the aphid sequences. Text mining of the D.melanogaster literature was performed to construct a network of co-cited gene or protein names, which should facilitate functional annotation of aphid homolog sequences. AphidBase represents one of the first genomic databases for a hemipteran insect. Availability: http://w3.rennes.inra.fr/AphidBase Contact: [email protected], [email protected]

We used the Generic Model Organism Database (GMOD; http://www.gmod.org/home) open source project for administrating of genomic databases. Among GMOD tools, Chado (database architecture on PostgreSQL; http://www.gmod.org/ apollo-chado) and Gbrowse (a genomic data browser, Stein et al., 2002) were installed to set up and develop AphidBase. The complete D.melanogaster chromosome sequences (Flybase version r4.2.1; Drysdale et al., 2005), 13 639 A.gambiae gene transcripts (MOZ2a) and 17 069 A.pisum putative unique transcribed sequences (or contigs; see Sabater-Mun˜oz et al., 2006) were retrieved. These 17 069 A.pisum contigs were assembled from 53 190 ESTs, forming the so-called updated ‘v5’ version of contigs (Sabater-Mun˜oz et al., 2006 and unpublished data). In order to compare A.pisum and A.gambiae to D.melanogaster sequences, tblastx was performed between the A.pisum contigs or A.gambiae gene transcripts and the D.melanogaster genomic sequence. Matching A.pisum or A.gambiae sequences to fly sequences were displayed onto the D.melanogaster genome. For A.pisum, 12 280 (72%) contigs had no match to D.melanogaster sequences (e value ¼ 106) and 4789 (28%) were homologous to D.melanogaster sequences. 5551 (32%) A.pisum contigs were homologous to A.gambiae sequences, and 2922 (17%) matched to both D.melanogaster and A.gambiae sequences. This lack of homology reflects the divergence between hemipteran and dipteran, as already discussed in Sabater-Mun˜oz et al. (2006).

1

INTRODUCTION

Since the genome of Drosophila melanogaster, several other insect genomes have been or are currently being sequenced. These new genomic resources require the development of databases in order not only to store and analyse each insect genome separately, but also for interspecies comparisons. Aphids are plant-sucking insects attacking plants by feeding on fluids circulating in the plant phloem. These hemipterans diverged from other insects 300 million years ago (Grimaldi and Engels, 2005) and are responsible for considerable damage worldwide to cultivated and ornamental plants. Genomic tools have been recently developed on aphids, mainly on the pea aphid Acyrthosiphon pisum. In the last 3 years, a large collection of ESTs (more than 60 000) have been developed, and the complete genome sequence is currently being released. AphidBase is a project of the International Aphid Genomic Consortium as a resource for users to analyse, compare, retrieve, and annotate aphid sequences, in comparison with D.melanogaster and Anopheles gambiae. *To whom correspondence should be addressed.

3

DATABASE STRUCTURE

FUNCTIONAL ANNOTATION TOOLBOX

In order to facilitate functional annotation of the pea aphid sequences for which very little data is available on gene and protein functions, several descriptions are proposed for each of the 4789 A.pisum contigs homologous to D.melanogaster and/or A.gambiae sequences. A direct link to the FlyBase report for each D.melanogaster gene was activated, in order to take advantage of the whole molecular and genetic description of fly genes homologous to aphid genes. In parallel, a similar direct link was performed for the A.gambiae gene transcripts annotated at the Ensembl Mosquito Transcript Report. Finally, a contig report for each of the A.pisum sequences was created, containing a list of

ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

783

J.-P.Gauthier et al.

several features. First, the sequence of each contig as well as EST composition and cDNA libraries of origin were indicated. Second, the translation in the six open reading frames were displayed, as well as the identification of the largest putative ORF after FrameD analysis (Schiex et al., 2003). Third, as each of the aphid transcripts had several matches to D.melanogaster genomic sequences, only the first hit was displayed on Gbrowse for search of clarity. Thus, the complete tblastx report was included on the contig report. A major fraction (2872) of the 4789 pea aphid contigs homologous to D.melanogaster sequences had a single match (the one visualized by Gbrowse) but 1827 contigs still had between 2 and 7 different matches on the D.melanogaster genome (see ‘‘Global Report’’). Fourth, the result of Uniprot annotation (e value ¼ 105, Sabater-Mun˜oz et al., 2006) was displayed. Finally, a text mining analysis was set up, based on the large D.melanogaster literature. For this, the thesaurus of the D.melanogaster bibliography (about 37 000 bibliographical records from the Medline database) was constructed (i) by automatic extraction of gene names from the abstracts, (ii) by automated pattern recognition and natural language processing methods for names of genes and their products, and (iii) by compiling literature annotations available in various databases such as FlyBase, SwissProt or Entrez Gene. Co-citation clusters and networks were also constructed by automatic clustering approaches, and annotated by biological and functional information from Medline records (indexation keywords) and by available information for D.melanogaster genes and their products in databases and ontologies. Each of the 4487 pea aphid contigs homologous to a D.melanogaster gene was thus linked to literature records, clusters and networks by using the names of D.melanogaster homologous. This bibliographic network of co-occurrence of cited genes in the literature is a quick and efficient tool to infer biological functions in which a given pea aphid contig might be involved.

4

FUTURE

As soon as the sequence and assembly of the 530 Mb genome of A.pisum will be available, gameXML files of specific regions

784

of interest would be easily extracted in order to be loaded into genome annotation editors. Comparisons with (i) other aphid ESTs or cDNAs sequences (e.g. Hunter et al., 2003), (ii) D.melanogaster and A.gambiae genomes or (iii) other insect genomes under annotation (e.g. the honey bee) will help human expert decisions. AphidBase will also be implemented with new functional annotation modules, mainly for transcript and proteic profilings. AphidBase will thus represent the necessary infrastructure for curation, archiving and functional annotation of an aphid genome, one of the first hemipteran sequenced genome.

ACKNOWLEDGEMENTS S. Cain (Gmod), O. Chenede´ (INRA Rennes), J.P. Gaultier (INRA Jouy en Josas), C. Pommier (INRA, URGI), J.C. Simon, C. Soster (INRA Rennes) and L. Stein (CSHL, USA) are acknowledged for their technical support, advice and discussions. Financial support was received from Rennes Metropole and ANR Exdisum. Conflict of Interest: none declared.

REFERENCES Drysdale,R. et al. (2005) FlyBase: genes and gene models. Nucleic Acids Res., 33, D390–D395. Grimaldi,D. and Engels,M.S. (2005) Evolution of the Insects. Cambridge University Press, New York . Hunter,W.B. et al. (2003) Aphid biology: expressed genes from the alate Toxoptera citricida, the brown citrus aphid. J. Ins. Sci., 3, 23. Sabater-Mun˜oz,B. et al. (2006) Large-scale gene discovery in the pea aphid Acyrthosiphon pisum (Hemiptera). Genome Biol., 7, R21. Schiex,T. et al. (2003) FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res., 31, 3738–3741. Stein,L.D. et al. (2002) The generic genome browser: a building block for a model organism system database. Genome Res., 12, 1599–1610.

Suggest Documents