BIOINFORMATICS APPLICATIONS NOTE
Vol. 20 no. 1 2004, pages 131–132 DOI: 10.1093/bioinformatics/btg387
The Genomic Threading Database Liam J. McGuffin∗, Stefano Street, Søren-Aksel Sørensen and David T. Jones Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK Received on May 22, 2003; accepted on July 22, 2003
INTRODUCTION In order to benefit from the wealth of information contained in recently sequenced genomes it is essential for us to compile comprehensive resources of reliable structural annotations of the encoded proteins. The Genomic Threading Database (GTD) is a resource of accurately annotated proteomes of key organisms made using the GenTHREADER method for genomic scale fold recognition (Jones, 1999a; McGuffin and Jones, 2003). A simple web interface allows users to query the database in order to find genes within selected organisms that encode proteins with particular three-dimensional folds.
ANNOTATION SYSTEM GenTHREADER is a fast and powerful protein fold recognition method, which can be applied to whole, proteomic sequences, as in the case of the GTD implementation of the method, as well as individual protein sequence submissions, as in the case of the PSIPRED server implementation (McGuffin et al., 2000). The original GenTHREADER protocol devised by Jones (1999a) has recently been updated so that it now makes use of structural alignment profiles in the construction of the fold library. PSI-BLAST profiles (Altschul et al., 1997), bidirectional scoring and secondary structures predicted by PSIPRED (Jones, 1999b) have also been incorporated into the ∗ To
whom correspondence should be addressed.
Bioinformatics 20(1) © Oxford University Press 2004; all rights reserved.
modified protocol (McGuffin and Jones, 2003). This increases both the sensitivity of the method and enhances the accuracy of alignments, but means that it is also considerably slower than the standard GenTHREADER method. However, we are prototyping grid technology in order to speed up the computationally intensive process of reliably annotating whole proteomes. Currently the annotation system uses a fast version of GenTHREADER which is distributed across processing clusters and their associated servers. The system uses Globus Toolkit 2 to connect the hosts in the member clusters, although a Globus Toolkit 3 service is also under development. Scheduling is carried out using the Sun Grid Engine or Condor. Currently clusters of processors from University College, London and Imperial College, London are incorporated into the service. The processing clusters consist of a mixture of Intel Pentium 3, Pentium 4 and AMD Athlon based machines running the Linux operating system. The associated servers are SunFire 880s with the Solaris operating system. A demo of the distributed GenTHREADER system may be viewed at http://bioinf.cs.ucl.ac.uk/GTD/demo.avi
DATABASE IMPLEMENTATION The proteome annotations for each organism are loaded into tables within a MySQL relational database (http://www. mysql.com). Currently, the GTD contains searchable proteome annotations for Homo sapiens, Mus musculus, Anopheles gambiae, Drosophila melanogaster, Oryza sativa, Caenorhabditis elegans, Fugu rubripes, Saccharomyces cerevisiae, Schizosaccharomyces pombe as well as 101 prokaryotes. Full statistics on the coverage and confidence of annotations may be found at http://bioinf.cs.ucl.ac.uk/ summaryView.html, but as an example, around 75% of proteins from Homo sapiens have been assigned to known structures with a p-value < 0.01. A simple HTML form is used to submit queries to the GTD. Users may select which genomes they wish to search and then provide keywords or search phrases from the gene descriptions/identifiers, Protein Data Bank codes/descriptions (Berman et al., 2000), or Structural Classification of Proteins (SCOP) codes/descriptions (Murzin et al., 1995).
131
Downloaded from bioinformatics.oxfordjournals.org at Hospital Miguel Servet on May 23, 2011
ABSTRACT Summary: The Genomic Threading Database currently contains structural annotations for the genomes of over 100 recently sequenced organisms. Annotations are carried out by using our modified GenTHREADER software and through implementing grid technology. Availability: http://bioinf.cs.ucl.ac.uk/GTD Contact:
[email protected] Supplementary information: http://bioinf.cs.ucl.ac.uk/ GTD/demo.avi, http://bioinf.cs.ucl.ac.uk/GTD/summaryView. html, http://bioinf.cs.ucl.ac.uk/GTD/figure1.pdf, http://www. e-protein.org
L.J.McGuffin et al.
THE GTD AND e-PROTEIN The GTD is part of the e-Protein project (e-protein.org), which is a multi-site initiative that aims to develop a distributed pipeline for proteome annotation by utilizing grid technology. The e-Protein project includes researchers from University College, London, Imperial College, London and the European Bioinformatics Institute in Cambridge, who are currently depositing structural and functional annotations in relational databases, such as the GTD, which are hosted at each site. It is intended that these databases will eventually be connected through a single interface using the Distributed Annotation System (Hubbard and Birney, 2000). A further long-term goal of the project is to distribute all annotation
132
jobs, using a variety of methods, across processors at each site using grid technology.
ACKNOWLEDGEMENTS This work was supported by the Biotechnology and Biological Sciences Research Council and the Department of Trade and Industry.
REFERENCES Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein data base search programs. Nucleic Acids Res., 25, 3389–3402. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. Hubbard,T. and Birney,E. (2000) Open annotation offers a democratic solution to genome sequencing. Nature, 403, 825. Jones,D.T. (1999a) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol., 287, 797–815. Jones,D.T. (1999b) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202. Laskowski,R.A. (2001) PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res., 29, 221–222. McGuffin,L.J., Bryson,K. and Jones,D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404–405. McGuffin,L.J. and Jones, D.T. (2003) Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics, 19, 874–881. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540.
Downloaded from bioinformatics.oxfordjournals.org at Hospital Miguel Servet on May 23, 2011
Users also have the option to limit the number of hits displayed per chromosome per genome. Hits are initially ranked by a P -value which relates to their significance but they also can be re-ranked according to output score, threading potentials and alignment scores. On submission of the search parameters the user is linked to a results page of tabulated data on the top hits found (http://bioinf.cs.ucl.ac.uk/GTD/figure1.pdf). The results table is divided into sub-tables displaying the hits per chromosome per selected genome. Each chromosome sub-table consists of several columns showing the similarity scores, structure information and graphics of the matched templates. The ‘Confidence’ column gives an indication of the strength of each hit, ranging from ‘Certain’ to ‘Guess’. Each of these cells is also colour coded for ease of use. Users may display the full alignment of the target protein to fold template for a particular hit by clicking on the relevant cell in the table. Links are also provided to allow the user to view the full SCOP and PDBsum (Laskowski, 2001) information for the aligned fold template.