Vol. 17 no. 7 2001 Pages 642–645
BIOINFORMATICS
Semi-automated update and cleanup of structural RNA alignment databases J. Gorodkin 1,∗,†, C. Zwieb 2 and B. Knudsen 1,∗ 1 Department
of Ecology and Genetics, The Institute of Biological Sciences, University of Aarhus, Building 540, Ny Munkegade, DK-8000 Aarhus C, Denmark and 2 The University of Texas Health Science Center, 11937 US Highway 271, PO Box 2003, Tyler, TX 75708, USA
Received on November 10, 2000; revised on February 6, 2001; accepted on March 6, 2001
ABSTRACT Summary: We have developed a series of programs which assist in maintenance of structural RNA databases. A main program BLASTs the RNA database against GenBank and automatically extends and realigns the sequences to include the entire range of the RNA query sequences. After manual update of the database, other programs can examine base pair consistency and phylogenetic support. The output can be applied iteratively to refine the structural alignment of the RNA database. Using these tools, the number of potential misannotations per sequence was reduced from 20 to 3 in the Signal Recognition Particle RNA database. Availability: A quick-server and programs are available at http://www.bioinf.au.dk/rnadbtool/ Contact:
[email protected]
MOTIVATION Maintenance of RNA databases, which contain structural alignments and secondary structure assignments, is an elaborate task. Considerable time is required to identify and incorporate new sequences, and to decide where and how to revise and improve the existing alignment. Hence, it is desirable to automate parts of the procedure to achieve increased accuracy and reduce the time spent with manual intervention. Currently, the preferred search tool for finding new sequences for inclusion into an existing structural alignment is to use Stochastic Context-Free Grammars (SCFGs), e.g. as Lowe and Eddy (1997, 1999). Although these methods require a model for each family of RNA sequences, they can be extremely useful for finding related RNAs (Omer et al., 2000), even if not used in conjunction with BLAST. Models could for example be built by recent software such as RNACAD (Brown, 2000), an approach that combines hidden Markov models and SCFGs. That approach was ∗ These authors contributed equally to this work. †
To whom correspondence should be addressed.
642
used to perform an automated update of 16S rRNA structural alignment. An alternative might be to use pattern search tools such as patscan (Overbeek and Price, 1998) (for simpler motifs), and then apply an SCFG to refine the selection. However, these search tools are considerably slower than BLAST searches (Altschul et al., 1997). Therefore it might be desirable to let BLAST provide a subset of sequences useful for further investigation that might involve the more sophisticated methods mentioned above. Alternatively, one can also apply BLAST to provide a list of near identical hits, as done for the human genome (International Human Genome Sequencing Consortium, 2001). Here we construct such a wrapper, which in principle could be used for any type of database-assisted update. We also make available several tools to determine structural consistency in (manually updated) RNA secondary structure databases. Using the consensus structural annotation for the entire alignment, the programs can detect inconsistencies such as pairings other than A–U, C–G, or G–U (inside a pairing mask), and whether some stems should be extended. In particular the latter can result in minor rearrangements of the alignment. The checks are very simple: is the alignment and its overall structural annotation in agreement? Once such disagreements have been pointed out by the programs, corrections can be made right away. Interestingly, sequence editors such as DCSE, SeqPup, and BioEdit (De Rijk and De Wachter, 1993; Gilbert, 1999; Hall, 2001), that in numerous ways can assist in building the structural alignment, are apparently not able to make the same amount of consistency checks, though DCSE seems to be able to detect potential helix extension. The ability to check assignment consistency is especially useful once a core version of the alignment has been constructed. Other approaches can be used to make the first version of the alignment, for example those using information measures, energy folding, and maximum c Oxford University Press 2001
Update and cleanup of structural RNA databases
weighted matching (Zuker and Stiegler, 1981; Gutell et al., 1992; Gorodkin et al., 1997, 1999; Tabaska et al., 1998; Page, 2000). However, some of the sequence editors are able to detect patterns of covariance through mutual information calculations. The programs presented here work on unix/linux platforms. The programs, manual pages as well as a worldwide web server with basic consistency checks are available at http://www.bioinf.au.dk/rnadbtool/.
METHOD The search tool only requires that the RNA database exists on the local platform. All other information can be retrieved over the Internet from NCBI servers. The program uses the NETBLAST program (Altschul et al., 1997) by default, but a command line option allows replacement by a local version of BLAST. GenBank entries (Benson et al., 2000) are presently retrieved solely over the Internet, thus freeing the user from the burden of maintaining the database locally. The first step in a database update is to search for new sequences using the wrapper getseqs. Besides the BLAST filtering, the hits can automatically be filtered for already existing GenBank entries. The remaining sequences can be globally aligned to appropriate extended regions of query sequences using align0 (Myers and Miller, 1988; Pearson, 1990). In addition we incorporate a novel SCFG approach (qrna) which can scan for common structure among the sequences (Rivas and Eddy, 2000). User intervention is then required to make an additional selection, and a first attempt to manually align the new sequences to the database can be made. In order to run getseqs, a pre-installation of the programs mentioned above is required. In the following step, the putative alignment is submitted to a number of programs which readily report inconsistencies and suggest improvements. These programs check for consistent base pair assignment (stdpair), potential stem extensions (extendstem), and report whether sufficient sequences support (support) a pairing mask. A number of other programs are also available. The alignment comes together with a pairing mask that contains annotation of which positions in the structural alignment form pairs. Most positions in the alignment get annotated as base pairs if a sufficient number of sequences on those positions base pair. The program stdpair will detect if non-standard base pairs, A–U, C–G, or G– U, are assigned in such a pairing mask. The program extendstem indicates possible stem extensions for each sequence in the alignment. If a sufficient number of sequences throughout the alignment can be extended at the same positions the pairing mask can be extended, that is, if there is support for such change. Typically, observations from extendstem result in minor rearrangements of the
alignment, such as switching a base with a number of gaps so it will be adjacent to the beginning of a stem instead of the end of another. Some bases can sometimes participate in the extension of either of two adjacent stems. Such cases will also be detected by extendstem, but where rearrangements of the alignment cannot resolve the assignment they are left unchanged. Bases for which there is not support for extending the alignment will be left unchanged. Thus the number of such inconsistencies depends strongly on the assigned pairing mask. The program support detects whether there is support for a particular pairing mask. The rule that was applied for updating the SRP RNA and tmRNA databases was to assign pairing masks one pair further than was supported in the previous alignment. Such support was assigned if more than 2/3 other sequences at those alignment positions base pair. The approach can be summarized as follows: (1) Run getseqs (this includes NETBLAST). (2) Align selected sequences onto existing database. (3) Run programs like stdpair, extendstem, support. Go to 2 if alignment is not complete. Most of the programs work in a new data format, the column format, which lists the data in columns, and each entry (sequence) contains a header that specifies the number and types of columns. This data structure significantly simplifies data processing via command line. For details, see http://www.bioinf.au.dk/colformat/. The header contains a fixed syntax which is utilized to specify the function of each column.
RESULTS The getseqs program described above was used in the update of two structural RNA databases (Knudsen et al., 2001; Gorodkin et al., 2001). The two databases were updated with sequences from multiple sources, including GenBank, where hundreds of hits were initially found. Along with a later BLAST search on a subset of representative sequences, candidates were screened by eye to look for conserved features and similarity to close relatives. The screenings were eased by using the automated generated output from the align0 program that realigned extended regions of the BLAST hit to the database query sequence. Eventually, attempts were made to align the sequences. At each step, sequences may be rejected or included by certain criteria. In the selection process other tools were used: mfold (Mathews et al., 1999; Zuker et al., 1999), patscan (Overbeek and Price, 1998), and for SRP, a computational tool that can identify helix 8 (Samuelsson, personal communication). The final number of GenBank sequences that entered the databases were 10 for the SRP 643
J.Gorodkin et al.
pairing_mask Bor.bur. Thi.fer.pa Pse.aer. Kle.pne. Sal.par. Sal.typ. support
. . . . . . . .
. . . . . . . .
. . . . . . . .
431 449 qr r r r r - - - ssssssss r r - - GUUu - - - UCAAGU - - uG - - - - - - - - - a UU c u a gU c u - A aGUU c - - GCUGUA - - GA - a uGCC c u - GCCUGG - - GG - a uGCC c u - GCCUGG - - GG - a uGCC c u - GCCUGG - - GG . . SSSS . . . SSSSS . . . SS
pairing_mask Bor.bur. Thi.fer.pa Pse.aer. Kle.pne. Sal.par. Sal.typ. support
. . . . . . . .
. . . . . . . .
. . . . . . . .
426 445 qr r r r r r - sssssssss r r r - - GUU - - - UUCAAGU - - - uG - - - - - - - - - a UU c u a gU - c u - A aGUU c - - GCUGUA - - - GA - a uGCC c u - GCCUGG - - - GG - a uGCC c u - GCCUGG - - - GG - a uGCC c u - GCCUGG - - - GG . . SSSS . . . SSSSS . . . . SS
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
471 - - ssssssss - u u a uACUUGAg a u a A a g c aGA a u c c u c a UACAGCu c c u a a UCAGGCu a c g a a UCAGGCu a c g a a UCAGGCu a . . . . . SSSSS . .
. . . . . . . .
.
466 - - sssssssss u u a uACUUGAG u a A a g c aGA a u c c u c a UACAGCu c c u a a UCAGGCu a c g a a UCAGGCu a c g a a UCAGGCu a . . . . . SSSSS . .
539 - t t t t t u u u u u - - - v v v - - - - wwww - x x x v v v - - - u u u u u - t t t t t - - - z z z z z - - GCAAUuUGGUg g uUUGc u a gU a U - - Uu c CAA - - - ACCAu - AUUGCu u a a u a - - - - GCC a UGUCGU c - a UCCu - - - GU c U - GUCGGA - g gGCGGU - GcGGUu a a a - - - - - - GCUCC a aGC - - - a CCCu - - - GCCA - CUCGGG - - c gGCg c - GGAGUu a a c - - - - - - GUUUGUUAGU - - gGCGu - - - GUCu - GUCCGC - - aGCUGG - CAAGCg a a u - - - - - - GUCUGGUAGU - - gGCGu - - - GUC c - GUCCGC - - aGGUGC - CAGGCg a a u - - - - - - GUCUGGUAgU - - gGCGu - - - GUC c - GUCCGC - - aGgUGC - CAGGCg a a u - - - - - . SSSSSSSS . . . . . SSS . . . . SSSS . NNNSSS . . . N . SSS . SSSSS . . . . . . . S . .
. . . . . . . .
535 - t t t t t t - u u u u u - - - v v v - - - wwwwwx x x x v v v - - - u u u u u y t t t t t t - - z z z z z - - AGCAAU - uUGGUg g uUUGc u aGU a U - - Uu c CAA - - - ACCAu - AUUGCUu a a u a - - - - - GCC a U - GUCGU c - a UCCu - - - GU c U - GUCGGA - g gGCGGU - GcGGUu a a a - - - - - - - GCUCC - a aGC - - - a CCCu - - - GCCA - CUCGGG - - c gGCg c - GGAGUu a a c - - - - - - - GUUUG - UUAGU - - gGCGu - - - GUCu - GUCCGC - - aGCUGG - CAAGCg a a u - - - - - - - GUCUG - GUAgU - - gGCGu - - - GUC c - GUCCGC - - aGgUGC - CAGGCg a a u - - - - - - - GUCUG - GUAgU - - gGCGu - - - GUC c - GUCCGC - - aGgUGC - CAGGCg a a u - - - - - . . SSSSS . SSS . . . . . SSS . . . . SSSS . SSSSSS . . . . . SSS . SSSSS . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Fig. 1. A simplified example of update of tmRNA database (Knudsen et al., 2001). Upper case letters form pairs as specified in the pairing mask (each letter indicate different helices). For the initial alignment (top) two regions are shown: alignment positions 431– 449 and 471– 539. Individual bases are highlighted when involved in inconsistency or potential stem extension. For example, the ‘g’ in the black box in the sequence ‘Bor.bur.’ can be part of two extensions (the ‘s’ or the ‘t’ stem) as it can join pair with either the u at pos 436 (assigned to ‘r’ stem) or it can join together with the adjacent ‘a’ pair with the two u’s at positions 530 and 531. A potential extension of the ‘u’ stem is indicated in Thi.fer.pa. The two highlighted G’s in sequences Sal.par. are a non-standard base pair. The last line, ‘support’, shows positions, where base pairs are supported by two thirds of the sequences (‘S’). Some columns have inconsistent pairings outside the part of the alignment that has been shown (‘N’). Upon updating (bottom) the alignment length and base pair assignment have both changed, and thereby a number of lower case letters have been changed to upper case letters (and vice versa) which resolves the most inconsistencies. The potential stem extension of c–g pair in sequence Thi.fer.pa is not supported by the remaining sequences in the alignment, as stem extensions have only been performed on neighboring supported pairs.
database, and 11 for the tmRNA database. In particular for the tmRNAs, many sequences came from other sources, and that database was updated with in total 26 sequences. SRP sequences came solely from GenBank. The qrna program (Rivas and Eddy, 2000) was not used because it became available after updates were made. Making use of the consistency checks, the final alignments were obtained by iterating between 5 and 10 times on successively updated versions. An example of a consistency check performed during the update of the tmRNA database (Knudsen et al., 2001) is shown in Figure 1. The consistency check marked a number of nucleotides involved in irregular base pairing relative to the database pairing mask (overall stem assignment). The previous version of the tmRNA database that contained 26 fewer sequences had 653 ‘inconsistencies’ or potential misannotations, that were highlighted by the programs. The first alignment attempt on including the 26 sequences had 1203 of such inconsistencies. However, when iterating using the tools this number was reduced to 858. The remaining inconsistencies are mainly due to individual sequences that can have some of their stems extended, but where the overall support for extending the pairing mask is not present in the alignment. In addition, a few bases that can merge into either of two adjacent stems were also marked. The relative increase of ‘inconsistencies’ from the old to the new version depend especially on the criteria for extending stems. If such criteria was made on a 644
representative set, this number could probably be reduced even more. In fact the average number of inconsistencies per sequence is 9.8 and 10.5 for new and old versions respectively. The total number of assigned non A–U, C– G or G–U pairs was reduced from 81 to 27. The usefulness of consistency checks is even more elucidative on the SRP RNA database (Gorodkin et al., 2001). The old version had 2336 nucleotides highlighted by the consistency check. This number was reduced to 407 in the updated version, that includes 11 new sequences. Hence the average number of such inconsistencies was reduced from 20 to 3 per sequence using the tools. Such reduction is in particular possible as the database is divided into subclasses (eukaryotes, archaea, and bacteria), each with their own pairing mask. The number of non-A–U, C–G, G–U assigned pairs was reduced from 322 to 25, indicating that most of these originally were erroneous assignments. Such reduction clearly shows the usefulness of the tools presented here.
FINAL REMARKS We have presented a collection of computational tools that can aid the manual update of structural RNA databases, in selection of BLAST hits, and point out misannotations in the databases. The tools were successfully applied on two recently published databases (Knudsen et al., 2001; Gorodkin et al., 2001). In both cases the number of inconsistencies in base pair assignments were minimized
Update and cleanup of structural RNA databases
significantly. That number might be reduced even more for the tmRNA database, provided it is split into appropriate subsets each with their own pairing mask. In fact, an analysis using the qrna program reveals a subclass of sequences that fits poorly into the overall alignment. Future work includes optimizing the use of SCFGs as an integrated part of the system, as well as searching in databases other than GenBank.
ACKNOWLEDGEMENTS This work was supported by the Danish Technical Research Council (J.G.). J.G. thanks G. D. Stormo for support at Washington University Medical School in St Louis. We would like to thank E. Rivas and S. Eddy for the introduction to the qrna program, and anonymous reviewers for useful suggestions. REFERENCES Altschul,S.F., Madden,T.L., Schffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 15–18. Brown,M. (2000) Small subunit ribosomal RNA modeling using stochastic context-free grammars. In Altman,R., Bailey,T.L., Bourne,P., Gribskov,M., Lengauer,T., Shindyalov,I.N., Eyck,L.F.T. and Weissig,H. (eds), Proceedings of the Eigth International Conference on Intelligent Systems in Molecular Biology. Menlo Park, CA: AAAI/MIT Press, Boston, MA, pp. 57–66. De Rijk,P. and De Wachter,R. (1993) DCSE, an interactive tool for sequence alignment and secondary structure research. CABIOS, 9, 735–740. (http://www.rrna.uia.ac.be/dcse/). Gilbert,D.G. (1999) Seqpup: a biosequence editor. (http://www. hgmp.mrc.ac.uk/Registered/Help/seqpup/). Gorodkin,J., Heyer,L.J., Brunak,S. and Stormo,G.D. (1997) Displaying the information contents of structural RNA alignments: the structure logos. CABIOS, 13, 583–586. (http://www.cbs.dtu. dk/gorodkin/appl/slogo.html). Gorodkin,J., Stærfeldt,H.H., Lund,O. and Brunak,S. (1999) Matrixplot: visualizing sequence constraints. Bioinformatics, 15, 769– 770. (http://www.cbs.dtu.dk/services/MatrixPlot/). Gorodkin,J., Knudsen,B., Zwieb,C. and Samuelsson,T. (2001) SRPDB (Signal Recognition Particle Database). Nucleic Acids Res., 29, 169–170. (http://www.psyche.uthct.edu/dbs/SRPDB/ SRPDB.html).
Gutell,R.R., Power,A., Hertz,G.Z., Putz,E. and Stormo,G.D. (1992) Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Res., 20, 5785–5795. Hall,T. (2001) BioEdit. (http://www.mbio.ncsu.edu/RNaseP/info/ programs/BIOEDIT/bioedit.html). International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. Knudsen,B., Wower,J., Zwieb,C. and Gorodkin,J. (2001) tmRDB (tmRNA database). Nucleic Acids Res., 29, 171–172. (http:// www.psyche.uthct.edu/dbs/tmRDB/tmRDB.html). Lowe,T. and Eddy,S. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res., 25, 955– 964. (http://www.genome.wustl.edu/eddy/tRNAscan-SE/). Lowe,T. and Eddy,S. (1999) A computational screen for methylation guide snoRNAs in yeast. Science, 283, 1168–1171. Mathews,D., Sabina,J., Zuker,M. and Turner,D. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288, 911– 940. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear space. CABIOS, 4, 11–17. Omer,A.D., Lowe,T.M., Russell,A.G., Ebhardt,H., Eddy,S.R. and Dennis,P.P. (2000) Homologues of small nucleolar RNAs in archaea. Science, 288, 517–522. Overbeek,R. and Price,M. (1998) Accessing integrated genomic data using genobase: a tutorial (part 1). Report ANL/MCS-TM173, Argonne National Laboratory. (http://www-unix.mcs.anl. gov/compbio/PatScan/). Page,R.D.M. (2000) Comparative analysis of secondary structure of insect mitochondrial small subunit ribosomal RNA using maximum weighted matching. Nucleic Acids Res., 28, 3839– 3845. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol., 183, 63–98. Rivas,E. and Eddy,S. (2000) Private communication. Tabaska,J., Cary,R.B., Gabow,H.N. and Stormo,G.D. (1998) An automated RNA modeling approach capable of identifying pseudoknots and base-triples. Bioinformatics, 14, 691–699. Zuker,M., Mathews,D.H. and Turner,D.H. (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In Clark,J.B.B. (ed.), RNA Biochemistry and Biotechnology. NATO ASI Series. Kluwer, Dordrecht, pp. 11– 43. (http://www.bioinfo.math.rpi.edu/∼mfold/rna/form1.cgi). Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA sequences using thermodynamic and auxiliary information. Nucleic Acids Res., 9, 133–148.
645