BIOINFORMATICS APPLICATIONS NOTE
Vol. 19 no. 13 2003, pages 1707–1709 DOI: 10.1093/bioinformatics/btg235
Intergenic sequence inspector: searching and identifying bacterial RNAs Christophe Pichon and Brice Felden∗ Laboratoire de Biochimie Pharmaceutique, Faculté de Pharmacie, Université de Rennes I, UPRES Jeune Equipe 2311, 2 avenue du Pr. Léon Bernard, 35043 Rennes, France Received on December 18, 2002; revised on March 13, 2003; accepted on March 31, 2003
INTRODUCTION In the past years, it has been shown that all living organisms contain a wealth of ribonucleic acids (RNAs), tens to hundreds of nucleotides long, that are not transfer, ribosomal or messenger RNAs. The complete set has not been identified, but estimates range from hundreds (Prokaryotes) to thousands (Eukaryotes) per genome. These RNAs have been unknown until recently, because experimental procedures are laborintensive (Hüttenhofer et al., 2001) and difficult to detect by computational approach (Eddy, 2002). In bacteria, these RNAs are gene regulators (RNAregs) in response to environmental changes. The recent explosion in the available bacterial genome sequences have identified and confirmed the expression of novel RNAregs, using high sequence conservation among closely related bacterial species (Wassarman, 2002 for review). These RNA genes are located within the intergenic regions (IGRs) of the bacterial genomes. Our appreciation of the importance of RNAregs is limited by the lack of data on the extent of the ‘RNome’, the ∗ To
whom correspondence should be addressed.
Bioinformatics 19(13) © Oxford University Press 2003; all rights reserved.
RNA equivalent of the proteome. Novel RNAregs are identified by comparative genomics, but a systematic RNA gene finding algorithm would be of use. Specialized algorithms recognize specific RNA structures, as transfer RNA genes (e.g. Fichant and Burks, 1991), group I introns (Lisacek et al., 1994) or small nucleolar RNAs (Lowe et al., 1997), requiring knowledge about each class of RNA. A bioinformatic approach searching for DNA sequences containing σ 70 promoters within a short distance of ρ-independent terminators was proposed and some were experimentally verified (Chen et al., 2002). Between related genomes, extended sequence conservations are observed in some intergenic sequences (Wassarman et al., 2001) and compensatory mutations are additional signatures for the presence of RNAregs. An algorithm uses RNA structure and coding constraints to examine mutations in a pairwise sequence alignment (Rivas and Eddy, 2001). In bacteria or archeas with AT-rich genomes, local nucleotide composition biases within an RNA gene towards a higher G+C content can be used for searching and identifying RNAregs (Klein et al., 2002). Therefore, current algorithms are either genome- or RNA-specific without integrating all the sequence and structural characteristics of RNA genes. Here we describe Intergenic Sequence Inspector (ISI), a software package that explores any bacterial genome to detect RNA genes expressed in IGRs. ISI extracts IGRs and selects those that possess sequence conservations between phylogenetically related species. The selected genomic sequences are visualized together with sequence and structural features specific of RNA genes.
PROGRAM OVERVIEW ISI is a Perl package combining sequence alignment with various features of the selected genomes (Fig. 1A). Two associated softwares are required, ‘bioperl 0.9.3’ (http://www.bioperl.org) and ‘Standalone NCBI BLAST’ (ftp://ftp.ncbi.nlm.nih.gov/BLAST). Bacterial sequences are extracted from the EMBL database (ftp://ftp.ebi.ac.uk/pub/ genome). ‘IGR extractor’ analyzes the sequence of an annotated genome (the script accepts any genomes: linear, circular,
1707
Downloaded from bioinformatics.oxfordjournals.org by guest on September 20, 2011
ABSTRACT Summary: Intergenic Sequence Inspector (ISI) is a program that helps in identifying bacterial ribonucleic acids (RNAs). It automatically extracts, selects and visualizes candidate intergenic regions (IGRs) that bear conservations between phylogenetically related species, displaying sequence and structural signatures of RNA genes encompassing putative promoters, terminators, RNA secondary structure predictions and the G+C percent of the selected sequences. ISI is intentionally modular to analyze various bacterial genomes according to their intrinsic specificities and to the available data. ISI provides a sum of information to the user who can evaluate whether or not an IGR is susceptible to express RNAs. Availability: The program can be downloaded at http://www.biochpharma.univ-rennes1.fr/ under the subsection ‘Software’. Contact:
[email protected]
C.Pichon and B.Felden
Downloaded from bioinformatics.oxfordjournals.org by guest on September 20, 2011
Fig. 1. (A) UML activity diagram of ISI. ISI is made of independent program units and the analysis can be performed according to the user choices. ‘Genview’ exploits various data inputs, including putative bacterial promoters, ρ-independent terminators, RNA secondary structure, G+C percent, RNA motifs, repeated sequences or other features. (B) Sample output of an IGR sequence in E.coli identifying a RNA expressed in vivo (Wassarman et al., 2001), with its nucleotide numbering, a sequence alignment with closely related bacteria, the G+C analysis, the RNA secondary structure prediction, σ 70 promoter and ρ-independent terminator. (C) Sample output of ISI showing a novel RNA gene candidate in E.coli predicted by ISI. The 5 and 3 ends of the RNA transcript have to be mapped experimentally.
single- or double-stranded and can take into account the presence of introns, if annotated) and retrieves all the IGRs, creating a batch file. IGRs containing known RNAs can be removed, and nucleotide sequences from the 5 and 3 ends of the IGRs can also be deleted to remove putative sequence conservations arising from promoters or terminators from flanking protein- or known RNA-encoded genes. ‘IGR filtering’ selects the IGRs according to given criteria (e.g.
1708
the size of the IGRs, the G+C percent). Then, a Blast is performed between the filtered IGRs against the available sequenced bacterial genomes. A file containing the results is created and goes to ‘Blast analyzer’ that sorts and organizes the IGR sequences according to the ‘expect value’, the sizes of matching sequences, putting aside the non-significant results. The individual alignments of the bacterial IGRs are converted into multiple alignments with a home made
ISI: searching and identifying bacterial RNAs
algorithm that stacks the sequences according to their nucleotide conservations, and the sequences are visualized in ‘Genview’. When available, additional information that helps detecting RNA gene sequences can be displayed, but these data are not strictly required. For example, for the Escherichia coli genome, the location of predicted promoters (Yada et al., 1999), terminators (Lesnik et al., 2001), RNA secondary structure prediction (the Vienna package is used, http://www.tbi.univie.ac.at/∼ivo/RNA/) and the G+C percent can be imported and displayed in ‘Genview’ (Fig. 1). The script parameters are adjustable by the user according to the characteristics of the bacterial genome under study. ISI has detected all the RNAregs in E.coli recently described (an example is provided in Fig. 1B), as well as new ones (Fig. 1C) that are currently being characterized. ISI can be used for any RNA genes that are phylogenetically conserved. RNAregs may overlap with annotated ORFs; to address this issue, an option to extend searches over the known ORF was included.
Calculation time depends on the number of IGRs to be extracted. For the E.coli genome, analysis of the ∼3500 IGRs takes less than 10 h on a 700 MHz Intel PIII. The modular assembly of ISI allows to perform several analyses on one set of IGRs.
ACKNOWLEDGEMENTS We thank the Conseil Régional de Bretagne (Accueil et émergence de nouvelles équipes de recherche, N◦ 006; C.P. has support for his Ph.D.) and the French Ministry of Research (ACI blanche Jeune Chercheur), to B.F.
Chen,S., Lesnik,E.A., Hall,T.A., Sampath,R., Griffey,R.H., Ecker,D.J. and Blyn,L.B. (2002) A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems, 65, 157–177. Eddy,S.R. (2002) Computational genomics of noncoding RNA genes. Cell, 109, 137–140. Fichant,G.A. and Burks,C. (1991) Identifying potential tRNA genes in genomic DNA sequences. J. Mol. Biol., 220, 659–671. Hüttenhofer,A., Kiefmann,M., Meier-Ewert,S., O’Brien,J., Lehrang,H., Bachellerie,J.P. and Brosius,J. (2001) RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse. The EMBO J., 20, 2943–2953. Klein,R.J., Misulovin,Z. and Eddy,S.R. (2002) Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc. Natl Acad. Sci. USA, 99, 7542–7547. Lesnik,E., Sampath,R., Levene,H., Henderson,T., McNeil,J. and Ecker,D. (2001) Prediction of rho-independent transcriptional terminators in Escherichia coli. Nucl. Acids Res., 29, 3583–3594. Lisacek,F., Diaz,Y. and Michel,F. (1994) Automatic identification of group I intron cores in genomic DNA sequences. J. Mol. Biol., 235, 1206–1217. Lowe,T.M. and Eddy,S.R. (1999) A computational screen for methylation guide snoRNAs in yeast. Science, 283, 1168–1171. Rivas,E. and Eddy,S.R. (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics, 2(8), 1–19. Wassarman,K., Repoila,F., Rosenow,C., Storz,G. and Gottesman,S. (2001) Identification of novel small RNAs using comparative genomics and microarrays. Genes Dev., 15, 1637–1651. Wassarman,K. (2002) Small RNAs in bacteria: diverse regulators of gene expression in response to environmental changes. Cell, 109, 141–144. Yada,T., Nakao,M., Totoki,Y and Nakai,K. (1999) Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15, 987–993.
1709
Downloaded from bioinformatics.oxfordjournals.org by guest on September 20, 2011
CALCULATIONS
REFERENCES