Intergenic sequence inspector: searching and

BIOINFORMATICS APPLICATIONS NOTE

Vol. 19 no. 13 2003, pages 1707–1709 DOI: 10.1093/bioinformatics/btg235

Intergenic sequence inspector: searching and identifying bacterial RNAs Christophe Pichon and Brice Felden∗ Laboratoire de Biochimie Pharmaceutique, Faculté de Pharmacie, Université de Rennes I, UPRES Jeune Equipe 2311, 2 avenue du Pr. Léon Bernard, 35043 Rennes, France Received on December 18, 2002; revised on March 13, 2003; accepted on March 31, 2003

INTRODUCTION In the past years, it has been shown that all living organisms contain a wealth of ribonucleic acids (RNAs), tens to hundreds of nucleotides long, that are not transfer, ribosomal or messenger RNAs. The complete set has not been identified, but estimates range from hundreds (Prokaryotes) to thousands (Eukaryotes) per genome. These RNAs have been unknown until recently, because experimental procedures are laborintensive (Hüttenhofer et al., 2001) and difficult to detect by computational approach (Eddy, 2002). In bacteria, these RNAs are gene regulators (RNAregs) in response to environmental changes. The recent explosion in the available bacterial genome sequences have identified and confirmed the expression of novel RNAregs, using high sequence conservation among closely related bacterial species (Wassarman, 2002 for review). These RNA genes are located within the intergenic regions (IGRs) of the bacterial genomes. Our appreciation of the importance of RNAregs is limited by the lack of data on the extent of the ‘RNome’, the ∗ To

whom correspondence should be addressed.

Bioinformatics 19(13) © Oxford University Press 2003; all rights reserved.

RNA equivalent of the proteome. Novel RNAregs are identified by comparative genomics, but a systematic RNA gene finding algorithm would be of use. Specialized algorithms recognize specific RNA structures, as transfer RNA genes (e.g. Fichant and Burks, 1991), group I introns (Lisacek et al., 1994) or small nucleolar RNAs (Lowe et al., 1997), requiring knowledge about each class of RNA. A bioinformatic approach searching for DNA sequences containing σ 70 promoters within a short distance of ρ-independent terminators was proposed and some were experimentally verified (Chen et al., 2002). Between related genomes, extended sequence conservations are observed in some intergenic sequences (Wassarman et al., 2001) and compensatory mutations are additional signatures for the presence of RNAregs. An algorithm uses RNA structure and coding constraints to examine mutations in a pairwise sequence alignment (Rivas and Eddy, 2001). In bacteria or archeas with AT-rich genomes, local nucleotide composition biases within an RNA gene towards a higher G+C content can be used for searching and identifying RNAregs (Klein et al., 2002). Therefore, current algorithms are either genome- or RNA-specific without integrating all the sequence and structural characteristics of RNA genes. Here we describe Intergenic Sequence Inspector (ISI), a software package that explores any bacterial genome to detect RNA genes expressed in IGRs. ISI extracts IGRs and selects those that possess sequence conservations between phylogenetically related species. The selected genomic sequences are visualized together with sequence and structural features specific of RNA genes.

PROGRAM OVERVIEW ISI is a Perl package combining sequence alignment with various features of the selected genomes (Fig. 1A). Two associated softwares are required, ‘bioperl 0.9.3’ (http://www.bioperl.org) and ‘Standalone NCBI BLAST’ (ftp://ftp.ncbi.nlm.nih.gov/BLAST). Bacterial sequences are extracted from the EMBL database (ftp://ftp.ebi.ac.uk/pub/ genome). ‘IGR extractor’ analyzes the sequence of an annotated genome (the script accepts any genomes: linear, circular,

1707

Downloaded from bioinformatics.oxfordjournals.org by guest on September 20, 2011

ABSTRACT Summary: Intergenic Sequence Inspector (ISI) is a program that helps in identifying bacterial ribonucleic acids (RNAs). It automatically extracts, selects and visualizes candidate intergenic regions (IGRs) that bear conservations between phylogenetically related species, displaying sequence and structural signatures of RNA genes encompassing putative promoters, terminators, RNA secondary structure predictions and the G+C percent of the selected sequences. ISI is intentionally modular to analyze various bacterial genomes according to their intrinsic specificities and to the available data. ISI provides a sum of information to the user who can evaluate whether or not an IGR is susceptible to express RNAs. Availability: The program can be downloaded at http://www.biochpharma.univ-rennes1.fr/ under the subsection ‘Software’. Contact: [email protected]

C.Pichon and B.Felden


Fig. 1. (A) UML activity diagram of ISI. ISI is made of independent program units and the analysis can be performed according to the user choices. ‘Genview’ exploits various data inputs, including putative bacterial promoters, ρ-independent terminators, RNA secondary structure, G+C percent, RNA motifs, repeated sequences or other features. (B) Sample output of an IGR sequence in E.coli identifying a RNA expressed in vivo (Wassarman et al., 2001), with its nucleotide numbering, a sequence alignment with closely related bacteria, the G+C analysis, the RNA secondary structure prediction, σ 70 promoter and ρ-independent terminator. (C) Sample output of ISI showing a novel RNA gene candidate in E.coli predicted by ISI. The 5 and 3 ends of the RNA transcript have to be mapped experimentally.

single- or double-stranded and can take into account the presence of introns, if annotated) and retrieves all the IGRs, creating a batch file. IGRs containing known RNAs can be removed, and nucleotide sequences from the 5 and 3 ends of the IGRs can also be deleted to remove putative sequence conservations arising from promoters or terminators from flanking protein- or known RNA-encoded genes. ‘IGR filtering’ selects the IGRs according to given criteria (e.g.

1708

the size of the IGRs, the G+C percent). Then, a Blast is performed between the filtered IGRs against the available sequenced bacterial genomes. A file containing the results is created and goes to ‘Blast analyzer’ that sorts and organizes the IGR sequences according to the ‘expect value’, the sizes of matching sequences, putting aside the non-significant results. The individual alignments of the bacterial IGRs are converted into multiple alignments with a home made

ISI: searching and identifying bacterial RNAs

algorithm that stacks the sequences according to their nucleotide conservations, and the sequences are visualized in ‘Genview’. When available, additional information that helps detecting RNA gene sequences can be displayed, but these data are not strictly required. For example, for the Escherichia coli genome, the location of predicted promoters (Yada et al., 1999), terminators (Lesnik et al., 2001), RNA secondary structure prediction (the Vienna package is used, http://www.tbi.univie.ac.at/∼ivo/RNA/) and the G+C percent can be imported and displayed in ‘Genview’ (Fig. 1). The script parameters are adjustable by the user according to the characteristics of the bacterial genome under study. ISI has detected all the RNAregs in E.coli recently described (an example is provided in Fig. 1B), as well as new ones (Fig. 1C) that are currently being characterized. ISI can be used for any RNA genes that are phylogenetically conserved. RNAregs may overlap with annotated ORFs; to address this issue, an option to extend searches over the known ORF was included.

Calculation time depends on the number of IGRs to be extracted. For the E.coli genome, analysis of the ∼3500 IGRs takes less than 10 h on a 700 MHz Intel PIII. The modular assembly of ISI allows to perform several analyses on one set of IGRs.

ACKNOWLEDGEMENTS We thank the Conseil Régional de Bretagne (Accueil et émergence de nouvelles équipes de recherche, N◦ 006; C.P. has support for his Ph.D.) and the French Ministry of Research (ACI blanche Jeune Chercheur), to B.F.

Chen,S., Lesnik,E.A., Hall,T.A., Sampath,R., Griffey,R.H., Ecker,D.J. and Blyn,L.B. (2002) A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems, 65, 157–177. Eddy,S.R. (2002) Computational genomics of noncoding RNA genes. Cell, 109, 137–140. Fichant,G.A. and Burks,C. (1991) Identifying potential tRNA genes in genomic DNA sequences. J. Mol. Biol., 220, 659–671. Hüttenhofer,A., Kiefmann,M., Meier-Ewert,S., O’Brien,J., Lehrang,H., Bachellerie,J.P. and Brosius,J. (2001) RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse. The EMBO J., 20, 2943–2953. Klein,R.J., Misulovin,Z. and Eddy,S.R. (2002) Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc. Natl Acad. Sci. USA, 99, 7542–7547. Lesnik,E., Sampath,R., Levene,H., Henderson,T., McNeil,J. and Ecker,D. (2001) Prediction of rho-independent transcriptional terminators in Escherichia coli. Nucl. Acids Res., 29, 3583–3594. Lisacek,F., Diaz,Y. and Michel,F. (1994) Automatic identification of group I intron cores in genomic DNA sequences. J. Mol. Biol., 235, 1206–1217. Lowe,T.M. and Eddy,S.R. (1999) A computational screen for methylation guide snoRNAs in yeast. Science, 283, 1168–1171. Rivas,E. and Eddy,S.R. (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics, 2(8), 1–19. Wassarman,K., Repoila,F., Rosenow,C., Storz,G. and Gottesman,S. (2001) Identification of novel small RNAs using comparative genomics and microarrays. Genes Dev., 15, 1637–1651. Wassarman,K. (2002) Small RNAs in bacteria: diverse regulators of gene expression in response to environmental changes. Cell, 109, 141–144. Yada,T., Nakao,M., Totoki,Y and Nakai,K. (1999) Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15, 987–993.

1709


CALCULATIONS

REFERENCES

Intergenic sequence inspector: searching and

Intergenic sequence inspector: searching and

Suggest Documents

Enterobacterial Repetitive Intergenic Consensus Sequence Repeats ...

Comparison of dkgBlinked intergenic sequence ... - BioMedSearch

dna compressed and sequence searching on

Local repeat sequence organization of an intergenic spacer in the

Plant viral intergenic DNA sequence repeats with transcription ...

The Nuclear Ribosomal DNA Intergenic Spacer as a Target Sequence ...

Comparison of dkgB-linked intergenic sequence ribotyping to DNA ...

Nucleotide sequence of the delta-beta-globin intergenic segment in ...

Sequence variation of intergenic mitochondrial DNA ... - naldc - USDA

Sequence context-specific profiles for homology searching

Searching for Sequence Directed Mutagenesis in Eukaryotes

Searching the expressed sequence tag (EST ...

MMseqs2: sensitive protein sequence searching for

Searching sequence space for protein catalysts

Statistics of large scale sequence searching

Sensitive protein sequence searching for the analysis

TargetFinder: searching annotated sequence databases for target ...

Protein and nucleic acid sequence database searching

SeqWare Query Engine: storing and searching sequence data in the ...

Intergenic bidirectional pro

Sequence of the Chicken 62 Crystallin Gene and Its Intergenic Spacer

Characteristics of Intronic and Intergenic Human ...

Characteristics and Signiicance of Intergenic ...

Searching for Bulges at the End of the Hubble Sequence