Bioinformatics, 18:513â528, 2002. [4] U. Keich, P.A. Pevzner. Finding motifs in the twilight zone. Bioinfor- matics, 18:1374â1381, 2002. [5] G. Pesole, N. Prunella, ...
ABOUT SEQUENCE MOTIFS IN UTRs K. Missal, P.F. Stadler Bioinformatics Group, Department of Computer Science, Universit¨ at Leipzig, Germany Tel: ++49 341 149 5115
Fax: ++49 341 149 5119
Email: {kristin, studla}@bioinf.uni-leipzig.de
WorldWideWeb: http://www.bioinf.uni-leipzig.de/
Sequence motifs of DNA or RNA molecules are sequences of any length having important function in gene regulation. Non-specific sequence motifs are often revealed by cross species comparisons as their evolutionary rate is low. We are interested in developing methods to identify non-specific and specific sequence motifs in untranslated regions (UTRs) of eukaryotic mRNAs exploring overrepresented oligonucleotides in conjunction with their secondary structures. We focus on UTRs, as post-transcriptional control in eukaryotes is mainly affected by functional elements in the untranslated regions at the 5’ and 3’ ends of mRNA.
translated region
5’ UTR
3’ UTR
Fig. 1: Tripartite structure of mature eukaryotic transcripts. The 50 and 30 untranslated regions (UTRs) play an important role in regulating the translation of the message. Known motifs in 50 UTRs regulate the efficiency the message of the mRNA is translated into an amino-acid sequence by e.g. leaky scanning, whereas known motifs in 30 UTRs affect in particular the rate of mRNA degradation by e.g. deadenylation.
A more sophisticated approach is taken by gene annotation pipelines, like Ensembl [1], where evidence from proteins and cDNAs substantiate the prediction of the tripartite structure of eukaryotic mRNA. In particular we are interested in Hox clusters, which comprise genes involved in regulating development. Comprehensive sources for UTRs of Hox genes are Ensembl and UTRef (Fig. 2). We used Ensembl to retrieve UTRs, as the strong algorithmic background of its gene annotation pipeline results in more reliable transcript annotations. However a combination of Ensembl and UTRef would be worth to look at.
Sequence motifs in UTRs of Hox genes Phylogenetic comparisons extract conserved sequence regions which are sequences of low evolutionary rate and hence candidates for sequence motifs. In [7] the tracker approach was introduced which identifies highly conserved regions, so called footprints (FPs):
An existing collection of functional sequences and structures located in UTRs is UTRSite [6], but is generated on the basis of information reported in literature.
Species 50 UTR 30 UTR Human
24%
44%
Mouse
21%
48%
Rat
25%
42%
Table 1: Fraction of UTRs of the human, mouse and rat Hox genes containing footprints.
Retrieval of UTRs UTRs of mature mRNAs are mostly not exactly known because the detection of complete mRNAs is still an unsolved problem in gene prediction. Existing UTR databases, like UTRdb [6] and UTRef [8], scan EMBL/Genbank and Refseq entries, respectively, for coding and non-coding sequence annotations. This approach depends highly on the quality of annotation.
Cross species comparisons are useful to identify long conserved stretches of sequences but fail to reveal binding sites and regulatory elements specific to a species. Algorithms searching for overrepresented oligonucleotides in a set of sequences address this task. Oligonucleotide-patterns can be represented by [3]: • Exact k-mers
11
• Regular expressions
10
Counts
9
UTRdb (5’) UTRef (5’)
8
EnsEMBL (5’) 7
• HMM • Probability matrices
GENES EnsEMBL (3’)
6 UTRef (3’) UTRdb (3’)
5 HsA
HsB
HsC
HsD
11 10 9
Counts
8
UTRdb (5’)
7
UTRef (5’)
6
EnsEMBL (5’)
5
GENES
4
EnsEMBL (3’)
3
UTRef (3’)
2
UTRdb (3’) MmA
MmB
MmC
MmD
10 9
How to identify an a-priori unknown pattern with unknown length in a set of sequences S? The naive method generates all 4l patterns of length l from the alphabet Σ = {a, b, c, g}. One such algorithm is used in WordUP [5] which finds all exact matches of a pattern p. The method is based on a first-order Markov chain model to calculate the expected probability Pe(p) to observe p in S. A χ2-score, comparing Pe(p) to the observed relative frequency Po(p) of pattern p in S, reveals how significant the overrepresentation of p is:
8
UTRdb (5’)
Counts
UTRef (5’)
5
EnsEMBL (5’)
4
GENES
3
2 (P (p) − P (p)) o e χ2 = Pe(p)
.
EnsEMBL (3’)
2
UTRdb (3’)
0 RnA
RnB
RnC
RnD
Fig. 2: Comparisons of three different sources for UTRs of the mammalian Hox genes (UTRdb, UTRef and Ensembl). Each database was queried for UTRs of human, mouse and rat Hox genes. Blast alignments identified the positions of the UTRs within the Hox cluster. Alignments with homologous UTRs detected positions of few UTRs not known of a species.
50 UTR acaaauca
46.34
6
1
aaccgaca
40.79
3
1
acgugacc
37.79
2
0
cacgugac
34.93
2
0
accgacac
31.82
2
1
30 UTR cacacaca
34.08
6
0
gcacgcgg
28.50
3
0
gcgcgccc
23.67
5
0
uauauuuu
17.69
8
3
auauauau
17.59
27
1
Table 2: Five most significant oligonucleotides of length 8 observed in UTRs identified by WordUP [5] of the human Hox cluster and their occurrence in footprints in UTRs of the human Hox cluster identified by tracker [7].
In table 2 two pairs of significant words but overlapping regions are given (2nd and 5th row, 3rd and 4th row). This arises the question whether words with overlapping regions are two different patterns, instances of the same pattern or one is just a random word having a high χ2-score because of its overlapping region with a significant word. This suggests to apply validation statistics to identify significant overrepresentation [3]. This search for exact matches is not sufficient, as biological sequences are subject to mutations. Noise in the pattern should be modelled, e.g. using regular expressions [2] or HMM [9]. The detection of overrepresented patterns based upon generating all 4l possible words is feasible only for small l. Approaches different than the pattern driven approach described above are sequence driven approaches and sample driven approaches. Sequence driven [2] methods use local pairwise alignments to reveal patterns that match a subset of sequences in S. But for large sets of sequences this method is still too time consuming. More promising are sample driven methods [4], which generate seed patterns whose neigbourhoods are further explored via local search. They are not as time consuming, as they generate a much smaller set of patterns, but might miss patterns with highly variable instances.
References [1] E. Birney, et al. Ensembl 2004. Nucl. Acid 2004.
Research, 32:D468–D470,
[2] A. Brazma, I. Jonassen, J. Vilo, E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Gen. Research, 8:1201–1215, 1998. [3] S. Hampson, D. Kibler, P. Baldi. Distribution patterns of over-represented k-mers in non-coding yeast DNA. Bioinformatics, 18:513–528, 2002.
[5] G. Pesole, N. Prunella, S. Liuni, M. Attimonelli, C. Saccone. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucl. Acid Research, 20:2871–2875, 1992. [6] G. Pesole, S. Liuni, G. Grillo, F. Licciulli, F. Mignone, C. Gissi, C. Saccone. UTRdb and UTRSite: specialized databases of sequences and functional elements of 5’ and 3’ untranslated regions of eukaryotic mRNAS. Nucl. Acid Research, 30:335–340, 2002.
UTRef (3’)
1
χ2-Score Counts in UTR FPs
[4] U. Keich, P.A. Pevzner. Finding motifs in the twilight zone. Bioinformatics, 18:1374–1381, 2002.
7 6
Oligo
We applied WordUP to identify overrepresented oligonucleotides in UTRs of human Hox genes (Table 2). WordUP found additional occurrences of oligonucleotides observed in footprints and others which could not be revealed by cross species comparisons.
[7] S. Prohaska, C. Fried, C. Flamm, G.P. Wagner, P.F. Stadler. Surveying Phylogenetic Footprints in Large Gene Clusters: Applications to Hox Cluster Duplications Mol. Evol. Phylog., 31:581–604, 2004. [8] http://www.ba.itb.cnr.it/srs7bin/cgi-bin/wgetz?-page+LibInfo+id+3a2m41NjNtT+-lib+UTR REFSEQ [9] T. Yada, Y. Totoki, M. Ishikawa, K. Asai, K. Nakai. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics, 14:317–325, 1998.