ABOUT SEQUENCE MOTIFS IN UTRs

ABOUT SEQUENCE MOTIFS IN UTRs K. Missal, P.F. Stadler Bioinformatics Group, Department of Computer Science, Universit¨ at Leipzig, Germany Tel: ++49 341 149 5115

Fax: ++49 341 149 5119

Email: {kristin, studla}@bioinf.uni-leipzig.de

WorldWideWeb: http://www.bioinf.uni-leipzig.de/

Sequence motifs of DNA or RNA molecules are sequences of any length having important function in gene regulation. Non-specific sequence motifs are often revealed by cross species comparisons as their evolutionary rate is low. We are interested in developing methods to identify non-specific and specific sequence motifs in untranslated regions (UTRs) of eukaryotic mRNAs exploring overrepresented oligonucleotides in conjunction with their secondary structures. We focus on UTRs, as post-transcriptional control in eukaryotes is mainly affected by functional elements in the untranslated regions at the 5’ and 3’ ends of mRNA.

translated region

5’ UTR

3’ UTR

Fig. 1: Tripartite structure of mature eukaryotic transcripts. The 50 and 30 untranslated regions (UTRs) play an important role in regulating the translation of the message. Known motifs in 50 UTRs regulate the efficiency the message of the mRNA is translated into an amino-acid sequence by e.g. leaky scanning, whereas known motifs in 30 UTRs affect in particular the rate of mRNA degradation by e.g. deadenylation.

A more sophisticated approach is taken by gene annotation pipelines, like Ensembl [1], where evidence from proteins and cDNAs substantiate the prediction of the tripartite structure of eukaryotic mRNA. In particular we are interested in Hox clusters, which comprise genes involved in regulating development. Comprehensive sources for UTRs of Hox genes are Ensembl and UTRef (Fig. 2). We used Ensembl to retrieve UTRs, as the strong algorithmic background of its gene annotation pipeline results in more reliable transcript annotations. However a combination of Ensembl and UTRef would be worth to look at.

Sequence motifs in UTRs of Hox genes Phylogenetic comparisons extract conserved sequence regions which are sequences of low evolutionary rate and hence candidates for sequence motifs. In [7] the tracker approach was introduced which identifies highly conserved regions, so called footprints (FPs):

An existing collection of functional sequences and structures located in UTRs is UTRSite [6], but is generated on the basis of information reported in literature.

Species 50 UTR 30 UTR Human

24%

44%

Mouse

21%

48%

Rat

25%

42%

Table 1: Fraction of UTRs of the human, mouse and rat Hox genes containing footprints.

Retrieval of UTRs UTRs of mature mRNAs are mostly not exactly known because the detection of complete mRNAs is still an unsolved problem in gene prediction. Existing UTR databases, like UTRdb [6] and UTRef [8], scan EMBL/Genbank and Refseq entries, respectively, for coding and non-coding sequence annotations. This approach depends highly on the quality of annotation.

Cross species comparisons are useful to identify long conserved stretches of sequences but fail to reveal binding sites and regulatory elements specific to a species. Algorithms searching for overrepresented oligonucleotides in a set of sequences address this task. Oligonucleotide-patterns can be represented by [3]: • Exact k-mers

11

• Regular expressions

10

Counts

9

UTRdb (5’) UTRef (5’)

8

EnsEMBL (5’) 7

• HMM • Probability matrices

GENES EnsEMBL (3’)

6 UTRef (3’) UTRdb (3’)

5 HsA

HsB

HsC

HsD

11 10 9

Counts

8

UTRdb (5’)

7

UTRef (5’)

6

EnsEMBL (5’)

5

GENES

4

EnsEMBL (3’)

3

UTRef (3’)

2

UTRdb (3’) MmA

MmB

MmC

MmD

10 9

How to identify an a-priori unknown pattern with unknown length in a set of sequences S? The naive method generates all 4l patterns of length l from the alphabet Σ = {a, b, c, g}. One such algorithm is used in WordUP [5] which finds all exact matches of a pattern p. The method is based on a first-order Markov chain model to calculate the expected probability Pe(p) to observe p in S. A χ2-score, comparing Pe(p) to the observed relative frequency Po(p) of pattern p in S, reveals how significant the overrepresentation of p is:

8

UTRdb (5’)

Counts

UTRef (5’)

5

EnsEMBL (5’)

4

GENES

3

2 (P (p) − P (p)) o e χ2 = Pe(p)

.

EnsEMBL (3’)

2

UTRdb (3’)

0 RnA

RnB

RnC

RnD

Fig. 2: Comparisons of three different sources for UTRs of the mammalian Hox genes (UTRdb, UTRef and Ensembl). Each database was queried for UTRs of human, mouse and rat Hox genes. Blast alignments identified the positions of the UTRs within the Hox cluster. Alignments with homologous UTRs detected positions of few UTRs not known of a species.

50 UTR acaaauca

46.34

6

1

aaccgaca

40.79

3

1

acgugacc

37.79

2

0

cacgugac

34.93

2

0

accgacac

31.82

2

1

30 UTR cacacaca

34.08

6

0

gcacgcgg

28.50

3

0

gcgcgccc

23.67

5

0

uauauuuu

17.69

8

3

auauauau

17.59

27

1

Table 2: Five most significant oligonucleotides of length 8 observed in UTRs identified by WordUP [5] of the human Hox cluster and their occurrence in footprints in UTRs of the human Hox cluster identified by tracker [7].

In table 2 two pairs of significant words but overlapping regions are given (2nd and 5th row, 3rd and 4th row). This arises the question whether words with overlapping regions are two different patterns, instances of the same pattern or one is just a random word having a high χ2-score because of its overlapping region with a significant word. This suggests to apply validation statistics to identify significant overrepresentation [3]. This search for exact matches is not sufficient, as biological sequences are subject to mutations. Noise in the pattern should be modelled, e.g. using regular expressions [2] or HMM [9]. The detection of overrepresented patterns based upon generating all 4l possible words is feasible only for small l. Approaches different than the pattern driven approach described above are sequence driven approaches and sample driven approaches. Sequence driven [2] methods use local pairwise alignments to reveal patterns that match a subset of sequences in S. But for large sets of sequences this method is still too time consuming. More promising are sample driven methods [4], which generate seed patterns whose neigbourhoods are further explored via local search. They are not as time consuming, as they generate a much smaller set of patterns, but might miss patterns with highly variable instances.

References [1] E. Birney, et al. Ensembl 2004. Nucl. Acid 2004.

Research, 32:D468–D470,

[2] A. Brazma, I. Jonassen, J. Vilo, E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Gen. Research, 8:1201–1215, 1998. [3] S. Hampson, D. Kibler, P. Baldi. Distribution patterns of over-represented k-mers in non-coding yeast DNA. Bioinformatics, 18:513–528, 2002.

[5] G. Pesole, N. Prunella, S. Liuni, M. Attimonelli, C. Saccone. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucl. Acid Research, 20:2871–2875, 1992. [6] G. Pesole, S. Liuni, G. Grillo, F. Licciulli, F. Mignone, C. Gissi, C. Saccone. UTRdb and UTRSite: specialized databases of sequences and functional elements of 5’ and 3’ untranslated regions of eukaryotic mRNAS. Nucl. Acid Research, 30:335–340, 2002.

UTRef (3’)

1

χ2-Score Counts in UTR FPs

[4] U. Keich, P.A. Pevzner. Finding motifs in the twilight zone. Bioinformatics, 18:1374–1381, 2002.

7 6

Oligo

We applied WordUP to identify overrepresented oligonucleotides in UTRs of human Hox genes (Table 2). WordUP found additional occurrences of oligonucleotides observed in footprints and others which could not be revealed by cross species comparisons.

[7] S. Prohaska, C. Fried, C. Flamm, G.P. Wagner, P.F. Stadler. Surveying Phylogenetic Footprints in Large Gene Clusters: Applications to Hox Cluster Duplications Mol. Evol. Phylog., 31:581–604, 2004. [8] http://www.ba.itb.cnr.it/srs7bin/cgi-bin/wgetz?-page+LibInfo+id+3a2m41NjNtT+-lib+UTR REFSEQ [9] T. Yada, Y. Totoki, M. Ishikawa, K. Asai, K. Nakai. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics, 14:317–325, 1998.

ABOUT SEQUENCE MOTIFS IN UTRs

ABOUT SEQUENCE MOTIFS IN UTRs

Suggest Documents

Oligonucleotide Sequence Motifs as Nucleosome

Identification of sequence motifs in oligonucleotides ... - BioMedSearch

AptaTRACE Elucidates RNA Sequence-Structure Motifs from ...

Table S2. Overview of sequence motifs - PLOS

Identification of conserved, primary sequence motifs ...

Discovering Sequence Motifs with Arbitrary ... - Semantic Scholar

What are DNA sequence motifs? - Marcotte Lab

Conserved sequence motifs in the unorthodox BvgS ... - Springer Link

Multiple sequence motifs are involved in SV40 enhancer function

Sequence Motifs Involved in the Regulation of ... - Journal of Virology

From sequence to structural analysis in protein phosphorylation motifs

Highly prevalent putative quadruplex sequence motifs in human DNA

Abundance, arrangement, and function of sequence motifs in the ...

Spark-based data analytics of sequence motifs in ... - Science Direct

UTRs Greatly Affect Gene Expression in

S motifs P motifs - PLOS

Amino Acid Sequence Motifs Essential for P0 ... - APS Journals

Maximum Entropy Modeling of Short Sequence Motifs ...

Impact of VP1-Specific Protein Sequence Motifs ... - Journal of Virology

Sequence motifs associated with hepatotoxicity of ... - Oxford Academic

Sequence motifs capable of forming DNA stem& ...

MEME: discovering and analyzing DNA and protein sequence motifs

Host sequence motifs shared by HIV predict response to antiretroviral ...

Cations Form Sequence Selective Motifs within ... - Semantic Scholar