Complications in Mammalian microRNA Target ... - Semantic Scholar

3 downloads 0 Views 155KB Size Report
MicroRNA, microRNA target prediction, RNA interference, LINE-2, genomic repeats ... precursors, because they appear to regulate mRNA stability and/or ...
Complications in Mammalian microRNA Target Prediction

Neil R. Smalheiser, MD, PhD and Vetle I. Torvik, PhD

Department of Psychiatry and UIC Psychiatric Institute, University of Illinois-Chicago, MC912, 1601 W. Taylor Street Chicago, IL 60612 USA 312-413-4581 fax 312-413-4569 [email protected]

i.

Summary.

In this chapter, we review evidence that at least 3 different types of microRNA-mRNA target interactions exist in mammals: short-seeds, long-seeds, and “perfect” hits (allowing G:U matches). Since new types of microRNAs are still being discovered, this list may not yet be complete.

ii.

Key Words.

MicroRNA, microRNA target prediction, RNA interference, LINE-2, genomic repeats

1

1. Introduction The phenomena of RNA interference were first defined in plants and C. elegans, and have subsequently been studied in all living phyla including mammals (1). Among several different populations of small RNAs that have been identified, the microRNAs have perhaps received the most attention because they arise from well-defined genomic precursors, because they appear to regulate mRNA stability and/or translation, and because they are clearly important for development and cell differentiation (see reviews in refs. 2-4).

Over the past few years, a growing number of biologically validated mRNA targets have been identified for specific microRNAs. In plants, many of the mRNA targets appear to encode transcription factors; the microRNA often shows perfect or near-perfect “long seed” complementarity with its target, and microRNAs bind within the protein coding region as well as in untranscribed regions (5, 6). In C. elegans and Drosophila, the situation appears to be quite different: the 5’-end of the microRNA generally exhibits a short, perfect “seed” of 6-8 nucleotides in length, excluding G:U matches, which appears to be targeted to the 3’-UTR of the mRNA (2-4).

A number of studies have used biologically validated mRNA targets in C. elegans and Drosophila as a training set for machine learning algorithms, to predict additional mRNA targets that follow the same rules in a variety of species, including mammals (715). Most of these studies focused attention on 3’-UTR regions that are well conserved across species, in order to minimize the chance of false-positive predictions. Indeed,

2

based on these rules, each microRNA appears to have ~10-100 plausible mRNA targets. Studies of microRNA-mRNA binding have confirmed the importance of short-seeds having a 5’-end location within certain microRNAs (16).

However, it does not appear that all mammalian microRNAs interact with their targets via short-seeds. Xie et al. (17) identified a large number of 8-mer motifs within 3’-UTRs that were conserved across multiple mammalian species, which correspond to short-seed mRNA target regions. Only about half of currently identified mammalian microRNAs mapped to one of these motifs, suggesting that the remaining microRNAs may obey different types of binding interactions with their targets.

In a recent study, we asked if mammalian microRNAs may exhibit long-seed complementarity interactions with their mRNA targets (18). Although no biologically validated mammalian microRNA-mRNA pairs were known at the time, we took the statistical approach of looking at the entire set of potential binding interactions among all known microRNAs and all mRNAs in the human RefSeq database. To define the level of binding interactions that would be expected by chance, each microRNA sequence was scrambled 10 times [either scrambled by random permutation or by maintaining the dinucleotide composition; both methods gave similar results] and each of the scrambled microRNAs was tested. Our underlying assumption is that scrambled sequences will hit mRNA at random and define the “noise” level in any given situation, whereas microRNA sequences will hit the same number of “noise” interactions plus any true targets.

3

2. Methods 2.1. Retrieving the microRNA and RefSeq mRNA sequences. The set of mature miRNA sequences are available in FASTA format in a plain text file from The miRNA Registry (http://www.sanger.ac.uk/Software/Rfam/mirna/). This file contains all miRNA sequences from all species, where the human ones are identified by lines starting with ">hsa". The most recent set of human mRNA sequences in the NCBI RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/) can be retrieved by submitting the query "srcdb_refseq[prop] AND biomol_mrna[prop] AND homo sapiens[orgn]" to Entrez Nucleotide (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide) database, and saving the results to a local file. The RefSeq database is updated quite often, so one is not likely to retrieve the same set of records for the same query performed twice e.g. a month apart. Some records are deleted, some are temporarily removed while being annotated, and some are replaced by newer versions (even though the actual sequence may not be changed). However, the status of any given record can be found by searching on the GI or accession numbers.

2.2. Preprocessing. Many miRNA sequences are similar (some even differ only in 1 or 2 nucleotides), a fact that could potentially affect the statistics in this study. To remove redundant miRNAs, we compared all miRNA sequences pairwise, removing the shorter miRNA from consideration if they overlapped with 10 or more nucleotides. Since we only count exact hits of length 10 or more, no hits on the same mRNA at the same location will be counted more than once.

4

To create a population of “negative control” sequences to represent what is expected by chance, we created 10 scrambled versions of each miRNA sequence, by randomly permuting its nucleotides. This maintains the base composition but not the order. Alternatively, we also scrambled the sequences while preserving the dinucleotide composition by first splitting each miRNA sequence into chunks of two nucleotides analogous to a reading frame: once starting with the first nucleotide and once starting with the second nucleotide. We then permuted these chunks in creating a second set of scrambled miRNA sequences.

Note that RefSeq mRNA sequences use Ts whereas the miRNA Registry files use Us. The regular expression $sequence =~ s/T/U/g substitutes each occurrence of a T with a U, which is necessary to be able to match the two sequences (fig. 1).

2.3. Aligning miRNAs with mRNA sequences. Because the standard BLAST server at NCBI is unable to find imperfect alignments of the nature studied here (potentially with large gaps and multiple mismatches per matched nucleotide) (19), we wrote a Perl script to perform a custom gapped BLAST alignment. The procedure comprises two main steps:

2.3.1. Identifying exact hits. Since we are looking for sequence complementarity (where miRNAs bind to mRNAs), each miRNA sequence (and their scrambled counterparts) are first reversed and

5

them complemented (i.e., A->U, C->G, U->A, G->C). Complementation can be accomplished with the regular expression $sequence =~ tr/ACUG/UGAC/g. For each pair of miRNA and mRNA sequences, take all subsequences of the miRNA in decreasing length starting with the entire sequence and going down to 10 nucleotides. Search for this subsequence within the mRNA sequence (see Perl code in fig. 1). If an exact match is found, then the hit site on the mRNA is replaced with Xs so that only the longest hit by the miRNA at this site will be counted. The hits are then recorded in a “hit file” identifying the two sequences, the length of the hit, and the location on each of the two sequences.

Often several different RefSeq records represent close variants of the same mRNA. To prevent this redundancy effect from inflating the statistics on exact hits, we only count the longest mRNA and do not count hits upon other mRNAs whose sequences matched exactly on the 25 nucleotides on each side of the hit. This will remove hits that have long stretches of exactly the same sequences, while preserving the hits that only share the hit region and not the flanks. We also remove hits having low complexity sequences by using the regular expression $sequence =~ s/((.+)\2{4,})/’N’ x length $1/eg (Lincoln Stein, Bioperl: Repetitive DNA, http://bioperl.org/pipermail/bioperl-l/1999November/003313.html).

2.3.2. Extending the exact hits. Given an exact hit of 10 or more nucleotides, with no gaps, mismatches or G:U matches, the seed region is extended in both directions to find the optimal alignment allowing for large gaps and multiple mismatches. This is achieved by penalizing gaps at

6

only a fraction of the reward of a match, and making the mismatch penalty a fraction of the reward of a match. Note that the BLAST programs from NCBI tends to not pick up good alignments in this scenario, probably due to the method of calculating significance (i.e., "Expect Values"). We allow a ratio of up to 4:1 between mismatches and matches, and allow a four nucleotide gap between matches, while allowing a longer gap as long as multiple matches are obtained as a result. This is achieved with the parameters: Match reward, r = 10, mismatch penalty, q = - 2.5, Open gap penalty G = 8, Extend gap penalty E = 0.5.

Given a pair of sequences x and y of lengths m and n, respectively, let xi, yj denote their i-th and j-th nucleotide, respectively. Then, the alignment score is calculated as follows:

Start by assigning Score(i,j) = 0 for all (i,j) where i or j = 0. Then, for i = 1,…, m and j = 1, …, n compute the score using the following equation:

 Score(i − 1, j − 1) + (r − q ) × Match( xi , y j ) − q  Score(i , j ) = max  Score( k , j ) − G − E × (i − k ), k = 0,..., i − 1  Score(i , k ) − G − E × ( j − k ), k = 0,..., j − 1 

The greatest score is then given by Score(m,n). If the actual alignment is to be found one needs to keep track of the path from (0,0) to (m,n) in the Score array as well.

For each record identified in the hit file, a sequence alignment is performed between

7

the right flank of the miRNA vs the right flank of the mRNA using the recursive equation above, ensuring that the first nucleotides are either gap or mismatch scored. Then, the same procedure is performed on the left flanks, after which the total score is tallied and recorded in a file. When allowing for G:U matches, nucleotide C in the microRNA reverse-complement sequence is replaced by C OR U, and nucleotide A is replaced by A OR G (fig. 1) indicating that multiple combinations should receive a match reward.

3. An analysis of long-seed interactions between mammalian microRNAs and their targets. As shown in fig. 2, a large number of long-seed complementary interactions existed between microRNA sequences and mRNA sequences having 10 or more exact matches (excluding G:U matches), which differed significantly from the level expected by chance. Strikingly, at longer and longer seed lengths, the difference between the microRNA set and the scrambled set became more and more pronounced: at a seed length of 16 or greater, there were 3.5 times more examples of potential target interactions in the microRNA set than in the scrambled set, and at a seed length of 17 or greater, the ratio was over 7 to 1 (fig. 2).

This finding suggested that some true microRNA-mRNA target interactions may involve long-seeds. To further define a list of candidate mRNA targets with high confidence, we analyzed further the set of interactions having 10 or more matches in a row (the “10+ set”) to see if they would also exhibit other statistically significant differences at the population level. Indeed, we defined a gapped BLAST score for

8

microRNA-mRNA interactions taking into account gaps, mismatches and G:U matches, and found that the gapped BLAST scores in the 10+ set were significantly better overall than expected by chance (fig. 3). Furthermore, for those mRNAs that received multiple “hits” from different microRNAs within the 10+ set, we examined the minimum distance between hits. Again, there was a striking difference between the minimum distance seen in the microRNA population vs. the population of scrambled sequences (fig 4).

By combining these three criteria (seed length, gapped BLAST score, and minimum distance between hits) we created a list of 71 candidate mRNA targets which were unlikely to be chosen by chance – with a discrimination ratio of 5.2 to 1, which means that over 80% of the targets on the list are expected to be biologically valid targets. The 71 mRNAs had a larger number of microRNA hits per kilobase of target sequence than did the scrambled sequences. As well, individual microRNAs hit multiple (up to 17) distinct members of the candidate set, which again happened significantly more often than by chance (fig. 5). These findings indicate that the outlier mRNAs are different as a whole from the mRNAs that were hit by scrambled counterparts, even those that satisfied the same cut-off criteria.

Our candidate target list contained very similar types of targets as predicted by the short-seed computational studies, including members of the same gene families: Transcription factors (including homeobox genes), nucleic acid-binding proteins and many other functional categories including kinases, receptors and other signal transduction proteins, membrane and cytokeletal proteins, and effectors of differentiation.

9

Also, our study was in concordance with other studies showing that some individual microRNAs may hit multiple mRNA targets residing in the same metabolic pathway (18). However, surprisingly, we found that the long-seed candidate mRNA target list had no preference at all for microRNA hits to be located within 3'-untranslated regions, but hit within the protein coding region in ~2/3 of cases. As well, the best microRNA hits upon candidate mRNA targets did not have relatively better target complementarity near their 5'-end. These considerations suggest that short-seed and long-seed target interactions both exist in mammals, and that they may follow different rules.

4. “Perfect” target interactions (allowing G:U matches). A few examples of microRNAs exhibiting perfect complementarity have been described (miR-196 (20), miR-127 and miR-136 (21, 22)). During our analysis of longseed interactions we were struck by the existence of targets that had perfect complementarity to microRNAs along their full length (ref. 18, fig. 2), and we wondered whether they could be representatives of a larger, distinct class of targets.

Whereas long-seeds are defined as having exact Watson-Crick base pairing with no G:U matches, recent suggest that complementarity interactions that contain up to, say, 5-7 G:U matches (but no frank mismatches) could still be deemed “perfect” in gene silencing (23). In particular, we noted that human miR-95 was perfectly complementary (including up to 4 G:U matches) with scores of human mRNAs and ESTs (fig. 6). Similarly, miR-151* was perfectly complementary to 6 transcripts, which was significantly above the level expected by chance.

10

The reason for these perfect hits turned out to be simple and intriguing: The precursors of these two microRNAs, as well as two others (miR-28 and miR-325), turned out to derive entirely from genomic repeats (24). In particular, the hairpin foldbacks were formed by the junction of two adjacent LINE-2 repeat segments apposed in opposite orientation (fig. 7). Insofar as MIR repeats and other LINE-2 derived elements are present in the 3’-UTR of many different mRNAs and EST transcripts, when in the proper orientation they are natural generic targets for the repeat derived microRNAs – in some cases having perfect complementarity, and in other cases (since there is divergence among LINE-2 elements) imperfect complementarity (24).

5. Discussion

Computer and genomic analyses of mammalian microRNA-mRNA target interactions suggest that at least three distinctly different types of interactions exist: short-seeds, long-seeds and “perfect” hits (allowing G:U matches). By their nature, one would expect that short-seed interactions are likely to have weak effects, to inhibit protein translation, and to affect large numbers of genes in parallel, whereas long-seed interactions might be expected to have stronger effects and involve fewer targets. The LINE-2 repeat derived microRNAs appear to recognize transcripts that share repeats in their 3’-UTR regions, and when they bind with perfect or near-perfect complementarity, they may be expected to lead to degradation of the transcripts. This could serve as a mechanism for detecting and neutralizing aberrant transcripts (having readthrough

11

transcription from retained introns or neighboring genomic regions) as well as serving to regulate specific mRNAs (24).

It should be cautioned that complementarity between microRNAs and their targets is not the only factor that may govern which microRNA-mRNA target interactions are effective in vivo. One must consider the potential importance of mRNA target secondary structure (25) as well as the strong possibility that RNA-binding proteins may participate in microRNA recognition (26). As well, both microRNA and mRNA need to be coexpressed in proper amounts within the same cell for effective interaction to occur, and A-to-I editing of RNA might abrogate potential mRNA targets from being effectively silenced by the RISC complex (27).

Finally, it is likely that new types of microRNA-mRNA target interactions still remain to be uncovered. A number of viruses have been shown to encode microRNA precursors (28-31) that are not conserved in their host genomes or across different viruses, and their targets have not been fully defined. As well, a population of small RNAs that shares some, but not all, of the features of microRNAs has been described in C. elegans (32); these so-called tncRNAs are not conserved across species, and their targets (if any) have not been identified yet. Therefore, new types of target interactions, and indeed new types of small RNAs, may be identified in mammalian species within the near future.

12

6. Acknowledgments Supported by NIH grants DA15450 and LM07292. This Human Brain Project/Neuroinformatics research is funded jointly by the National Library of Medicine and the National Institute of Mental Health.

13

7. References 1. Dykxhoorn, D.M., Novina, C.D. and Sharp, P.A. (2003) Killing the messenger: short RNAs that silence gene expression. Nat. Rev. Mol. Cell Biol. 4, 457-467. 2. Bartel, D.P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281-297. 3. Ambros, V. (2004) The functions of animal microRNAs. Nature 431, 350-355. 4. Lai, E.C. (2003) microRNAs: runts of the genome assert themselves. Curr. Biol. 13, R925-936. 5. Llave, C., Xie, Z., Kasschau, K.D. and Carrington, J.C. (2002) Cleavage of Scarecrow-like mRNA targets directed by a class of Arabidopsis miRNA. Science 297, 2053-2056. 6. Rhoades, M.W., Reinhart, B.J., Lim, L.P., Burge, C.B., Bartel, B. and Bartel, D.P. (2002) Prediction of plant microRNA targets. Cell 110, 513-520. 7. Enright, A.J., John, B., Gaul, U., Tuschl, T., Sander, C. and Marks, D.S. (2003) MicroRNA targets in Drosophila. Genome Biol. 5, R1. 8. Stark, A., Brennecke, J., Russell, R.B. and Cohen, S.M. (2003) Identification of Drosophila MicroRNA Targets. PLOS Biol. 1, 397-409. 9. Rajewsky, N. and Socci, N.D. (2004) Computational identification of microRNA targets. Dev. Biol. 267, 529-535. 10. Lewis, B.P., Shih, I.H., Jones-Rhoades, M.W., Bartel, D.P. and Burge, C.B. (2003) Prediction of mammalian microRNA targets. Cell 115, 787-798.

14

11. Kiriakidou, M., Nelson, P.T., Kouranov, A., Fitziev, P., Bouyioukos, C., Mourelatos, Z. and Hatzigeorgiou, A. (2004) A combined computational-experimental approach predicts human microRNA targets. Genes Dev. 18, 1165-1178. 12. John, B., Enright, A.J., Aravin, A., Tuschl, T., Sander, C. and Marks, D.S. (2004) Human MicroRNA targets. PLoS Biol. 2, e363. 13. Rehmsmeier, M., Steffen, P., Hochsmann, M. and Giegerich, R. (2004) Fast and effective prediction of microRNA/target duplexes. RNA 10, 1507-1517. 14. Lewis, B.P., Burge, C.B. and Bartel, D.P. (2005) Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15-20. 15. Brennecke, J., Stark, A., Russell, R.B. and Cohen, S.M. (2005) Principles of microRNA-target recognition. PLoS Biol. 3, e85. 16. Doench, J.G. and Sharp, P.A. (2004) Specificity of microRNA target selection in translational repression. Genes Dev. 18, 504-511. 17. Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S. and Kellis, M. (2005) Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 434, 338-345. 18. Smalheiser, N.R. and Torvik, V.I. (2004) A population-based statistical approach

identifies parameters characteristic of human microRNA-mRNA interactions. BMC Bioinformatics 5, 139. 19. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

15

20. Yekta, S., Shih, I.H. and Bartel, D.P. (2004) MicroRNA-directed cleavage of HOXB8 mRNA. Science 304, 594-596. 21. Seitz, H., Youngson, N., Lin, S.P., Dalbert, S., Paulsen, M., Bachellerie, J.P., Ferguson-Smith, A.C. and Cavaille, J. (2003) Imprinted microRNA genes transcribed antisense to a reciprocally imprinted retrotransposon-like gene. Nat Genet. 34, 261262. 22. Smalheiser, N.R. (2003) EST analyses predict the existence of a population of chimeric microRNA precursor-mRNA transcripts expressed in normal human and mouse tissues. Genome Biol. 4, 403. 23. Miyagishi, M., Sumimoto, H., Miyoshi, H., Kawakami,Y. and Taira, K. (2004) Optimization of an siRNA-expression system with an improved hairpin and its significant suppressive effects in mammalian cells. J. Gene Med. 6, 715-723. 24. Smalheiser, N.R. and Torvik, V.I. (2005) Mammalian microRNAs derived from genomic repeats. Trends Genet., in press. 25. Bohula, E.A., Salisbury, A.J., Sohail, M., Playford, M.P., Riedemann, J., Southern, E.M. and Macaulay, V.M. (2003) The efficacy of small interfering RNAs targeted to the type 1 insulin-like growth factor receptor (IGF1R) is influenced by secondary structure in the IGF1R transcript. J. Biol. Chem. 278, 15991-15997. 26. Jing, Q., Huang, S., Guth, S., Zarubin, T., Motoyama, A., Chen, J., Di Padova, F., Lin, S.C., Gram, H. and Han, J. (2005) Involvement of microRNA in AU-rich element-mediated mRNA instability. Cell 120, 623-634. 27. Scadden, A.D. and Smith, C.W. (2001) RNAi is antagonized by A-->I hyper-editing. EMBO Rep. 2, 1107-1111.

16

28. Pfeffer, S., Zavolan, M., Grasser, F.A., Chien, M., Russo, J.J., Ju, J., John, B., Enright, A.J., Marks, D., Sander, C. and Tuschl, T. (2004) Identification of virusencoded microRNAs. Science 304, 734-736. 29. Bennasser, Y., Le, S.Y., Yeung, M.L. and Jeang, K.T. (2004) HIV-1 encoded candidate micro-RNAs and their cellular targets. Retrovirology 1, 43. 30. Pfeffer, S., Sewer, A., Lagos-Quintana, M., Sheridan, R., Sander, C., Grasser, F.A., van Dyk, L.F., Ho, C.K., Shuman, S., Chien, M., Russo, J.J., Ju, J., Randall, G., Lindenbach, B.D., Rice, C.M., Simon, V., Ho, D.D., Zavolan, M. and Tuschl, T. (2005) Identification of microRNAs of the herpesvirus family. Nat Methods 2, 269276. 31. Cai, X., Lu, S., Zhang, Z., Gonzalez, C.M., Damania, B. and Cullen, B.R. (2005) Kaposi's sarcoma-associated herpesvirus expresses an array of viral microRNAs in latently infected cells. Proc. Natl. Acad. Sci. U S A 102, 5570-5575. 32. Ambros, V., Lee, R.C., Lavanway, A., Williams, P.T. and Jewell, D. (2003) MicroRNAs and other tiny endogenous RNAs in C. elegans. Curr. Biol. 13, 807-818.

17

Figure Legends

1. Perl script to match a subsequence of a miRNA with a mRNA. See Methods for details.

2. microRNAs and their scrambled counterparts interact differently with the population of human mRNAs. Shown are all exact hits ≥ 10 bases long (not counting G:U matches) produced on human RefSeq mRNAs by the set of nonredundant microRNAs, vs. the average of 10 replications of scrambled control sequences. Shown is the number of hits as a function of exact hit length. Only the longest hit was counted: e.g., for a hit of length 18, the two subsets of length 17 in the same hit position were not counted. Figures 2-5 are reproduced from BMC Bioinformatics 5, 139.

3. Distribution of gapped BLAST scores in hits made by microRNAs and scrambled counterparts. Without permitting G:U matches in the extension phase, the microRNAs had better average gapped BLAST scores than scrambled counterparts across all mRNAs in the "10+ set" (153.00 ± 0.03 vs. 150.98 ± 0.01, mean ± s.e.m., p < 0.0001). With permitting G:U matches in the extension phase, the microRNA set showed significantly fewer G:U matches overall relative to scrambled counterparts, even when holding constant the length of the exact hit (2.891 ± 0.004 vs. 2.939 ± 0.001, p < 0.0001).

4. Number of distinct mRNA sequences which received hits from two or more distinct microRNAs, as a function of the minimum distance between hits. Distance of

18

0 or 1 was excluded because this might be produced by partial overlap of microRNA sequences.

5. Individual microRNAs hit multiple targets on the candidate list, more often than expected by chance.

6. Multiple sequence alignment of mRNAs and ESTs that exhibited perfect complementarity to miR-95. miR-95 hit perfectly (including two-to-four G:U matches) on three mRNAs and 94 ESTs comprising 31 distinct clusters (Unigene clusters or singletons when not belonging to a Unigene cluster). Multiple sequence alignment performed using ClustalW (http://www.ebi.ac.uk/clustalw) shows that these putative targets are not related to each other except in the microRNA hit region and in nearby LINE-2 homologous sequences (L2A consensus from Repbase, http://www.girinst.org). Figures 6 and 7 are reproduced with permission from Trends in Genetics.

7. Genomic structure of human LINE-2-derived miRNA precursors. Information was downloaded and edited from the UCSC Genome Browser. Each of the precursors resides within an intron, and each flanks the junction of two L2 repeats in opposite orientation (darker shading indicates less divergence from the L2 consensus sequence).

19

$mrna = 'ACTTTGGGACGAGCTT'; $mrna =~ s/T/U/g; #replace Ts with Us $mirna = 'UUGG'; $mirna = reverse $mirna; #reverse the sequence $mirna =~ tr/ACUG/UGAC/; #find the complement if ($GUs_allowed) { $mirna =~ s/C/[CU]/g; $mirna =~ s/A/[AG]/g; }

#Gs will now match either Cs or Us #Us will now match either As or Gs

while ($mrna =~ /$mirna/g) { $hit_lngth = length($mirna); $hit_pos = pos($mrna) - $hit_lngth; substr($mrna,$hit_pos,$hit_lngth) = 'X' x $hit_lngth; }

5

microRNA sequences (nonredundant set) Scrambled sequences (average of 10 replications with 95% prediction interval)

10

4

Number of hits of exact hit length x or greater

10

3

10

2

10

35 1

10.1

10

14 4

3

3

3 2

1.9 0

1

10

0.5 0.2 −1

10

10

12

14

16 18 Exact hit length, x

20

22

microRNA sequences (nonredundant set) Scrambled sequences (average of 10 replications) 1600

1400

Number of hits

1200

1000

800

600

400

200

0 110

120

130

140

150

160 Blast score

170

180

190

200

210

microRNA sequences (nonredundant set) Scrambled sequences (average of 10 replications) 3000

Number of mRNAs

2500

2000

1500

1000

500

0

0

50

100 150 200 250 300 350 400 Minimum distance between pairs of hits by different microRNAs

450

500

35 microRNA sequences (nonredundant set) Scrambled sequences (average of 10 replications with 95% prediction interval)

30

Number of microRNA sequences

25

20

15

10

5

0

2

4

6

8

10

12

Number of distinct mRNA sequences hit

14

16

GI|4190660 GI|27481231|LOC285431 GI|27784230 GI|820483 GI|3678514 GI|3838849|SPAG11 GI|12040792 GI|38201645|E2F6 GI|23679869 GI|10730623 GI|13995855 GI|1688960 GI|22513822 GI|6039557 GI|2788661 GI|10340744 GI|23527186 GI|18991197 GI|1990638 GI|21857398 L2A_consensus GI|1224665 GI|772218 GI|802402 GI|24797633 GI|11599167 GI|1548472 GI|1551964 GI|26673550 GI|34321815|Nbla03515 GI|10832781 GI|10850675 GI|11106952 GI|16337390 GI|27479135|KIAA1998 GI|34478492 GI|11650612|IMAGE4289241

T C C A A C T G T A T C G G C T C C G T T A C C A G G T G A T T A C

T T T T T T A A A C C T A A C A T T C T C T C A T A A G T A C A A T

C T T G G A G C G C C A T A C T A A T T T A C C T T T G A G A T A C

C A A G G G T T A G T A C C A C G G A T C G T C T T T T A C T T G T

T T T C C A T G A A G C A A A A A A G G C G T C G A C A C C T T T T

A G G A A C T T T C T T T G A A G A A T T T G A A T T T A C G T A T

A A C C G G T A A G G G G C G G T T T A T C C C A T T C T C A C G T G

T T T T A A G C A T G T A A A T A G G T T C T C A T T G T A A T C G T

A A A A T T T C G A C A A A T C G T T G T C G T C G G A A T C G A A T

10 | GC GC GA GA CT CT A A A T CT A G A C - A A A T GC CA T G GC - GA A A T A T C A T T C GG CA T T CT T T A T T A A T A T T T T G T T

T T C C G G G T C T A T A T G A A G G A A T T C G G T A C A A A T T G

G G T T G G C A C C G T A T T A A C C G T A T T A C T A C T G C T C C

A A G G G G T G G C C C C G G T A T T C T G T C G T T C G A A A T C A

G G T T A A C A T C A C A T T T T C C T T A C A T A A T C A G C A A A

C C A A T T C C G A G A C G A T G C C C T A T C A T A G T A A A T G A

A A A A C C T T A T C T C A G C A A A C T T C A A G A A A T A A G T T

C C G G T T T G G G C T T C T T A T T A C G T A G A T T G A A A C G G

T T C C A A G T G A T A C T C T G G G T A T G A C T T G A T A G C A G

20 | CT CT T C T C A A A A A G GA GG GG A A - GG T C T G A G GA T G - A G A G GA A C A A GC GC CA GC CC GA T T T G GG A C A T GC T A T T C C A A G G T G T C A T A G A A G G A G T C T C A A T C A C G T C

G G T T T T G C A C C C C A C G A A G G A C C C C T T G T T T A C C A

G G T T A A C T G A G A T G A A T C C G C T T G C C A C C T G A A A T

A A G G T T A C G G A T C A G A A A A C C C G G C T C A A A G A G G T

G G A A A A G C G G C G T G C A T G G A A C G C T C A T G T G A A G G

T T G G A A G T A G A A C C C A T G G A T T G A G A T A C T G C A G T

G G G G G G G T C A T G T C A G T G G G A T A G G C G T A T T T C A A

T T G G C C A G A C A T G T G G T T A A G C A G G T T A T A T G G A A A

30 | A G A G CA CA T C T C CT A G T G T T GA - CT GC T T CA T C GC T T CC CT A T CA A G CA T C T T A G T T CT A G T T T C T A A T GG CC G G G G C C G G G G C C T A T A A T T T T C T A T G A T T A T A A A T T

G G A A C C T G T C A T A C G A G T G T T A T G G G T A C A A C C C T G

G G G G G G G A C A G G T C G T A G C G A A T G C G G G T C G T T T T T

C C A A G G T C T T C C T T T G T T T T T G A A T T T A A T A G T G A T

T T C C A A C T G C G C T A G T G C C C C G A C C G G A A G T C T A T C

C C T T A A T T C T T T A T A C T T C T T A G T C G A A C G A T A A A T

A A G G G G A T T A G C C T A T C G T G G A C G C G G A C A A T A A T A

A A T T G G T G T T T T T G C T T T T T T T A A C G C A A G T G T A T G

40 | CC CC T T T T CA CA CT T C T C T T GC - T T T T T C T G T T A C T T CA T T A G GA CT CT GC GC A T A A CA T T GA GC GT CA GT A G T T T T G G T A A T G T G T A C A T C T T A T A G A C C G T A A C T G C

T T A A G G G T T G T G C A G A G A A G G C T A C G A A T A A G A A G A

A A T T G G T A T T G T A C T T T T T T T C T T C A G T G T G A T A A G

T T T T A A T A C T G T T C C C T T T T T C T C C A G A C T T G C T T A

G G T T T T C T A T G C T C A T T C C C T A G C C C A T C C A C G C C G

C C A A G G A C C G G A G T A C A A G A A A T A C G A A T A G A A T A A

C C T T T T C A C T G G C C A T T C T C C A T T T G G A T T A C T T T C

T T T T T T A C A G G T C T C G C T G T C T C A C C G T A C T A T G T T

50 | CT CT CT CT GA GA GT T G CT GT A G - A A T T T T A G T A A C T C GT GC A G A T A C T G GG A A T G GT CC T T A C GT GA GT A T T C T T T T A A T T G T C T T G G T T T A T T A T T C C T A A T C G G G T C

T T G G G G G T T G A G C T T T G G G G G T G G T T C T C T A C T C G T

G G T T T T T A A T C G A G A G T C T T T C T T T A C T C A T C G T T G

G G A A A A A C T A A A A T T T C A A A A T A A C C A T T A G G T T A T

A A T T T T T A T T G T A T A T T T T T T T T T T C C T G C T G G G T A

C C C C C C C T C C A C G C G C C C C C C G C C C C T A A A T G C A T G

A A C C C C C C C T G T A C A C C C C C C T C T T T G T C T C C T A T T

C C T T C C C C T C G C T T C C C C C C C C C C G C C T C C A A G A T C

60 | A C A C CA CA A A A A CA CA GC T A GC - T A A A A A A G CA T C T G CA CA A A CT T A T A A C A G T A T T A G A C T T CT GA CA T G T C A A G G G G G G A A T G T C C C A G G G G T A A T G T A T C A C G G G T

A A T T G G C T G C G G A T A C C T G C T C N C T G G T A C T T A C T A

G G G G G G A G G A G A C C A T A G G G G C A A C A T G T T T C C T A G

C C C C C C C C C C G C A T C C T T C C C A C T A A T T A A G T A A T G

C C C C T T C C C C G C C A C T C C C C C A C C C G C G A T T T G T A G

T T T T T T C T T T G T C C C T T T T T T G T T C C T C T A T C T C T A

A A G G A A A C G G G A C C C A A A A A A T A A T T A T A C A A T C G A

G G G G A A G A C G G A A A T A T G G G G G G G C T G T T T C G A T A A

70 | CC CC T A T A A G A G CA CA CA GA CA - A G GC T A CA CA CA A A GA MA A G A G A A CT CT CA A A CC GA T T T A GG A C A G T A CT C C C C G G C C C T C C A G G C C T C C T C T G C A T T T G G G A A C A

A A A A A A A C A A T C T T A A A A A A A A A G C T A G T A C C T A T A

T T G G A A G A G G A T A G A G A G G G G C G A A C G G G A T C A T G C

G G G G T T T T T T A T T G G T T T A T C C T T C T C T A C T C G A C A

G G A A G G G G G C G T G G T G G T G G T G G C T N T G G T G T G A C G

G G C C T T C C C A T C C T G C T C C C A T C C G C G C C A G T C T C C

C C T T G G C C C C A T C C T C G C T C T G T T C A C C A C T A C G T A

T T T T T T T T T T T T T C C T T T T T T G T T T T T T C C T G T A G G

80 | GG GG GA GA GG GG GG GG GG GG CC - GG A A A G T G GG GG GG GG GG GA GT GT A G CA GG A G GG CT A T T T T C GC CA A T GT C C C C C C C C C C T C C C C C C T C C C G C G G C C T A C T A A T A T

A A A A T T A C A A A A A C A A G A A A A A A A T T A A G C A C C A C G

T T C C T T T C C A T T T C C C T T C C C C C C A C C C C T G A A T A A

A A A A A A A T T A A A A A A A A A A A G A A T A T A A A C T C G A T T

T T G G T T T T T T C T T T T T T T C T T G C T A T T T A A G A A T T T

A A A A A A T A A G C A A T A A G A A A A A A A G A A A A A C T C G A C

G G G G G G G A G A A T A G G C G G G G G G G G C G A A G G A A T C A T

G G T T T T T G C T T T T T T T T T T T G T T T A T T A C T T A A T A T

90 | A G A G A G A G A G A G A G A A A G A G CT A G A G A A GG A G A G A G A G A G A G A G A A A A GG A G A G A G GT A C A A A A T G GG A G A A A A C C T T G G G G G T G G G C G A G G A G G G A A T G G G A A G T A C A A G

T T T T T T T T T T T T T T T T T T T T C T T T T T T T T T T T T T T T T

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

C C T T T T C C C C C C C C C C C C T C C C C C C C C T C C C T C T T C C

T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T

C C C C C C T C C C T C C C C C C T C C C C C C C C C C C C C C C C C C T

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

A A A A A A A A A A A A G A G A A A A A A A A G A A A A A A A A G G A A A

100 | T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A T A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T

G G A A A A A G G A G A A A A A G A A G A A A A A G A A A A A A G A A A A

T T T T C T T T T T T T T T T T T T T T T T T T T T T T T C T C C C T T C

T T T T T T T C T T C T T T T T C T T T T T T T T T T T T T T T C T T C C

T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

110 | T T T T T T T T T T T T T T A A T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T G G G G G G G C G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

A A A A A A A A A G A A A A A A A A A A A A A A A A A A A A G A A A A A A

A A A A A A A A A A A G A A A A A A A A A A A A A A A A A A A A A A A G A

C C T T T T T A A C C G T C T T T T T T T T T T T T T T T T T T T T T C T

G G G G G G G T T A C C G A A A A G G G G G G G A G C A G A G G G G G A G

A A A A A A A A A A G A A A A A G A A A A A A A A A T A A A A A C A A T A

A A A A A A A A A A A G A A C A A G A A A A A A A C G A T A A A C A G C C

T T T T T T A G A T A G G T T A A T T T T T T T T T C T G T T T T T T T T

120 | GA GA CA CA T T T T A A T G A A A A A A GC GA T G T T GA T T T A GA GA GA GA A A GA GA A A A A GA T A GA GC GC GA GA A T GT GA G G A A A A A A A T A A A A A A A A A A A A A A A A T A G T A A T G G T T

T T A A G G A A A G A T T T T T T T T T T C C T T T A T T G A G A T A C A

G G A A C C A A A A A G A G G T T G T G A A G G C T G C A A G C G G T G

G G A A A C A A A A A G C G A G A A C G A A T A C C C A A C T T A A T A

A A A A C C A A A A A T A A A A A A T C G C G G A T A A A A G T A A A A

T T A A C C A A A A A G T A A G T T C T G A T A C A T A G C A A C A A A

A A A A C A A A A A G A A A G A G T C T A G G C C G T G A A T G A A G

C C A A A A A A A A C A A A C A T G C G A T A C T A A G C A T A A A T

130 | A C A C A A - A A A G A A A A A A A A A T C T C A A A A A C T T GT T G T A - T T A T A G A A CC CA A T A A A T T C T G GA A T GA GA A G G G A G A A A A A A A A T T C T C T C G T A G A T T T A G G T C T

A A A G A A A A C T A A T A C A C G A C G A T T T T T T C G A T G

A A A T A A A T G T A T A T T C T T A A C T A T G T T A T G A T

T T A A A A T T C A A G G A A C G C T G A T T A A N C A T A A T

A A A A A A A C T T G A T A T C C T T C G G C A T T C A T A C

hsa-mir-28

Base Position

189727450

Chromosome Band RefSeq Gene

mir-28

MicroRNA from the miRNA Registry

hsa-mir-28 precursor LINE: L2

Repeating Elements by RepeatMasker

hsa-mir-95 hsa-mir-151\151*

8199550

LINE: L2

8199600

8199650

8199700

4p16.1 ABLIM2 intron

Chromosome Band RefSeq Gene

mir-95

MicroRNA from the miRNA Registry

hsa-mir-95 precursor

Repeating Elements by RepeatMasker

Base Position

189727550

3q28 LPP intron

Base Position

hsa-mir-325

189727500

LINE: L2

LINE: L2

141713700

141713750

141713800

141713850

8q24.3 PTK2 intron

Chromosome Band RefSeq Gene MicroRNA from the miRNA Registry

LINE: L2

Repeating Elements by RepeatMasker

Base Position

mir-151*

mir-151 hsa-mir-151\151* precursor

LINE: L2

75092250

75092300

75092350 Xq21.1

Chromosome Band

AK057746

mRNA from GenBank

mir-325 MicroRNA from the miRNA Registry Repeating Elements by RepeatMasker

hsa-mir-325 precursor LINE: L2

LINE: L2 SINE: MIR