© 2006 Nature Publishing Group http://www.nature.com/naturemethods
PERSPECTIVE
A guide through present computational approaches for the identification of mammalian microRNA targets Praveen Sethupathy1,2, Molly Megraw1,2 & Artemis G Hatzigeorgiou1–3 Computational microRNA (miRNA) target prediction is a field in flux. Here we present a guide through five widely used mammalian target prediction programs. We include an analysis of the performance of these individual programs and of various combinations of these programs. For this analysis we compiled several benchmark data sets of experimentally supported miRNA–target gene interactions. Based on the results, we provide a discussion on the status of target prediction and also suggest a stepwise approach toward predicting and selecting miRNA targets for experimental testing.
miRNAs are short RNAs, ~22 nucleotides long, which are involved in regulating the expression of mRNA. They guide the RNA-induced silencing complex to miRNA target sites that are thought to be most prevalent in the 3′ untranslated region (UTR) of an mRNA1,2. The first miRNAs and their target genes had been identified via classical genetic techniques in 1993, but it was not until 2001 that many more miRNAs were discovered experimentally and found to be abundant and widespread3–6. For a detailed description of the biogenesis and targeting mechanisms of miRNAs we direct the reader to several of the many comprehensive reviews on the topic2,7,8. During 2003, several groups independently developed the first computational miRNA target prediction programs focused on the fruit fly9–11 and mammalian12,13 genomes. The lack of high-throughput experimental methods for miRNA target identification provided the impetus for continued development of computational target prediction programs, resulting in the development of at least ten more programs during the last two years (see Table 1 for a list of most of these programs and their web locations). Here we focus on mammalian target prediction programs that are either directly available to us
(DIANA-microT12 and miRanda11) or those that provide precompiled lists of their predictions (TargetScan13, miRanda11, TargetScanS14 and PicTar15). We provide an analysis of their performance on a set of experimentally supported miRNA targets. miRNA target sites can be classified into three categories: (i) 5′-dominant canonical, (ii) 5′-dominant seed only and (iii) 3′-compensatory16 (Fig. 1). The seed region is defined as the consecutive stretch of 7 nucleotides starting from either the first or the second nucleotide at the 5′ end of an miRNA. The canonical sites have perfect base pairing to at least the seed portion of the 5′ end of the miRNA and extensive base pairing to the 3′ end of the miRNA. The seed-only sites have perfect base pairing to at least the seed portion of the 5′ end of the miRNA and limited base pairing to the 3′ end of the miRNA. The 3′-compensatory sites have extensive base pairing to the 3′ end of the miRNA to compensate for imperfect or a shorter stretch of base pairing to the seed portion of the miRNA16. The basic features of the computational target prediction programs that we analyze here are listed in Table 2. These programs essentially perform two steps. In the first step they identify potential miRNA binding sites according to specific base-pairing rules. TargetScanS requires perfect complementarity with an miRNA seed14, whereas PicTar allows for targets with imperfect seed matches given that they pass a heuristically defined binding-energy threshold15. Additionally, PicTar implements a maximum likelihood approach to incorporate the combinatorial nature of miRNA targeting15,17. MiRanda uses a modified dynamic programming approach that recognizes the importance of seed binding, but does not require perfect seed complementarity18. DIANA-microT also uses a modified dynamic programming approach to implement an empirically determined set of binding
1Center for Bioinformatics, University of Pennsylvania, Philadelphia Pennsylvania 19104, USA. 2Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA. 3Department of Computer and Information Science, School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.Correspondence should be addressed to A.G.H. (
[email protected]).
PUBLISHED ONLINE 23 OCTOBER 2006; DOI:10.1038/NMETH954
NATURE METHODS | VOL.3 NO.11 | NOVEMBER 2006 | 881
PERSPECTIVE
© 2006 Nature Publishing Group http://www.nature.com/naturemethods
Table 1 | Current target prediction programs, which are available for public use Program
Website
Organisma
Reference
DIANA-microT
http://www.diana.pcbi.upenn.edu/
Any
12
MicroInspector
http://mirna.imbb.forth.gr/microinspector/
Any
32
MiRanda
http://www.microrna.org/
Fruit fly
11
MiRanda
http://www.microrna.org/
Vertebrates
18
PicTar I
http://pictar.bio.nyu.edu
Fruit fly
33
PicTar I
http://pictar.bio.nyu.edu
Vertebrates
15
PicTar II
http://pictar.bio.nyu.edu
Any
17
Ref. 27
http://tavazoielab.princeton.edu/mirnas/
Worm and fruit fly
27
RNA22
http://cbcsrv.watson.ibm.com/rna22.html
Any
RNAhybrid
http://bibiserv.techfak.uni-bielefeld.de/
Any
TargetBoost
https://demo1.interagon.com/demo
Worm and fruit fly
35
TargetScan
http://www.targetscan.org/archives.html
Vertebrates
13
TargetScanS
http://www.targetscan.org/
Vertebrates
14
34
Programs are listed in alphabetical order by program name. aOrganism(s)
for which the program is best suited.
rules that includes strong base pairing to a miRNA seed region, but does not require perfect seed complementarity12. In the second step, the programs implement cross-species conservation requirements. TargetScanS and PicTar both require conservation between at least five species for the portion of the target site that binds to the miRNA seed, but they define conservation slightly differently. TargetScanS requires that a seed match occur at exactly corresponding positions in a cross-species UTR alignment, whereas PicTar requires only that the seed match occur at overlapping positions in a cross-species UTR alignment. A later version of PicTar provides precompiled target predictions on the mouse genome based on comparative analyses of 17 vertebrate genomes. MiRanda and DIANA-microT require only conservation between human and rodent, and define conservation as an entire target site occurring with at least 90% identity at exactly corresponding positions in a crossspecies UTR alignment. MiRanda provides the additional option of using extensive conservation (more than two species). An extended description of miRNA target prediction programs and the statistical significance of their predictions are available in ref. 19. The benchmark set During the past two years the number of miRNAs, as well as the number of experimentally supported miRNA–target gene interactions, has more than doubled. The miRNA database miRBase20 (http:// microrna.sanger.ac.uk/sequences/) now contains over 450 mammalian entries, and TarBase21 (http://www.diana.pcbi.upenn.edu/
5'-dominant canonical site
5'-dominant seed site
3'-compensatory site
LAMC2 3'UTR
BCL2 3'UTR
LIMK1 3'UTR
hsa-miR-199b
hsa-miR-15/16
mmu-miR-134
Figure 1 | Three categories of microRNA target sites. Experimentally supported examples of canonical (left), seed (middle) and 3′-compensatory (right) mammalian miRNA target sites. 882 | VOL.3 NO.11 | NOVEMBER 2006 | NATURE METHODS
tarbase), a database for experimentally supported miRNA–target gene interactions, reports around 130 mammalian entries. TarBase also reports the experiments that were performed to provide support for each miRNA–target gene interaction, which range from in vitro reporter silencing assays to in vivo miRNA overexpression studies. For this study, we consider only the subset of mammalian miRNA–target gene interactions in TarBase for which the strongest experimental evidence⎯a direct miRNA effect⎯has been shown. At the time of this study, TarBase contained 84 such interactions involving 32 different miRNAs. These 84 interactions, however, are not unbiased with respect to the computational miRNA target prediction programs that we analyze here because some of these interactions initially had been discovered during the experimental testing phase of mammalian target prediction program validation, and others had been chosen for laboratory testing after the use of a particular program. Further analysis of these 84 interactions also revealed that 23 of them correspond to target sites that are not conserved with any other species (Supplementary Table 1 online). As conservation is an important requirement for the miRNA target prediction programs discussed here, none of these 23 interactions could be predicted by any of these programs. For this reason, we compiled from the complete set of 84 interactions four types of data sets to serve as benchmark in this comparative analysis. These data sets are defined as follows. Current. This data set includes all 84 interactions and is referred to as ‘current’ (Supplementary Table 2 online). Unbiased current. This type of data set is specific to each individual program or combination of programs. It includes only those interactions that were not chosen for verification during the experimental testing phase of the program(s) or were not chosen for verification based on a specific run of the program(s). Conserved current. This data set includes 61 interactions (Supplementary Table 2) that meet the conservation requirements of the programs analyzed here. Conserved unbiased current. This type of data set is specific to each program or combination of programs. It includes all of the interac-
PERSPECTIVE Table 2 | Summary of the features used by the mammalian target prediction programs considered in this study Features
TargetScan
D-microT
miRanda
TargetScanS
PicTar
Sequence Perfect seed match rule
×
× ×
Preference for perfect seed matcha ×
Empirically determined binding rules
×
Dynamic programming alignment score cutoff
×
© 2006 Nature Publishing Group http://www.nature.com/naturemethods
Seed 5′ and/or 3′ flank requirements Thermodynamics ∆G calculations based on traditional RNA folding programs
×
×
×
∆G calculations based on programs for short nucleic acid hybridizations
×
Conservation ×
Only between human and rodent species Among human, chimp, rodent, and dog Residing in an ‘island’ of conservation
×
×b ×b
×
×
×
aPicTar10,25 does predict targets with imperfect seed matches, but preferentially predicts targets with perfect seed matches. bmiRanda provides the option of running the program under both parameters. The comparative study presented in this paper uses the “only human and rodent” version of miRanda.
tions in the corresponding unbiased current data set except those that are not conserved in other species (Supplementary Table 2). Comparative study Definition of a predicted miRNA-target interaction. Computational target prediction programs provide putative binding sites for miRNAs. A single miRNA can interact with a gene at multiple sites, however experimental support is primarily provided for miRNA–target gene interactions as opposed to miRNA–target site interactions. For comparison and analysis purposes, we transform the predictions of each program (miRNA–target site) into miRNA– target gene interactions. A miRNA–target gene pair is reported as a prediction only if there is at least one target site prediction on the UTR of the target gene. In the case that more than one miRNA is predicted to target the same gene, we count each miRNA–target gene pair as a separate interaction. Different programs use different UTR annotation sets to define their search space for prediction. For example, PicTar searches the UTRs of all splice forms of a gene, whereas TargetScanS searches only the UTR of the longest splice form of a gene. As a consequence, the different mammalian target prediction programs do not refer to predicted target genes according to the same identification type. TargetScan, DIANA-microT and miRanda provide Ensembl gene identifiers, TargetScanS provides gene symbols, and PicTar provides Refseq identifiers. We used identifier conversion tables from the Ensembl website to map all identification types to a single standard type. One Ensembl gene identifier can map to several Refseq identifiers or Ensembl transcript identifiers corresponding to different splice forms of a gene, but one Refseq identifier maps only to one unique Ensembl gene identifier. Therefore, we chose Ensembl gene identifier for this standard type. Sensitivity. The performance of the computational programs is first measured by sensitivity, given by the following equation: Sensitivity =
True positives True positives + false negatives
‘True positives’ is defined as the number of experimentally supported miRNA–target gene interactions that are predicted by a
program, and ‘false negatives’ is defined as the number of experimentally supported miRNA–target gene interactions that are not predicted by a program. Using only this equation, however, a program that predicts every gene as target of each miRNA will have the ‘best’ performance because it will include all experimentally supported miRNA–target gene interactions. However, of course, it will also include a huge number of false predictions. For this reason it is also necessary to calculate the specificity, or the ‘false positive rate’22. False positive rate. False positive rate is defined as the proportion of all negatives (experimentally refuted miRNA–target gene interactions) that is erroneously predicted. Presently, TarBase has a record of only ~20 experimentally refuted miRNA-target interactions for mammals, of which only two are unbiased with respect to all prediction programs21. Here we attempt to address this problem by providing the number of total miRNA–target gene interactions that are predicted from every program or combination of programs. Of course we will only truly know if this ‘best available’ approximation to specificity is accurate when high-throughput experimental methods for miRNA–target gene interaction assessment become available. Comparison of individual programs. The results for each of the individual programs are summarized in Table 3. We observed that the earliest developed programs (TargetScan and DIANA-microT) achieve a relatively low sensitivity on the benchmark data sets (less than 7% on ‘conserved unbiased’) and also predict a relatively small number of total miRNA–target gene interactions. Notably, the next program developed (miRanda) picks up nearly 65% of conserved unbiased experimentally supported interactions, but also exhibits a substantial increase in the number of total predictions. Finally, the most recent programs (TargetScanS and PicTar) demonstrate almost identical sensitivity to miRanda but predict several thousand fewer miRNA–target gene interactions. Program combinations. Next we investigated several combinations of programs. Table 3 provides a summary of the performance of various unions and intersections of individual programs. Each line in the table corresponds to a point in Figure 2. NATURE METHODS | VOL.3 NO.11 | NOVEMBER 2006 | 883
PERSPECTIVE Table 3 | Sensitivity and number of total predictions for each program
Program
Percentage of experimentally supported miRNA–target gene interactions predicteda
Percentage of conserved and experimentally supported Number of total miRNA– miRNA–target gene target gene interactions interactions predictedb predicted
Number of total miRNA– target gene interactions predicted per miRNA 9
1
TS
20.8% (4.7%)
29.6% (7.3%)
278
2
DT
9.5% (1.3%)
13.1% (1.9%)
95
3
3
miR
48.8% (48.8%)
67.2% (67.2%)
18,289
572
4
TSS
47.6% (45.6%)
66.1% (64.3%)
10,351
323
5
PT
47.6% (45.0%)
65.6% (63.2%)
11,259
352
Program unions 6
PT, TSS
52.4% (51.8%)
72.3% (71.7%)
14,583
456
7
PT, TSS, TS
57.1% (56.1%)
78.7% (78.0%)
14,690
459
8
PT, TSS, DT
57.1% (55.6%)
78.7% (77.6%)
14,632
457
9
PT, TSS, miR
66.7% (66.7%)
91.8% (91.8%)
26,800
838
10
PT, TSS, miR, TS
70.2% (69.9%)
96.7% (96.7%)
26,881
840
11
PT, TSS, miR, TS, DT
72.6% (71.6%)
100% (100%)
26,915
841
Program intersections 12
PT, TSS
41.7% (41.0%)
57.3% (56.7%)
7,036
220
13
PT, TSS, TS
11.9% (11.9%)
16.4% (16.4%)
119
4
14
PT, TSS, DT
3.6% (3.6%)
4.9% (4.9%)
25