Finding cancer-associated miRNAs: methods and tools

0 downloads 0 Views 468KB Size Report
May 24, 2011 - ses on existing computational methods for identifying. miRNA genes, provides an overview of the methodology undertaken by these tools, and ...
See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/51161164

Finding cancer-associated miRNAs: methods and tools ARTICLE in MOLECULAR BIOTECHNOLOGY · MAY 2011 Impact Factor: 2.28 · DOI: 10.1007/s12033-011-9416-4 · Source: PubMed

CITATIONS

DOWNLOADS

VIEWS

4

30

178

4 AUTHORS, INCLUDING: Anastasis Oulas

Nestoras Karathanasis

Hellenic Centre for Marine Research

Democritus University of Thrace

22 PUBLICATIONS 120 CITATIONS

9 PUBLICATIONS 16 CITATIONS

SEE PROFILE

SEE PROFILE

Panayiota Poirazi Foundation for Research and Technology - … 77 PUBLICATIONS 1,273 CITATIONS SEE PROFILE

Available from: Panayiota Poirazi Retrieved on: 19 August 2015

Mol Biotechnol (2011) 49:97–107 DOI 10.1007/s12033-011-9416-4

REVIEW

Finding Cancer-Associated miRNAs: Methods and Tools Anastasis Oulas • Nestoras Karathanasis Annita Louloupi • Panayiota Poirazi



Published online: 24 May 2011  Springer Science+Business Media, LLC 2011

Abstract Changes in the structure and/or the expression of protein coding genes were thought to be the major cause of cancer for many decades. The recent discovery of noncoding RNA (ncRNA) transcripts (i.e., microRNAs) suggests that the molecular biology of cancer is far more complex. MicroRNAs (miRNAs) have been under investigation due to their involvement in carcinogenesis, often taking up roles of tumor suppressors or oncogenes. Due to the slow nature of experimental identification of miRNA genes, computational procedures have been applied as a valuable complement to cloning. Numerous computational tools, implemented to recognize the features of miRNA biogenesis, have resulted in the prediction of novel miRNA genes. Computational approaches provide clues as to which are the dominant features that characterize these regulatory units and furthermore act by narrowing down the search space making experimental verification faster and cheaper. In combination with large scale, high throughput methods, such as deep sequencing, computational methods have aided in the discovery of putative molecular signatures of miRNA deregulation in human tumors. This review focuses on existing computational methods for identifying miRNA genes, provides an overview of the methodology undertaken by these tools, and underlies their contribution towards unraveling the role of miRNAs in cancer.

A. Oulas  N. Karathanasis  P. Poirazi (&) Institute of Molecular Biology and Biotechnology (IMBB), Foundation for Research and Technology-Hellas (FORTH), Heraklion, Crete, Greece e-mail: [email protected] N. Karathanasis  A. Louloupi University of Crete, Heraklion, Crete, Greece

Keywords MicroRNAs  Gene prediction  Software  Tools comparison  Cancer

Introduction Current estimates show that while over 30% of vertebrate genomes is transcribed [1], only 1% represents protein coding genes; the rest is believed to be various types of non-coding RNA (ncRNA) genes. Currently, over 1,000 human miRNA genes exist in the miRNA registry,1 and it is anticipated that there may be many more. The role of these molecules in cancer has lately received a great deal of the scientific community’s attention. Recent findings indicate that alterations in the expression of several MicroRNAs (miRNAs) are often present in human cancers, suggesting potential roles of miRNAs in carcinogenic processes. For example, the expression levels of let-7 [2], let-7a [3, 4], miR-205, miR-21 and miR-196b [5], miR-449 [6], miR-15a/miR-16-1 cluster [7], and neighboring miR-143/miR-145 [8] are found to be reduced in some malignancies, suggesting their potential role as tumor suppressors. In contrast, some other miRNAs, such as the miR-17-92 cluster [9–11], miR-615 [5] and miR-155/ BIC [12], are over-expressed in various cancers, suggesting a possible oncogenic role. Furthermore, some miRNAs with altered expression levels appear to be associated with certain genetic alterations, such as deletion, amplification, and mutation. Regions which are prone to such genetic alterations are commonly referred to as cancer-associated genomic regions (CAGRs) and fragile sites (FRA) [13]. MiRNA genes located within, or in close proximity, to these regions have been suggested to be associated with 1

miRBase, release 17.0.

123

98

Mol Biotechnol (2011) 49:97–107

Fig. 1 MiRNAs as cancer players. Computational prediction initiates the search for putative miRNAs that play a role in tumorigenesis. Some of these proposed mechanisms are experimentally proven, like the deletion of miR-15a/miR-16a cluster in B-CLL [7, 57], the c-myc over-expression by the reposition near a putative miR promoter [13], or miR143/miR145 cluster down regulation in colon cancers [8]. Figure adopted with permission from Callin et al. [13]

chromosomal events leading to carcinogenesis, as graphically illustrated in Fig. 1. The large amount of unexplored non-coding regions in the human genome combined with the increasing importance of miRNAs in cancer highlights the need for fast, flexible, and reliable miRNA identification methods. Toward this goal a number of different computational methods have been used to identify miRNA genes. Early studies focused on scanning for hairpin structures conserved between closely related species such as Caenorhabditis elegans and Caenorhabditis briggsae [14, 15], or using homology between known miRNAs and other regions in aligned genomes like human and mouse [16]. Other approaches relied on conserved regions of synteny —conserved clustering of miRNAs in closely related genomes—to predict novel miRNAs [16]. Subsequent computational studies utilized Profile-based detection [17] as well as secondary structure alignment [18] of miRNAs using sequence conservation across multiple, highly divergent, organisms (i.e., mouse and fugu). The main drawback of the abovementioned tools is that they undertake a pipeline approach by applying stringent cut-offs and eliminating candidate miRNAs as the pipeline proceeds [14, 15]. This results in the loss of numerous true miRNAs along the line. The use of homology by some tools [16–18] to detect novel miRNAs based on their similarity to previously identified miRNAs is another drawback. These methods obviously fall short when scanning distantly related sequences or when novel miRNAs lack detectable homologs. The next generation of computational tools relied on more sophisticated machine learning algorithms such as support vector machines (SVMs) capable of taking into account multiple biological features such as free energy of

123

the hairpin structure, paired bases, loop length, and stem conservation to predict novel miRNAs [19–27]. Two very effective computational studies utilized Hidden Markov Models (HMM) and a Bayesian classifier [28, 29], to simultaneously consider sequence and structure features at the nucleotide level for predicting miRNA genes. These studies, however, did not integrate conservation information in their algorithms, an important feature of the majority of miRNA genes. More recently two computational tools miRRim [30] and SSCprofiler [31] also employing HMMs proved to be very effective, achieving high performance on identifying miRNAs in the human genome. With the advent of large scale, high throughput methods such as tiling arrays or deep sequencing the identification of novel miRNA genes is taking a different turn [32–34]. These methods are exceptionally useful as they produce large datasets that offer a relatively accurate expression map for small RNAs in the genome. However, since largescale expression data are usually limited by the specific tissue and developmental stage of their samples, only the coupling of such data to computational tools (as done in two recent studies [30] [31]) can facilitate rapid and precise detection of novel miRNAs, while at the same time giving greater credence to computational predictions.

Tool Comparison In the following paragraphs, we overview representative examples of miRNA gene prediction tools and highlight their most important characteristics. The tools are organized according to the biological features they implement.

Mol Biotechnol (2011) 49:97–107

99

Sequence-Based Prediction The initial computational tools for miRNA identification were based on sequence conservation with already cloned miRNAs. For instance, Quintana et al. [35] predicted 34 novel miRNAs using tissue specific cloning. Almost all of these miRNAs were conserved in the human genome and frequently in non-mammalian vertebrate genomes, like pufferfish. One interesting observation was that certain miRNAs showed increased expression in heart, liver, or brain when compared to the entire miRNA population known at the time, proposing a role of these miRNAs in tissue specification or cell lineage decision. Sequence, Structure, and Closely Related Species Conservation Early computational methods for miRNA gene prediction relied on rules derived from sequence and structural features of miRNA precursors as well as their degree of conservation across species. The use of conservation was usually limited to pairwise comparison of closely related species. MiRscan [15] and MiRseeker [14] are two representative approaches for this category of tools. The MiRscan [15] tool implements a probabilistic method that uses known miRNAs as a training set to derive new miRNAs based on their degree of similarity. Specifically, the tool was developed as follows: 1.

2.

3.

36,000 conserved sequences between C. elegans and C. briggsae (WU-BLAST cut-off E \ 1.8) were scanned using RNAfold in search for hairpin structures. 50 of them, which have previously been reported as real miRNAs, were used as the training set to derive statistical information regarding a set of characteristic features (shown in Fig. 2). Using the trained algorithm, all 36,000 sequences were evaluated with respect to their potential as miRNA precursors. Identification of new miRNA precursors by MiRscan is achieved via the use of a sliding window of 21 nt that is shifted across each hairpin candidate searching for the presence of specific miRNA features. These features include: • • • • •

complementary base pairing high degree of 50 and 30 conservation certain degree of bulge symmetry specific distance from the loop initial pentamer properties.

MiRseeker [14] is another tool belonging to the same category. It has been used for predicting miRNA genes in two Drosophila species, namely Drosophila melanogaster

Fig. 2 Criteria used by MiRscan to identify miRNA genes among aligned segments of two genomes. The typical hairpin-like structure of miRNAs and the seven components of the MiRscan score for mir232 of C. elegans/C. briggsae. Figure adopted with permission from Lim et al. [58]

and Drosophila pseudoobscura, via the use of the following 3-step pipeline approach: 1. 2.

3.

the two genomes are aligned to find all conserved regions miRseeker, which includes MFOLD, is used to identify miRNA genes. A very important feature of MiRseeker that enhances its prediction power is the consideration of pattern divergence. As the authors report, there is less selective pressure and hence less conservation in the loop sequences. The method is evaluated according to its ability to assign high scores to 24 known Drosophila miRNAs.

123

100

Mol Biotechnol (2011) 49:97–107

Multiple Species Conservation and Conserved Regions of Synteny Another approach that has been used to predict novel human miRNAs is known as phylogenetic shadowing [36, 37]. This approach uses multiple sequence comparisons of closely related species and allows the identification of conserved regions. The method is based on observing conserved islands of a specific length, which are considered likely to be miRNA genes, in an otherwise un-conserved region [36]. An example of phylogenetic shadowing application to the identification of miRNA genes is found in Berezikov et al. [37], using the following steps: 1.

Sequences from more than 100 miRNA regions in 10 different primates are compared to infer a characteristic profile: • • •

2.

3.

variation in the loop sequences, conservation in stem of hairpins, and significant decrease in conservation of sequence flanking the hairpins.

This profile is used to identify new miRNAs in pairwise alignments of more divergent species such as human and mouse or human and rat. Additional filtering is performed according to the folding energy of candidate sequences.

A total of 976 potential human miRNAs have been identified using this method. This set contains over 80% of all known human miRNAs in version 3.1 of the miRNA Registry (http://microrna.sanger.ac.uk/sequences/). Northern blot analyses combined with database searches reach a conservative estimate of 200–300 verified novel human miRNAs, a 2-fold increase over previous studies [38]. Strong conservation over all species is only evident for two well-known miRNA genes. Another approach is the use of full genome sequence alignment [16, 39, 40]. The motivation being that human and mouse miRNAs should reside in conserved regions of synteny. A representative study using full genome alignments is found in Weber [16], whereby: 1.

2.

3.

The BLAT comparison tool [41] is used to compare the entire set of human and mouse precursor and mature miRNAs in the miRNA Registry, version 2.2 (http://microrna.sanger.ac.uk/sequences/). The results are further filtered using secondary structure prediction tools, like MFOLD and other criteria (such as G:U base pairing) Characteristic features of some miRNAs are used to identify miRNA gene clusters and display conservation in the location of the clusters in comparison with other neighboring genes.

123

The findings of this study included the prediction of 80 new putative miRNAs genes. The computational approaches described so far utilize information regarding closely related species and homology searches using sequences of already cloned miRNAs. These methods proved to be successful leading to the prediction of multiple new miRNA genes. However, as they were bound by species similarity they eventually reached a prediction limit. To overcome this obstacle, researchers turned their attention to the prediction of miRNAs that do not show conservation to other known miRNAs and are not highly conserved across closely related species. A More General Model for miRNA Gene Prediction miRAlign [18] is one of the first tools that detect new miRNAs based on both sequence and structure alignment without implementing conservation features. The main characteristics that differentiate this tool from the existing homolog search methods are: 1.

2.

By applying a relatively loose conservation of the mature miRNA sequence it has the ability to find distant homologs. It considers more structural properties by introducing a structure alignment strategy that can use each single miRNA as a query for genomic searches.

The tool has been shown to perform better in comparison with other tools such as BLAST or ERPIN [42], and its main advantage is the prediction of more distant miRNA homologs or orthologs. The Next Generation of Tools HMM Tools A HMM is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. ProMir—Nam et al. [28] 1. 2.

3.

A highly specific probabilistic co-learning method was handcrafted, based on the paired HMM. This method combines both sequence and structural characteristics of miRNA genes in a probabilistic framework and simultaneously decides if a miRNA gene and the mature miRNA are present by detecting the signals for the site cleaved by Drosha. MiRNA gene candidates are finally filtered using conservation across multiple divergent species

Mol Biotechnol (2011) 49:97–107

Most recently two freely available prediction tools (SSCprofiler [31] and miRRim [30]) have been shown to predict miRNA genes with high accuracy.

101

conceptual differences between the two tools are provided in Table 1. Bayes Classifiers

SSCprofiler Oulas et al. [31] 1.

2.

3.

SSCprofiler utilizes Profile HMM trained to recognize key biological features of miRNAs such as sequence, structure, and conservation to identify novel miRNA precursors as shown in Fig. 3. SSCprofiler is trained to learn with high accuracy the characteristic features of human miRNA precursors, and the trained model is applied on CAGRs in search of novel miRNA genes. Predictions are ranked according to expression information from a recently published full genome tiling array [33], and the top four scoring candidates have been verified experimentally using northern blot.

MiRRim—Terai et al. [30] MiRRim is similar to SSCprofiler as it also uses an HMM algorithm that considers structure and conservation features for predicting novel miRNA genes. However, it does not take into consideration sequence information. The main

A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Naive Bayes—Yousef et al. [29] 1.

2.

This method generates the model automatically and identifies rules based on the miRNA gene sequence and structure thus allowing prediction of non-conserved miRNAs. In addition, the method uses a comparative analysis over multiple species to reduce the false positive (FP) rate.

Fig. 3 The supervised procedure of training HMM for miRNA precursor identification. Biological features of miRNA biogenesis and conservation across other organisms are used as input for training. Initially, secondary structure prediction is performed using programs such as RNAfold. Every nucleotide position is henceforth represented by an ‘M’ for match and an ‘L’ for loop. This information is aligned integrated with conservation and sequence information for every nucleotide position and used to train the HMM. Once trained, the HMM, can be utilized to analyze sequences of desired length and assign a likelihood score. The higher the score the greater the chances of a candidate sequence being a true miRNA precursor. Figure adopted with permission from Oulas et al. [31]

123

102

Mol Biotechnol (2011) 49:97–107

Table 1 Comparison of SSCprofiler and miRRim tools SSCprofiler

miRRim

Biological features of mirnas

Sequence, structure, and conservation

Structure and conservation

Negative data

Selected from 30 UTRs, and filtered according to conservation and a minimum energy score to ensure that they resemble true miRNAs in both structure and conservation

Randomly selected 200 nt-long genomic regions with different degrees of conservation. No requirements for resemblance with true miRNAs

Sensitivity/specificity (validation set)

HMM score: 3

Sens: *70%

Sens: 88.95%

Spec: *90%

Spec: 84.16% Generalization (blind test set) Scanning procedure

Identification of 219 previously unseen miRNAs with an accuracy of 72.15% 104 nt sliding window, shifted 1 nt at a time for positive as well as negative data.

Total number of hits

For a coverage of 96.0% (HMM threshold 11), *5,800 miRNA hits for 350 MB (CAGRs) of the human genome

For a coverage of 91.0%, *4,000 miRNAs hits for the whole human genome

Expression information using high throughput methods

Tiling array data from HeLa and HepG2 cells

No expression information is provided

Experimental verification

Successful verification of four top scoring miRNA candidates via northern blot

No experimental verification is provided

Tools Extracting miRNA Genes from Next Generation Sequencing data miRDeep—Friedla¨nder et al. [32] 1.

2.

3.

The miRDeep algorithm uses a probabilistic model of miRNA biogenesis to score compatibility of the position and frequency of sequenced RNA (Solexa sequencing as well as 454 deep sequencing) with the secondary structure of the miRNA precursor. The authors demonstrate the accuracy and robustness of the tool using published C. elegans data and data generated by deep sequencing human and dog RNAs. miRDeep reports altogether *230 previously un-annotated miRNAs, of which four novel C. elegans miRNAs have been validated by northern blot analysis.

SVM Tools—[19–27] SVMs are a set of related supervised learning methods used for classification and regression. Viewing input data as two sets of vectors in an n-dimensional space, an SVM will construct a separating hyperplane in that space, one which maximizes the margin between the two data sets.

123

No evaluation of performance on a blind test set Size of sliding window and shift step unclear. Positive data ranged between 160 and 236 nt, negative data were 200 nt, implying a larger window and thus a smaller search space

Multiple studies have utilized SVMs to predict novel miRNA genes, some of which are listed below: 1.

2.

Xue et al. [23] proposed a set of novel features of local contiguous structure-sequence information for distinguishing the hairpins of real pre-miRNAs and 1,000 pseudo pre-miRNAs. Remarkably, the SVM classifier built on human data can also correctly identify up to 90% of the pre-miRNAs from other species, including plants and virus, without utilizing any comparative genomics information. Sewer et al. [22] focused on genomic regions around already known miRNAs to incorporate the observation that miRNAs are occasionally found in clusters. Starting with the known human, mouse and rat miRNAs, the authors scanned 20 kb of flanking genomic regions for the presence of putative precursor miRNAs. Each genome was analyzed separately, allowing the evaluation of the species-specific identity and genome organization of miRNA loci. Only cross-species comparisons were used to make conservative estimates of the number of novel miRNAs. This ab initio method predicted between 50 and a 100 novel pre-miRNAs for each of the considered species. Around 30% of these miRNAs have already been experimentally verified in a large set of cloned mammalian small RNAs [34].

x

Mature location

x

73

x

x

x

x

x

x

x

x

ProMir

Human

90

x

x

x

x

x

x

x

RNAmicro

Human

84.16

88.95

x

x

x

x

x

x

x

x

SSCprofiler

a

Directly—in the sense that the nucleotide distribution in the sequence is taken into consideration (i.e., GC content). Indirectly refers to the use of sequence to derive structure

Reference

x

x

Human, mouse, rat

64

64

x

x

x

x

x

x

x

Sewer

Experimental verification

Multispecies

88.10

93.30

x

x

x

x

x

x

Xue

x

Mouse

91

97

x

x

x

x

x

x

x

Bayes classifier

Use of large scale high throughput data

Drosophila

x

x

x

x

x

x

miRAlign

Human

Nematode ? human

x

x

x

Blatting

Species

75

x

x

x

x

Phylogenetic shadowing

96

74

x

x

x

x

miRseeker

Specificity

Sensitivity (%)

Performance

Complement other tools

Probablistics

SVM

Homology based

Bruteforce

Methodology

Multiple species

Conserved clustering

Consered synteny

Pairwise

Conservation

x

x

Bulges/loops

x

x

Thermodymamic temp

x

x

MiRscan

Hairpin

x

x

Cloning

Base pairing

Structure

Indirectly

Directlya

Sequence

Features

Table 2 Comparison of miRNA identification methods

x

x

C. elegans, human, dog

x

x

x

x

x

miRDeep

Human

Human

90 78

*70

x

x

x

x

x

x

x

x

Microprocessor SVM

*90

x

x

x

x

x

x

x

miRRim

Mol Biotechnol (2011) 49:97–107 103

123

104

Mol Biotechnol (2011) 49:97–107

Methods Designed to Complement Other Tools In addition to miRNA gene prediction tools, a number of methods have been developed to complement such tools and help to maximize their prediction accuracy. RNAmicro [21] RNAmicro is designed specifically to work as a ‘‘subscreen’’ for large-scale ncRNA surveys with RNAz. RNAz is an SVM-based classification tool which combines comparative sequence analysis and structure prediction [43]. RNAmicro tries to provide an annotation of the RNAz survey data to provide a more balanced trade-off between sensitivity and specificity. Thus, the goal of RNAmicro is a bit different from that of specific surveys for miRNAs in genomic sequences: in the latter case, one is interested in very high specificity so that the candidates selected for experimental verification contain as few FPs as possible. RNAmicro is an SVM-based method which, to classify the miRNA precursors, evaluates the information contained in multiple sequence alignment. RNAmicro consists of the following components: 1.

nucleotides of the predicted site, in about 90% of the cases. Even though Microprocessor SVM is not effective as a stand alone tool, it can be useful as: 1. 2.

Examples of Computationally Predicted miRNAs involved in Cancer In order to stress the importance of miRNA gene prediction tools in identifying miRNA genes implicated with various types of cancers, it is necessary to provide representative examples. 1.

A pre-processor that performs the following three actions, Identifies conserved hairpin-structure regions in a multiple sequence alignment. Extracting windows of length L are used in 1-nucleotide steps from the input alignment. (b) RNAalifold algorithm (Vienna RNA Package) is used to compute consensus sequence and structure [44]. (c) The consensus secondary structure which is obtained in ‘‘dot-parenthesis’’ notation is further analyzed.

(a)

2. 3.

2.

A module that computes a vector of numerical descriptors from each hairpin structure. A SVM that classifies the candidate based on its vector of descriptors.

Microprocessor SVM [20] Another study describes a SVM classifier that can separate between true and false Drosha processing sites. The biological relevance of this tool is based on the assumption that mature miRNAs are processed from long hairpin transcripts by Drosha, and that this processing defines the mature product and is characteristic for all miRNA genes. This classifier can predict the exact location of 50 microprocessor processing sites in human 50 miRNAs with 50% accuracy. It is important also to mention that if the predicted site is wrong, the actual site is within two

123

A post-processor for existing tools that only predict whether hairpins are likely miRNAs. A complementary miRNA gene classifier that performs better than currently available methods for predicting un-conserved miRNAs. As a consequence, other prediction tools can be improved upon postprocessing using this method.

3.

These include miRNAs predicted by sequence homology to already cloned miRNAs such as, miR-143 [35] involved in colorectal cancer [8], miR-125b (lin-4) and miR-145 which are implicated in breast cancer [45], miR-106a which is believed to play a regulatory role in colon, pancreas, and prostate cancer [46], and mir-155 which is associated with HL, BCL, pediatric BL, breast and lung cancer as well as poor survival [12, 47–50]. Another study utilized conservation with mouse and Fugu rubripes sequences and the score given by the program MiRscan [38] to predict novel miRNAs associated with cancer. These included mir221/222 which are involved in Papillary thyroid carcinoma [51] and glioblastomas [52], hsa-mir-192 which is shown to have reduced expression in colorectal neoplasia [8], hsa-mir-196a-1 which was cloned from human osteoblast sarcoma cells [53], and hsa-mir-210 which is implicated in Kaposi’s sarcoma-associated herpes virus infections [54]. The majority of these miRNA genes were identified computationally and their implication in cancer confirmed by experimental methods. In a recent large-scale bioinformatics study, Calin et al. [13] made use of the miRNA registry as well as bibliography for cancer CAGRs, to show that over 80 known miRNAs reside within CAGR and FRA sites. This was one of the first large-scale computational studies to indicate a direct connection between genomic location of miRNAs and regions prone to genetic alteration in cancer. This study has provided the raw material for undergoing experiments aiming to verify these connections.

Mol Biotechnol (2011) 49:97–107

4.

Recently, a sophisticated computational tool (SSCprofiler [31]) scanned over 70 CAGRs and resulted in the prediction of 4 novel candidate miRNAs, all of which were verified by tilling array data as well as northern blot analysis.

105

9.

10.

Tool Criteria for Success

11.

This section underlies the capabilities and limitations of the numerous available computational methods for miRNA gene prediction.

12.

1.

2.

3.

4.

5.

6.

7.

8.

The general rule is that the more biological information integrated in the computational tool the more successful it will be. As indicated in Table 2, early tools making use of sequence and closely related species conservation alone did not achieve very high prediction accuracy. Subsequent methodologies taking into account additional biological information, such as structure and multiple species conservation for filtering their predictions, showed significant improvements. Another important criterion for success is the simultaneous consideration of biological information. Considering all features at once, preferably at the nucleotide level is more informative and more efficient than undertaking a pipeline approach, which utilizes the different features sequentially to predict novel miRNAs. Sophisticated machine learning algorithms process all the biological features in parallel to build predictive models. This is a significant improvement to linear approaches adopted by initial brute-force methods (see Table 2). Successful training of learning algorithms requires caution when selecting positive and negative training examples. Online databases may contain FPs, and the definition of a negative miRNA is still uncertain. The use of 30 UTR regions to draw negative miRNA genes has been the norm in most studies. Mostly because there was no documented miRNA gene within these region. However, recently a small percentage of miRNA genes have been shown to exist within 30 UTRs. Evaluating the performance of prediction tools by measuring sensitivity as well as specificity. However, these measures are greatly affected by the quality as well as the number of negative and positive samples [29]. Hence, the prediction accuracy of one tool may change if the dataset from another study is used and vice versa. In general computational miRNA gene, prediction is lacking a benchmark dataset.

13.

There is still ample space for improvement in the field of computational prediction of miRNA genes, besides the great advances in the last years. As more biological information regarding miRNA biogenesis and regulation is made available, computational tools incorporating this information will become much more effective. Perhaps novel biological information will shed some light into one of the bottlenecks of in-silico prediction of miRNAs, namely, the identification of the mature miRNA sequence on the miRNA precursor. It is also important to mention that miRNA gene prediction is currently addressed as a 2D problem. Secondary structure prediction algorithms only follow 2D rules and do not portray a complete tertiary picture. Tools capable of predicting tertiary structure of miRNAs (such as pseudoknots, [55]) will transform a 2D problem into a 3D one, reflecting more accurately the conditions found in the cell. As a final note, it is important to stress that the development of computational tools is tightly linked to biological research. The successful evolution of these tools demands that developers keep up with novel biological findings that may change the way information should be used. A characteristic example of this is the use of Drosha processing sites in the Microprocessor [20] study mentioned above. It was recently shown that intronic microRNA precursors may bypass Drosha processing [56].

Acknowledgments This study was supported by the action 8.3.1 (Reinforcement Pro-gram of Human Research Manpower—’’PENED 2003’’, [03ED842]) of the operational program ‘‘competitiveness’’ of the Greek General Secretariat for Research and Technology, a Marie Curie Fellowship of the European Commission [PIOF-GA-2008219622], the National Science Foundation [NSF 0515357] and the ‘‘High-Performance Computing Infrastructure for South East Europe’s Research Communities—HP-SEE’’ project [FP7 Capacities grant agreement Nr. 261499].

References 1. Fantom, C. (2005). The transcriptional landscape of the mammalian genome. Science, 309, 1559–1563. 2. Takamizawa, J., Konishi, H., Yanagisawa, K., Tomida, S., Osada, H., et al. (2004). Reduced expression of the let-7 microRNAs in human lung cancers in association with shortened postoperative survival. Cancer Research, 64, 3753–3756. 3. Dong, Q., Meng, P., Wang, T., Qin, W., Qin, W., et al. (2010). MicroRNA let-7a inhibits proliferation of human prostate cancer cells in vitro and in vivo by targeting E2F2 and CCND2. PLoS One, 5, e10147. 4. Yang, Q., Jie, Z., Cao, H., Greenlee, A. R., Yang, C., et al. (2011). Low-level expression of let-7a in gastric cancer and its involvement in tumorigenesis by targeting RAB40C. Carcinogenesis, 32, 713–722.

123

106 5. Hulf, T., Sibbritt, T., Wiklund, E. D., Bert, S., Strbenac, D., et al. (2011). Discovery pipeline for epigenetically deregulated miRNAs in cancer: Integration of primary miRNA transcription. BMC Genomics, 12, 54. 6. Kheir, T. B., Futoma-Kazmierczak, E., Jacobsen, A., Krogh, A., Bardram, L., et al. (2011). miR-449 inhibits cell proliferation and is down-regulated in gastric cancer. Molecular Cancer, 10, 29. 7. Calin, G. A., Dumitru, C. D., Shimizu, M., Bichi, R., Zupo, S., et al. (2002). Frequent deletions and down-regulation of microRNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proceedings of the National Academy of Sciences of the United States of America, 99, 15524–15529. 8. Michael, M. Z., O’Connor, S. M., van Holst Pellekaan, N. G., Young, G. P., & James, R. J. (2003). Reduced accumulation of specific microRNAs in colorectal neoplasia. Molecular Cancer Research, 1, 882–891. 9. Hayashita, Y., Osada, H., Tatematsu, Y., Yamada, H., Yanagisawa, K., et al. (2005). A polycistronic microRNA cluster, miR17–92, is overexpressed in human lung cancers and enhances cell proliferation. Cancer Research, 65, 9628–9632. 10. He, L., Thomson, J. M., Hemann, M. T., Hernando-Monge, E., Mu, D., et al. (2005). A microRNA polycistron as a potential human oncogene. Nature, 435, 828–833. 11. Tagawa, H., & Seto, M. (2005). A microRNA cluster as a target of genomic amplification in malignant lymphoma. Leukemia, 19, 2013–2016. 12. Metzler, M., Wilda, M., Busch, K., Viehmann, S., & Borkhardt, A. (2003). High expression of precursor microRNA-155/BIC RNA in children with Burkitt lymphoma. Genes Chromosomes Cancer, 39, 167–169. 13. Calin, G. A., Sevignani, C., Dumitru, C. D., Hyslop, T., Noch, E., et al. (2004). Human microRNA genes are frequently located at fragile sites and genomic regions involved in cancers. Proceedings of the National Academy of Sciences of the United States of America, 101, 2999–3004. 14. Lai, E. C., Tomancak, P., Williams, R. W., & Rubin, G. M. (2003). Computational identification of Drosophila microRNA genes. Genome Biology, 4, R42–R61. 15. Lim, L. P., Lau, N. C., Weinstein, E. G., Abdelhakim, A., & Yekta, S. (2003). The microRNAs of Caenorhabditis elegans. Genes and Development, 16, 991–1008. 16. Weber, M. J. (2005). New human and mouse microRNA genes found by homology search. FEBS Journal, 272, 59–73. 17. Legendre, M., Lambert, A., & Gautheret, D. (2004). Profile-based detection of microRNA precursors in animal genomes. Bioinformatics, 21, 841–845. 18. Wang, X., Zhang, J., Li, F., Gu, J., He, T., et al. (2005). MicroRNA identification based on sequence and structure alignment. Bioinformatics, 21, 3610–3614. 19. Buck, A. H., Santoyo-Lopez, J., Robertson, K. A., Kumar, D. S., Reczko, M., et al. (2007). Discrete clusters of virus-encoded micrornas are associated with complementary strands of the genome and the 7.2-kilobase stable intron in murine cytomegalovirus. Journal of Virology, 81, 13761–13770. 20. Helvik, S. A., Snove, O., Jr., & Saetrom, P. (2006). Reliable prediction of Drosha processing sites improves microRNA gene prediction. Bioinformatics, 23, 142–149. 21. Hertel, J., & Stadler, P. F. (2006). Hairpins in a Haystack: Recognizing microRNA precursors in comparative genomics data. Bioinformatics, 22, e197–e202. 22. Sewer, A., Paul, N., Landgraf, P., Aravin, A., Pfeffer, S., et al. (2005). Identification of clustered microRNAs using an ab initio prediction method. BMC Bioinformatics, 6, 267–281. 23. Xue, C., Li, F., He, T., Liu, G. P., Li, Y., et al. (2005). Classification of real and pseudo microRNA precursors using local

123

Mol Biotechnol (2011) 49:97–107

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38. 39.

40.

41. 42.

structure-sequence features and support vector machine. BMC Bioinformatics, 6, 310–316. Batuwita, R., & Palade, V. (2009). microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics, 25, 989–995. Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., et al. (2007). MiPred: Classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Research, 35, W339–W344. Ng, K. L., & Mishra, S. K. (2007). De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics, 23, 1321–1330. Wang, M., Song, X., Han, P., Li, W., & Jiang, B. (2010). New syntax to describe local continuous structure-sequence information for recognizing new pre-miRNAs. Journal of Theoretical Biology, 264, 578–584. Nam, J. W., Shin, K. R., Han, J., Lee, Y., Kim, V. N., et al. (2005). Human microRNA prediction through a probabilistic colearning model of sequence and structure. Nucleic Acid Research, 33, 3570–3581. Yousef, M., Nebozhyn, M., Shatkay, H., Kanterakis, S., Showe, L. C., et al. (2006). Combining multi-species genomic data for microRNA identification using a Naive Bayes classifier. Bioinformatics, 22, 1325–1334. Terai, G., Komori, T., Asai, K., & Kin, T. (2007). miRRim: A novel system to find conserved miRNAs with high sensitivity and specificity. RNA, 13, 2081–2090. Oulas, A., Boutla, A., Gkirtzou, K., Reczko, M., Kalantidis, K., et al. (2009). Prediction of novel microRNA genes in cancerassociated genomic regions: A combined computational and experimental approach. Nucleic Acids Research, 37, 3276–3287. Friedlander, M. R., Chen, W., Adamidi, C., Maaskola, J., Einspanier, R., et al. (2008). Discovering microRNAs from deep sequencing data using miRDeep. Nature Biotechnology, 26, 407–415. Kapranov, P., Cheng, J., Dike, S., Nix, D. A., Duttagupta, R., et al. (2007). RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science, 316, 1484–1488. Landgraf, P., Rusu, M., Sheridan, R., Sewer, A., Iovino, N., et al. (2007). A mammalian microRNA expression atlas based on small RNA library sequencing. Cell, 129, 1401–1414. Lagos-Quintana, M., Rauhut, R., Yalcin, A., Meyer, J., Lendeckel, W., et al. (2002). Identification of tissue-specific microRNAs from mouse. Current Biology, 12, 735–739. Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K. D., Ovcharenko, I., et al. (2003). Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science, 299, 1391–1394. Berezikov, E., Guryev, V., van de Belt, J., Wienholds, E., Plasterk, R. H., et al. (2005). Phylogenetic shadowing and computational identification of human microRNA genes. Cell, 120, 21–24. Lim, L. P., Glasner, M. E., Yekta, S., Burge, C. B., & Bartel, D. P. (2003). Vertebrate microRNA genes. Science, 299, 1540. Artzi, S., Kiezun, A., & Shomron, N. (2008). miRNAminer: A tool for homologous microRNA gene search. BMC Bioinformatics, 9, 39. Sunkar, R., & Jagadeeswaran, G. (2008). In silico identification of conserved microRNAs in large number of diverse plant species. BMC Plant Biology, 8, 37. Kent, W. J. (2002). BLAT: The BLAST-like alignment tool. Genome Research, 12, 656–664. Lambert, A., Fontaine, J. F., Legendre, M., Leclerc, F., Permal, E., et al. (2004). The ERPIN server: An interface to profile-based RNA motif identification. Nucleic Acids Research, 32, W160– W165.

Mol Biotechnol (2011) 49:97–107 43. Washietl, S., Hofacker, I. L., & Stadler, P. F. (2005). Fast and reliable prediction of noncoding RNAs. Proceedings of the National Academy of Sciences of the United States of America, 102, 2454–2459. 44. Hofacker, I. L. (2003). Vienna RNA secondary structure server. Nucleic Acids Research, 31, 3429–3431. 45. Iorio, M. V., Ferracin, M., Liu, C. G., Veronese, A., Spizzo, R., et al. (2005). MicroRNA gene expression deregulation in human breast cancer. Cancer Research, 65, 7065–7070. 46. Volinia, S., Calin, G. A., Liu, C. G., Ambs, S., Cimmino, A., et al. (2006). A microRNA expression signature of human solid tumors defines cancer gene targets. Proceedings of the National Academy of Sciences of the United States of America, 103, 2257–2261. 47. Eis, P. S., Tam, W., Sun, L., Chadburn, A., Li, Z., et al. (2005). Accumulation of miR-155 and BIC RNA in human B cell lymphomas. Proceedings of the National Academy of Sciences of the United States of America, 102, 3627–3632. 48. Kluiver, J., Haralambieva, E., de Jong, D., Blokzijl, T., Jacobs, S., et al. (2006). Lack of BIC and microRNA miR-155 expression in primary cases of Burkitt lymphoma. Genes Chromosomes Cancer, 45, 147–153. 49. Kluiver, J., Poppema, S., de Jong, D., Blokzijl, T., Harms, G., et al. (2005). BIC and miR-155 are highly expressed in Hodgkin, primary mediastinal and diffuse large B cell lymphomas. Journal of Pathology, 207, 243–249. 50. Yanaihara, N., Caplen, N., Bowman, E., Seike, M., Kumamoto, K., et al. (2006). Unique microRNA molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell, 9, 189–198.

107 51. He, H., Jazdzewski, K., Li, W., Liyanarachchi, S., Nagy, R., et al. (2005). The role of microRNA genes in papillary thyroid carcinoma. Proceedings of the National Academy of Sciences of the United States of America, 102, 19075–19080. 52. Ciafre, S. A., Galardi, S., Mangiola, A., Ferracin, M., Liu, C. G., et al. (2005). Extensive modulation of a set of microRNAs in primary glioblastoma. Biochemical and Biophysical Research Communications, 334, 1351–1358. 53. Lagos-Quintana, M., Rauhut, R., Meyer, J., Borkhardt, A., & Tuschl, T. (2003). New microRNAs from mouse and human. RNA, 9, 175–179. 54. Cai, X., Lu, S., Zhang, Z., Gonzalez, C. M., Damania, B., et al. (2005). Kaposi’s sarcoma-associated herpesvirus expresses an array of viral microRNAs in latently infected cells. Proceedings of the National Academy of Sciences of the United States of America, 102, 5570–5575. 55. Reeder, J., & Giegerich, R. (2004). Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, 104–115. 56. Ruby, J. G., Jan, C. H., & Bartel, D. P. (2007). Intronic microRNA precursors that bypass Drosha processing. Nature, 448, 83–86. 57. Cimmino, A., Calin, G. A., Fabbri, M., Iorio, M. V., Ferracin, M., et al. (2005). miR-15 and miR-16 induce apoptosis by targeting BCL2. Proceedings of the National Academy of Sciences of the United States of America, 102, 13944–13949. 58. Lim, L. P., Lau, N. C., Weinstein, E. G., Abdelhakim, A., & Yekta, S. (2003). The microRNAs of Caenorhabditis elegans. Genes and Development, 17, 991–1008.

123