Statistical evaluation of improvement in RNA secondary structure

0 downloads 0 Views 379KB Size Report
Dec 1, 2011 - download from the Mathews lab website, http://rna .urmc.rochester.edu. ... refers to the probability of the Type II error, namely, the probability of ...
Published online 1 December 2011

Nucleic Acids Research, 2012, Vol. 40, No. 4 e26 doi:10.1093/nar/gkr1081

Statistical evaluation of improvement in RNA secondary structure prediction Zhenjiang Xu1, Anthony Almudevar2 and David H. Mathews1,2,* 1

Department of Biochemistry and Biophysics and Center for RNA Biology and 2Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA

Received December 24, 2010; Revised October 26, 2011; Accepted October 28, 2011

ABSTRACT With discovery of diverse roles for RNA, its centrality in cellular functions has become increasingly apparent. A number of algorithms have been developed to predict RNA secondary structure. Their performance has been benchmarked by comparing structure predictions to reference secondary structures. Generally, algorithms are compared against each other and one is selected as best without statistical testing to determine whether the improvement is significant. In this work, it is demonstrated that the prediction accuracies of methods correlate with each other over sets of sequences. One possible reason for this correlation is that many algorithms use the same underlying principles. A set of benchmarks published previously for programs that predict a structure common to three or more sequences is statistically analyzed as an example to show that it can be rigorously evaluated using paired twosample t-tests. Finally, a pipeline of statistical analyses is proposed to guide the choice of data set size and performance assessment for benchmarks of structure prediction. The pipeline is applied using 5S rRNA sequences as an example. INTRODUCTION There has been an explosion in our understanding of roles for RNA in cellular processes and gene expression in recent decades. RNA can form complex three dimensional structure either alone or with proteins to catalyze RNA splicing (1), catalyze peptide bond formation (2), guide protein localization (3), and tune gene regulation (4,5). Prediction of RNA secondary structure, the set of base pairing interactions between A-U, G-U and G-C, facilitates the development of hypotheses that connect structure to function. It also underlies a number of applications

such as non-coding RNA detection (6–8), RNA tertiary structure prediction (9), siRNA design (10), miRNA target prediction (11) and structure design (12). There are many algorithms that have been developed to apply to specific situations with their own strengths and limitations. The performance of a given structure prediction method is usually benchmarked on a set of RNA families by comparing predicted structures with the known secondary structures. Two statistical scores, sensitivity and positive predictive value (PPV), are commonly tabulated to determine the accuracy of the prediction methods (13). Sensitivity is the fraction of known pairs correctly predicted and PPV is the fraction of predicted pairs in the known structure. The two scores are sufficient to evaluate the accuracy. Another measure, the Matthews correlation coefficient (MCC) (14,15), is used as a single score that summarizes both sensitivity and PPV. It is defined as: MCC = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTPTNFPFNÞ= ðTP+FPÞðTP+FNÞðTN+FPÞðTN+FNÞ, where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative base pairs. It can be approximated by the geometric mean of sensitivity and PPV (15). Although MCC is a more succinct measure, by combining sensitivity and PPV it loses the information about capability and quality of a given method.

Traditionally, the averages of the statistical scores, either over families or sequences, are calculated to compare the accuracies of different folding algorithms. In this contribution, a statistical test is introduced to evaluate the performance of RNA folding methods against each other. Specifically, the prediction accuracies of some methods are shown to be correlated over sequences, and the paired two-sample t-test is proposed to compare accuracies of methods. Finally, precision and power analyses are suggested to determine the necessary dataset size for the benchmark to control the probabilities of hypothesis testing errors. An example is shown for programs that predict a consensus structure for multiple sequences, using all 5S rRNA sequences with reference structures available (16).

*To whom correspondence should be addressed. Tel: +1 585 275 1734; Fax: +1 585 275 6313; Email: [email protected] ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

PAGE 2 OF 8

e26 Nucleic Acids Research, 2012, Vol. 40, No. 4

METHODS Calculation of prediction accuracy The prediction results of 12 RNA folding algorithms are taken from a previous study for this statistical analysis (16). That study reported a new algorithm for predicting an RNA secondary structure common to three or more sequences. When calculating sensitivity and PPV, a predicted base pair, i  j, counted as a correctly predicted pair if i  j, (i+1)  j, (i  1)  j, i  (j+1) or i  (j  1) pair is in the known structure (17). This is important because structures are compared to structures determined by comparative sequence analysis where there can be uncertainty in the exact pair match (18). Additionally, thermal noise can cause the actual structure to sample multiple pairs. For example, there is evidence in the thermodynamic studies of single RNA bulges that multiple conformational states exist (19,20). Statistical analyses All the following analyses were performed with the R statistical environment (21). Scripts are available for download from the Mathews lab website, http://rna .urmc.rochester.edu. All the figures were plotted with the Lattice package in R (22). Correlation coefficient. To evaluate the independence between prediction results by two RNA folding algorithms, the Spearman rank correlation coefficient was calculated. It is a Pearson correlation coefficient between the ranks of two variables. It is q calculated using the equation: ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P P  i  yÞ=   2 ðyi  yÞ  2 , where ðxi  xÞ x,y ¼ ðxi  xÞðy xi and yi are the ranks of average sensitivity or PPV predicted by the two algorithms on the ith group of sequences, and x and y are the means of the ranks. In case of tied observations, the arithmetic average of the rank numbers was used (23). t-test. t-Tests were performed to calculate P-values in two ways. Assuming their prediction results were uncorrelated, an incorrect assumption, independent two-sample Welch’s t-tests were used to compare the sensitivities or PPVs predicted by two algorithms with the null hypothesis that the means of the two samples are the same. Alternatively, paired two-sample t-tests were performed. Essentially the differences of individual sensitivities or PPVs from two algorithms were calculated to reduce the problem to a one-sample t-test with null hypothesis that the mean of the differences is zero (24). Precision and power analysis. The probability of Type I error, denoted by , is the probability of rejecting a true null hypothesis. The precision analysis relates the width of the confidence interval to the sample size by controlling . A (1  ) percent confidence interval ofpffiffiaffi two-tailed paired t-test is defined as   t=2,ðn1Þ Sd = n, where  is the sample mean of the differences between sensitivities (or PPVs) of two algorithms, Sd is the sample standard deviation, n is the sample size and ta/2,(n  1) is the critical value for the t distribution with (n  1) degrees of freedom

at the significance level of . For an observed Sd, determining sample size n such that the confidence interval does  not contain the point of zero, i.e.  t=2,ðn1Þ Sd =pffiffinffi  , gives the estimated sample size to reject the null hypothesis with the probability of Type I error no greater than  (25). The power of a statistical test is the probability of rejecting a false null hypothesis. It equals (1  ), where  refers to the probability of the Type II error, namely, the probability of failing to reject a false null hypothesis. The sample size required for a given power can be prospectively estimated as n = Vd (Z1  +Z1  /2)2/2, where , Vd are the hypothetical mean and variance of the differences between two groups of scores,  and  are the probabilities of Types I and II errors, and Z1   and Z1  /2 are Z statistics of a standard normal distribution at the (1  )th and (1  /2)th quantiles (25). Conversely, a posteriori  can be calculated after the statistical test from the equation to show the confidence to accept the null hypothesis with the given sample size and . While precision analysis is often used to define the width of the confidence interval, a priori power analysis is used more to estimate a sample size to define how small a difference can be detected and with what degree of certainty. Sequential test. Sequential testing is an alternative to fixed sample size testing (26,27). It is designed to reduce sample sizes without impacting power. Sample size is adaptively determined based on the accumulated data in a sequential test. It avoids the need for a single sample size estimate that would be larger than required for many cases. The methodology on the sequential procedure proposed in this article is in the Supplementary Material. Briefly, to prevent inflation of the Type I error rate introduced by sequential testing as opposed to testing with a fixed sample size, an adaptive rejection rule needs to be defined. RESULTS Many structure prediction methods employ similar energy models, evolve from the same underlying algorithm, or are extended from a previous version. It is possible that the prediction results of those related methods are correlated. In other words, two programs may make similar errors in structure prediction that would result in similar prediction accuracies on a number of sequences. The correlation was tested on a set of benchmark results published previously (16). This benchmark consists of 400 tRNA sequences (28) predicted by 12 methods, namely, Fold (29), Dynalign (30), Multilign (16), FoldalignM (31), mLocARNA (32), MASTR (33), Murlet (34), RAF (35), RNASampler (36), RNAshapes (37), StemLoc (38) and RNAalifold (39). Fold finds the lowest free energy structure for a single sequence. Dynalign finds the lowest free energy structure that is conserved for two unaligned sequences. RNAalifold folds multiple RNA sequences, prealigned with ClustalW2 in this benchmark. The remaining methods predict secondary structures of multiple sequences without requiring that the sequences be aligned ahead of time.

PAGE 3 OF 8

In the benchmark, 400 tRNA molecules were randomly selected and divided into groups of 5, 10 or 20 sequences. A single calculation was run over multiple sequences (except for Fold and Dynalign). For Dynalign, one sequence is chosen to be used with each of the other sequences in the group for structure prediction, and for this one sequence, the structure from last Dynalign calculation is used for scoring (16). The structures predicted in a single group are dependent on each other because the algorithms predict consensus structures. Thus, the sensitivities and PPVs for each sequence from the same calculation were averaged and this average was used as a single data point in the following statistical analyses. The sample size for the 5-, 10- and 20-sequence group predictions of 400 tRNA sequences are 80 (= 400/5), 40 (=400/10) and 20 (=400/20), respectively. By treating a single calculation as a data point instead of a single sequence as a data point, the statistical analysis is conservative in determining statistical significance because some power is lost in the calculation. This is specific to statistical tests of methods that use multiple sequences to find a consensus structure. Correlation analysis The most widely used correlation analysis is the Pearson product–moment correlation. It, however, is a parametric statistic designed to test a linear relationship between two variables normally distributed along an interval. For RNA structure prediction, the sensitivities or PPVs appear to be distributed far from the normal distribution (Supplementary Figure S1). Because of this, the Spearman rank correlation coefficients, which are more robust to outliers with extreme value, were computed (23). It is a non-parametric measure calculated with the ranks rather than the values of the observations. Figure 1 shows the matrices of correlation coefficients between sensitivities or PPVs predicted by any two algorithms over tRNA. Multilign and Dynalign are correlated because Multilign is based on multiple Dynalign calculations. The correlation coefficient decreases from over 0.84 to 0.65 when the number of sequences in one Multilign calculation increases from 5 to 20. This decreased correlation is expected because increasing the number of sequences in the Multilign prediction weakens the influence of individual Dynalign calculations. This is more obvious in the scatter plots in Figure 2. The horizontal distances of points from the diagonal line in the scatter plot of the 10 or 20 sequence predictions are longer than those of 5 sequence predictions, showing that the decreased correlation coefficients are at least partially due to the further prediction improvement on a few sequences by increasing the sequence number from 5 to 20. Statistical significance To test whether an algorithm has better performance than another, the means of their sensitivities and PPVs are traditionally compared. Comparing mean scores calculated from a sample of sequences, however, is usually not sufficient to infer the relationships between

Nucleic Acids Research, 2012, Vol. 40, No. 4 e26

two statistical populations, i.e. the prediction accuracies of two folding algorithms on all possible RNA sequences. In addition to reporting means, many studies additionally report standard deviations, but this is also not enough information to determine whether two methods have significantly different performance. To statistically evaluate the prediction accuracy of a method compared to another, a two-tailed t-test can be used to infer whether their accuracy means are significantly different. The test should be two-tailed because the better method is unknown ahead of time. Incorrectly, Welch’s t-test of two samples could be performed. Welch’s t-test is similar to Student’s t-test but is intended for use with two samples having possibly unequal variances. This test is performed on sensitivities and PPVs predicted by any two methods. The P-values are shown in Figure 3A. The P-value represents the risk of Type I error of rejecting the null hypothesis, that is the probability of claiming the two methods predict with different accuracies when actually their accuracies are the same. Welch’s t-test does not require equal variance, but like the typical (Student) t-test, it assumes the population from which the sample is drawn is normally distributed and the two populations are independent. Although the distribution of calculated sensitivities or PPVs is unlikely to be normal, a t-test is still justified by Central Limit theorem (CLT) with a large size (i.e. n  30). The pffiffisample ffi CLT says that ðx  Þ=ð= nÞ ! Nð0,1Þ when the sample size n is large, where x is the sample mean, m and  are the population mean and population standard deviation. Replacing  with the sample standard deviation s gives the t-distribution of (n  1) degrees of freedom, pffiffiffi ðx  Þ=ðs= nÞ ! tðn  1Þ (40). The assumption of independence, however, is clearly violated as explained above by the fact that the methods have correlated prediction accuracy (Figure 1). This correlation results in a systematic overestimation of the P-values by Welch’s t-test in comparison with the paired test. Given the observed correlation in prediction accuracy, a paired two-sample t-test is the appropriate statistical test. It calculates the differences of each observation between two methods and tests whether the mean of these differences is significantly different from zero. In the presence of significant correlation, paired t-tests have greater statistical power than unpaired t-tests by canceling the correlation when the two groups being compared occur in natural pairs. As shown in Figure 3, the P-values calculated from independent two-sample t-test (A) tend to be systematically larger than those from paired t-test (B). Specifically, the differences of PPVs and sensitivities between Multilign and Dynalign are significant when judged using paired t-tests and an  chosen to be 5%. The significances, however, are obscured by the incorrect application of independent t-tests (Figure 3A). Powering the benchmarks For those comparisons with P-values larger than 0.05, the benchmarks fail to demonstrate that there is a significant difference in prediction accuracies between the two

PAGE 4 OF 8

e26 Nucleic Acids Research, 2012, Vol. 40, No. 4 Correlation Coefficients

A

100

tRNA, 10 sequences

StemLoc

26 20 25 19 15 43 27 16 30 38 7 100

StemLoc

RNAshapes

53 58 62 51 39 14 25 31 23 26 100 35

RNAshapes

49 40 24 28 44 3 42 27 27 31 100 6

RNASampler

28 12 21 21 12 30 43 25 14 100 36 30

RNASampler

33 24 25 6 37 12 38 33 −7 100 27 6

RAF

14 47 43 26 29 25 25 18 100 21 24 39

RAF

13 31 23 2 28 18 −21 −9 100 15 29 44

Murlet

18 7 16 11 25 20 34 100 44 26 32 24

50

20 21 34 5

7 40 8 22 4 27 10 100

Murlet

17 31 32 35 31 40 25 100 45 35 51 42

MASTR

21 25 24 36 25 27 100 55 35 41 47 35

mLocARNA

15 20 26 16 7 100 33 38 36 31 24 33

FoldalignM

38 53 51 37 100 38 40 31 29 41 22 23

FoldalignM

39 58 53 37 100 45 14 3 43 28 27 37

RNAalifold

25 47 46 100 34 36 50 55 41 30 51 38

RNAalifold

11 49 48 100 34 44 29 33 33 11 22 16

Multilign

41 84 100 47 27 46 38 46 43 37 56 29

Multilign

26 79 100 50 38 41 39 38 46 30 14 39

Dynalign

36 100 84 47 28 44 36 53 35 31 54 31

Dynalign

18 100 82 45 45 52 37 38 57 27 29 36

PPV

PPV

Correlation Coefficients

B 100

tRNA, 5 sequences

0

−50

MASTR mLocARNA

Fold 100 34 42 29 15 22 39 31 9 46 51 18

19 24 16 32 23 9 100 39 45 −1 20 14 −6 2

4

6

StemLoc

RNAshapes

RAF

RNASampler

Murlet

MASTR

FoldalignM

mLocARNA

Multilign

RNAalifold

Fold

Dynalign

−100

Sensitivity

Sensitivity

Correlation Coefficients

C

100

tRNA, 20 sequences StemLoc

16 −18 −4 −18 −27 18 −31 −9 −30 10 10 100

RNAshapes

48 56 32 32 24 4 54 15 11 47 100 10

RNASampler

PPV

−50

Fold 100 16 22 8 24 28 17 37 25 40 40 24

StemLoc

RNAshapes

RAF

RNASampler

Murlet

MASTR

FoldalignM

mLocARNA

Multilign

RNAalifold

Fold

0

5 100 52 37 57 40 43 23

−100 Dynalign

50

56 −11 16 20 1

4 10 22 12 100 16 11

RAF

57 9 12 3 24 −40 23 −28 100 31 −8 −2

Murlet

−2 −5 1 22 15 55 34 100 42 57 15 6

MASTR

4 38 30 40 21 24 100 43 35 3 27 −19

50

0

mLocARNA −29 −6 8 24 0 100 31 62 15 41 55 −9 FoldalignM

17 41 13 44 100 13 −2 1 13 3

RNAalifold

4 36 33 100 36 13 50 29 41 27 14 −38

1 −32

Multilign

17 63 100 54 20 −8 32 9 31 −3 18 9

Dynalign

6 100 65 40 48 25 34 −10 17 −24 48 −9

−50

Fold 100 −12 6 −1 −15 50 19 57 32 60 40 41 StemLoc

RNAshapes

RAF

RNASampler

Murlet

MASTR

FoldalignM

mLocARNA

Multilign

RNAalifold

Dynalign

Fold

−100

Sensitivity Figure 1. Spearman rank correlation coefficients between scores of RNA folding algorithms. The upper left triangle shows the correlation coefficients of PPV and the lower right triangle shows the correlation coefficients of sensitivity. The labeled numbers inside the ellipses indicate rounded correlation coefficients in percentage. The prediction results of tRNA in 5 (A), 10 (B) and 20 (C) sequence combinations are shown here. The colored ovals depict the extent of correlation (red) or anticorrelation (blue) between methods.

methods with a low Type I error. There is a subtle but important distinction between failing to demonstrate that there is difference and demonstrating there is no difference. The caveat is that the benchmark might lack the power to detect the difference due to the small sample size and large data variations, if the difference does exist. How big should the sample size be to detect small difference with controlled Type I error? This is non-trivial because benchmarks are costly in computer time, so choosing an optimally sized data set can save computation time when comparing two folding

algorithms. On the other hand, if the null hypothesis cannot be rejected, Type II error, which is the probability of failing to reject a false null hypothesis, needs to be controlled. Conventionally the probability of Type II error is denoted by  and is chosen to be no larger than 0.2. Based on that, a pipeline is proposed to statistically evaluate the performance of algorithms using pairwise comparisons (Figure 4). During the benchmark, data are accumulated sequentially over time until the available computation time or sequences run out or the null

PAGE 5 OF 8

Nucleic Acids Research, 2012, Vol. 40, No. 4 e26 tRNA PPV 20

5 sequences

SENSITIVITY 40

60

80

100

10 sequences

20 sequences

100

Multilign

80

60

40

20 20

40

60

80

100

20

40

60

80

100

Dynalign Figure 2. Scatter plot of Multilign scores versus Dynalign scores. A dot indicates the sensitivity (open triangle) or PPV (open circle) of a sequence predicted by Multilign (y axis) and Dynalign (x axis). The dots on the diagonal line mean the sensitivities or PPVs predicted by the two algorithms are the same. The sensitivities or PPVs of those sequences predicted by Multilign are better than those predicted by Dynalign if the dots are located above the diagonal line or worse if they are located below the diagonal line. The horizontal (upper triangle) or vertical (lower triangle) distances from the dots to the diagonal line are the differences of the scores between Multilign and Dynalign. Only the prediction results of tRNA using either 5 (left panel), 10 (middle panel) or 20 (right panel) sequences per Multilign calculations are shown here.

B

StemLoc

RNAshapes

RAF

RNASampler

Murlet

Multilign

RNAalifold

Fold

0

Sensitivity

0 0 0 0 18

FoldalignM

0 0 0 0

RNAalifold

0 27 0

Multilign

0 0

Dynalign

0

85 0 0 1 0 0 0 0 1 0 0

40

15 0 0 0 10 0 0 0 0 0 0 0 0 15 0

0 0 0 0 0 77 0 0 23

20

0 0 0 0 2 1 5 0 0 0 0 0 44 0 0 0 0 0 0 34 0

Fold

60

0

StemLoc

5 0 0 0 5 4 9 0 0 0 0 0 49 0 0 0 0 0 0 49 0

Fold

20

0 3 79 0 0 0

80

0 0 13 0 0 0 0

RNAshapes

0

0 0 0 0 0 0 27 0 0 0 0 0 0 81 0 0 28

0 0 0 0 1 0 0

RAF

Dynalign

40

25 0 0 0 14 0 0

MASTR mLocARNA

0 0 0 0 40 8 0 22

0 0 1

RNASampler

0 5

90 0 0 3 0 0 0 0 2 0 0

Murlet

Murlet

0 40 1

Multilign

RAF 60

0 0 0 0 58 44 0 0 22

MASTR

RNAalifold

0 0 20 0 0 1 0

0 0 0 0 0 0 0 0 0 0

FoldalignM

0 0 0 0

RNASampler

mLocARNA

FoldalignM

80

0 3

0 0 0 0 7 0 0 39 51 0 0

Multilign

0 0 0 0 25

RNAshapes

PPV

0 5 81 1 0 0

StemLoc

0

Fold

0 0 0 0 4 0 0

Dynalign

PPV

MASTR mLocARNA

0 0 0 0 46 8 0 24

MASTR

RAF Murlet

0 0 0 0 62 57 0 1 25

FoldalignM

RNASampler

4 0 0 0 0 0 0 0 0 0

100

tRNA, 5 sequences

0 0 0 0 11 0 0 47 53 4 0

mLocARNA

StemLoc RNAshapes

P−value

100

RNAalifold

P−value tRNA, 5 sequences

Dynalign

A

Sensitivity

Figure 3. P values from t-tests for tRNA structure prediction. P-values in (A) are calculated with the independent two-sample Welch’s t-test with null hypothesis that the average score predicted by the two algorithms are the same. P-values in (B) are calculated from the paired two-sample t-test. The upper left triangle shows the P-values of PPV and the lower right triangle shows the P-values of sensitivity. Values are expressed as rounded percentages. For the multiple (>2) sequence methods, the results of using five sequences in a single calculation are shown here.

hypothesis is proven or disproven. Thus a sequential paired two-sample t-test can be employed to reach an early conclusion regarding the comparison of prediction accuracy. This sequential style of interim estimate, however, can inflate the Type I error probability (41,42). Here the critical value is adjusted by simulations to accommodate this inflation. If the null hypothesis cannot be

rejected at any of the interim tests, the power is calculated to see whether the null hypothesis can be accepted with high confidence. If no conclusion can be made, the process proceeds to the next stage of testing with benchmarks on an additional group of sequences. This step can be iterated until a conclusion is reached, computer resources are depleted, or until no more sequences are available.

PAGE 6 OF 8

e26 Nucleic Acids Research, 2012, Vol. 40, No. 4

hypothesis that MASTR is as accurate as RNAalifold cannot be rejected, and it cannot be accepted either, for the inability to reject the null hypothesis may be due to the small power (0.18 and 0.20 for sensitivity and PPV), not because they truly have the same performance. Unfortunately, the available 5S rRNA sequences run out before sufficient power can be achieved to make a conclusion.

Algorithm development and implementation Sequential test design with the total N sequences: K ni sequences for ith stage ( Σ n = N ) i=1 i

Choose proper α and β values (customarily 0.05 and 0.2, respectively) Run predictions of ni sequences

i=i+1

DISCUSSION Compare with another method using two-sample paired t-test sequentially H0 : their accuracies are the same H1 : their accuracies are different P < α? Yes Reject H0

No

Power > 1- β? Yes Accept H0

No

No

i = K? Yes Inconclusive

Figure 4. Statistical analysis pipeline. After the implementation of a new algorithm, its performance can be compared to another method with a sequential paired two-sample t-test. The total available sequences (N) are benchmarked in a number of stages (maximum of K). The next stage is needed only if no conclusion can be made with controlled Type I and II error probabilities. Note that  is adjusted as explained in the Methods section to prevent inflation of Type I error.

To illustrate the pipeline, the performances of the same 12 methods as above were evaluated on groups of five 5S rRNA sequences (43). In the Multilign paper, the sample size for 5S rRNA is only 20 ( = 100 sequences/5 sequences per calculation). It is expanded with an additional 1095 sequences to reach the sample size of 239. Then the statistical analysis proceeded sequentially for up to 12 stages, with 20 more predictions done on an extra 100 sequences at each stage (except that the last one only has 95 sequences because all available sequences were used). At the beginning, a t-statistic for the first 20 predictions is calculated and compared to the adjusted critical values in the same manner as the paired two-sample t-test. If the null hypothesis is neither rejected nor accepted with defined Types I and II error probabilities, the t-test is repeated with 20 additional calculations. Many pairwise comparisons are shown to already have significant difference in the first stage (Figure 5). Take Multilign versus RNAalifold for example, Multilign is significantly better than RNAalifold for both sensitivity and PPV with the original 100 sequences. For some cases, the iterative sampling method does refine the test to be able to draw conclusions. For example, using the first 20 groups of sequences, the a priori power analysis estimates that 142 groups of sequences are needed to resolve the hypothesis of comparing the sensitivity between Multilign and Dynalign. This is much more than the 80 groups eventually used in the sequential test, which demonstrates that Multilign does have a higher average PPV than Dynalign. There are, however, also many inconclusive comparisons. For example, the null

Because of the natural variability between measurements and the limitations on sampling, one cannot conclude that a method is better than another merely on the basis of means, especially when the difference in means is small. In this article, a statistical analysis is introduced to test the significance of improved RNA secondary structure prediction accuracy. It shows that the performance of algorithms can be correlated and a paired two-sample t-test can be used to compare the performance of two algorithms in terms of sensitivity or PPV. Even if there is no relationship between two RNA folding methods, a paired t-test is still justified because both methods are tested on the same set of sequences and thus their prediction scores are matching pairs. Another caveat is that a correlation coefficient of zero does not necessarily mean the two sets are uncorrelated. An example is the parabola function y = x2, for which both the Spearman rank correlation coefficient and Pearson product–moment correlation coefficient are zero for a sample centered on x = 0, although there is perfect quadratic relationship between x and y. Finally, a pipeline of evaluation is provided as illustrated in Figure 4. After the development and implementation of a new algorithm, its performance can be compared to another algorithm with a sequential procedure illustrated in the flow chart. If no conclusion can be made in the ith stage, the procedure moves on to the next stage with more sequences added into the benchmark. The sequential test used here has the advantages of reducing the total number of sequences required for the statistical test. In reality, with the broad range of the scores, the small differences between the average performances, and limited number of sequences available, it is often the case that which of two methods is better is statistically unknown. Because, however, the performance difference is usually small, the two methods may perform similarly in a practical sense on real data whether they are statistically different or not. Although all the analyses above were performed on sensitivity and PPV, they essentially apply to other scores, such as MCC or the alignment score (sum of pairs score) without any modifications. The method presented here, in Figure 4, provides the means for assessing whether the structure prediction performance of a program is significantly better than another on a benchmark set of sequences. It does not, however, determine whether one program is actually better than another. One first reason is that the benchmark set of sequences needs to be well chosen to represent the actual way the programs will be applied. A second

PAGE 7 OF 8

Nucleic Acids Research, 2012, Vol. 40, No. 4 e26

A

B

Rejection

Power

StemLoc

99 63 96 5 100 82 23 85 88 100

RNAshapes

RNASampler 100 100 100 100 100 100 100 100 100 RAF 100 80

97 43 63 61 82

mLocARNA

FoldalignM 100 67

85 86 98

RNAalifold

Multilign

Multilign 100 64

Dynalign

Dynalign 100

Sensitivity

20

StemLoc

RNASampler

0

Sensitivity

C

The number of groups

20 20 20 239 20 40 20 239 20 40 20

FoldalignM

20 20 239 20

RNAalifold

20 20 20

Multilign

20 40

Dynalign

20

20 20 180

200

20 60 20 20 20 60 40

150

239 40 20 80 80 40 20 239 20 40 20

239 20 20 20 20 20 20

100

20 20 239 239 40 20 80 40 20 160 100 40 20 60 20 20 20

50

80 20 220 140 40 20 100 20 20 20

StemLoc

RNAshapes

RAF

RNAalifold

Multilign

Fold

Dynalign

20 20 20 20 20 20 40 20 80 20 239

Fold

RNASampler

mLocARNA

20 80 239 20 239 20 20 20 20 239 20 40 20 239 20

Murlet

MASTR

20 20 20 20 20 20 20 20 20

MASTR

RAF Murlet

20 20 20 239 20 60 239 60 20 20

FoldalignM

RNAshapes RNASampler

250

20 20 60 20 239 20 20 20 60 100 20

mLocARNA

StemLoc

PPV

40

95 69 18 53 79 100 96 58 91 69 72 89 100 58 100 99 95

RAF

Fold

StemLoc

RNAshapes

RNASampler

RAF

Murlet

MASTR

mLocARNA

FoldalignM

Multilign

RNAalifold

Fold

Dynalign

89 99 28 100 63 85 24 63 100 54 100 98 99

99 100 71 100 97 87 88 96 59 96 10

Fold

60

71 77 64 80 86 99 79 100 83 98

Dynalign

Fold

100 68 92 98 89 67 55

Murlet

RNAalifold

9 99

80

72 67 100 95 67

MASTR

FoldalignM

mLocARNA

PPV

mLocARNA

62 97 99 20 100 65

MASTR

RNAalifold

MASTR

6 100 5 94 100 80

99 27 68 56 87 6 53

Murlet

FoldalignM

Murlet

Multilign

RAF

65 100 68

RNAshapes

RNASampler

PPV

100

StemLoc 100 93 85 99 35 88 100 83 63 68 98

RNAshapes

Sensitivity Figure 5. Example of 5S rRNA illustrating the flowchart of Figure 4. The 12 methods evaluated against each other on groups of five 5S rRNA sequences. Starting with 20 calculations (100 sequences), an additional group is benchmarked and statistically tested until the P value is smaller than , power is greater than 1  , or the total 239 groups of sequences ran out. (A) The final conclusions. Red: null hypothesis rejected; green: not rejected. (B) The final powers between any two methods. These are expressed as rounded percentages. If the null hypothesis is not rejected but the power is larger than 0.8, then the null hypothesis can be accepted; otherwise, it is inconclusive. (C) The number of groups needed to test the hypothesis or the total available groups for those inconclusive comparisons (239 maximum groups of five sequences). In each panel, the upper triangle applies to PPV and the lower triangle to sensitivity. For reference, the average sensitivity and PPV for each program is provided in Supplementary Table S2.

reason is that programs may perform better on some families of sequences than on others, so a statistically better performance on one family of sequences may not result in better performance on a different family of sequences.

SUPPLEMENTARY DATA Supplementary Data are available at NAR online: Supplementary Tables 1 and 2, Supplementary Figure 1, and Supplementary methods.

e26 Nucleic Acids Research, 2012, Vol. 40, No. 4

ACKNOWLEDGEMENTS The authors thank Holger Hoos, University of British Columbia, and Charles Lawrence, Brown University, for insightful comments. FUNDING National Institutes of Health (grant R01HG004002 to D.H.M.). Funding for open access charge: NIH. Conflict of interest statement. None declared. REFERENCES 1. Doudna,J.A. and Cech,T.R. (2002) The chemical repertoire of natural ribozymes. Nature, 418, 222–228. 2. Nissen,P., Hansen,J., Ban,N., Moore,P.B. and Steitz,T.A. (2000) The structural basis of ribosome activity in peptide bond synthesis. Science, 289, 920–930. 3. Jambhekar,A. and DeRisi,J.L. (2007) Cis-acting determinants of asymmetric, cytoplasmic RNA transport. RNA, 13, 625–642. 4. Yanofsky,C. (1981) Attenuation in the control of expression of bacterial operons. Nature, 289, 751–758. 5. Winkler,W.C., Nahvi,A., Roth,A., Collins,J.A. and Breaker,R.R. (2004) Control of gene expression by a natural metabolite-responsive ribozyme. Nature, 428, 281–286. 6. Washietl,S., Hofacker,I.L. and Stadler,P.F. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459. 7. Uzilov,A.V., Keegan,J.M. and Mathews,D.H. (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinform., 7, 173. 8. Torarinsson,E., Sawera,M., Havgaard,J.H., Fredholm,M. and Gorodkin,J. (2006) Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res., 16, 885–889. 9. Shapiro,B.A., Yingling,Y.G., Kasprzak,W. and Bindewald,E. (2007) Bridging the gap in RNA structure prediction. Curr. Opin. Struct. Biol., 17, 157–165. 10. Tafer,H., Ameres,S.L., Obernosterer,G., Gebeshuber,C.A., Schroeder,R., Martinez,J. and Hofacker,I.L. (2008) The impact of target site accessibility on the design of effective siRNAs. Nat. Biotechnol., 26, 578–583. 11. Long,D., Lee,R., Williams,P., Chan,C.Y., Ambros,V. and Ding,Y. (2007) Potent effect of target structure on microRNA function. Nat. Struct. Mol. Biol., 14, 287–294. 12. Zadeh,J.N., Wolfe,B.R. and Pierce,N.A. (2011) Nucleic acid sequence design via efficient ensemble defect optimization. J. Comput. Chem., 32, 439–452. 13. Mathews,D.H. (2004) Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10, 1178–1190. 14. Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442–451. 15. Gorodkin,J., Stricklin,S.L. and Stormo,G.D. (2001) Discovering common stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res., 29, 2135–2144. 16. Xu,Z. and Mathews,D.H. (2011) Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics, 27, 626–632. 17. Mathews,D.H., Sabina,J., Zuker,M. and Turner,D.H. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288, 911–940. 18. Pace,N.R., Thomas,B.C. and Woese,C.R. (1999) Probing RNA structure, function, and history by comparative analysis. In: Gesteland,R.F., Cech,T.R. and Atkins,J.F. (eds), The RNA World, 2nd edn. Cold Spring Harbor Laboratory Press, New York, pp. 113–141.

PAGE 8 OF 8

19. Znosko,B.M., Silvestri,S.B., Volkman,H., Boswell,B. and Serra,M.J. (2002) Thermodynamic parameters for an expanded nearest-neighbor model for the formation of RNA duplexes with single nucleotide bulges. Biochemistry, 41, 10406–10417. 20. Mathews,D.H., Disney,M.D., Childs,J.L., Schroeder,S.J., Zuker,M. and Turner,D.H. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl Acad. Sci. USA, 101, 7287–7292. 21. R Development Core Team. (2010) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 22. Sarkar,D. (2008) Lattice: Multivariate Data Visualization with R. Springer, New York. 23. Glantz,S. (2005) Primer of Biostatistics, 6th edn. McGraw-Hill Medical, New York. 24. Dalgaard,P. (2008) Introductory Statistics with R, 2nd edn. Springer, New York. 25. Gerstman,B.B. (2009) Basic Biostatistics: Statistics for Public Health Practice, 1st edn. Jones and Bartlett Publishers, Sudbury, MA. 26. Siegmund,D. (1985) Sequential Analysis: Tests and Confidence Intervals. Springer, New York. 27. Wald,A. (1947) Sequential Analysis. Wiley, New York. 28. Sprinzl,M. and Vassilenko,K.S. (2005) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res., 33, D139–D140. 29. Reuter,J.S. and Mathews,D.H. (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinform., 11, 129. 30. Mathews,D.H. and Turner,D.H. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol., 317, 191–203. 31. Torarinsson,E., Havgaard,J.H. and Gorodkin,J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics, 23, 926. 32. Will,S., Reiche,K., Hofacker,I.L., Stadler,P.F. and Backofen,R. (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol., 3, e65. 33. Lindgreen,S., Gardner,P.P. and Krogh,A. (2007) MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing. Bioinformatics, 23, 3304–3311. 34. Kiryu,H., Tabei,Y., Kin,T. and Asai,K. (2007) Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics, 23, 1588–1598. 35. Do,C.B., Foo,C.-S. and Batzoglou,S. (2008) A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics, 24, i68–i76. 36. Xu,X., Ji,Y. and Stormo,G.D. (2007) RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics, 23, 1883–1891. 37. Steffen,P., Voss,B., Rehmsmeier,M., Reeder,J. and Giegerich,R. (2006) RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics, 22, 500–503. 38. Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinform., 6, 73. 39. Hofacker,I.L. (2007) RNA consensus structure prediction with RNAalifold. Methods Mol. Biol., 395, 527–544. 40. Gosling,J. (1995) Introductory Statistics. Pascal Press, Sydney, Australia. 41. Shao,J. and Feng,H. (2007) Group sequential t-test for clinical trials with small sample sizes across stages. Contemp. Clin. Trials, 28, 563–571. 42. Cui,L., Hung,H.M.J. and Wang,S.-J. (1999) Modification of sample size in group sequential clinical trials. Biometrics, 55, 853–857. 43. Szymanski,M., Barciszewska,M.Z., Erdmann,V.A. and Barciszewski,J. (2002) 5S ribosomal RNA database. Nucleic Acids Res., 30, 176–178.