Multiperm: shuffling multiple sequence alignments ...

8 downloads 0 Views 49KB Size Report
Jan 9, 2009 - and show that our method better approximates this characteristic. ACKNOWLEDGEMENTS. We thank Robert Gentleman for useful discussions, ...
Bioinformatics Advance Access published January 9, 2009

Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies Parvez Anandam 1,∗, Elfar Torarinsson 3 and Walter L. Ruzzo 1,2 1

Departments of Computer Science and Engineering and 2 Genome Sciences, University of Washington, Seattle WA 98195-2350, USA; 3 Department of Natural Sciences, Faculty of Life Sciences, University of Copenhagen, 1870 Frederiksberg C, Denmark Associate Editor: Prof. Dmitrij Frishman

With multiple sequence alignments, competing preservation requirements make shuffling more challenging. The mononucleotide frequencies, gap structure and local conservation patterns are all important (Washietl and Hofacker, 2004); additionally preserving dinucleotide frequencies reduces the number of possible shuffles unacceptably — often no non-native alignment is possible. Nevertheless, several authors have lamented this situation, as failure to preserve dinucleotide frequencies in pairwise alignments can result in a drastic underestimate of false-positive rates in RNA prediction Babak et al. (2007). We present here a shuffling algorithm for arbitrary multiple sequence alignments that preserves mononucleotide frequencies, gap structure and local conservation patterns exactly, while preserving dinucleotide frequencies approximately. The number of distinct shuffles that can be produced by the algorithm is very large for the vast majority of multiple sequence alignments. Together these characteristics provide a much more realistic null model for RNA prediction studies than previously available.

1 INTRODUCTION An important question in any computational genomic search is whether its predictions are sufficiently surprising to suggest biological significance. A standard approach is to re-run the prediction algorithms on random data that is similar to real genomic data, and compare the score distribution of the resulting (presumably false) predictions to those from native sequences. For prediction of structured RNAs, the folding free energy is a critical attribute. Since G-C base pairs are more energetically favorable, random sequences generated as controls should ideally exhibit the same mononucleotide frequencies as native sequences. More subtly, dinucleotide frequencies also matter, since stacked pairs in helices also affect folding energy. Several studies, e.g., (Workman and Krogh, 1999; Clote et al., 2005) convincingly demonstrate that this is more than just a theoretical concern. A very elegant algorithm for permuting a single sequence while preserving dinucleotide frequencies performs an Eulerian walk on a directed multigraph constructed from the original sequence (Altschul and Erickson, 1985). They also show that shuffling is preferable to sampling from a Markov chain, which only preserves expected frequencies. ∗ To

whom correspondence should be addressed.

2 METHODS The essential idea behind the algorithm is to permute columns of the multiple alignment so as to preserve dinucleotide frequencies in the alignment’s consensus sequence, hence approximately in the individual sequences, while respecting constraints on gap and local conservation patterns. We partition columns in the multiple sequence alignment into conservation classes based on the positions of gaps and a coarse representation of the nucleotide conservation pattern in each column as in Washietl and Hofacker (2004). Specifically, a class in an n-sequence alignment is the concatenation of a binary n-vector indicating presence of gaps, and an integer c summarizing conservation in the column. The extent of coarse-graining of the latter attribute is controlled by the level parameter. At level 0, c equals the Mean Pairwise Identity (MPI) of the column rounded to zero digits after the decimal point (thus, only 0 or 1). At level 1, c is the MPI of the column rounded to one digit after the decimal point (eleven possible bins). Since there are fewer conservation classes at level 0, the number of possible unique shuffles is greater at level 0 than at level 1. Our shuffling heuristic is inspired by Altschul and Erickson’s Euler tour technique, but differs in important ways. In outline, we augment the Altschul-Erickson multigraph with conservation class information, then randomly “walk” the graph guided by that information. Our multigraph has up to 4 vertices, one for each type of nucleotide that appears in the consensus sequence for the multiple alignment. For each dinucleotide in the consensus sequence, there is an edge from the start nucleotide to the end

© The Author (2009). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on December 23, 2016

ABSTRACT Summary: Assessing the statistical significance of structured RNA predicted from multiple sequence alignments relies on the existence of a good null model. We present here a random shuffling algorithm, Multiperm, that preserves not only the gap and local conservation structure in alignments of arbitrarily many sequences, but also the approximate dinucleotide frequencies. No shuffling algorithm that simultaneously preserves these three characteristics of a multiple (beyond pairwise) alignment has been available to date. As one benchmark, we show that it produces shuffled exonic sequences having folding free energy closer to native sequences than shuffled alignments that do not preserve dinucleotide frequencies. Availability: The Multiperm GNU C++ source code is available at http://www.anandam.name/multiperm Contact: [email protected]; [email protected] Supplementary information: One supplemental figure illustrating the algorithm is available online.

Anandam et al.

3 RESULTS In addition to dinucleotide statistics, a key concern is whether a large number of unique shuffles remain possible for a given multiple sequence alignment. As in Babak et al. (2007), we use the fraction of nucleotides in the alignment that change identity to quantify the degree of shuffling. We define the shufflability of a multiple sequence alignment to be the mean of this quantity for 1000 random alignments. At level 0, we found that about 90% of a sample of the UCSC 17-way vertebrate alignments had a shufflability above 0.5, comparable to the 0.53 reported by Babak et al. (2007) in pairwise alignments. Early work highlighting the importance of dinucleotide frequencies focused on coding sequence (Workman and Krogh, 1999; Clote et al., 2005). As a benchmark, we compared shuffling algorithms on the 570 multiple alignments within exons in human chromosome Y. We shuffled each alignment 100 times as follows: 1) each sequence independently preserving mononucleotide frequencies, 2) each sequence independently preserving dinucleotide frequencies (Altschul-Erickson algorithm from Clote et al. (2005)), 3) the whole multiple alignment using rnazRandomize.pl from the RNAz package at levels 0 and 1, and 4) the whole multiple alignment using Multiperm at levels 0 and 1. For each native and shuffled sequence, we used RNAfold (Hofacker et al., 1994) to obtain its minimum free energy, and calculated a z-scores for the native with respect to the shuffled background. Results are shown in Table 1. The independent mono- and dinucleotide shuffles extend the results of (Workman and Krogh, 1999), showing in a larger dataset that significant dinucleotide effects are visible in individual exons (not just full length mRNAs) and pervasive in vertebrates. (Intriguingly, the shift of approximately 0.44 standard deviations is quite stable among the homeotherms tested, while the frog and fish species all show shifts near 0.30. However, the smaller data sets and weaker alignments to these 4 species may contribute to the apparent discrepancy.) In nearly all cases, preserving dinucleotide frequencies produces sequences whose minimum free energies on average are nearly the same as native sequences, unlike the mononucleotide shuffle. Of course, independently shuffling aligned sequences destroys

2

the alignment, making it a poor null model. The RNAz shuffle does preserve this similarity but the resulting z-scores are not substantially different Table 1. Minimum free energy z-scores using different shuffles

Num

hg18 panTro1 canFam2 bosTau2 rheMac2 mm8 loxAfr1 rn4 dasNov1 monDom4 oryCun1 echTel1 galGal2 xenTro1 tetNig1 fr1 danRer3

570 532 528 519 500 441 428 422 371 367 366 360 250 213 195 164 157

Len

144 145 145 148 147 142 147 143 148 151 140 143 143 139 130 139 141

G+C (%) 44 44 48 49 45 43 46 45 46 42 48 47 42 41 56 53 48

Independent mono di -0.46 -0.43 -0.47 -0.49 -0.50 -0.54 -0.43 -0.50 -0.44 -0.54 -0.52 -0.46 -0.50 -0.57 -0.23 -0.18 -0.27

-0.03 0.00 -0.05 -0.06 -0.04 -0.08 0.02 -0.07 0.06 -0.10 -0.08 0.03 -0.05 -0.25 0.04 0.14 0.04

RNAz 0

1

-0.47 -0.44 -0.47 -0.49 -0.51 -0.53 -0.43 -0.51 -0.46 -0.55 -0.52 -0.42 -0.48 -0.58 -0.24 -0.17 -0.26

-0.45 -0.42 -0.47 -0.48 -0.48 -0.52 -0.42 -0.49 -0.44 -0.51 -0.52 -0.40 -0.46 -0.56 -0.27 -0.16 -0.26

Multiperm 0 1 -0.24 -0.21 -0.23 -0.27 -0.25 -0.35 -0.19 -0.32 -0.21 -0.37 -0.31 -0.16 -0.31 -0.45 -0.18 -0.09 -0.19

-0.27 -0.24 -0.29 -0.31 -0.29 -0.37 -0.23 -0.36 -0.26 -0.37 -0.36 -0.21 -0.34 -0.46 -0.21 -0.10 -0.20

The number of multiple alignments (Num) containing the species, the length of the consensus sequence (Len), the G+C content and the computed minimum free energy z-scores are shown for multiple alignments within exons in human chromosome Y.

from mononucleotide shuffles of the individual sequences. The Multiperm z-scores are significantly closer to native, while still preserving other key characteristics of the alignments. The data thus underscore the importance of preserving dinucleotide frequencies when generating random alignments, and show that our method better approximates this characteristic.

ACKNOWLEDGEMENTS We thank Robert Gentleman for useful discussions, and the Danish Research Council for financial support.

REFERENCES Altschul, S. F., B. W. Erickson (1985) Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage, Mol. Biol. Evol., 2(6), 526-538. Babak, T., B.J. Blencowe, T.R. Hughes (2007) Considerations in the identification of functional RNA structural elements in genomic alignments, BMC Bioinformatics, 8, 33. Clote, P., F. Ferr´e, E. Kranakis, D. Krizanc (2005) Structural RNA has a lower folding energy than random RNA of the same dinucleotide frequency, RNA, 11, 578-591. Hofacker, I.L., W. Fontana, P.F. Stadler, S. Bonhoeffer, M. Tacker, P. Schuster (1994), Fast Folding and Comparison of RNA Secondary Structures, Monatshefte f. Chemie, 125, 167-188. Washietl, S., I. L. Hofacker (2004) Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics, J. Mol. Biol., 342, 19-30. Workman, C., A. Krogh (1999) No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution, Nucleic Acids Research, 27, 4816-4822.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on December 23, 2016

nucleotide, labeled by the end column, and hence its consensus character and conservation class. The crux of (Altschul and Erickson, 1985) is that the consensus character labels on a random Euler tour of this graph (i.e., a tour crossing each edge once) exactly preserves the dinucleotide frequencies of the underlying sequence (and uniformly samples all such sequences). Our algorithm uses conservation classes of successive columns in the input alignment as a script to restrict selection of the next edge (alignment column) from the edge list at the current vertex, selecting randomly among those unused outgoing edges whose labels match the desired conservation class. If no such edge is available, pick uniformly at random among unused edges with the correct end class leaving other vertices; this produces a violation in the preservation of consensus dinucleotide frequencies (but preserves conservation class sequence). If such violations are infrequent, and if the consensus nucleotides are indeed representative of the nucleotides in each column, we will have good approximate preservation of dinucleotide frequencies. Our implementation includes two small refinements of this basic scheme. First, to avoid biasing dinucleotide violations towards the end of the alignment, we start the walk at a randomly chosen column (and wrap from end to beginning) instead of always beginning at the first column. Second, regions of the alignment where the majority of characters in a column are gaps are specially handled. Specifically, for such regions we attempt to preserve the dinucleotide formed by the pair of consensus nucleotides bracketing the region, in preference to the consensus pairs entering and leaving the region, as the former is likely to make a bigger contribution to the overall dinucleotide statistics. Supplemental Figures S1 and S2 illustrate the workings of the algorithm.

Suggest Documents