Exact Calculation of Distributions on Integers, with ... - Semantic Scholar

3 downloads 0 Views 393KB Size Report
distribution for a random variable defined on integers and for which dynamic ... or an estimate, e.g., an estimating alignment, as a point in the solution ensemble ...
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 16, Number 1, 2009 © Mary Ann Liebert, Inc. Pp. 1–18 DOI: 10.1089/cmb.2008.0137

Exact Calculation of Distributions on Integers, with Application to Sequence Alignment LEE A. NEWBERG1;2 and CHARLES E. LAWRENCE3

ABSTRACT Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits. Key words: algorithms, computational molecular biology, statistics, alignment, dynamic programming, genomics. 1. INTRODUCTION

C

ONFIDENCE INTERVALS AND OTHER INTERVAL ESTIMATION PROBLEMS, and hypothesis testing are two of the most important topics in statistical inference. To perform either of these task requires the obtaining of the probability distribution function or, equivalently, the cumulative distribution function (CDF) of a scalar quantity, or at least the tails of such a distribution. The cumulative distribution function for the random variable S is defined as

CDFSj‚ .x/ D PrŒS < xj‚; where ‚ is the set of parameters of the distribution.

1 Center

for Bioinformatics, Wadsworth Center, New York State Department of Health, Albany, New York. of Computer Science, Rensselaer Polytechnic Institute, Troy, New York. 3 Center for Computational Molecular Biology, Brown University, Providence, Rhode Island. 2 Department

1

(1)

2

NEWBERG AND LAWRENCE

In genomics and computational molecular biology, attention has been focused to date primarily on tests of statistical significance. We say that a score is statistically significant if obtaining a score as good as or better than the one observed is unlikely under a specified null distribution with parameters ‚0 , i.e., if PrŒS  sj‚0  D 1 CDFSj‚0 .s/ is not more than some threshold. A recursive algorithm has been presented to calculate the cumulative distribution function for tests of significance in motif finding when the scores are integers (Staden, 1989; Moses et al., 2004; Carmack et al., 2007). While in this case integer scores are employed to facilitate computation, integer scores arise naturally in obtaining confidence intervals for many of the most important problems in computational molecular biology, because many of our field’s inference problems concern discrete high-dimensional binary unknowns. For example, in RNA secondary structure prediction, a binary variable Aij is set to 1 if base i is predicted to pair with base j , and 0 otherwise. Similarly, in sequence alignment, binary variables are used to predict the alignment of a specific residue in one sequence to a specific residue in another sequence, and in network inference, binary variables are used to predict the edges of a graph. For these problems, a count of the number of differences between a reference solution and an alternate solution is necessarily an integer. Here we present an approach, applicable to a wide range of problems, for computing the exact probability distribution for a random variable defined on integers and for which dynamic programming algorithms are available. In the context of local sequence alignments, we apply the approach both to the calculation of credibility limits (also known as Bayesian confidence limits), described in the following, and to the calculation of the distribution of scores obtainable by alignment of any given pair of sequences.

1.1. Credibility limits Hypothesis testing has received much attention in computational biology, but there has been considerably less focus on how to assess the global uncertainty/reliability of the prediction of the many discrete highdimensional problems. For example, when a BLAST search is completed, not only is an E-value returned, but also an alignment; however, no overall assessment of the uncertainty/reliability of the reported alignment is given. In any discrete setting, a procedure delivering a single answer delivers one feasible solution, a point estimate, selected from the solution ensemble, i.e., from the set of all possible solutions. For highdimensional discrete space, these ensembles are immense, and thus there is likely to be considerable uncertainty. For example, in the alignment of two sequences of length m and n there are A.m; n/ D

.m C n/Š mŠ nŠ

(2)

possible effective alignments (see, e.g., Waterman [1995], p. 188). The number of alignments, i.e., the size of the alignment ensemble, grows rapidly with the length of the sequences being aligned; for example, two small sequences of a length of only 50 residues permit 1029 possible alignments. Thus, it is very unlikely that any procedure can confidently identify the true alignment. In fact, for the local alignments of the human/rodent pairs of sequences described below, the probability of each of the maximum similarity alignments ranged from 1:3  10 1 to 3:4  10 52 , with a median value of 3:1  10 18 . Even for algorithms that return near-optimal solutions, say the 1000 alignments with the highest alignment scores, the total probability represented can be vanishingly small. Accordingly, it is more appropriate to think of a prediction or an estimate, e.g., an estimating alignment, as a point in the solution ensemble that is selected as a representative solution, a point estimate, around which there can be considerable variability and thus uncertainty. To asses the uncertainty/reliability of such point estimates, Webb-Robertson et al. (2008) have recently introduced credibility limits. For a hyper-sphere centered on a point estimate, a credibility limit is the radius of the minimum-sized hyper-sphere containing at least a specified fraction of the posterior distribution. When the distance between solutions is the Hamming distance that counts the number of differing discrete unknowns, the credibility limits will always be nonnegative integers. Webb-Robertson et al. (2008) also employ a Bayesian statistical approach to introduce a sampling procedure for estimating the cumulative distribution function of the Hamming distances from a point estimate for discrete high-dimensional inference problems; they use these cumulative distribution functions to obtain the desired credibility limits. As a concrete illustration of their approach, they obtained estimated credibility limits for the local sequence

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

3

alignment problem. Here, our interest is in finding exact distribution functions for integer scalar-valued functions, such as those defined on alignments. In the Methods section, we describe the general approach for calculating a distribution on integers. We specialize this to the problem of credibility limits for local sequence alignments, where the distance between two sequence alignments is a count of the number of pairings that the sequences do not have in common. We also specialize the general approach to the problem of giving the distribution of alignment scores for any given pair of sequences. In the Results section, we apply the credibility limits algorithm to obtain exact credibility limits on the alignment of 20 human-rodent sequence pairs, and on a set of alignments of pairs of Drosophila sequences.

2. METHODS We begin this section by describing how we use a dynamic programming algorithm to compute probability distributions on integers. We then specialize the algorithm, for the computation of credibility limits for local sequence alignments, and for the computation of alignment-score distributions.

2.1. General dynamic programming algorithms Dynamic programming algorithms can be used to find solutions to a problem when there exists a hierarchy of a reasonable number of subproblems, with the feature that any problem or subproblem can be easily computed from solutions to its subproblems. A generic example with a probabilistic interpretation computes a value Z.n/ by starting with Z.0/ D 1 and then computing Z.1/; : : : ; Z.n/ in order, by combination of the previous values (Table 1). Generally, the final computed value, Z.n/, is proportional to the Bayesian marginal posterior probability of a data set R. Via Bayes’ rule, Z.n/ serves as a normalizing divisor in a manner that shall become apparent presently. The run time for such an algorithm is often linear, for instance, when computation of the transition probability t.k; Rjj / (Table 1) can be achieved in constant time, and when the transition probability is nonzero only when k j is less than some constant. In this case, the memory requirement can be rendered constant as well, since early terms in the sequence Z.0/; Z.1/; : : : can be discarded as k grows larger. Also common is a quadratic run time, for instance, when the number of terms in the sum grows with the problem size. Usually in this case, the entire Z array must be retained, and the memory requirement is linear in the problem size. Note that we allow a transition probability to depend on the data R. In the language of finite-state machines, such a transition often depends upon a subset of the data Rk corresponding to the destination

TABLE 1.

G ENERAL P ROBABILISTIC DYNAMIC P ROGRAMMING A LGORITHM

Z.0/ D 1 For k in 1 : : : n P 1 Z.k/ D kj D0 Z.j /t.k; Rjj / EndFor R is the data set being evaluated and, for all j and k, we require that t.k; Rjj /  0. The value Z.j / is proportional to the Bayesian marginal posterior probability of stage j , and the value t.k; Rjj / is proportional to the probability of the a transition from stage j to stage k, conditioned upon originating from stage j . The product Z.j /t.k; Rjj / is proportional to the unconditioned posterior probability of the transition from stage j to stage k. The recursion must be formulated so that these transitions to stage k are mutually exclusive events; we are thereby allowed to compute the probability of the union of these events as the sum of their individual probabilities. The final computed value, Z.n/, is proportional to the Bayesian marginal posterior probability of the data R. For many dynamic programming problems, multiple states are defined for each of the indices of the data. Here we take each of these states as a different stage in f0; : : : ; ng.

4

NEWBERG AND LAWRENCE TABLE 2. G ENERAL D IRECT A PPROACH TO I NTEGER S CORE D ISTRIBUTIONS For s in smin : : : smax Z.0; s/ D 0 EndFor Z.0; 0/ D 1 for k in 1 : : : n for s in smin : : : smax P 1 Z.k; s/ D kj D0 Z.j; s EndFor EndFor

s.j; k; R/

s.k; R//t.k; Rjj /

By adding a score-specific dimension to Z, we can allocate a fraction of each Z value to each of the integer values within the allowable range of scores. Since the score is computed as the sum of values s.k; R/ for a visit to stage k, and s.j; k; R/ for a transition from stage j to stage k, we adjust for the score difference between stage j and stage k. Note that, when s s.j; k; R/ s.k; R/ is outside the range of possible scores fsmin ; : : : ; smax g, the algorithm must use 0 instead of Z.j; s s.j; k; R/ s.k; R//.

stage k, and a distinction is made between a transition probability and an emission probability. Specifically, we can write 0 1 ! X t.k; R jj / k ; (3) t.k; Rk jj / D @ t.k; Rk jj /A P Rk t.k; Rk jj / Rk

where the first factor is called the transition probability from stage j to stage k and is independent of R, and the second factor is called the emission probability of Rk for that transition. We will not make this distinction in the following; instead we will refer to t.k; Rjj / as a transition probability. 2.1.1. General case: the direct approach. Suppose that we want the probability distribution for a score that counts an integer cost s.k; R/ for a visit to stage k, and an integer cost s.j; k; R/ for a transition from stage j to stage k. A direct way to achieve this is to add a second dimension to the array Z (Table 2). Here the cumulative score is tracked via the second index of the array, and the final unnormalized probability distribution over scores is stored in the last matrix row, Z.n; smin : : : smax/, where smin and smax are, respectively, the minimum and maximum possible scores. These Z.n; : : : / values are normalized to probabilities by dividing by their sum, which is the Z.n/ value of Table 1. The original algorithm (Table 1) requires less time and memory than does the newer algorithm (Table 2), because the latter has the loop over values of s. When the requirements for the original algorithm are O.n˛ / memory and O.n1C˛ / time, where ˛ is often 0 or 1, the newer algorithm requires O.n˛ s# / memory and O.n1C˛ s# / time, where s# D smax smax C 1. Fortunately, with the Fourier transform approach, we will be able to restore the original memory requirement and, in the case of a parallel processing environment, the original time requirement. 2.1.2. General case: the polynomial approach. Instead of adding a second dimension to a Z array that stores real (floating-point) numbers, we can have each element of the original array store a polynomial. That is, for each k, the values .Z.k; smin /; : : : ; Z.k; smax// will instead be stored as the polynomial Z.k/ D

smax X

Z.k; s/p s ;

(4)

sDsmin

where p is the polynomial indeterminant (also known as a variable). (Note that we have slightly generalized the strict mathematical definition of polynomial, in that we are not restricting the integers s to be nonnegative.) The sums in Table 2 can be expressed as operations on these polynomials (Table 3).

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS TABLE 3. G ENERAL P OLYNOMIAL A PPROACH I NTEGER S CORE D ISTRIBUTIONS

5 TO

Z.0/ D 1 For k in 1 : : : n P 1 Z.k/ D ps.k;R/ kj D0 Z.j /t.k; Rjj /ps.j;k;R/ EndFor We can track the accumulating score s by altering our array Z so that it contains polynomials over the indeterminant (also known as a variable) p, rather than real numbers. Here we are computing the distribution for a score that is the sum of the values s.k; R/ for a visit to stage k, and s.j; k; R/ for a transition from stage j to stage k.

This approach was applied for motif-significance calculations (Staden, 1989; Moses et al., 2004; Carmack et al., 2007) and falls into the general category of generating-function techniques (Wilf, 2006). Unfortunately, the memory requirement for storage of an array of polynomials is likely to be as costly as is that for storage of a two-dimensional array of real numbers, and the time requirement is not improved, either. Note that with a slight change of notation, substitution of e i t for p, Equation (4) becomes Z.k/ D

smax X

Z.k; s/e i ts ;

(5)

sDsmin

which, as a function of the indeterminant t, is known as the characteristic function of the probability distribution on s. This association is closely related to the Fourier transform approach, described next. 2.1.3. General case: the Fourier transform approach. Fortunately, the expanded memory requirement can be avoided by use of Fourier transforms. If smin is the known, smallest possible score and smax is the known, largest possible score, and if s# D smax smin C 1 then the polynomial can be reconstructed through knowledge of its value at s# distinct values of p. In particular, if the choices for p are the s# complex roots of unity f! s D cos.2 s=s#/ C i sin.2 s=s#/ W s D 0; : : : ; s# 1g, then the polynomial can be reconstructed by Fourier transform in O.s# log.s# // time and O.s# / memory, a cost which is typically insignificant. Through its use of a specific complex number value for p (e.g., ! s , for some s) rather than an indeterminant p, the Z array of Table 3 stores complex numbers instead of polynomials. This usage reduces memory and time requirements to the original O.n˛ / and O.n1C˛ /, respectively. However, such a computation must be run O.s# / times, once for each s D 0; : : : ; s# 1, so the overall run-time remains O.n1C˛ s# /. Fortunately, the intermediate values of these calculations need not be kept in memory simultaneously, and thus the overall memory requirement remains at the reduced level of O.n˛ / (Table 4). Note that the the calculations within the outer loop of Table 4 do not depend on the calculations from previous iterations for that loop; thus, the run time can revert to the reduced O.n1C˛ / in a parallelprocessing environment, with a typically insignificant post-processing time of O.s# log.s# // for the Fourier transform. Note also that some software implementations of Fourier transforms run considerably faster if s# has only small prime factors. To accommodate these implementations, it is reasonable to slightly increase smax so that s# D smax smin C 1 has only small prime factors.

2.2. Credibility limits for sequence alignment In this section, we specialize the general algorithm described above, for the purpose of computing credibility limits for pairwise local sequence alignments. Since the size of the ensemble of alignments of two sequences is typically immense, the probability of any individual alignment, including the optimal alignment, is typically very small. Accordingly, there will usually be substantial uncertainty about the alignment of the two sequences, and credibility limits, also known as Bayesian confidence limits, provide a means by which to represent the uncertainty/reliability of a reported estimating alignment (Webb-Robertson

6

NEWBERG AND LAWRENCE TABLE 4.

G ENERAL F OURIER T RANSFORM A PPROACH S CORE D ISTRIBUTIONS

TO I NTEGER

For s in 0 : : : s# 1 p D cos.2s=s# / C i sin.2s=s# / Z.0/ D 1 For k in 1 : : : n P 1 Z.k/ D ps.k;R/ kj D0 Z.j /t.k; Rjj /ps.j;k;R/ EndFor F .s/ D Z.n/ EndFor F .0; : : : ; s# 1/ D F .0; : : : ; s# 1/=.F .0/s# / F .smid ; : : : ; smax ; smin ; : : : ; smid 1/ D FourierTransform.F .0; : : : ; s#

1//

We can track the accumulating score s by running the polynomial algorithm with roots of unity used instead of the polynomial indeterminant p. We are computing the distribution for a score that is the sum of the values s.k; R/ for a visit to stage k, and s.j; k; R/ for a transition from stage j to stage k. Note that smid is the unique integer multiple of s# that falls in the range Œsmin ; smax ; it will be 0 if smin  0  smax . The second-to-last line indicates that every element of the F array should be divided by the product F .0/s# ; this serves to normalize the probabilities in the solution, F . The last line indicates that a Fourier transform is applied to the F array, leaving the probability for the score smid in the first element of the array, the probability for the score smid C 1 in the second element of the array, etc.

et al., 2008). Specifically, the x% credibility limit of an alignment is the radius of the smallest hypersphere that contains x% of the posterior ensemble of alignments. That is, when a pairing distance from the estimating alignment s satisfies the relationship PrŒdistance < s < x%  PrŒdistance  s

(6)

for some x% (e.g., x% D 90%, 95%, 99%), then s is said to be the x% credibility limit for the estimating alignment A. A small value of s indicates that the probability mass of the ensemble of alignments is tightly compact around the estimating alignment. Accordingly, since there will be at most the short pairing distance s to any alignment within this radius, the estimating alignment quite reliably represents the recommendations of the data. On the other hand, a large value of s indicates that A is not a reliable representative of the distribution. In the Results section, we will present the probability distribution for s in analyses of some human-rodent sequence data and in analyses of some fly sequence data. In this section, we will define a probability model on alignments. We will then provide an algorithm that computes the probability distribution on pairing distances from an estimating alignment A. That is, for all s, we simultaneously compute the probability that an alignment drawn randomly from the probability distribution will have exactly s pairing differences from the estimating alignment A. 2.2.1. Probability distribution for local sequence alignments. A probability distribution is an essential component of nearly all statistical inference procedures, including hypothesis testing and the identification of confidence limits. Several methods are available for obtaining these distributions. The use of a probabilistic model, such as a pair-HMM model, is the most direct means to obtain these distributions (Durbin et al., 2006). Also, these distributions can be obtained using the implicit relationship of a null distribution to parameters of a scoring matrix (Karlin and Altschul, 1990). Here we employ an approach that allows near-direct interpretation of alignment scores as logarithmic-odds ratios. To compute local pairwise alignment scores, we employ the default nucleotide sequence alignment scoring scheme of SSEARCH (Pearson, 1991). SSEARCH uses the scoring scheme that assigns spair.match/ D C5 for each nucleotide pairing of identical nucleotides, spair .mismatch/ D 4 for each nucleotide pairing that is a mismatch, sopen D 16 to start an insertion, and sextend D 4 to extend an insertion. Because we are looking at local sequence alignments, an insertion penalty is applicable neither for insertions before the first nucleotide pairing, nor for insertions after the last nucleotide pairing. We convert alignment scores to posterior probabilities via a partition function approach; for a pair of input sequences R D .x; y/ and a

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

7

proposed alignment A, the alignment score sssearch .R; A/ is converted to a probability via the Boltzmann distribution:   sssearch .R; A/ 1 exp ; (7) PrŒAjR D Zend T where T , the temperature, (sometimes written as 1= or 1=ˇ) is a scaling factor, and Zend is the partition function that normalizes the probabilities so that the sum over all possible alignments is 1.0:   X sssearch .R; A0 / Zend D exp : (8) T 0 A

The distribution of Equation (7) is also known as a Boltzmann distribution. To show that consideration of the ensemble of alignments is valuable even when a maximum-score alignment is weighted heavily, we have chosen for our experiments a temperature of T D 5= ln.10/  2:17, which corresponds to  D ˇ  0:46. With this choice, an alignment with a score better than that of another alignment by C5 (e.g., one more nucleotide matching) will be 10 times as probable. However, other choices of T are viable. For instance, we could have selected T D 5:22113, the value we derive for the SSEARCH pairing alignment score matrix, via the approach of Karlin and Altschul (1990) for gapless alignments. Such a choice would have significantly de-emphasized maximum-score alignments. 2.2.2. Basic forward algorithm for Zend . The basic forward algorithm for pairwise sequence alignments is a two-dimensional dynamic programming algorithm, in that values for an alignment of the first i positions of sequence x to the first j positions of sequence y are computed from the values for i 0  i and j 0  j . The form of these computations depends on the state of the pair .xi ; yj /, e.g., whether xi and yj are paired, or not. For local sequence alignment, we find it convenient to define a BEGIN state, an END state, and seven intermediate states for a finite-state machine that recognizes the regular expression in Table 5. We need to provide the (unnormalized) transition probabilities from one state to another. Because the contributions to an alignment score sssearch .R; A/ are added together, the corresponding contributions to the unnormalized probability exp.sssearch .R; A/=T / are multiplied together. Thus, to model the default SSEARCH parameters, we set a transition probability of topen D exp.sopen =T /, with sopen D 16, for a legal transition into the XINS state from a state other than XINS, regardless of the type of nucleotide being inserted. A transition from the XINS state to itself has a transition probability of textend D exp.sextend =T /, with sextend D 4. We use the same transition probabilities for legal transitions into the YINS state; these transitions similarly depend on whether the transition is from the YINS state and, in this application, we do not include the common convention that excludes a visit to the YINS state directly from a visit to the XINS state. For legal transitions into the PAIR state, we use a transition probability of tpair .xC ; yC / D exp.spair .xC ; yC /=T /, where spair .xC ; yC / D C5 if xC and yC are the same nucleotide, and spair.xC ; yC / D 4 when xC and yC are different nucleotides. Regardless of the identity of the nucleotide being inserted, each legal transition into each of the XBEGIN, YBEGIN, XEND, and YEND states is TABLE 5.

R EGULAR E XPRESSION

FOR

L OCAL S EQUENCE A LIGNMENT

8 9 2 3‹ ˆ > > ˆ > ! 6 !ˆ ! ! ! ! !     7 ˆ > xC xC = xC 6 xC < xC 7 6 7 6 7 ˆ > yC y y > y 4 yC ˆ 5 C C C ˆ > „ ƒ‚ … „ ƒ‚ … „ ƒ‚ … ˆ :„ ƒ‚ … „ ƒ‚ … „ ƒ‚ …> ; „ ƒ‚ … „ ƒ‚ … !

XBEGIN

YBEGIN

PAIR

XINS

YINS

PAIR

XEND

YEND

The  indicates zero or more repetitions of its argument, ‹ means zero or one repetition of its argument, xC signifies the next nucleotide position in sequence x, and yC signifies the next nucleotide position in sequence y. That is, the alignment begins with zero or more insertions into sequence x and zero or more insertions into sequence y that are without cost, because we are analyzing local sequence alignment. Then, if the sequences have not yet been exhausted, there is one or more pairings of a nucleotide in x with a nucleotide in y, with zero or more insertions into x, and with zero or more insertions into y between pairings. Finally, there are zero or more insertions into sequence x, and zero or more insertions into sequence y, that are without cost for local sequence alignments.

8

NEWBERG AND LAWRENCE TABLE 6.

S EQUENCE A LIGNMENT: T HE BASIC F ORWARD A LGORITHM

States: The The The The The The The The The

BEGIN state is applicable when i D j D 0. XBEGIN state is applicable when xi is within the unpenalized, initial insertion. YBEGIN state is applicable when yj is within the unpenalized, initial insertion. PAIR state is applicable when xi is paired with yj . XINS state is applicable when xi is within a penalized insertion. YINS state is applicable when yj is within a penalized insertion. XEND state is applicable when xi is within the unpenalized, final insertion. YEND state is applicable when yj is within the unpenalized, final insertion. END state is applicable when i D m and j D n.

Basic Forward Algorithm: Zbegin.0; 0/ D 1 // Compute the first cell Zxbegin.0; 0/ D Zybegin.0; 0/ D Zpair .0; 0/ D 0 Zxins.0; 0/ D Zyins.0; 0/ D Zxend.0; 0/ D Zyend.0; 0/ D 0 For i in 1 : : : m // Compute the first column Zxbegin.i; 0/ D tfreeopen.xi /Zbegin.i 1; 0/ C tfreeextend.xi /Zxbegin.i 1; 0/ Zbegin.i; 0/ D Zybegin.i; 0/ D Zpair .i; 0/ D 0 Zxins.i; 0/ D Zyins.i; 0/ D Zxend.i; 0/ D Zyend.i; 0/ D 0 EndFor For j in 1 : : : n // Compute the first row Zybegin.0; j / D tfreeopen.yj /Zbegin.0; j 1/ C tfreeextend.yj /Zybegin.0; j 1/ Zbegin.0; j / D Zxbegin.0; j / D Zpair.0; j / D 0 Zxins.0; j / D Zyins.0; j / D Zxend.0; j / D Zyend.0; j / D 0 EndFor For i in 1 : : : m // Compute the remaining values For j in 1 : : : n Zbegin.i; j / D Zxbegin.i; j / D 0 Zybegin.i; j / D tfreeopen.yj /Zxbegin.i; j 1/ C tfreeextend.yj /Zybegin.i; j 1/ Zpair .i; j / D tpair .xi ; yj /ŒZbegin.i 1; j 1/ C Zxbegin.i 1; j 1/ C Zybegin.i 1; j 1/ CZpair .i 1; j 1/ C Zxins.i 1; j 1/ C Zyins.i 1; j 1/ Zxins.i; j / D topen.xi /Zpair.i 1; j / C textend.xi /Zxins.i 1; j / Zyins.i; j / D topen.yj /ŒZpair .i; j 1/ C Zxins.i; j 1/ C textend.yj /Zyins.i; j 1/ Zxend.i; j / D tfreeopen.xi /Zpair .i 1; j / C tfreeextend.xi /Zxend.i 1; j / Zyend.i; j / D tfreeopen.yj /ŒZpair.i; j 1/ C Zxend.i; j 1/ C tfreeextend.yj /Zyend.i; j 1/ EndFor EndFor Zend D Zxbegin.m; n/ C Zybegin.m; n/ C Zpair .m; n/ C Zxend.m; n/ C Zyend .m; n/ In our implementation, we have defined tpair .xi ; yj / D exp.C5=T / when xi D yj , but tpair .xi ; yj / D exp. 4=T / when xi ¤ yj . We have also defined topen ./ D exp. 16=T /, textend ./ D exp. 4=T /, and tfreeopen ./ D tfreeextend ./ D 1. We have set T D 5= ln.10/  2:17, so that an alignment with a score better than that of another alignment by C5 will be 10 times as probable.

assigned a transition probability of tfreeopen D exp.sfreeopen=T /, with sfreeopen D 0, unless it is a transition from the same state, in which case it is assigned a transition probability of tfreeextend D exp.sfreeextend=T /, with sfreeextend D 0. See Table 6 for the algorithm. The final computed value, Zend , is the sum in Equation (8). Thus, via its score sssearch .R; A/, we can compute PrŒAjR for any alignment A from Equation (7). The run time of the algorithm is O.mn/, and the algorithm can be made to run with O.min.m; n// memory, where m and n are the lengths of the input sequences, because computed values can be discarded once they are no longer needed. 2.2.3. Credibility limits: the Fourier transform approach. The preceding basic forward algorithm can be adapted to compute exact credibility limits. For this problem, we wish to precisely determine the probability that an alignment drawn from the Boltzmann distribution of Equation (7) will differ from an

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

9

estimating alignment A by exactly some number s of pairings not common to the two alignments. The key to the adaptation is the replacement of the real numbers topen .n/, textend.n/, tpair .n; n0 /, tfreeopen.n/, and tfreeextend.n/ (for nucleotides n and n0 ) with complex numbers. There is some latitude in when to count pairing differences between alignments, and we choose to count an .xi ; yj / pair when xi is considered. That is, in the dynamic programming algorithm, when xi is an insertion via the XBEGIN, XINS, or XEND state, the transition will contribute s D 1 to the pairing distance if xi is paired in the estimating alignment A (regardless of the yj to which it is paired in A); otherwise, the transition will contribute s D 0 to the pairing distance. On the other hand, when xi is paired with yj via the PAIR state, it will contribute s D 0 to the pairing distance if A has the same pairing, it will contribute s D 1 if A has xi unpaired, otherwise it will contribute s D 2 if A has xi paired to something other than yj . The algorithm is given in Table 7. The sequential run time of this algorithm is O.mn min.m; n//, because the number of possible alignment scores s# is O.min.m; n//. However, the run time of the Fourier approach can be reduced to that of the basic forward algorithm, O.mn/, if the s# calculations for Zend .! k / are done in parallel on separate processors. The algorithm utilizes the fact that the maximum number of pairing differences from an estimating alignment A is bounded by the number of pairings in the estimating alignment plus the shorter of the two sequence lengths. Implementations that adapt the generic direct approach of Table 2 and the generic polynomial approach of Table 3 are provided in Section A.1 of the Supplementary Materials (see www.liebertonline.com). Also, note that the values of smin and smid that appear in Table 4 are zero for this algorithm, because zero is the minimum achievable pairing distance; it is the pairing distance achieved for the estimating alignment itself. 2.2.4. Software specifics. Because of its improved memory requirement, we implemented the Fourier transform approach described in Section 2.2.3. We checked its accuracy on small problem instances via an implementation of the direct approach described in Section A.1 of the Supplementary Materials, and we double-checked the smallest instances by hand. For each sequence pair, we computed the exact credibility distribution for two distinct choices of an estimating alignment: a maximum-score alignment, and the centroid alignment. (For a centroid alignment, termed a “probability alignment” by Miyazawa [1995], two nucleotides are paired if and only if the marginal posterior probability of their being paired exceeds 50%. Also see Ding et al. [2005] and Carvalho and Lawrence [2008] for descriptions of the theoretical and practical advantages exhibited by centroid estimators over maximum-score estimators.) As a first step, we computed maximum-score and centroid alignments themselves. These computations require use of a “backward algorithm,” which is a little more complex than the forward algorithm of Table 6, and which is run from the opposite end of the sequences (see, e.g., Section 4.4 in Durbin et al. [2006]). However, a straightforward implementation using a backward algorithm requires O.mn/ memory, because the values that previously were discardable in the forward algorithm implementation now must be retained for use in the backward algorithm. To avoid use of too much computer memory, we employed the optimal checkpointing approach of Newberg (2008a). See Section B of the Supplementary Materials for additional software considerations. 2.2.5. Problem instances. We analyzed 24 pairs of human-rodent DNA sequences. The first member in each of the 24 pairs was created from 3,000 nucleotides of sequence immediately upstream of a human gene, truncated so as to avoid inclusion of the upstream gene as necessary, and the second member of each pair was created from 3,000 nucleotides of sequence immediately upstream of the orthologous rodent gene, truncated as necessary. We applied RepeatMasker (Smit et al., 1996–2004) to these sequences and deleted each identified repeat, joining the flanking sequences together under the plausible hypothesis that the sequences were adjoining prior to the time of the insertion of the repeat. The resulting sequences varied in length from 973 to 3000 with a mean length of 2100. See Section C.1 of the Supplementary Materials for more information about this data set. We also analyzed 204 sets of orthologous fly data. For each of 20 genes, we compared 1000 nucleotides of sequence immediately upstream of the gene in Drosophila melanogaster, truncated to avoid including the upstream gene as necessary, with 1000 nucleotides of sequence immediately upstream of the orthologous gene in D. pseudoobscura, truncated as necessary. Similarly, we compared the 20 D. melanogaster

10

NEWBERG AND LAWRENCE TABLE 7.

C REDIBILITY L IMITS : F OURIER T RANSFORM A PPROACH

s# D GoodFourierIntegerGreaterThan.NumberOfParingsIn.A/ C min.m; n// For s in 0 : : : s# 1 p D cos.2s=s# / C i sin.2s=s# / Zbegin.0; 0/ D 1 // Compute the first cell Zxbegin.0; 0/ D Zybegin.0; 0/ D Zpair .0; 0/ D Zxins.0; 0/ D Zyins.0; 0/ D Zxend.0; 0/ D Zyend .0; 0/ D 0 For i in 1 : : : m // Compute the first column Zxbegin.i; 0/ D tfreeopen.xi /Zbegin.i 1; 0/ C tfreeextend.xi /Zxbegin.i 1; 0/ Zbegin.i; 0/ D Zybegin.i; 0/ D Zpair .i; 0/ D Zxins.i; 0/ D Zyins.i; 0/ D Zxend.i; 0/ D Zyend.i; 0/ D 0 If xi is paired in A Then Zxbegin.i; 0/ D pZxbegin.i; 0/ EndIf EndFor For j in 1 : : : n // Compute the first row Zybegin.0;j / D tfreeopen.yj /Zbegin.0;j 1/ C tfreeextend.yj /Zybegin.0;j 1/ Zbegin.0;j / D Zxbegin.0;j / D Zpair .0;j / D Zxins.0;j / D Zyins.0;j / D Zxend.0;j / D Zyend.0;j / D 0 EndFor For i in 1 : : : m // Compute the remaining values For j in 1 : : : n Zbegin.i;j / D Zxbegin.i;j / D 0 Zybegin.i;j / D tfreeopen.yj /Zxbegin.i;j 1/ C tfreeextend.yj /Zybegin.i;j 1/ Zpair .i;j / D tpair .xi ; yj /ŒZbegin.i 1;j 1/ C Zxbegin.i 1;j 1/ CZybegin.i 1;j 1/ C Zpair .i 1;j 1/ C Zxins.i 1;j 1/ C Zyins.i 1;j 1/ Zxins.i;j / D topen .xi /Zpair .i 1;j / C textend.xi /Zxins.i 1;j / Zyins.i;j / D topen .yj /ŒZpair.i;j 1/ C Zxins.i;j 1/ C textend.yj /Zyins.i;j 1/ Zxend.i;j / D tfreeopen.xi /Zpair.i 1;j / C tfreeextend.xi /Zxend.i 1;j / Zyend.i;j / D tfreeopen.yj /ŒZpair.i;j 1/ C Zxend.i;j 1/ C tfreeextend.yj /Zyend.i;j 1/ If xi is paired in A Then Zxins.i;j / D pZxins.i;j / Zxend.i;j / D pZxend.i;j / If xi is paired in A, but not to yj , Then Zpair .i;j / D p2 Zpair .i;j / EndIf Else Zpair .i;j / D pZpair .i;j / EndIf EndFor EndFor Zend.s/ D Zxbegin.m; n/ C Zybegin.m; n/ C Zpair .m; n/ C Zxend.m; n/ C Zyend.m; n/ EndFor Zend.0; : : : ; s# 1/ D FourierTransform.Zend .0; : : : ; s# 1/=.Zend.0/s# // All array elements are complex numbers. Also see the legend for Table 6.

sequences with orthologous upstream sequences from D. erecta, D. yakuba, and D. simulans. As with the human-rodent data, we deleted each repeat identified by RepeatMasker, joining flanking sequences. The resulting sequence lengths varied from 139 to 1000 with a mean of 634. See Section C.2 of the Supplementary Materials for more information about this data set. For each of the 24 C .20  4/ D 104 sequence pairs, we computed both a maximum-score local sequence alignment and the centroid local sequence alignment, and we then computed the exact credibility distribution for these two estimating alignments.

2.3. Alignment score distribution We can compute the distribution of integer-valued alignment scores (also known as the density of states) for a given pair of sequences. In Table 8, we have inserted the complex values p spair , p sopen , and p sextend into the basic forward algorithm, so as to track the contributions to an alignment score.

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

11

We can compute the alignment score distribution under the assumption that alignments are distributed according to the posterior distribution of Equation (7), or under the null assumption that alignments are equally probable. For the former, we set s  pair (9) tpair D exp T s  open topen D exp T s  extend : textend D exp T For the latter, we set tpair D topen D textend D 1:

(10)

The value of smid that appears in Table 4 is zero for this algorithm because the alignment score of zero is achievable, for the “empty” alignment, and thus smin  0  smax . See Section A.2 of the Supplementary Materials for direct-approach and polynomial-approach implementations of this algorithm.

TABLE 8.

A LIGNMENT S CORE D ISTRIBUTION : F OURIER T RANSFORM A PPROACH

s# D GoodFourierIntegerGreaterThan.smax smin / For s in 0 : : : s# 1 p D cos.2s=s# / C i sin.2s=s# / Zxbegin.0; 0/ D 1 // Compute the first element Zybegin.0; 0/ D Zpair .0; 0/ D Zxins.0; 0/ D Zyins.0; 0/ D Zxend.0; 0/ D Zyend.0; 0/ D 0 For i in 1 : : : m // Compute the first column Zxbegin.i; 0/ D Zxbegin.i 1; 0/ Zybegin.i; 0/ D Zpair .i; 0/ D Zxins.i; 0/ D Zyins.i; 0/ D Zxend.i; 0/ D Zyend .i; 0/ D 0 EndFor For j in 1 : : : n // Compute the first row Zybegin.0; j / D Zxbegin.0; j 1/ C Zybegin.0; j 1/ Zxbegin.0; j / D Zpair .0; j / D Zxins.0; j / D Zyins.0; j / D Zxend.0; j / D Zyend.0; j / D 0 EndFor For i in 1 : : : m // Compute the remaining values For j in 1 : : : n Zxbegin.i; j / D Zxbegin.i 1; j / Zybegin.i; j / D Zxbegin.i; j 1/ C Zybegin.i; j 1/ Zpair .i; j / D tpair .i; j / pspair .i;j / ŒZxbegin.i 1; j 1/ C Zybegin.i 1; j 1/ CZpair .i 1; j 1/ C Zxins.i 1; j 1/ C Zyins.i 1; j 1/ Zxins.i; j / D topen psopen Zpair .i 1; j / C textend psextend Zxins.i 1; j / Zyins.i; j / D topen psopen ŒZpair.i; j 1/ C Zxins.i; j 1/ Ctextend psextend Zyins.i; j 1/ Zxend.i; j / D Zpair .i 1; j / C Zxend.i 1; j / Zyend.i; j / D Zpair .i; j 1/ C Zxend.i; j 1/ C Zyend.i; j 1/ EndFor EndFor Zend .s/ D Zxbegin.m; n/ C Zybegin.m; n/ C Zpair .m; n/ C Zxend.m; n/ C Zyend.m; n/ EndFor Zend.0; : : : ; s# 1/ D Zend .0; : : : ; s# 1/=.Zend.0/s# / Zend.0; : : : ; smax ; smin ; : : : ; 1/ D FourierTransform.Zend .0; : : : ; s# 1// All array elements are complex numbers. The division by the product Zend .0/s# serves to normalize the probabilities in the solution, Zend . We have set spair .i; j / D C5 when xi D yj , but spair .i; j / D 4 when xi ¤ yj ; we have also set sopen D 16 and sextend D 4. We can set tpair .i; j / D exp.spair .i; j /=T /, topen D exp.sopen =T /, and textend D exp.sextend =T /, or we can set tpair D topen D textend D 1.

12

NEWBERG AND LAWRENCE

3. RESULTS Even though they score well, there are not many alignments that are nearly the same as a given estimating alignment. Thus, generally, if an alignment is chosen at random according to its Boltzmann probability, there is a low probability that the number of differences from the optimum is small. At the other extreme, alignments that have almost nothing in common with a given estimating alignment tend to score quite poorly, and accordingly the probability that an alignment chosen from the Boltzmann distribution will be a large pairing distance from an estimating alignment is also low. In between these extremes, there is often a single peak in the probability distribution. For instance, Figure 1 shows the exact credibility distributions for the maximum-score alignment and the centroid alignment, for the eighth pair of humanrodent sequences. The example demonstrates a superiority of the centroid alignment over the maximumscore alignment as well as an odd-even parity. See the figure legend for more information. Multi-modality is not uncommon. See Figure 2 for the example of the second pair of sequences from the human-rodent data. The multi-modality arises from multiple clusters of quality alignments. In this case, the centroid of the smaller cluster has the 205 pairings from the whole-ensemble centroid plus a fairly contiguous block that contains an additional 59 pairings. Ding et al. (2006), in the context of RNA secondary structures, observe and more deeply analyze the phenomenon of multiple solution clusters. Figure 3 depicts the credibility distributions for the fourth pair of sequences from the human-rodent data; this is a case where the centroid and maximum-score alignments fall into different clusters. The whole-ensemble centroid, which is also the centroid of the cluster containing it, has 450 pairings; the other

FIG. 1. The exact credibility distributions, the distribution of the pairing distances of the posterior weighted ensemble of alignments from the centroid alignment and from the maximum-score alignment, for the eighth pair of sequences in the human-rodent data set. The higher peak, on the left, is for the centroid alignment; the lower peak is for the maximum-score alignment. This example clearly demonstrates the rule that the average pairing distance from the centroid alignment is less than the average pairing distance from the maximum-score alignment. Also evident is the common phenomenon of odd-even alternation; in some circumstances, the breaking of a pairing present in an estimating alignment frees up a nucleotide for participation in another advantageous pairing; thus, differences from an estimating alignment are encouraged to arise in pairs. The human nucleotide sequence with repeats removed is of length 1769, and likewise the rodent sequence is of length 1575. The centroid alignment has 1099 pairings, and the maximum-score alignment has 1123 pairings.

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

13

FIG. 2. The exact credibility distributions for the second pair of sequences in the human-rodent data set. The distribution for the centroid alignment is slightly left of that for the maximum-score alignment. This example demonstrates the occurrence of multiple peaks; often there are multiple distinct clusters of good-quality alignments, and an estimating alignment can fall within only one of them. The human nucleotide sequence with repeats removed is of length 1691, and likewise the rodent sequence is of length 2219. The centroid alignment has 205 pairings, and the maximum-score alignment has 214 pairings.

cluster’s centroid has 459 pairings; and 363 of the pairings belong to both cluster centroids. The pairings not shared arise from a slight offset in the two clusters’ centroid alignments, due to a weakly repeating pattern of approximately ten nucleotides in length. Ding et al. (2005), in the context of RNA secondary structures, observe and more deeply analyze the phenomenon of a maximum-score solution falling into an improbable cluster. Similar examples exist among the 4  20 pairs of fly sequences. Use of fly data set also permits us to compare differing evolutionary distances, because D. pseudoobscura, D. erecta, D. yakuba, and D. simulans have different evolutionary separations from Drosophila melanogaster (Table 9). In Figure 4, for each of the 20 gene ortholog groups, are plotted four points, one for the pairwise alignment of D. melanogaster sequence to sequence from each of the four species D. pseudoobscura, D. erecta, D. yakuba, and D. simulans. Plotted are the 95% credibility limits for the centroid and maximumscore alignments; i.e., a point has an x coordinate indicating the radius from the centroid alignment that is needed if there is to be a 95% probability that an alignment chosen from the Boltzmann distribution will occur within a hyper-sphere of that radius; and the point has a y coordinate for the 95% credibility radius for the maximum-score alignment. Perhaps the most striking feature in this plot is the wide ranges of both the centroid and the maximum-score alignment credibility limits. Even the 95% credibility limits for the sequence alignments of D. melanogaster to D. simulans, the two most closely related species, range from nearly zero base pair differences to over 300 base pair differences. Also, notice that the ranges of the credibility limits of the alignment of D. melanogaster sequence with the other four species’ sequences broadly overlap, indicating that, on average, the credibility limits only modestly increase as the paired species become more divergent. Further, observe that most of the points are above the y D x diagonal, indicating that the radius (i.e., the error bar) is tighter for the centroid alignment than it is for the maximumscore alignment. Theory guarantees that the mean pairing distance to the centroid alignment will be lower

14

NEWBERG AND LAWRENCE

FIG. 3. The exact credibility distributions for the fourth pair of sequences in the human-rodent data set. The distribution for the centroid alignment is strongly to the left of that for the maximum-score alignment. The order of the peaks for the two estimating alignments is inverted, indicating that the centroid and maximum-score alignments belong to distinct clusters of high-quality alignments; the centroid alignment falls within the larger cluster. The human nucleotide sequence with repeats removed is of length 1677, and likewise the rodent sequence is of length 1666. The centroid alignment has 438 pairings, and the maximum-score alignment has 450 pairings.

TABLE 9.

A P HYLOGENETIC T REE

OF

70 66

 D. melanogaster

 54

802

D ROSOPHILA S PECIES

 D. simulans

 111

 D. yakuba

 39 596

102

 D. erecta

 D. pseudoobscura

Tree is demonstrating the evolutionary separation of the four other species from D. melanogaster. The branches are labeled with the expected number of synonymous substitutions per 1000 codons (Pollard, 2005; Drosophila 12 Genomes Consortium, 2007).

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

15

FIG. 4. For pairwise alignments of D. melanogaster sequences with each of the orthologous sequences in D. pseudoobscura, D. erecta, D. yakuba, and D. simulans, for each of 20 intergenic regions—80 plotted points in total—we have plotted the 95% credibility limit for the centroid alignment (x coordinate) and the 95% credibility limit for the maximum-score alignment (y coordinate). The sequence lengths varied from 139 to 1000 with a mean of 634. The fact that most points lie above the y D x diagonal indicates that 95% credibility limits (i.e., error bars) are tighter for the centroid alignments. We see no apparent pattern by which to differentiate the species-specific sets of points.

than the mean pairing distance to the maximum-score alignment, but it makes no such guarantee for fixed percentile credibility limits (Webb-Robertson et al., 2008). Because statistical significance, as measured by p-value, is often used as an indicator of the reliability of a pairwise sequence alignment, we compare significance with credibility in Figure 5. For the 80 pairwise alignments of D. melanogaster sequences with the four other fly species, we depict the relationship of significance to the relative 95% credibility limit, where the relative 95% credibility limit is the ratio of the 95% credibility limit to the number of pairings in the maximum score alignment, and thus is an indication of the fraction of the maximum score alignment that is not reliable. In the first panel of Figure 5, which depicts all sequence pairs for which the p-value is greater than 10 7 (and includes all sequence pairs for which the relative credibility limit exceeds 100%), notice that credibility limits are quite variable and are frequently quite high, and that there is little or no relationship between relative credibility and statistical significance. In the second panel of Figure 5, which depicts all sequence pairs for which the relative credibility limit is less than 100% (and includes all sequence pairs for which the p-value is less than 10 7 ), notice that relative credibility limits are variable and can be non-negligible even in cases of p-values with extreme significance. Although relative credibility continues to exhibit sizable variability, the mean relative credibility does diminish gradually with diminishing p-value. Statistical significance calculations were made using the technique of Newberg (2008b).

4. DISCUSSION Results from our exact credibility limit computations for pairwise local sequence alignments indicate that, in many cases, these distributions are not unimodal, and we see evidence that this deviation from “pretty” results is quite general. In such situations, the choice of a single solution does not capture the

16

NEWBERG AND LAWRENCE

FIG. 5. Relative credibility versus significance. For each of the 80 pairwise alignments of fly sequence data, we have plotted the relative 95% credibility limit (the ratio of the 95% credibility limit for the maximum score alignment to the number of pairings in the maximum score alignment) versus statistical significance, as measured by p-value. In the first panel, with p-value of 10 7 , the relative credibility values are widely variable and exhibit little or no relationship to p-values. In the second panel, all remaining points are plotted; there is a downward trend of mean relative credibility with diminishing p-value. However, near any given p-value there is high variability in the relative credibility among the alignments; and this is true even for the extremely significant p-values of  10 100. We conclude that it is unreasonable to assume that an alignment is credible based simply upon the fact that its p-value is strong.

EXACT CALCULATION OF DISTRIBUTIONS ON INTEGERS

17

wide diversity of the solution space. Instead, a sample of solutions, from the probability distribution over the ensemble of all solutions, can be used to represent the solution space as a whole. For instance, the probability that a solution from the entire ensemble will have a particular feature can be estimated as the fraction of the solutions in the representative sample that have that feature. Ding et al. (2005) find that a sample size of 1000 is sufficient for many computations for RNA secondary structure. Even if only a single estimating solution is required, exact credibility limit calculations allow scientists to readily appreciate the quality of such a solution. Experiments or subsequent computational analyses connected to data sets that yield solutions with tight credibility limits can be treated as reliable since the data indicate that only small deviations from the estimating alignment are likely. Thus, subsequent analyses dependent upon such alignments can be prioritized ahead of experiments and computational analyses connected to data sets that demonstrate inferior credibility limits. For instance, the 95% credibility limit for the Figure 1 centroid solution is 182 nucleotides, which is 182=1099  17% of the number of pairings in the centroid solution. The input sequence lengths are 1769 and 1575 nucleotides. The corresponding limits for Figures 2 and 3 are 114 (114=205  56% of centroid solution pairings, with input sequence lengths 1691 and 2219) and 160 (160=438  37% of centroid solution pairings, with input sequence lengths 1677 and 1666). These numbers can be used in prioritizing experiments on these sequence pairs. The use of Fourier transforms significantly reduces memory requirements for the computation of an integer distribution, making feasible calculations that previously required a prohibitive amount of memory. Furthermore, the necessary computations for the integer distribution calculation can, in many circumstances, be done in parallel, even when an underlying algorithm cannot easily be parallelized; thus, there may be little or no run-time penalty for the computation of an integer distribution. In the example described here we intentionally chose a conservative scaling factor (i.e., a low temperature) that puts strong emphasis on the highest scoring alignments. While this choice makes the point that there is considerable uncertainly in alignments even under very conservative assumptions, this choice yields unrealistic features. For example, using the recipe for conversion of alignment scores to pair-HMM parameters from Durbin et al. (2006, pp. 98–99) this choice of temperature gives the expected lengths of indels as the unrealistically short 1.18 nucleotides. (However, note that Durbin et al. [2006] cautions the reader that scores that work well with maximum-score algorithms may not yield reasonably valued pair-HMM parameters.) For poor significance values (first panel of Figure 5) there is little or no relationship between relative credibility limits and significance; in particular, some pairwise alignments have a reasonably low credibility limit despite having inferior p-values. For better significance values (second panel of Figure 5) we see a downward trend of mean relative credibility with diminishing p-value, but several alignments exhibit poor relative credibility despite having extremely good p-values. Thus, we caution that the quality of the p-value of an alignment is not a strong indicator of the reliability of that alignment. This should not be too surprising; interval testing and hypothesis testing are solutions to different problems. For most of these calculations, numerical underflow is an issue. See Section B of the Supplementary Materials for some information on how we addressed this issue.

5. CONCLUSION We have presented a memory-efficient approach for calculation of the exact distribution for integer scores and, in the context of pairwise local sequence alignments, we have applied it to find the alignment score distribution and Bayesian confidence limits. We have discovered that these distributions of credibility limits are often not unimodal, and we hypothesize that an analysis of their features will lead to a deeper understanding of the underlying science.

ACKNOWLEDGMENTS We thank the Computational Molecular Biology and Statistics Core Facility at the Wadsworth Center for the computing resources to make these calculations. This research was supported by the National Institutes of Health/National Human Genome Research Institute (grants K25 HG003291 to L.A.N. and R01 HG001257 to C.E.L.), and by the U.S. Department of Energy (grant DE-FG02-04ER63924 to C.E.L.).

18

NEWBERG AND LAWRENCE

DISCLOSURE STATEMENT No competing financial interests exist.

REFERENCES Carmack, C.S., McCue, L.A., Newberg, L.A., et al. 2007. PhyloScan: identification of transcription factor binding sites using cross-species evidence. Algorithms Mol. Biol. 2, 1. Available at: http://dx.doi.org/10.1186/1748-7188-2-1. Carvalho, L.E., and Lawrence, C.E. 2008. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. USA 105, 3209–3214. Available at: http://dx.doi.org/10.1073/pnas.0712329105. Ding, Y., Chan, C.Y., and Lawrence, C.E. 2005. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA 11, 1157–1166. Available at: http://dx.doi.org/10.1261/rna.2500605. Ding, Y., Chan, C.Y., and Lawrence, C.E. 2006. Clustering of RNA secondary structures with application to messenger RNAs. J. Mol. Biol. 359, 554–571. Available at: http://dx.doi.org/10.1016/j.jmb.2006.01.056. Drosophila 12 Genomes Consortium. 2007. Evolution of genes and genomes on the drosophila phylogeny. Nature 450, 203–218. Available at: http://dx.doi.org/10.1038/nature06341. Durbin, R., Eddy, S., Krogh, A., et al. 2006. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. Karlin, S., and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268. Available at: http://dx.doi.org/10.1073/ pnas.87.6.2264. Miyazawa, S. 1995. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 8, 999–1009. Available at: http://dx.doi.org/10.1093/protein/8.10.999. Moses, A.M., Chiang, D.Y., Pollard, D.A., et al. 2004. MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. 5, R98. Available at: http://dx.doi.org/10.1186/gb-2004-5-12-r98. Newberg, L.A. 2008a. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics 24, 1772–1778. Available at: http://dx.doi.org/10.1093/bioinformatics/btn308. Newberg, L.A. 2008b. Significance of gapped sequence alignments. J. Comput. Biol. 15, 1187–1194. Available at: http://dx.doi.org/10.1089/cmb.2008.0125. Pearson, W.R. 1991. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the SmithWaterman and FASTA algorithms. Genomics 11, 635–650. Available at: http://dx.doi.org/10.1016/0888-7543(91) 90071-L. Pollard, D.A. 2005. Drosophila phylogenies. Available at: http://rana.lbl.gov/dan/trees.html. Smit, A.F.A., Hubley, R., and Green, P. 1996–2004. RepeatMasker Open—3.0. Available at: www.repeatmasker.org. Staden, R. 1989. Methods for calculating the probabilities of finding patterns in sequences. Comput. Appl. Biosci. 5, 89–96. Available at: http://dx.doi.org/10.1093/bioinformatics/5.2.89. Waterman, M.S. 1995. Introduction to Computational Biology. Maps, Sequences and Genomes. Chapman & Hall/CRC, Boca Raton, FL. Webb-Robertson, B.-J.M., McCue, L.A., and Lawrence, C.E. 2008. Measuring global credibility with application to local sequence alignment. PLoS Comput. Biol. 4, e1000077. Available at: http://dx.doi.org/10.1371/journal. pcbi.1000077. Wilf, H.S. 2006. generatingfunctionology, 3rd ed. A.K. Peters Ltd., Wellesley, MA.

Address reprint requests to: Dr. Lee A. Newberg Department of Computer Science Rensselaer Polytechnic Institute 110 8th Street Troy NY 12180 E-mail: [email protected]

Suggest Documents