The correlation error and finite-size correction in an ungapped ...

0 downloads 0 Views 167KB Size Report
Jan 18, 2002 - Yonil Park and John L. Spouge. ∗. National Center for ... Poisson approximation (Dembo et al., 1994; Iglehart,. 1972; Karlin and Dembo, 1992; ... gap of length g, the constants K and λ are now known to extraordinary accuracy ...
Vol. 18 no. 9 2002 Pages 1236–1242

BIOINFORMATICS

The correlation error and finite-size correction in an ungapped sequence alignment Yonil Park and John L. Spouge ∗ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA Received on October 15, 2001; revised on January 18, 2002; accepted on March 12, 2002

ABSTRACT Motivation: The BLAST program for comparing two sequences assumes independent sequences in its random model. The resulting random alignment matrices have correlations across their diagonals. Analytic formulas for the BLAST p-value essentially neglect these correlations and are equivalent to a random model with independent diagonals. Progress on the independent diagonals model has been surprisingly rapid, but the practical magnitude of the correlations it neglects remains unknown. In addition, BLAST uses a finite-size correction that is particularly important when either of the sequences being compared is short. Several formulas for the finite-size correction have now been given, but the corresponding errors in the BLAST p-values have not been quantified. As the lengths of compared sequences tend to infinity, it is also theoretically unknown whether the neglected correlations vanish faster than the finite-size correction. Results: Because we required certain analytic formulas, our study restricted its computer experiments to ungapped sequence alignment. We expect some of our conclusions to extend qualitatively to gapped sequence alignment, however. With this caveat, the finite-size correction appeared to vanish faster than the neglected correlations. Although the finite-size correction underestimated the BLAST p-value, it improved the approximation substantially for all but very short sequences. In practice, the Altschul–Gish finite-size correction was superior to Spouge’s. The independent diagonals model was always within a factor of 2 of the true BLAST p-value, although fitting p-value parameters from it probably is unwise. Contact: [email protected]

against the sequences in a database, and its output ranks potential database matches by their E-values. For each database match, the corresponding E-value gives (under the random model) the expected number E of false positives with a smaller rank in the output. Because the random number N of false positives is theoretically Poisson distributed (Dembo et al., 1994), the E-value can be converted to a standard statistical p-value with the equation p = Pr [N  1] = 1 − Pr [N = 0] = 1 − e−E . If the E-value is small (and thus p ≈ E), the relationship between the query and the database sequence is unexpectedly close under the random model. On the other hand, if the E-value is large, the relationship is unremarkable. The current BLAST program adjusts its E-values with a heuristic finite-size correction. The finite-size correction is particularly important when the query or matching database sequence is short (Altschul and Gish, 1996), but its mathematical basis was mysterious until quite recently (Spouge, 2001). This paper uses computer simulations to probe the practical and theoretical foundations of the finite-size correction. To quantify the foregoing, consider a pair of random sequences A = A1 A2 . . . Am and B = B1 B2 . . . Bn , where m  n. The corresponding BLAST score provides a heuristic approximation to the Smith–Waterman score M (Smith and Waterman, 1981). The p-value P(M > y) that M exceeds a threshold y has a simple Poisson approximation (Dembo et al., 1994; Iglehart, 1972; Karlin and Dembo, 1992; Waterman and Vingron, 1994)

INTRODUCTION When investigating the unknown function of a protein or DNA sequence, the BLAST program (Basic Local Alignment Search Tool) is arguably the most important computational tool available (Altschul et al., 1990, 1997; Schaffer et al., 2001). BLAST matches a query sequence

so P(M > y) ≈ Q P O I (M > y). Equation (1) is an example of an extreme value distribution (Aldous, 1989; Galombos, 1978). The constants K and λ defining the distribution presently require computer simulation for their determination (Collins et al., 1988; Mott, 1992; Mott and Tribe, 1999; Smith et al., 1985). For the default parameters in protein–protein BLAST, the BLOSUM62 matrix and a cost C(g) = 11 + g for each

∗ To whom correspondence should be addressed.

1236

Q P O I (M > y) = 1 − exp[−K mn exp(−λy)],

(1)

Published by Oxford University Press

Finite-size effects in an ungapped alignment

Q F S E (M > y) = 1−exp[−K (m − l y )(n − l y ) exp(−λy)]. (2) In practice, the finite-size effect is particularly important for short query sequences (e.g. m ≈ l y or n ≈ l y ). Present implementations of Equations (1) and (2) rely heavily on their analogies with rigorous results for ungapped alignment statistics, coupled with careful and extensive computer simulations. In the ungapped alignment of very long random sequences, e.g. there is a theorem (Dembo et al., 1994) that the Poisson error εP O I =

Q P O I (M > y) − P(M > y) P(M > y)

(3)

from Equation (1) is negligible (i.e. under the theorem’s conditions, as m, n, y → ∞ together, ε P O I → 0). It is easy to believe that ε P O I for gapped alignment behaves similarly, and computer experiments confirm this analogy (Altschul et al., 2001). With such analogies in mind, our investigation of the finite-size effect is restricted to ungapped alignments, because our mathematical methods are not available for gapped alignments. Consider, e.g. the finite-size error εF S E =

Q F S E (M > y) − P(M > y) , P(M > y)

(4)

from Equation (2) and the corresponding finite-size correction to the simple Poisson approximation,

Q F S E (M > y)− Q P O I (M > y) . P(M > y) (5) In ungapped alignment, various mathematical theorems provide analytic expressions for Q F S E (M > y) and Q P O I (M > y) (Altschul and Gish, 1996; Karlin and Dembo, 1992; Spouge, 2001). The analytic expressions show that the finite-size correction κ F S E is negligible for long sequences (i.e. under the theorems’ conditions, as m, n, y → ∞ together, κ F S E → 0). Our investigation depends on the analytic expressions, so unless indicated otherwise, the following discourse implicitly restricts itself to ungapped sequence alignment. By the usual analogies, however, we expect that our results and conclusions extend qualitatively to gapped alignment statistics. κ F S E = ε F S E−ε P O I =

Sequence B A E E S L N R E A A F F V E R D Q

Sequence A

gap of length g, the constants K and λ are now known to extraordinary accuracy (Altschul et al., 2001). The finite-size correction to Equation (1) starts by considering a random gapped alignment with a score exceeding y. Such an alignment cannot start anywhere along the full length m of sequence A. On the contrary, if it has mean length l y , it must start on average within the first m − l y letters of the sequence (Altschul and Gish, 1996). This ‘finite-size effect’ suggests replacing mn in Equation (1) with (m − l y )(n − l y ), to give the finite-size Poisson approximation

T V S P S G Y K G I C L A P

1 0 1 1

0 0 0 0

0 0 0 0

1 0 2 1

0 3 0 0

0 0 4 0

0 0 0 4

0 0 0 0

1 0 1 1

1 1 1 2

0 0 0 0

0 0 0 0

0 4 0 0

0 0 4 0

0 0 0 4

0 0 0 0

0 0 0 0

16 15

1

1

0

2

0

1

0

4

1

2

0

0

0

0

0

4

0

14

1

1

1

1

0

0

0

0

5

2

0

0

0

0

0

1

3

13

0

0

0

0

0

0

0

0

0

2

9

7

0

0

0

0

0

12

0

0

0

0

0

1

3

0

0

0

0

4

5

0

3

0

1

11

1

0

0

1

0

0

0

3

1

1

0

0

3

5

0

4

0

10

0

0

0

0

3

0

0

0

2

0

2

1

4

1

3

0

2

9

0 0 2 1

0 0 0 1

0 0 0 0

0 0 1 1

0 6 0 0

0 0 6 0

0 0 0 6

0 0 0 0

0 0 2 1

0 0 2 3

0 2 0 0

0 2 0 0

0 2 2 0

0 0 2 1

0 0 0 2

0 0 0 0

0 0 0 0

8 7 6 5

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

0

1

2

4 3

Fig. 1. Ungapped alignment for the two sequences. The figure shows two sequences of length m=14 and n=17. The local maximum scores shown were calculated for the PAM250 matrix from the SmithWaterman algorithm. The numbers outside the matrix on the bottom and the right are the enumeration D of the diagonals, as given in the Introduction. The overall local maximum score is M=9, which is on diagonal D=4 and is the sum of the three consecutive scores X (A5 , B9 ) + X (A6 , B10 ) + X (A7 , B11 ). The overall maximum score cell shares a ‘Y’ with every cell in its row and an ‘F’ with every cell in its column.

Having described the finite-size correction, we now turn to correlation errors. Figure 1 shows the local alignment matrix for the two sequences A = A1 A2 . . . Am and B = B1 B2 . . . Bn (Smith and Waterman, 1981). An ungapped local match between A and B is a set of consecutive letterpairs {(Ak , B D+k ) : k = I + 1, . . . , J }, where I , J , and D are integers subject only to the necessary restrictions 1  I + 1  J  m and 1  D + I + 1  D + J  n. Geometrically, an ungapped local match is confined to a single diagonal within the alignment matrix. The diagonal corresponds to the constant difference D = (D + k) − k, implicitly enumerating the diagonals from −(m − 1) to (n − 1) inclusive, from bottom to top. Assume that we have already chosen a scoring matrix X (e.g. the default BLOSUM62 matrix in protein–protein BLAST), so that we can assign a score X (a, b) to each letter pair (a, b). For each fixed diagonal D, we define   J  X (Ak , B D+k ) (6) M D = max 0, max I y) be the probability of M exceeding y in the independent diagonals model. The independent diagonals approximation Q(M > y) provides an estimate for the p-value P(M > y) from the independent sequences model. Because of neglected correlations, there is a correlation error εC O R R =

Q(M > y) − P(M > y) . P(M > y)

(8)

Note that for convenience of comparison, we have taken all errors relative to P(M > y). For ungapped alignments, there is a theorem (Dembo et al., 1994) that the correlation error εC O R R is negligible for long sequences (i.e. under the theorem’s conditions, as m, n, y → ∞ together, εC O R R → 0). It is theoretically unknown, however, whether the correlation error εC O R R vanishes faster than the finite-size correction κ F S E for long sequences (i.e. whether εC O R R /κ F S E → 0 as m, n, y → ∞). Indeed, the referee of a previous paper (Spouge, 2001) challenged one of us (JLS) to prove this. After all, if the finite-size correction κ F S E to BLAST is thought to be important, we should verify that the correlation error is smaller (i.e. that εC O R R  κ F S E ). In addition, recent progress on the independent diagonals model has been surprisingly rapid (Bundschuh, 2000). If the independent diagonals model continues to prove substantially more tractable than the independent sequences model, it becomes important to know the actual magnitude of the correlation error εC O R R . Our computer study addresses all of these issues. The outline of this paper is as follows. The Methods describe the calculation of the tail probabilities P(M > y), Q(M > y), Q P O I (M > y), and Q F S E (M > y). The Results examine ε P O I , ε F S E , and εC O R R ; compare ε F S E from variant formulas for the finite-size correction; and 1238

then compare εC O R R and κ F S E . Finally, the Discussion gives the implications of our results.

METHODS Computer code was written in C++ and compiled with the Microsoft Visual C++ 6.0 compiler. The computer had a single Intel Pentium III 600 MHz microprocessor and 260 MB RAM and employed the Microsoft windows 2000 operating system. We carried out our procedures for the Robinson amino acid frequencies (Robinson and Robinson, 1991) and for each of the three scoring matrices BLOSUM62, PAM120, and PAM250. The Monte Carlo estimation of the true p-value P(M > y) The true p-value P(M > y) was estimated from computer simulation. For each sequence length m from 30 to 300, we generated 30 000 pairs of random sequences of equal length (m = n) from the Robinson amino acid frequencies. We then applied the Smith–Waterman algorithm with infinite gap costs (C(g) = ∞ for all g) to force all optimal alignments to be ungapped. For each sequence pair, we recorded the optimal alignment score M. For every integer threshold y up to 40, we then counted the number of optimal alignment scores M exceeding y, divided it by iteration number N = 30 000, thereby obtaining a Monte Carlo estimate of P(M > y). An (asymmetric) 68% confidence interval was then calculated (Dwass, 1970, Equation (2), p. 510), corresponding approximately to c = 1 standard deviations in a standard normal distribution. The other, faster calculations used the same ranges for m = n and y. The analytic formula for the independent diagonals approximation Q(M > y) The independent diagonals approximation Q(M > y) can be determined analytically in two stages. First, we compute from exact formulas (given below) the probability Q(M D > y) that the diagonal maximum local score M D in Equation (6) exceeds y. This probability depends only on the number of letter pairs (Ak , B D+k ) on the diagonal D. If the diagonal contains i letter pairs, the probability is conveniently denoted by Q(M(i) > y). Assume that we have Q(M(i) > y) in hand. Now, note that the sequence alignment matrix contains exactly n − m+1 diagonals with length m and two diagonals with each length from 1 to m − 1. Under the independent diagonals model, Q(M  y) = [Q(M(m)  y)]n−m+1

m−1 

[Q(M(i )  y)]2 ,

i=1

(9) because the overall maximum is less than or equal to y, only if the same is true on each of the diagonals.

Finite-size effects in an ungapped alignment

Substitutions like Q(M  y) = 1 − Q(M > y) quickly yield

Q(M > y) = 1 − [1 − Q(M(m) > y)]n−m+1 m−1  × [1 − Q(M(i) > y)]2 . (10) i=1

We used the following exact calculation for Q(M(i) > y) (Daudin and Mercier, 1999). First, from the relevant random amino acid frequencies and scoring matrix X (a, b), compute the frequency distributions P(X = x) for the random scores in the alignment matrix, which are independent and identically distributed in the independent diagonals model. Then, let X n denote the nth random score encountered along the diagonal D of interest. Consider the following absorbing Markov chain {Sn : n = 0, 1, . . .} based on X n . The chain’s states are in a set {0, 1, . . . , y, }, it starts at S0 = 0, and it has a one-step transition matrix P that can be derived from the distribution P(X = x) and the following rules. From the state Sn = , it goes to Sn+1 = (the absorbing state corresponds to a diagonal local maximum M(i) > y). From other states Sn  = , it goes to Sn+1 = max{0, Sn + X n+1 } unless Sn + X n+1 > y, in which case M(i) > y, so it goes to Sn+1 = . Because the diagonal local maximum M(i) > y if and only if Si = , the probability Q(M(i) > y) is given by the element of the matrix P i corresponding to i-step transition from S0 = 0 to absorbing state Si = . The powers P i of P are readily obtained by repeated matrix multiplication.

The analytic formula for the simple Poisson approximation Q P O I (M > y) The simple Poisson approximation generalizes to Q P O I (M > y) = 1 − exp{−

P( Eˆ y ) mn}, Eα

(11)

where α is a renewal length and P( Eˆ y ) is the probability that the Markov chain Sn above reaches before returning to 0 (or equivalently, that the corresponding random sum goes above y before going back to or below 0). We used exact formulas for both Eα (Iglehart, 1972; Karlin and Dembo, 1992) and P( Eˆ y ) Feller, 1971, p. 363) in Equation (11), the formula for P( Eˆ y ) being based on a recursion using a Markov chain with two absorbing states, corresponding to 0 and y + 1. Because of the asymptotic equality P( Eˆ y ) ∼ Ce−λy for large y, Equation (11) is equivalent to Equation (1) with K = C/Eα, where exact formulas for C and λ are known (Iglehart, 1972; Karlin and Dembo, 1992). We therefore used Equation (11) instead of Equation (1), although for thresholds y  20, errors due to the asymptotic equality P( Eˆ y ) ∼ Ce−λy are negligible anyway.

The analytic formula for the finite-size Poisson approximation Q F S E (M > y) Altschul and Gish (1996) proposed the finite-size approximation Q F S E (M > y) in Equation (2). Spouge broadly generalized their correction and gave it some mathematical foundation (Spouge, 2001). In particular, his theorems show that in contrast to the Altschul–Gish correction in Equation (2), the variant formula   P( Eˆ y ) Q S P (M > y) = 1 − exp − mn − (m + n − 1) Eα  Eα 2 1 × l y − 12 −2 (12) Eα is asymptotically correct as m, n, y → ∞ under his theorems’ conditions. In Equation (12), l y was calculated using the Dembo and Karlin’s asymptotic formula, which is only asymptotically correct (Dembo and Karlin, 1993). Like Equation (11), Equation (12) avoids errors from the asymptotic approximation P( Eˆ y ) ∼ Ce−λy .

RESULTS Figures 2 and 3 display results for the BLOSUM62 scoring matrix. Other scoring matrices gave similar results. Figure 2 plots its relative errors against sequence length m = n for different values of the threshold y. As a convention, a sequence is ‘very short’ if it is shorter than the length range displayed in Figure 2. Otherwise, it is ‘short’, ‘long’, or ‘very long’, if its length is 60 or less, 60 or more, or 200 or more. Figure 2 shows that the magnitude of each relative error decreases to 0 as the sequence lengths increase (for threshold 39, this tendency is slightly obscured by Monte-Carlo fluctuations due to excessively small tail probabilities). For all but very short sequences, the Poisson error satisfied 0 < ε P O I < 2, the finite-size error satisfied −1 < ε F S E < 0, and |ε F S E | < 0.5ε P O I everywhere in Figure 2. In addition, the Altschul–Gish formula Equation (2) for ε F S E gave a smaller relative error than the Spouge formula Equation (12). For very long sequences, all relative errors were less than 0.1, with |ε F S E | < 0.1ε P O I . For all but very short sequences, the correlation error satisfied −0.1 < εC O R R < 0.6, with εC O R R > 0 everywhere in Figure 2 except for long sequences at the threshold y = 39. For the lengths displayed in Figure 2, Equations (3), (4), and (8) therefore yield the following observations. The simple Poisson approximation Q P O I (M > y) was larger than the true p-value P(M > y). The finite-size Poisson approximation Q F S E (M > y) overcorrected in the opposite direction, being smaller than P(M > y). For very long sequences, the data show that relative error in the Poisson approximation drops by almost a factor of 1239

Y.Park and J.L.Spouge

10

y=23

1.5

8

1

6

Ratio

Relative Errors

2

0.5 0

4 2

–0.5

0

–1 0

50

100

150

200

250

300

350

Sequence Length

–2 0.00001

0.0001

0.001

0.01

0.1

1

P(M>y)

Relative Errors

2

y=31

1.5 1 0.5 0 –0.5

Fig. 3. Plot of the ratio εC O R R /κ F S E of the correlation error and finite-size correction against the true p-value P(M > y), using the BLOSUM62 scoring matrix, for equal sequence lengths (m = n) 100 (), 200 () and 300 (). This figure uses some integer thresholds y as large as 60. The plots also give (in both X- and Ydirections) the corresponding 68% confidence intervals.

–1 0

50

100

150

200

250

300

350

greater than 1. If anything, the ratio tends to increase with increasing sequence lengths m = n.

Sequence Length

Relative Errors

2

y=39

1.5 1 0.5 0 –0.5 –1 0

50

100

150

200

250

300

350

Sequence Length

Fig. 2. Plot of relative errors against sequence length, using the BLOSUM62 scoring matrix, for the thresholds 23, 31 and 39. The plots show the Poisson error ε P O I (), the correlation error (), and the finite-size error ε F S E , using both the Altschul–Gish formula Q F S E (M > y) () and the variant Spouge formula QS P (M > y) (). The plots also give the corresponding 68% confidence intervals.

10 after applying the finite-size correction (i.e. |ε F S E | < 0.1ε P O I ). Finally, like the simple Poisson approximation Q P O I (M > y), the independent diagonals approximation Q(M > y) never differed from P(M > y) by a factor of more than about 2. Figure 3 shows plots of εC O R R /κ F S E against P(M > y) for selected increasing sequence lengths m = n. For fixed P(M > y) (which is the usual hypothesis in mathematically rigorous limit theorems), the ratio is usually not 1240

DISCUSSION In response to the theoretical question in the Introduction, Figure 3 shows that for long sequences, the correlation error εC O R R was comparable to the finite-size correction κ F S E , usually somewhat larger in magnitude. Thus, it is likely that for sequences of equal length, the correlation error does not vanish faster than the finite-size correction (i.e. the conjecture εC O R R /κ F S E → 0 as m, n, y → ∞ appears false in general). Asymptotic analysis is therefore unlikely to provide a complete theoretical justification for applying the finite-size correction. Despite this negative theoretical finding, Figure 2 shows the practical utility of applying finite-size corrections. For all but very short sequences (see the Results for definitions of ‘very short’, etc.), the finite-size approximation was substantially closer to the true p-value p = P(M > y) than the simple Poisson approximation. The Altschul– Gish finite-size approximation Q F S E (M > y) in Equation (2) was also superior to the Spouge’s approximation in Equation (12) for all but very long sequences. Interestingly, rigorous theory shows the Altschul–Gish approximation to be inferior to Spouge’s approximation for very long sequences (Spouge, 2001). Although Figure 2 does indeed display the theoretical inferiority, all approximations agreed with p so closely for long sequences that there was no practical difference among them. Unfortunately, for short sequences, no approximation (including the Altschul–Gish approximation) was close to p. In practice, BLAST therefore inflates the finite-size approxi-

Finite-size effects in an ungapped alignment

mation for short sequences with an ad hoc adjustment (S. Altschul, personal communication). The simple Poisson approximation Q P O I (M > y) always overestimated p, whereas the finite-size Poisson approximation Q F S E (M > y) always underestimated p, albeit only slightly for long sequences. Note, however, that the random model of independent sequences does not faithfully reproduce all aspects of a real database, e.g. sequence redundancy, varying sequence composition, etc (Altschul et al., 1994; Mott, 2000; Pearson, 1995, 1996). Overall, the ‘real aspects’ generally cause the statistical pvalues to be too small. By comparison, the modest underestimation due to the finite-size correction is completely harmless. For biological significance in BLAST protein searches, e.g. p  10−3 is sometimes mentioned as a practical statistical threshold (and similarly, p  10−6 for DNA searches). Our study therefore supports the present use of the finite-size correction in BLAST. Our most encouraging result was that the independent diagonals approximation Q(M > y) was always within a factor of 2 of the true p-value p = P(M > y) from the independent sequences model. It is known from theory that in ungapped alignment, the correlation error εC O R R is negligible for long sequences (i.e. that εC O R R → 0 as m, n, y → ∞). Theory therefore suggests that in ungapped alignment, the independent diagonals model should provide good approximations to p, and our results show quite quantitatively that in practice, the approximation is excellent. Out of necessity, our study restricted itself to ungapped sequence alignment, but it suggests some speculations about gapped sequence alignment. Random optimal gapped alignments are longer than their ungapped counterparts, so finite-size effects should be larger. Finite-size corrections are therefore probably more important in gapped alignments than in ungapped alignments, as has been noted before (Mott, 2000). On the other hand, the overall effect on correlation errors is extremely difficult to predict in gapped alignment. Our study suggests that the independent diagonals model should continue to provide practical approximations to p-values in gapped alignment, although studies addressing this issue more directly are desirable. Finally, one might be tempted to fit parameters for the independent sequences model (i.e. the constants K and λ in Equation (1)) from simulations using the independent diagonals model. This would be dangerously expedient, however, because the independent sequences model omits correlations, and the correlation error (whose analytic form is at present unknown) could be comparable to the finite-size correction.

REFERENCES Aldous,D. (1989) Probability Approximations Via the Poisson Clumping Heuristic. Springer, New York.

Altschul,S.F. and Gish,W. (1996) Local alignment statistics. In Doolittle,R.F. (ed.), Methods in Enzymology. Academic Press, London, 266, pp. 460–480. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– 410. Altschul,S.F., Boguski,M.S., Gish,W. and Wootton,J.C. (1994) Issues in searching molecular sequence databases. Nature Genet., 6, 119–129. Altschul,S.F., Bundschuh,R., Olsen,R. and Hwa,T. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res., 29, 351–361. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bundschuh,R. (2000) Preprint: asymmetric exclusion process and extremal statistics of random sequences. Collins,J.F., Coulson,A.F. and Lyall,A. (1988) The significance of protein sequence similarities. Comput. Appl. Biosci., 4, 67–71. Daudin,J.J. and Mercier,S. (1999) Exact distribution for the local score of independent and identically distributed random sequence. Comptes Rendus de L’Academie des Sciences Serie I— Mathematique, 329, 815–820. Dembo,A. and Karlin,S. (1993) Central limit-theorems of partialsums for large segmental values. Stochastic Processes and Their Applications, 45, 259–271. Dembo,A., Karlin,S. and Zeitouni,O. (1994) Limit distributions of maximal non-aligned two-sequence segmental score. Annals of Probability, 22, 2022–2039. Dwass,M. (1970) Probability and Statistics. Benjamin, New York. Feller,W. (1971) An Introduction to Probability Theory and its Applications, Vol. 1, Wiley, New York. Galombos,J. (1978) The Asymptotic Theory of Extreme Order Statistics. Wiley, New York. Iglehart,D.L. (1972) Extreme values in the GI/G/1 queue. Annals of Mathematical Statistics, 43, 627–635. Karlin,S. and Dembo,A. (1992) Limit distributions of maximal segmental score among Markov-dependent partial-sums. Advances in Applied Probability, 24, 113–140. Mott,R. (1992) Maximum-likelihood-estimation of the statistical distribution of Smith–Waterman local sequence similarity scores. Bull. Math. Biol., 54, 59–75. Mott,R. (2000) Accurate formula for p-values of gapped local sequence and profile alignments. J. Mol. Biol., 300, 649–659. Mott,R. and Tribe,R. (1999) Approximate statistics of gapped alignments. J. Comput. Biol., 6, 91–112. Olsen,R., Bundschuh,R. and Hwa,T. (1999) Seventh International Conference on Intelligent Systems for Molecular Biology. In Lengauer,T. (ed.), Rapid assessment of extreme statisties for local alignment with gaps. pp. 211–222. Pearson,W.R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci., 4, 1145–1160. Pearson,W.R. (1996) Effective protein sequence comparison. In Doolittle,R.F. (ed.), Methods in Enzymology. Academic Press, London, 266, pp. 227–258. Robinson,A.B. and Robinson,L.R. (1991) Distribution of glutamine

1241

Y.Park and J.L.Spouge

and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl Acad. Sci. USA, 88, 8880–8884. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the accuracy of PSI-Blast protein database searches with composition-based statistics and other refinements. Nucleic Acids Res., 29, 2994–3005. Smith,T.F. and Waterman,M.S. (1981) Identification of common

1242

molecular subsequences. J. Mol. Biol., 147, 195–197. Smith,T.F., Waterman,M.S. and Burks,C. (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Res., 13, 645–656. Spouge,J.L. (2001) Finite-size correction to Poisson approximations of rare events in renewal processes. J. App. Prob., 38, 554–569. Waterman,M.S. and Vingron,M. (1994) Sequence comparison significance and Poisson approximation. Stat. Sci., 9, 367–381.