Statistics of large scale sequence searching

Statistics of large scale sequence searching

Rainer Spang and Martin Vingron

Deutsches Krebsforschungszentrum (DKFZ) Theoretische Bioinformatik Im Neuenheimer Feld 280 D-69120 Heidelberg, Germany

Abstract

most accurate way to search a set of sequences for local similarities is to calculate the optimal local alignment between the query and each of the sequences in the set using the Smith-Waterman algorithm. Alignment scores are computed as a measure of similarity (Smith & Waterman, 1981). Since this procedure is extremely time consuming much faster heuristic programs have been developed that approximate the Smith-Waterman result. The most common of these search tools are the FASTA (Lipman & Pearson, 1985; Pearson & Lipman, 1988), and the BLAST (Altschul et al., 1990) programs. Striking resemblance of sequences is assumed to have an evolutionary background. But due to the large number of comparisons performed, random similarities frequently score higher than those of distantly related sequences. Within a twilight zone we observe a superposition of weak biological signals and large-deviations behavior of random variables. Thus it is important to evaluate the statistical signi cance of a certain score. We can calculate the distribution of scores that occur when aligning two sequences of independent identically distributed (i.i.d.) random variables (Arratia et al., 1986; Arratia et al., 1988; Karlin et al., 1990; Dembo & Karlin, 1991a; Dembo & Karlin, 1991b; Waterman & Vingron, 1994b). In the context of single sequence pair comparisons it has already proven very successful to evaluate actual results as if they were sampled from this distribution of random scores (Karlin & Altschul, 1990; Waterman & Vingron, 1994a; Altschul & Gish, 1996). Crucial to this approach is the question up to which score level random noise rises, rather than down to which degree of sequence similarity evolution is visible. Within a database search tens of thousands of comparisons are calculated. A score that was considered signi cant for a single alignment, can be meaningless in the context of a database search. Scores resulting from genuine evolutionary relationship obviously remain constant, while their credibility decreases as the database grows. The rst question we study in this paper is how

Motivation: Database search programs such as FASTA (Lipman & Pearson, 1985; Pearson & Lipman, 1988), BLAST (Altschul et al., 1990), or a rigorous Smith-Waterman algorithm (Smith & Waterman, 1981) produce lists of database entries, which are assumed to be related to the query. The computation of statistical signi cance of similarity scores is well established for single pairs of sequences and using purely random models. However, the multi-trial context of a database search poses new problems. The credibility of a certain score obtained in a database search decreases with the amount of data that is compared. To improve p-value computation for database search experiments, statistical properties of the databases such as the distribution of sequence length and eects induced by frequently repeated sequence patterns need to be taken into account. Results: We investigated the SWISS-PROT protein database release 31:0 (Bairoch & Apweiler, 1996) running extensive simulations of database searches. A discrepancy is observed between the theoretical predictions and the empirical distribution. To correct for this we evaluate the statistical signi cance of scores in the context of a database search by a contrasting semi-random model. This model enhances purely random models by one additional parameter re ecting individual statistical properties of real databases. We call this parameter the eective size of the database. Contact:

[email protected] [email protected]

Introduction Searching sequence databases is common practice in almost all molecular biology laboratories. The search algorithms identify those entries that share at least a segment of signi cant similarity to the query. Currently, the 1

An approximate distribution for the score H that arises from aligning two random sequences of length n and m respectively can be derived. (Dembo & Karlin, 1991b) show analytically that, for m and n large enough, the distribution of H can be well approximated by

to adjust p-values of sequence similarity in view of the multiple comparisons performed. The analytical framework describes the distribution of scores when searching random data. It is clear that these random models cannot accurately predict the level of random noise induced by searching real sequence data. Hence the next questions we address are: How do individual features of real sequence databases aect the distribution of random scores and how can they be incorporated into the random models determining the significance of scores? We present two analytical approaches to the rst question. The second question seems to be too complex to be understood analytically. Therefore we tackle it by running extensive simulations of multiple sequence comparison experiments on real data. The outline of this paper is as follows. The section Methods discusses rst the single- and then the multitrial context of alignment statistics. The section Results gives a detailed description of our simulation experiments. We put forward the concept of the eective size of a database in order to improve the accuracy of p-values and develop an ecient method for its estimation.

The scale parameter and the location parameter can be calculated by explicit formulae (Dembo & Karlin, 1991a). Using the notation X Y to signify that the distribution of X can be well approximated by that of Y , we get

H = max n

i;j;N

?1 X

k

=0

i

k

j

k

n

n

(4)

?

t

Of course, by (1) alignment scores are discrete and hence cannot perfectly t an extreme value distribution. Even for continuous scoring schemes on in nite alphabets the extreme value distribution assumption holds only in the limit for sequence lengths going to in nity. (For the precise mathematical results the reader is referred to (Dembo et al., 1994).) Hence alignment scores cannot be modeled by extreme value statistics with absolute precision. Nevertheless, it is an extremely helpful tool. The distribution of scores clearly depends on the length of the sequences that are compared. In database searches, long sequences in the database tend to have higher scores just by chance. To compensate for that one de nes the length adjusted scores

(1)

where the maximum is taken over all possible starting points (i; j ) of an alignment and all possible alignment lengths N . S (x; y) denotes the score for aligning residue x with residue y. In the classical setup for local alignment statistics we assume that both sequences are i.i.d.. A necessary assumption for the results below is that the expected score per letter is negative (Arratia et al., 1988; Karlin & Altschul, 1990). This condition holds for the PAM (Dayho et al., 1983) or the BLOSUM (Heniko & Heniko, 1992) matrix families. (Dembo & Karlin, 1991a) prove that

H = lim !1 log(n2 )

= log( n m) P [G < t ] = exp ?e? :

n

S (x + ; y + )

(3)

and G is a standard extreme value distributed random variable (Gumbel, 1958), i.e. :

The single-trial context We brie y recall the basics of pairwise alignment statistics for the non-gap case. For two sequences x1 ; x2 ; : : : ; x and y1 ; y2 ; : : : ; y the optimal alignment score not allowing for gaps is de ned as N

H G+ where

Methods

n

i

h

P [H t ] 1 ? exp ? m n e? t :

A = H ? log(n ) i

i

i

(5)

where H is a score obtained by a sequence of length n . The distribution of these length adjusted scores is independent of the length of the involved database entries. Sorting by length adjusted score yields a new ranking of the high scoring sequences, not biased by sequence length. It is equal to the one obtained if sequences are sorted by p-value. While both the location parameter and the growth parameter are necessary to calculate pvalues, alone is sucient to obtain the length corrected ranking. It is a matter of choice, whether to evaluate sequence similarity by plain scores or by p-values. Since scores i

(2)

holds with probability 1. The limit can be calculated analytically. This strong law implies that the random alignment score grows on a logarithmic scale with the length of the sequences. 2

i

Results

from related pairs of sequences tend to be exceeded by those from long but unrelated entries, length adjustment help to improve the selectivity of database searches (Waterman & Vingron, 1994a). On the other hand, if we search for a segment that may be included in both short and long sequences, the eect of length adjustment is troubling and plain scores are more appropriate. However, if we search on a large scale, both high scores and small p-values will be likely to arise by chance only. In rest of the paper we focus on plain scores.

Simulations with real data We now address the

question whether the analytical results of the previous section need to be adjusted when searching real databases. We run random queries against real databases and sample all optimal alignment scores obtained by 1 000 random queries. In that way non-random hits are totally excluded from the sample, whereas individual features distinguishing real databases from typical realizations of i.i.d. random sequences can still be studied in their eects on the distribution of scores. We generate 1 000 independent random queries of length m = 350. The residues are chosen independently according to an average amino acid distribution as described in (McCaldon & Argos, 1988). Each query is aligned to all entries of the SWISS-PROT database release 31:0 using the Smith-Waterman algorithm. We use the PAM 250 scoring matrix and very high gap penalties that do not allow gaps at all. Using the much faster programs FASTA or BLAST is inappropriate for studying random scores, since they fail to detect all similarities on weak signi cance levels reliably.

The multi-trial context We now address the prob-

lem of multiple comparison experiments. We want to determine the signi cance of a certain similarity in the context of a database search. Let H max be the maximum of N random variables H . H is the score obtained from the i-th entry of the database. Let n be the length of the i-th entry. Using the theory of pairwise sequence alignment statistics and assuming that the H are independent variables we get i

i

i

i

2

P [H max < t ] = P 4 =

3

\

1iN

fH < tg5 i

=1

Redundancy & Edge Effects

N Y

2

P [H < t ] : i

0

i

Hence

Prediction

Simulation

log P [H max < t ]

N X

=1

log (log (P[Score>t]) )

−2

? n m e? t i

q

i

= ? m L e? t (6) where L is the length of the concatenation of all sequences in the database and m denotes the length of the query. If we interpret the database as the concatenation of all its entries, we will clearly obtain the same distribution for the score when aligning the query to this long arti cial sequence. From (6) we obtain H max G + 0 + log(L) (7) where 0 = log( m ) . Hence, if we double the size of the database, random scores will increase by log(2) score points in average. Note that in equation (7) the parameter is both the scale parameter of the extreme value distribution and it describes the growth of random noise with the total size of the database. Crucial to this approach is that it considers the database search as one experiment. The probability of nding a segment in the database providing a certain score is determined regardless of the length of the sequence in which it is found.

−4

−6

q

−8

q

−10 75

80

85

90

95

100 t

105

110

115

120

125

Fig. 1. Comparison of linearized theoretical and

empirical distribution functions. The left curve is a log-log-transformation of the empirical distribution function induced by screening random queries against the SWISS-PROT database. The right curve is log-log-transformation of the theoretical distribution function when modeling the database as one long sequence. The steps at the right end of the empirical curve are due to sparse data.

q

The maximal scores H max of each of the 1 000 searches are sampled and their empirical distribution function is derived. This data is further referred to as the optimal sample. Note that each score arises from the comparison of a typical i.i.d. random sequence and the database s

3

which is the same in all comparisons. Thus we have statistically equal prerequisites for all comparisons, which is essential for our results. We call such a simulation a semi-random simulation. Using real query sequences leads to a much more complex behavior of non-genuine scores (see (Collins et al., 1988; Mott, 1992; Waterman & Vingron, 1994a)). The solid curve in Fig. 1 is a log-log-transformation of the empirical distribution function for the optimal sample. The dashed line to the right is the log-logtransformation of the distribution function of an extreme value distribution using the parameters and from pairwise alignment and inserting the length of the query and the total number of residues in the database as length parameters. Equation (7) suggests that both distributions are equal. However, Fig. 1 demonstrates that this assumption does not hold. In Fig. 1 three important properties of semi-random scores can be observed:

Endregion of a sequence not suitable as a startpoint of a high scoring alignment

...

... seq i

seq i+1

seq i+2

Fig. 2. Displaying the database as the concatenation of all its entries. a single huge sequence, edge eects will sum up because alignments do not start in one entry and end in the next. Random similarity results from the large amount of sequence data being searched. More precisely, it results from the large amount of dierent sequences screened. However, sequence databases include large families of homologous sequences sometimes showing very high sequence similarity. Since the possibilities for high scoring alignments within such a group do not simply grow with the sequence length, summing up their lengths overestimates the total number of the starting points for dierent alignments. Thus, in real databases the total database length overestimates the amount of distinct patterns. As a consequence of the inability to quantify the observed discrepancy analytically we focus on the semirandom simulation. Actual scores can be judged in contrast to the distribution of H max instead of judging them in a purely random model.

Applying a log-log-transformation to the empiri-

cal distribution function of maximal scores in the semi-random simulation yields a nearly straight line. Hence the data can be well described by an extreme value distribution and the associated extrapolation of probabilities for high scores is valid.

Both distribution functions are parallel. This

means, that the scale parameter remains unchanged in our semi-random simulations.

s

The locations of the two curves are dierent.

The theoretical distribution overestimates the noise caused by the data.

The eective size of a database In order to achieve

a practical re nement of the statistics re ecting our empirical observations, we compute an eective size of a database. The eective size of a database is de ned as the length of a single random sequence producing the same distribution of random scores as the simulated optimal sample does. Solving equation (4) for one of the length parameters yields

Hence, when passing from purely random models to the semi-random model, extreme value statistics is applicable, we need to adjust the location parameters, and the eect on the scale parameters is marginal. Although it is hard to name all possible reasons for this discrepancy, we mention two possible explanations that are likely to contribute to it. Both of the issues discussed below can be expected to reduce the noise level in the semi-random simulation and could sum up to the observed discrepancy between the empirical and the theoretical distribution. In (Altschul & Gish, 1996; Waterman & Vingron, 1994a) the problem of edge eects is pointed out. The basic idea is that a local alignment does not exist at a point, but has a certain length. Thus the end regions of both sequences are not suitable as starting points for a high scoring alignment. For short sequences these eects are clearly observable. If we interpret the database as

Le = 1m exp

were and are parameters taken over from pairwise alignment and has to be estimated from the sample of optimal scores. We estimate by linear regression on a log-log-transformation of the empirical distribution function. By the de nition of the eective size the maximal simulated score H max satis es s

H max G + 0 + log(Le ) : s

4

The eective size grows if the database grows, whereas the ratio = LLe

Optimal and Overall Sample 1

proved to be almost constant unless the database changes dramatically. Using the estimated ratio we propose to compute p-values by s

h

0.8

0.7

P[ Score