Inthe section b%1, bj there are at most k unmatched and/or mismatched nucleotides. Each of these bh nucleotides is denoted by the triple (h,0,0) (in our example.
Volume 14 Number 1 1986
Nucleic Acids Research
An efficient string matching algorithm with k differences for nucleotide and amino acid sequences
Gad M.Landau, Uzi Vishkin1 and Ruth Nussinov2* Department of Computer Science, School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel, 'Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, 251 Mercer St., New York, NY 10012, USA and 2Sackler Institute of Molecular Medicine, Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv 69978, Israel Received 8 July 1985
ABSTRACT There are a few algorithms designed to solve the problem of the optimal alignment of one sequence, the pattern, of length m, with another, longer sequence the text, of length n. These algorithms allow mismatches, deletions and insertions. Algorithms to date run in O(mn) time. Let us define an integer, k, which is the maximal number of differences allowed. We present a simple algorithm showing that sequences can be optimally aligned in O(k2n) time. For long sequences the gain factor over the currently used algorithms is very large. 1. Introduction
The problem of searching for homologies between sequences has been repeatedly addressed during the last few years. With the explosive growth of nucleotide (and amino acid) sequence data, there has been an urgent need for efficient algorithms which would compare the sequences. Although here we shall focus on the nucleotide sequence comparisons, it is eviden' that the same algorithms can be used to compare protein sequences as well. Our algorithm (1,2) is most efficient in cases where we want to compare one sequence, the "pattern", of length m to another sequence, the "text" of length n, and n is much larger than m. Such situations arise often when a sequence is much longer than the pattern sequence, or, if we align the pattern with several (or many) other sequences, whose combined lengths would then be n. This happens when we want to find the site of the optimal alignment between a section from one sequence and another, or when we compare one gene with several related genes, or, with possible candidates for pseudogenes. We may also wish to compare the degree of potential relatedness between introns, or repetitive sequences.
* We have a preliminary version of the program and are currently working on an advanced one. When complete, it will be available for distribution upon mailing a request and tape to the first author.
©) I R L Press Limited, Oxford, England.
31
Nucleic Acids Research A simple approach to finding good alignments was devised by Tinoco et al (3) and by Maizel and Lenk (4). It constructs an mxn matrix and scores matches of minimal lengths. The resulting diagonals are then visually inspected. For homologous sequences an elongated, nearly consecutive set of diagonals is observed. Often, however, the viewer is uncertain which set of diagonals to choose. Needleman and Wunsch (5) devised an elegant algorithm for finding an optimal alignment for which the number of matches as penalized by insertion/deletion gaps is maximal. Sankoff (6), Sellers (7) and Dumas and Ninio (8) have further extended this algorithm by minirmizing the measure of sequence discrepancy or by maximizing the measure of similarity. Korn et al (9) have written a program for finding good local alignments. Sellers (10, 11) has also treated the local alignment problem and devised an algorithm for finding within any pair of sequences all locally-close subsequences. Goad and Kanehisa (12) have introduced into his procedure a better measure of the quality of the detected alignments. A weight has been assigned to each deletion or mismatch type and a search for all subsequences in which the density of the differences is less than a previously set value is carried out. The GoadKanehisa program is probably the most sensitive method to locate all local homology alignments between two sequences. A much faster, approximate procedure has been devised by Wilbur and Lipman (13). Nussinov has presented an efficient code searching for global DNA homology (14) which is based on a maximal matching algorithm (15, 16). Fickett (17) has greatly improved the Seller's algorithm by disregarding computationally irrelevant steps.
To date, the algorithms used run in O(mn) time. We define an additional parameter, k. k is the largest number of differences (i.e. mismatches, insertions, or deletions) allowed in the alignments of the pattern and the text. The algorithm presented here uses O(k2n) time. In addition to the large economy in time, the memory requirements of this algorithm are greatly reduced as well. Whereas in the classical method (3-17) the memory requirement is O(mn), here the required memory is O(m2 + k2). We shall next describe two algorithms for finding the minimal number of differences between two strings, proceed to outline the new algorithm and end with a review of the biological implications of such algorithms. 2. An Algorithm for Flnding The Minimal Number of Differences Between Two
Strings
Tm Suppose we have two strings (i.e. nucleotide sequences): R = rT12 bn. In our case each string is composed of an alphabet of four
and B = b1b2 32
Nucleic Acids Research letters, A,C,G,T. On each of the strings there are three types of possible editing operations (i.e. mutations): (i) deleting a character from position i to yield R = rTr2 rm; rj_jrj+ (ii) inserting a character z to yield R = rjr2 rm; rizri + rm. These editing (iii) changing one character to yield R = rjr2 rj_jzrj+j operations have been penalized. The problem is, then, what is the minimal cost of editing operations that will transform sequence R into sequence B. This cost is denoted D(R,B). To simplify the presentation each difference between the two sequences is assigned a penalty of one unit, i.e. CCH = CDE = CIN = 1, where CCH, CDE and CIN are the costs of changing, deleting or inserting nucleotides. Now the cost of transforming sequence R into sequence B is simply the count of the minimal number of differences involved. In matrix D sequence R is stretched along the y axis (i = 1., m) and sequence B (j = 1, .. ,n) is along the x-axis. The Dij entry contains the attempted matching of r. ri and b1I bj. Say, we want to further inspect entries in the matrix. We can continue either by walking along a row (i - i + 1), by walking along a column (j - j + 1), or by increasing both parameters at the same time (i -+ i + 1,j -+ j + 1) and sliding along a diagonal d.
Each of the above steps has biological implications. Progressing along a row implies that the text (sequence B) has some nucleotides which the pattern (sequence R) lacks. That is, sequence B contains an insertion. Walking along a column in the D matrix implies that B lacks some nucleotides which are present in R, i.e., there is a deletion in the text. Continuation along a diagonal involves considering longer sequences with either a match or a mismatch. It is clear that advancing along a diagonal consisting of elements Dj.j with preferable to advancing along a row or a column. Consider nucleoj tides ri and bj, with ri . bj. Continuing along only one sequence, say B, we may at best, match the previously unmatched last nucleotide of B. That is, r, =- ij and ri is looped out of the alignment. Suppose we continue further along the column and consider now the pair Ti+2, bj. For the purpose of obtaining an optimal alignment of R and B it is immaterial whether ri+2 and bi match or not. If Tr+2 / bj, then the previous alignment is kept and ri+2 dangles at the end. If rTz2 = bj, then we may keep this match, but now in addition to r rT+1 is looped out as well. This state of affairs is bound to continue. Either the loop grows and/or the unmatched dangling end giows. The same situation holds for going along a column, i.e. keeping ri constant and increasing 6. The entry Djj - i = d is
33
Nucleic Acids Research B 0
0
1 2 3 4 5 6 R
7 8
A
G C
G C T T G
1 A
2 G
0
1
1
0
2 3 4
1 2 3
5
4
0 1 2 3 4 5 6 7
6
5
7
6
3 4 T C
5 6 7 G CC
2
3
1
2
5 4 3
8 7 9 8 10 9 11 10 9 12 11 10
9 C 10 T 11 G 12 C
1 1
2 3 3
4 5
4 3 2 1 2 2 3 4
8 G
9 10 11 12 13 14 15 C T G C T G C
7 8 9 10 11 12 13 14 15 6180 7 11 1 13 14 9
6 5 4 3 2
5
6
7
4 3
5
6
7
4
5
6
7
8
1
2
3
4
5
6
2
9
10 11
8
9
2 1 2 3 4 4
2
3
4
4
5
7 6
3
3
3
4
4
5
6
4
4
4
4 3
5
5
3 4
4
5
5 4
6 6 5
12 13
10 11 12 9 10 11 9 10 7 8 9 7
8
6
7
6
6 6
7
5 5 6
6
5
4 4 5
5
4
3
4
5
4
5
8
7
6
6
6
5
5
4
3
4
5
4
5
9
8
7
6
6
6
5
5
4
3
4
5
4
6
1. The Di matrix computed for the strings R (of length m = 12) and B (of length n = 15j. The match ending at (12,12) = 3 on the m'th row thus yields the best alignment. Here CCH = CDE = CIN = 1. Computation of this matrix uses O(mn) time. FIg.
in the matrix is then the minimal cost of the editing operations required to transform r1 ri to b 1 bj (see Fig. 1).
The following algorithm computes the matrix Initialization DO0 := 0. for all j,
1
j Sn,
for all i, 1 . 1l, we next add the triple (i+jl+d,ll,Ld,e-lI) to the sequence of Ld,e. This last triple describes the match between substrings of the pattern and the text. It was found during the computation of Ld e given Ld-l e-l, Ld.... and Ld+1..-1. This latter addition is done regardless of whether the source of II was Ld-l,e- 1, Ld.-l or Ld+ L.e-II At the end of iteration i we check which of the sequences of the 2k+1 Ld.k'S reach the rightmost symbol of the text. If the index of this symbol is greater than j (Ld,g+d+i > j), we take its sequence to be the new S,j. The old instruction 4 (where row is increased by one unit at a time without using S,j and MAXLENGTH) is employed each time we move to a new symbol of the text. We maintain O(k) diagonals at any time during the text analysis and may need to compare the new symbol for each of them. Hence, the old instruction 4 requires a total of O(kn) time throughout the text analysis. In order to evaluate the number of steps which are required by the new instruction 4 at iteration i, we use again the fact that O(k) diagonals are computed. The sequence Si,j has at most 2k + 1 triples. We can cha,rge each operation performed on any one of the diagonals to either a difference being discovered (there are .Ak such differences), or to a triple of S,j being examined (there are c 2k+1 triples). This amounts to 0(k) operations per diagonal at each iteration 44
Nucleic Acids Research i. Therefore, the total running time of the text analysis is O(k2n). This implies that the program will run the fastest when only good alignments are accepted (i.e. k is stipulated to be small). The actual alignment of the pattern and text can easily be produced if we retain all those S%j where the end of the pattern (mn) has been reached. Along with each Sjj, its corresponding number of differences (c k) can be stored. A preliminary version of the program has been written in Fortran and implemented on the CDC 6600. This program has been used to calculate the figures given here. 5. Discussion
This paper constitutes an additional example of a successful interdisciplinary effort where an algorithm developed by computer science designers is applied to solve problems that arise in molecular biology. The principle of the algorithm we have described, though seemingly complex, is essentially simple. As we slide the pattern along the text, we shall then get several alignments, each with at most k differences, rather than only the position in the text of the optimal alignment. Clearly this is advantageous both in comparing amino acid and nucleotide sequences. One may then list all alignments either by site or by their minimal differences score, i.e. by how good they are.
For the problem of efficient string matching in.the presence insertions and deletions, the algorithms presented here are the most efficient to date. Especially noteworthy is the algorithm described in section 4. Whenever the text analysis (rather than the pattern) dominates the computation time, the running time of this algorithm is close to optimal. Very often long sequences are involved in homology searches. Also very frequent is the need to compare one sequence (i.e. the pattern) to many others.
Thus, the fast growing rate in the sequencing of DNA (and protein) molecules, coupled with the efficiency of the algorithm, as well as the listing it provides of all alignments better than a previously set value, make it an attractive tool in molecular biology.
ACKNOWLIDGEMENTS
U. V. has been supported by NSF grant NSF-DCR-8318874 and ONR grant N0014-B5-K-0046.
*To whom correspondence should be addressed
45
Nucleic Acids Research
REFERENCES 1. Landau, G. M. and Vishkin, U. (1985). Proc. 26th IEEE Symp. on Foundation of Comp. Sci. In press. 2. Landau, G. M. and Vishkin, U, (1985). Submitted. 3. Tinoco, I. Jr. Uhlenbeck, 0. C. and Levine, M. D. (1971). Nature 230, 362-367. 4. Maizel, J. V. Jr. and Lenk, R. P. (1981). Proc. Natl. Acad. Sci. USA 78, 76657669. 5. Needleman, S. B. and Wunsch, C. D. (1970). J. Mol. Biol. 48, 443-453. 6. Sankoff, D. (1972). Proc. Natl. Acad. Sci. USA 69, 4-6. 7. Sellers, P. H. (1974). SIAM J. Appl. Math. 26, 787-796. B. Dumas, J. -P. and Ninio, J. (1982). NucI. Acids Res. 10,197-206. 9. Korn, L. J., Queen, C. L. and Hegman, M. N. (1977). Proc. Natl. Acad. Sci. USA 74, 4401-4405. 10. Sellers, P. H. (1979). Proc. Natl. Acad. Sci. USA 76, 3041. 11. Sellers, P. H. (1980). J. Algorithms 1, 359-373. 12. Goad, W. B. and Kanehisa, M. 1. (1982). Nucl. Acids Res. 10, 247-263. 13. Wilbur, W. J. and Lipman, D. (1983). Proc. Natl. Acad. Sci. USA 80, 726-730. 14. Nussinov, R. (1983). J. Theor. Biol. 100, 319-328. 15 Nussinov, R. Pieczenik, G. Griggs, J. R. and Kleitman, D. J. (1978). SIAM J. Appl. Math. 35 (1) 68-82. 16 Nussinov, R. and Jacobson, A. B. (1980). Proc. NatI. Acad. Sci. USA 77, 63096313. 17. Fickett, J. W. (1984). Nucl. Acids Res. 12, 175-180. 18. Ukkonen, E. (1983). Proc. Inter. Conf. Found. Comp. Theor. Lecture Notes Comp. Sci 158, 487-495. 19. Wagner, R. and Fisher, M. (1974). J. ACM 21, 168-178. 20. Knuth, D. E., Morris, J. H. and Pratt, V. R. (1977). SIAM J. Comp. 6, 322-350. 21. Main M. G. and Lorentz, R. J. (1984). J. of Algorithms 5, 422-432.
46