Improving the Divide-and-Conquer Approach to ... - Semantic Scholar

13 downloads 0 Views 150KB Size Report
Abstract—We consider the problem of multiple sequence alignment: given k sequences of length at most n and a certain scoring function, find an alignment that ...
Appl. Math. Lett. Vol. 10, No. 2, pp. 67–73, 1997 c Copyright°1997 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0893-9659/97 $7.00 + 0.00

Pergamon

Improving the Divide-and-Conquer Approach to Sum-of-Pairs Multiple Sequence Alignment J. Stoye Research Center for Interdisciplinary Studies on Structure Formation (FSPM) University of Bielefeld Postfach 10 01 31, D-33501 Bielefeld, Germany [email protected]

S. W. Perrey Department of Mathematics, Massey University Palmerston North, New Zealand [email protected]

A. W. M. Dress Research Center for Interdisciplinary Studies on Structure Formation (FSPM) University of Bielefeld Postfach 10 01 31, D-33501 Bielefeld, Germany [email protected] (Received May 1996; accepted August 1996)

Abstract—We consider the problem of multiple sequence alignment: given k sequences of length at most n and a certain scoring function, find an alignment that minimizes the corresponding “sum of pairs” distance score. We generalize the divide-and-conquer technique described in [1,2], and present new ideas on how to use efficient search strategies for saving computer memory and accelerating the procedure for three or more sequences. Resulting running times and memory usage are shown for several test cases. Keywords—Multiple sequence alignment, Dynamic programming, Divide-and-conquer.

1. INTRODUCTION Multiple sequence alignment is an important problem in computational molecular biology, and many algorithms have been presented in this area of research (for a recent comparison see [3]). Since the problem of computing optimal alignments with respect to the “sum-of-pairs” criterion and most of its variants are NP-hard [4], many approximative algorithms have been proposed (e.g., [5–8]), Unfortunately, almost all of these methods either exhibit a prohibitive computational complexity or yield biologically unplausible results. With our algorithm, we try to contribute towards improving this situation. The authors wish to thank R. Giegerich and U. T¨ onges for helpful comments on an earlier version of this paper. Part of this work is supported by the German Ministry for Education, Science, Research, and Technology (BMBF) under Grant Number 01 IB 301 B4. Typeset by AMS-TEX 67

68

J. Stoye et al.

2. THE PROBLEM Let us consider a finite alphabet A, k sequences s1 , s2 , . . . , sk over A of length n1 , n2 , . . . , nk , respectively, and an additional letter, say ‘−’, not contained in A, which symbolizes gaps. An alignment of s1 , s2 , . . . , sk is given by a k × N matrix M = (mij )1≤i≤k,1≤j≤N for some N ≤ Pk i=1 ni , with entries mij ∈ A ∪ {−} subject to the following constraints: it does not contain any column consisting of gaps only, and for each i = 1, 2, . . . , k, the row (mi1 , mi2 , . . . , miN ) reproduces the sequence si upon eliminating all of its gap letters. The weighted sum of pairs multiple sequence alignment problem can now be described as follows (cf. [9]): given s1 , s2 , . . . , sk , and given a scoring function D : (A ∪ {−})2 → R, defined on all possible pairs of letters, find an optimal alignment M , i.e., an alignment that minimizes

w(M ) :=

X 1≤p c1 ), s2 (> c2 ), . . . , sk (> ck ) forms an optimal alignment of the original sequences. c1 ), . . . , ck (ˆ c1 )), Obviously, for any fixed site cˆ1 (1 ≤ cˆ1 ≤ n1 ), there exists a (k − 1)-tuple (c2 (ˆ c1 ), . . . , ck (ˆ c1 )) forms a k-tuple of ideal slicing sites. Unfortunately, finding such that (ˆ c1 , c2 (ˆ these points requires searching the whole k-dimensional hypercube, requiring as much time as the standard dynamic programming procedure. So, of course, this is not the method of choice. Instead, our algorithm tries to find so-called C-optimal slicing sites that are based on pairwise sequence comparisons, only. More precisely, we use the dynamic programming procedure which we apply to all pairs of sequences (sp , sq ). The resulting score matrices for pairwise alignment give rise to additional cost matrices Csp ,sq [cp , cq ] := wop (sp (≤ cp ), sq (≤ cq )) + wop (sp (> cp ), sq (> cq )) − wop (sp , sq ) , 1 Here,

sp (≤ cp ) denotes the prefix subsequence of sp with indices running from 1 to cp , and sp (> cp ) denotes the suffix subsequence of sp with indices running from cp + 1 to np , 1 ≤ p ≤ k.

Divide-and-Conquer Approach

69

which contain the additional charge imposed by forcing the alignment path to run through a particular vertex (cp , cq ) (1 ≤ p < q ≤ k). The calculation of Csp ,sq can be performed by computing forward and reverse matrices in a similar way as it is described in [15,16], respectively. cp ) with Csp ,sq [ˆ cp , Note that there exists, for every fixed cˆp , at least one slicing site cq (ˆ cp )] = 0. This follows from the facts that the vertices on an optimal pairwise alignment cq (ˆ path are precisely those with no additional cost, and that every alignment path meets at least once every position of the two sequences. To search for a good k-tuple of slicing sites, we try to estimate the multiple additional cost imposed by forcing the multiple alignment path of the sequences through the particular vertex (c1 , c2 , . . . , ck ) in the whole (k-dimensional) hypercube associated with the corresponding alignment problem. To this end, we use a weighted sum of additional costs over all projections (cp , cq ) as such an estimate: we put C (c1 , c2 , . . . , ck ) :=

X

αp,q · Csp ,sq [cp , cq ],

1≤p c1 ), s2 (> c2 ), . . . , sk (> ck ), L), where (c1 , c2 , . . . , ck ) := calc-cut(s1 , s2 , . . . , sk ). In the following section, we describe how to realize calc-cut, which computes a k-tuple of C-optimal slicing sites.

4. EFFICIENTLY CALCULATING THE SLICING SITES In a naive implementation, the search calc-cut for C-optimal slicing sites (c1 , c2 , . . . , ck ) needs time O(k 2 n2 +nk−1 ), where n := max{n1 , n2 , . . . , nk }: the computation of all pairwise additional cost matrices takes O(k 2 n2 ) time and, for given cˆ1 , all possible combinations of c2 , . . . , ck have to be checked to find the tuple that minimizes C in O(nk−1 ). We reduce this running time and the required memory (O(k 2 n2 ) for the naive version) by b for C(c1 , c2 , . . . , ck ), which allows us to the following approach: we precalculate an estimation C prune the search space enormously. Because the multiple additional cost C(c1 , c2 , . . . , ck ) is a sum of nonnegative numbers αp,q · Csp ,sq [cp , cq ], it is possible to exclude a tuple of slicing sites b (c1 , c2 , . . . , ck ), whenever one of the summands αp,q · Csp ,sq [cp , cq ] is larger than the minimum C b found so far. In particular, for fixed cˆ1 , any cp with α1,q · Cs ,s [ˆ c1 , cp ] ≥ C can never lead to a 1

p

smaller sum C. With this in mind, a tuple of C-optimal slicing sites can be calculated as follows. Function calc-cut (s1 , s2 , . . . , sk ) 1. Fix cˆ1 := d(n1 /2)e. ˆ1 2. Calculate and save columns col c1,q [j] := Cs1 ,sq [ˆ c1 , j] (2 ≤ q ≤ k, 1 ≤ j ≤ nq ). cˆ1 3. Locate slicing sites cˆ2 , . . . , cˆk such that col 1,q [ˆ cq ] = 0 (2 ≤ q ≤ k).

70

J. Stoye et al.

4. Calculate the estimate X b := αp,q · Csp ,sq [ˆ cp , cˆq ] = C 1≤p

Suggest Documents