Parametric Sequence Alignment with Constraints Abstract 1 Introduction

4 downloads 0 Views 171KB Size Report
involve sequence alignment using substitutions and deletions based on some .... alignment between Ai?1 and Bj?1 followed by the substitution of ai with bj; and ...
Parametric Sequence Alignment with Constraints Roland H.C. Yap

Dept. of Information Systems & Computer Science National University of Singapore Lower Kent Ridge Road Singapore 119260 Republic of Singapore Email:

[email protected]

Abstract Techniques for detecting similarities between biological sequences such as proteins and DNA are used as a fundamental tool by biologists to investigate the relationships between such sequences. Most of the similarity techniques are based on notions of string edit sequence which involve sequence alignment using substitutions and deletions based on some cost measure. In this paper, we investigate algorithms which generalize the basic dynamic programming string edit distance algorithm to allow queries where the weight matrix of string edit costs is not xed. Instead the costs can be non-ground and the cost weights can also be constrained. A naive algorithm which uses inequalities to represent the alignment score is described and then an improved algorithm which greatly reduces the number of inequalities needed is developed.

1 Introduction Computational methods for detecting similarities between biological sequences such as proteins and DNA sequences are a fundamental tool used by biologists as a means of investigating the deeper relationships between such sequences.1 The most commonly used measure of similarity are based on measures of string edit distance. The minimum edit distance is the minimum number of edit operations involving substitutions and insertions/deletions of gaps needed to transform one sequence (string) into another. The edit operations are weighted and the total cost is the sum of the operation weights. This problem is also called sequence alignment since it can be treated as aligning two strings together under some number of appropriately inserted gaps. A classic reference on sequence alignment is Sanko and Kruskal [12]. See also the following references on sequence similarity matching [1, 9, 10, 13, 15, 17, 19] and a general text on computational biology [14]. An example of string alignment is as follows, consider the two strings attacg and atatcg, two possible alignments are (where the dash denotes an inserted gap): a a

t -

t t

a a

t

c c

g g

and

a a

t t

a

t t

a -

c c

g g

The number of possible alignments in general is huge and typically we are interested in optimal alignments under some cost measure. Di erent weights for the operations will lead to di erent optimal alignments. Such optimal alignment techniques are in routine use by biologists. However while these 1

We assume that the sequences are not identical.

1

alignments have been extremely useful, the precise choice of weights is still controversial and not settled. There are a number of di erent weight matrices in common use. One reason for the di erences is because di erent biological requirements may use di erent matrices. For example, the weights may be used to represent mutations due to represent di erent theories of evolution or may represent the correlation between structural/chemical properties of the underlying sequences. Thus the choice of weights can strongly in uence the outcome of sequence analysis. A brief list of some common weight matrices are: identity (unit similarity cost), BLAST [1], genetic code matrix, physical/structural properties, PAM [3] (evolutionary distance based on point mutations), BLOSUM [6], etc. Some proposals for various weight matrices are [2, 11, 6, 8]. One approach to dealing with the problem of the sensitivity of optimal alignments to the choice of weight matrices is to modify the problem to one of nding the K -best alignments from optimum [18]. However the number of alignments can grow quite rapidly. Using the example from [19], there are 14 alignments within 0% of optimum, 14 within 1%, 35 within 2%, 157 within 3%, 579 within 4%, 1317 within 5% and 20,137,655 within 20%. This example illustrates the potential sensitivity of the optimal alignment to the choice of weights. Another approach to dealing with this problem is not to enumerate the alignments but to analyse the solution space of alignments. The parametric alignment approach is to explore the solution space of the optimal alignments where the underlying cost function and weight matrices are treated as parameters. Gus eld and Naor [5] and Waterman et. al. [20] obtain some theoretical results on the relationships between optimal alignment regions in the two dimensional parametric space consisting of a single mismatch and deletion weight. Less precise results are known in the more general case with more weight parameters. The aim of this paper, is to explore whether constraint based approaches will be useful for solving alignment problems where there is a need to deal with parametric alignment. Parametric queries may be useful in investigating the sensitivity of alignments or where it is not clear what are the precise values for the weight matrices. We will choose to use the CLP language, CLP(R) [7], which provides reasoning with arithmetic constraints. The rationale here is that the natural kinds of constraints which involve alignment are mostly arithmetic. The constraint solving and modelling capabilities of CLP(R) proved to be useful in both formulating and solving some restriction mapping problem [4, 21, 22] which also utilised arithmetic constraints. In this preliminary work, we hope to demonstrate that this may be an interesting problem to investigate from a CLP perspective as well as a computational biology perspective.

2 Preliminaries Let A and B be two sequences over a nite alphabet where ai is the ith character in A and bj is the j th character in B. In the case of DNA, the alphabet size is 4 consisting of the neuclotides fa, c, g, tg, and for proteins the size is 20, one for each amino acid. We de ne edit operations which transform a string to another as follows:

 Substitution of ai with bj , a subs.  Insertion of a gap, or deletion of a character, this is usually called an indel since inserting

a one character gap in one string is equivalent to deleting a character in the other. A gap is usually written as -. 2

Each edit operation is assigned a cost or weight. De ne the cost of a substitution as d(a; b) and this can be extended to indels by writing as d(?; b) or d(a; ?). De nition 1 An alignment between a sequence A of length n and B of length m is obtained by

introducing gaps (-), into the sequences so that the lengths of both become identical. The length of an alignment is between min(n; m) : : :n + m. De nition 2 The score of an alignment between

operations to transform A into B (or vice-versa).

De nition 3 The edit distance between A and and B. We will denote this by D(A; B ).

B

A

and

B

is the sum of the costs of the edit

is the minimum score over alignments of

A

Often an extra condition on the edit distance (or simply distance) is that it be a metric space with the following conditions: 1. (Positive de niteness) D(A; A) = 0 and D(A; B) > 0 for A 6= B 2. (Symmetry) D(A; B) = D(B; A) 3. (Triangle inequality) D(A; C)  D(A; B) + D(B; C)

3 Alignment Algorithms In this paper, we will only be considering the standard minimum edit distance algorithms for alignment. For two sequences, A and B, of length n and m respectively. We can de ne the distance on subsequences as Di;j = D(a1 a2    ai ; b1b2    bj ), which gives the best alignment between Ai and Bj . The score Di;j can be computed using the following well known dynamic programming formulation (see [12, 14]):

Di;j = minfDi?1;j + d(ai ; ?); Di?1;j?1 + d(ai; bj ); Di;j?1 + d(?; bj )g Brie y the recurrence can be explained as follows: the rst term considers the shortest alignment between Ai?1 and Bj followed by the deletion of ai ; the second term considers the shortest alignment between Ai?1 and Bj ?1 followed by the substitution of ai with bj ; and the last term considers the shortest alignment between Ai and Bj ?1 followed by the deletion of bj . The shortest alignment Di;j is simply the smallest score of the three alternatives. The minimum edit distance, D(A; B), is simply Dn;m . The following initial conditions are needed,

D0;0 = 0 D0;j = D0;j?1 + d(?; bj) Di;0 = Di?1;0 + d(ai ; ?) In addition to the alignment score, we may also want to know what are the optimal alignments. This can either be obtained by saving at each minimization step which of the terms are minimal (deletion from A or B or a substitution), this is also called the path. Alternatively the path can be reconstructed by tracing backwards along the optimal scores from Dij . The time complexity of the standard algorithm is O(nm). 3

Considerations of Parametric Alignments The standard algorithm works using xed weights. An alternative way of expressing this is that the main operation of min operates in ground mode. An obvious extension is to simply to evaluate the alignment under varying choices of weights. The problem of parametric alignment generalizes this observation and here we want the weights themselves to be parameters which can be investigated. The approach in Gus eld et al. [5] and Waterman et al. [20] is to investigate regions of the parametric space with optimal score, mainly using two parameters { one substitution cost and one deletion cost. The approach investigated in this paper is to directly use the normal dynamic programming algorithm, suitably modi ed to allow non-ground weights. The main step is the evaluation of the min and how to modify this given arbitrary constraints on the weights which are now simply real-arithmetic variables (possibly non-ground). CLP(R) does provide a min/2 function but this mainly delays until the arguments become ground. As such, the best which could be done using the builtin min/2 is to construct a min network which can propagate the costs implicitly once values are provided for the weights. So this only gives us a lazy version of the standard algorithm.

A Naive Parametric Algorithm A naive algorithm is to represent the operation of the min using inequalities. We can think of Dij as a variable which represents the upper bound on the optimal alignment score for Ai and Bj . The min step can be re-expressed as the following three inequalities,

Dij  Di?1;j + d(ai; ?); Dij  Di?1;j?1 + d(ai; bj ); Dij  Di;j?1 + d(?; bj ) The standard dynamic programming algorithm can now be implemented in CLP(R), substituting the min operation with the set of inequalities. The initial conditions also become equation constraints over Di;0 and D0;j . Since Dij is expressed in terms of an upper bound on a variable, we do not have the alignment score directly. Note that the bound itself is just a constraint expression and need not be xed (unlike say nite domain constraints). We can think of the edit distance D(A; B) as being implicitly expressed by optimization function max Dnm . CLP(R) does not directly provide an optimization function, however it does provide the dump/1 facility which is a way of examining the constraint store by projecting it onto a set of variables of interest. Using the meta-level version, dump/3, on one variable will give the upper and lower bounds on that variable, represented as constraints, if any. We will now make a number of observations about the naive algorithm. Firstly, it doesn't care if the inputs are ground or not. This means that the input sequences themselves do not have to be xed either, it is not clear if there is a useful feature.2 Consider running the naive parametric algorithm with xed weights. The matrix Dij of score variables remains non-ground, if we then set Dnm = D(A; B), the variables Dij along the path will become ground indicating the score at that point. An alternative equally naive algorithm is to formulate the min instead 2 If non-ground character comparisons are made, then it is likely to be important not to create any choice points while doing so, otherwise the search space becomes very large.

4

as three possible alternatives. This has the problem that it will be necessary to evaluate the various alternatives which would lead to a huge search space (worst case O(3nm ). The alignment score is essentially a function of ]subs   + ]indels   , we assume here that substitution cost is the parameter  and indel cost is  . The inequalities produced by the naive algorithm essentially record down the various alignment scores obtain for particular choices of ]subs and ]indels during the algorithm. This can be illustrated with the following sequence alignment. Given the two sequences, A= a c t and B= c t a t and the constraints  > 0;  > 0. Doing a dump/1 on the optimal cost and the two parameters, we get the following constraints which are a representation of the store projected on Cost;  and  0 < ;

Cost