Consistent Equivalence Relations: a Set-Theoretical ... - CiteSeerX

2 downloads 0 Views 227KB Size Report
Jun 1, 1999 - Consistent Equivalence Relations: a Set-Theoretical. Framework for Multiple Sequence Alignment. Burkhard Morgenstern1, Jens Stoye2;3; and ...
June 1, 1999

Consistent Equivalence Relations: a Set-Theoretical Framework for Multiple Sequence Alignment Burkhard Morgenstern1, Jens Stoye2 3; and Andreas Dress2 ;

GSF { National Research Center for Environment and Health, Institute of Biomathematics and Biometry, Ingolstadter Landstr. 1, 85764 Neuherberg, Germany. Present address: Rh^one-Poulenc Rorer, JA3-3, Rainham Road South, Dagenham, Essex RM10 7XS, U.K. 2 Research Center for Interdisciplinary Studies on Structure Formation (FSPM), Universitat Bielefeld, Postfach 100131, 33501 Bielefeld, Germany 3 Present address: Deutsches Krebsforschungszentrum, Theoretische Bioinformatik (Abt. H0300), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany 1

Abstract Recently, Morgenstern have proposed a new mathematical de nition of sequence alignment (Morgenstern ,1996). In this paper, we discuss this de nition in more detail. We demonstrate that it provides an appropriate conceptual framework in which problems arising in the context of sequence alignment can be treated systematically. et al.

et al.

1

1 Introduction In the standard theory of sequence alignment, an alignment of N sequences s1; : : :; sN is de ned as a matrix, whose rows are so-called `padded sequences' s1; : : :; sN , which are obtained from the original sequences by insertion of additional characters designated as `blanks', `gap characters' or `neutral elements' (cf. Waterman, 1995; p. 186). This formal de nition corresponds to the way alignments are constructed by standard alignment algorithms: Following a method developed in 1970 by Needleman and Wunsch (Needleman & Wunsch, 1970), alignments are constructed by inserting gaps into sequences (and by penalizing them by so-called gap penalties). This point of view clearly is appropriate if the sequences in question are closely related and only a few gaps have to be inserted to align them properly. However, in general, related sequences may share only limited regions of similarity, separated e.g. by introns in DNA sequences or by loop regions in proteins. In such cases, the classical alignment de nition has a certain drawback, at least from the theoretical point of view: the way gaps may have to be inserted into the sequences to highlight the related parts of the sequences properly, is by no means unique. Consider e.g. the sequences s1 = abcdddddfgh and s2 = abceefgh. A meaningful alignment of these sequences will certainly align their rst three as well as their last three letters { but it seems pointless to align the d's and e's in the middle of the sequences, since these parts of the sequences share no similarity at all. If we want to describe such an alignment in standard terms, there are several matrices all of them describing essentially the same alignment (see Figure 1, alignments A1 { A3). To avoid such ambiguities, we propose another de nition of what should be called an alignment. While the standard de nition focuses on the gaps introduced into the sequences { i.e. on the non-aligned residues {, we de ne an alignment by specifying exclusively the aligned residues, and we simply ignore the non-aligned residues in our de nition. In our exposition, we will freely use standard terminology from basic set theory (including standard terminology referring to binary relations, in particular equivalence relations and partial (quasi-)orders, de ned on a set X) as recalled in the Appendix of this note. In addition, the Appendix will also provide the proof for a number of speci c facts which we will require in our exposition and which are not yet standard knowledge or folklore. This way, our exposition can be kept concise while those facts can be established within a simultaneously more natural and considerably more general environment. Remark: While designing the concepts and establishing the proofs discussed in the Appendix, it became apparent that an even more general notion of what really is an alignment should be studied which encompasses the two 2

concepts discussed here as its two extreme cases and allows for even more transparent and straight forward proofs of all the basic facts needed to be established in this context. This notion will be presented and discussed in a forthcoming paper entitled 'What really are alignments?'.

3

2 Alignments as Equivalence Relations In this section, we give a formal mathematical de nition of what we want to call an alignment, and we draw some immediate conclusions from this de nition. We assume familiarity with the basic concepts of the mathematical theory of sets and relations (see for instance Bourbaki, 1968). Let us consider a nite set A, called the alphabet. The disjoint union 1 [

n=0

An =: A

is called the sequence space over A. Given an alphabet A, any map s = sI : I ! A from a ( nite) index set I into A is called a sequence family (over A). For every i 2 I, we denote by Li the length of the sequence s(i), i.e. the unique number n 2 IN0 (:= f0; 1; : : :g) with s(i) 2 An . For simplicity, we write si rather than s(i) to designate the i-th sequence. The site space of a sequence family sI is now de ned to be the set S = S (sI ) := f[ijj] : i 2 I; 1  j  Li g: If I 0 is a subset of I, we denote the restriction sI jI 0 by sI and we denote S (sI ) by S (I 0 ). In addition, for i 2 I, we write S (i) instead of S (fig), so S (i) is the set of all sites belonging to the i-th sequence: S (i) := f[ijj] : 1  j  Li g: To describe an alignment of a sequence family sI by pointing out which residues are aligned with each other { rather than by focusing on where to introduce gaps {, we now de ne an alignment of sI as a binary relation A de ned on the site space S and we'll write ([ijj]; [i0jj 0 ]) 2 A if we want to express that the j-th site of the i-th sequence is aligned with the j 0 -th site of the i0 -th sequence. Obviously, the relation A is symmetrical as well as transitive provided we consider each site [ijj] 2 S as being aligned with itself, that is, we assume A also to be re exive. In other words, we assume an alignment A to be an equivalence relation de ned on the site space S , and we will also write x A y instead of (x; y) 2 A for any pair x; y 2 S of aligned sites. On the other hand, we do not want to regard every equivalence relation de ned on S as an alignment of the sequence family sI . We require an alignment to be `consistent' in a certain sense with the linear order of the sites in the individual sequences. It is therefore convenient to de ne a certain partial order `' on the site space S , the so-called `direct sum' of the linear orders given on the single sequences, which is de ned by [ijj]  [i0jj 0 ] , i = i0 and j  j 0 : 0

4

0

(The relation `' corresponds to the `implicit constraints' introduced by Myers

et al., 1996.)

An alignment A extends the partial order `' to a quasi partial order `A ' on the set S which is given as the transitive closure of the union A [  (note that  as well as A are binary relations de ned on the site space S , that is, they are subsets of the cartesian product S 2 , so we may apply the usual set-theoretical operations to them). In other words, for any two sites x; y 2 S , we write `x A y' if and only if there exists a chain x = x0 ; x1; : : :; xk = y of sites in S with either xi?1  xi or xi?1 A xi for each i 2 f1; : : :; kg (see Figure 2). Now we can de ne precisely what `consistent with the linear order of the sites in the individual sequences' means: For every i 2 I, the restriction of the extended quasi order A to the subset S (i) has to coincide with the `original' or `natural' linear order relation de ned on S (i). In other words, for every i 2 I and j; j 0 2 f1; : : :; Lig, we must have [ijj] A [ijj 0] if and only if j  j 0 holds. In formal terms, we propose De nition 1 (a) For a binary relation R de ned on a set X , we denote by Rt its transitive closure and by Re its equivalence closure, i.e. Rt and Re are the smallest transitive relation and the smallest equivalence relation, respectively, which are de ned on X and contain R. (b) Let sI be a sequence family. On the site space S = S (sI ), we de ne a partial order `' by

[ijj]  [i0 jj 0] ,def i = i0 and j  j 0 : (c) For every binary relation R de ned on the site space S and for every x 2 S and i 2 I , we de ne

R := (Re [ )t = (R [ R? [ )t ; R := R n (R \ ?R ); b# (i x; R) := min(j 2 f1; : : :; Li + 1g j = Li + 1 or x R [ijj]); 1

1

and





b" (i x; R) := max(j 2 f0; : : :; Li g j = 0 or [ijj] R x):

(d) An equivalence relation A de ned on the site space S is called an alignment of the sequence family sI if, for every i 2 I , the restriction of the relation A to the subset S (i) coincides with the linear order given on S (i) in which case A will be called ( ?)consistent. More generally, we'll say that a binary relation R  S 2 induces an alignment of sI if its equivalence closure Re is an alignment of sI in which case we'll also say that R is ( ?)consistent.

5





Clearly, R is consistent if and only if b" (i x; R)  b# (i x; R) holds for all x 2 S and i 2 I , and in this case we have x R [ijj] for some j 2 f1; : : :; Li g if and only if j = b" (i x; R) = b# (i x; R) holds. Consequently, we will refer to these numbers as the consistency bounds (of i relative to x and R). (e) Finally, for any subset I 0  I , an alignment A is de ned to be an I 0 -maximal alignment if S (I 0 )2 is contained in A [ A?1 . e

Note that x; y 2 S ; i 2 I and x R y implies









b# (i x; R)  b# (i y; R)

and

b" (i x; R)  b" (i y; R): The following fact is established in the Appendix:

Lemma 1 Let A be an equivalence relation de ned on the site space S = S (sI ) of some sequence family sI . Then each of the following four properties implies the other three: (a) A is an alignment of sI . (b) Both of the following two conditions hold:

A \ A?  A; A \   DS 1

(1) (2)

(Recall that for any set X , DX denotes the diagonal relation

f(x; x)jx 2 X g:) (c) There exist a partially ordered set (S~;  S~ ) and a strictly monotonously increasing mapping

 : (S ; ) ! (S~; S~ ) A y holds, such that, for all x; y 2 S , (x) = (y) holds if and only if x  i.e. x is aligned with y. (c') There exist a linearly ordered set (S~;  S~ ) and a strictly monotonously increasing mapping  : (S ; ) ! (S~; S~ ) A y holds. such that for all x; y 2 S , (x) = (y) holds if and only if x 

6

Since the `padded sequences' s1 ; : : :; sN used in the standard alignment de nition may be regarded as strictly monotonously increasing maps S (1) ! IN, : : :; S (N) ! IN, our de nition is closely related to the standard de nition (cf. Chan et al., 1992); however, we avoid the above mentioned ambiguity. More precisely: We may divide the entity of all standard, i.e. matrix-based alignments of a given sequence family sI into equivalence classes such that two `matrix alignments' A and B are equivalent if and only if the two corresponding maps 'A : S (sI ) ! IN and 'B : S (sI ) ! IN satisfy the condition 'A (x) = 'A (y) () 'B (x) = 'B (y) for all x; y 2 S (sI ) or { equivalently { B may be obtained from A by permuting successively pairs of two consecutive columns for which each sequence has at least one gap among its entries in those two columns. For instance, the rst three `matrix alignments' shown in Figure 1 belong to one single equivalence class. Clearly (see also the Appendix), there is a one-to-one correspondence between these equivalence classes and the alignments of sI as de ned in De nition 1 { the I-maximal alignments corresponding exactly to those equivalence classes of standard alignments which consist of just one such alignment only.

7

3 Alignments and Consistency How can we decide algorithmically whether or not a given binary relation A de ned on the site space S = S (sI ) of some sequence family sI induces an alignment? Clearly, we might start with A0 := ; and then proceed by successively enlarging the subset in question by pairs (x; y) from A, always checking for consistency. More generally, we may assume that we are given some pairs (x1; y1); (x2 ; y2); : : :; (xk ; yk ) 2 S 2 of elements from S , and that we may want to construct an alignment A of sI recursively in a greedy fashion by putting A0 := ; as above and then putting  (A [ f(x ; y )g) ; if A [ f(x ; y )g is consistent,   e ?1   A := A ?1; otherwise ?1 (cf. Abdeddam, 1997 a/b; Morgenstern et al., 1996.) In both cases, we need to know whether or not the union of an alignment A  S 2 and a single pair (^x; y^) 2 S 2 is consistent. This question is answered by Proposition 1 which is also established in the Appendix: Proposition 1 Let sI be a sequence family, let A be an alignment of sI , assume x^; y^ 2 S = S (sI ), and put A0 := A [ f(^x; y^)g. Then 8 x  y; or < A x A y () : (x A x^ and y^ A y); or (x A y^ and x^ A y): In particular, A0 is consistent if and only if neither x^ A y^ nor y^ A x^ holds. Corollary 1 A is not properly contained in any other alignment if and only if A is I -maximal, i. e. if we have S 2 = A [ ?A1 0

in which case A is a total quasi-order.

(Recall that a relation R on a set X is called total if any two elements of X are \comparable" via R, i.e. if one has R [ R?1 = X 2 .) Corollary 2 Given an alignment A, an element x^ 2 S and an element y^ = [i1jj1] 2 S , the relation A0 := A [ f(^x; y^)g is a proper consistent extension of A if and only if one has b" (i1 x^; A) < j1 < b# (i1 x^; A); moreover, whether or not A0 is consistent, its consistency bounds can be computed, for any x 2 S and i 2 I , by the simple formulae 8 min(b#(i x; A); b#(i y^; A)) if x  x^; < A (3) b# (i x; A0) = : min(b # (i x; A); b#(i x^; A)) if x A y^; b# (i x; A) else 8

and

8 max(b"(i x; A); b"(i y^; A)) < " (i x; A); b"(i x^; A)) b" (i x; A0) = : max(b b" (i x; A)

if x^ A x; if y^ A x; else,

(4)

{ note that this is well-de ned as x A x^ and x A y^ (or x^ A x and y^ A y) implies





















min(b# (i x; A); b#(i y^; A)) = min(b# (i x; A); b#(i x^; A)) = b# (i x; A)

(or

max(b" (i x; A); b"(i y^; A)) = max(b" (i x; A); b"(i x^; A)) = b" (i x; A);

respectively).

Clearly, these formulae allow us to formally set up the above mentioned recursive procedure for constructing alignments greedily for any given sequence (x1; y1); : : :; (xk ; yk ) of pairs of elements from S that we wish to align. If the sequence familyunder consideration consists of N sequences s1; : : :; sN of total length L := L1 + : : : + LN = #S , equations (3) and (4) allow us to compute the new values of b# and b" in O(NL) time and space whenever a new pair (xi ; yi) is included into the growing alignment. Remark: Note that, with x = [i0; j0], (3) is equivalent with 8 b#(i y^; A) if b"(i [i; b#(i; y^)] < j  b"(i x^) < 0 0 0 b# (i x; A0) = : b#(i x^; A) if b" (i0 [i; b# (i; x^)] < j0 < b" (i0 y^) b#(i x; A) else and that (4) can be rewritten in a similar way which allows for a rather fast updating procedure, essentially of the order N 2 . Based on a graph theoretical approach, Abdeddam has proposed an algorithm which { consistency provided { includes the pairs (x1; y1 ); : : :; (xk; yk ) successively into a growing alignment by computing the values of b# and b" every time a new pair (xi ; yi ) is included into the alignment. This algorithm takes O(kL+L2) time and O(NL) space for processing all pairs (x1 ; y1); : : :; (xk ; yk ). (Abdeddam, 1997 b). Next, we generalize formulae (3) and (4) to the case in which an alignment A is extended by incorporating an arbitrary pairwise alignment P. To this end, we de ne: De nition 2 Let sI be a sequence family. An alignment P of sI is called a pairwise alignment if there exist i; j 2 I so that P is the equivalence closure of a consistent relation R = f(x1; y1); : : :; (xk ; yk )g  S (i) S (j) (or { equivalently { if using the notation introduced in the Appendix, we have girth (supp(P ? DS ))  2). 9

So let A be an arbitrary alignment, assume (x1 ; y1); : : :; (xk ; yk ) 2 S (i)S (j) such that P := f(x1; y1 ); : : :; (xk; yk )ge is a pairwise alignment and assume that A0 := A [ P is a consistent extension of A. Then, applying Proposition 1 and Corollary 2 repeatedly, we get

Corollary 3 For all x; y 2 S and i 2 I , one has 8 x  y; or < A x A y () : (x A x and y A y for some  2 f1; : : :; kg); or (x A y and x A y for some  2 f1; : : :; kg) 0

as well as





















b# (i x; A0) = min(fb# (i x; A)g [ fmin(b# (i x ; A); b# (i y ; A))  2 f1; : : :; kg and x A x or x A y g)

and

b"(i x; A0) = max(fb" (i x; A)g [ fmax(b" (i x ; A); b"(i y ; A))  2 f1; : : :; kg and x A x or y A xg)

10

4 Consistency of Pairwise Alignments Recently, Stoye has shown the following fact concerning consistency of pairwise standard alignments: Consider a set of N sequences and pairwise standard alignments Ai;j for all 1  i < j  N. If any three of these pairwise alignments are consistent, then all of them are simultaneously consistent (Stoye, 1997). It is remarkable that this result depends crucially on the underlying alignment de nition: Generally, triplewise consistency of a family of pairwise alignments in the sense of De nition 2 does not imply simultaneous consistency of the family. Consider, for instance, the site space S := f[ijj] : i 2 f1; : : :; 4g; j 2 f1; 2gg and the pairwise alignments  D [ f([ij1]; [i0j2])g if i0 = i + 1 (mod 4); S e Pi;i := D otherwise. S Here, any three of the pairwise alignments are consistent { but the family (Pi;i )i;i 2f1;:::;4g is not consistent. The point is that standard alignments contain more information than consistent equivalence relations in the sense of De nition 1. We will discuss this di erence in the last section of this paper. However, the following theorem states that triplewise consistency of pairwise alignments does imply simultaneous consistency if pairwise alignments are maximal in a certain sense. Theorem 1 Let sI be a sequence family and suppose that for every pair (i; j) 2 I 2 there is a pairwise alignment Pi;j of the sequences si and sj such that the 0

0

0

following two conditions hold: (a) For every pair (i; j), the alignment Pi;j is fi; j g-maximal. (b) For all i; j; k 2 I; the family (Pi;j ; Pi;k; Pj;k) is consistent. S Then the union i;j 2I Pi;j is an alignment of S .

Proof: For any x; y 2 S with, say, x 2 si and y 2 sj , we have by de nition either x P y or y P x. In other words, we have i;j

i;j

S=

[

i;j 2I

(P [ ?P 1 ): i;j

i;j

Now our assertion follows from Corollary 7. 2 De nition 3 (a) Let sI be a sequence family. For every alignment A of sI and every i; j 2 I , we de ne Ti;j (A) := (A \ (S (i)  S (j)))e : (5) 11

(b) Consider a real-valued, monotonically increasing function w de ned on the set of all pairwise alignments of si and sj ; i; j 2 I . A pairwise alignment P of si and sj is called a (w-)optimal alignment of the sequences si and sj , if w(Q)  w(P) holds for all pairwise alignments Q of si and sj . (c) If the set I is nite, we call an alignment A of sI a (sum-of-pairs) (w-) optimal alignment of sI , if

X

i;j 2I

w(Ti;j (B)) 

X

i;j 2I

w(Ti;j (A))

(6)

holds for all alignments B of sI .

Remark 1 Ti;j (A) is the largest pairwise alignment of the sequences si and sj contained in A.

Theorem 2 Consider a nite set I , a sequence family sI , a real-valued, monotonically increasing function w de ned on the set of all pairwise alignments of si and sj ; i; j 2 I , and let Pi;j be a pairwise alignment of the sequences si and sj for every i; j 2 I . If for every i; j 2 I , Pi;j is a optimal alignment of the sequences si and sj and if the family (Pi;j ) is consistent, then

0 1 [ A := @ Pi;j A i;j 2I

is a (w-)optimal alignment of sI .

e

Proof: The inclusion Pi;j  Ti;j (A) implies w(Pi;j) = w(Ti;j (A)) for all i; j 2 I. Therefore, we have for every alignment B of the sequence family sI X X X w(Ti;j (B))  w(Pi;j ) = w(Ti;j (A)): (7) i;j 2I

i;j 2I

i;j 2I

2 Note that this last observation also holds for optimal `standard' alignments (Stoye, 1997; Lemma 2.15).

12

5 Discussion Listing pairs of { presumably { homologue residues or sites is a natural way of describing sequence alignments (see e.g. Smith and Waterman, 1981; Kruskal, 1983; Vingron and Argos, 1991; Miller, 1993; Miller et al., 1994). If two sequences are to be aligned, some authors call such a list of pairs of residues a `trace' rather than an `alignment' and use the term `alignment' only for standard `matrix alignments' (cf. Kruskal, 1983; Waterman, 1995). Kececioglu has used the language of graph theory to generalize the de nition of a `trace' to multiple sequence comparison (Kececioglu, 1993). Kruskal remarks that `alignments are richer than traces' since they contain some information about the order of adjacent gaps which is not contained in a mere listing of aligned pairs of residues (see Figure 1). However, it seems doubtful if this information can be inferred from sequence comparison alone { so if sequences are compared solely on the basis of their primary structure, it seems to be adequate if we try to avoid or, at least, make no assertions about the order of adjacent gaps and con ne us to pointing out which residues of the sequences can be expected to be homologous. Mathematically spoken, a set of aligned pairs of residues or sites is a binary relation. In the case of pairwise alignments, this relation can be regarded as a relation between the set of sites of the rst sequence and that of the second one { i.e. as a subset of the cartesian product S (1)  S (2) (Miller et al., 1994). However, since a direct generalization of this de nition to multiple alignments is somewhat complicated, we prefer to de ne an alignment as a binary relation, de ned on the set S (sI ) of sites of all sequences. Since the linear order relations on the single sequences can be described by one single partial order  on S (sI ) as well, we have a natural and uniform setting to treat questions arising in the context of sequence alignment { especially questions concerning the consistency of alignments. The concepts of the theory of sets and relations provide a convenient mathematical framework which allows a simple and transparent treatment of these questions. Taylor (1996) deplores the absence of a method to assess the quality of remotely related sequences that does not depend on the unreliable exact location of gaps. By successively adding consistent gapfree pairs of segments of the sequences to an initially empty set, algorithms based on our alternative alignment de nition can be developed which help improving this situation.

13

A Some more De nitions, Results, and Proofs In the following, we consider binary relations T; R; A; B; : : : de ned on a set X. For any such relations R; R1; R2  X 2 , we put (as usual)



R?1 := f(y; x) 2 X 2 (x; y) 2 Rg

and



R1  R2 := f(x; y) 2 X 2 there exists some z 2 X with (x; z) 2 R1 and (z; y) 2 R2g:

Recall that R is called re exive (or symmetric or transitive), if DX  R (or R = R?1, or R  R  R, respectively) holds. Now, assume T  X 2 to be a transitive binary relation, and A; B  X 2 to be equivalence relations de ned on X. We de ne A to be T -consistent if the assertions (A1) T \ T ?1  A and (A2) TA \ T ?1  T hold true, where TA denotes the smallest transitive relation de ned on X which simultaneously contains T and A. More generally, we de ne a binary relation R  X 2 to be T-consistent if the equivalence relation R(T) := (R [ (T \ T ?1 ))e is T-consistent, and we de ne TR := TR(T ) . Clearly, if T \ T ?1  A  B holds, then A is necessarily T-consistent whenever B is T-consistent in view of TA \ T ?1  TB \ T ?1  T: More precisely, we have

Theorem 3 For X , T , A and B as above with A  B, the following three assertions are equivalent: (i) B is T -consistent and T \ T ?1 is contained in A, (ii) TB \ TA?1 is contained in A and A \ T ?1 is contained in T , (iii) B is TA -consistent and A is T -consistent. Proof: 14

(i) ) (ii): Clearly, TB \ T ?  T implies A \ T ?  T. Now, assume (x; y) 2 TB and (y; x) 2 TA for some x; y 2 X. Then there exist elements x := x; x ; : : :; xn := y; xn ; : : :; xm := x with (xi? ; xi) 2 B [ T for i = 1; : : :; n and (xi? ; xi) 2 A \ T for i = n + 1; : : :; m. Clearly, this implies (xi? ; xi) 2 B [ T for all i = 1; : : :; m and, hence, also (xi ; xi? ) 2 TB . So, for all i = n+1; : : :; m, 1

1

0

1

+1

1

1

1

1

we have

(xi ; xi?1) 2 TB \ (A?1 [ T ?1 ) = (TB \ A) [ (TB \ T ?1 )  A [ (T \ T ?1)  A which implies (y; x) 2 A as claimed.

(ii) ) (iii): Clearly, the rst assumption implies T \ T ?  TA \ TA?  A  B 1

1

as well as

TB \ TA?1  A  TA ; so it implies in particular that B is TA -consistent. Similarly, the rst and the second assumption together imply TA \ T ?1  TB \ TA?1 \ T ?1  A \ T ?1  T;

so A is also T-consistent.

(iii) ) (i): Clearly, T \ T ?  A and A  B implies T \ T ?  B; while TB \ TA?  TA and TA \ T ?  T together imply TB \ T ? = TB \ TA? \ T ?  TA \ T ?  T: 1

1

1

1

1

1

1

1

2

Corollary 4 Given X , T , and A as above, then A is T -consistent if and only if TA \ TA?1 is contained in A and A \ T ?1 is contained in T .

Proof: Put B := A in Theorem 3 and use that, in this case, Assertion (i) is equivalent with the assertion that A = B is T-consistent, so this assertion is equivalent with Assertion (ii) which is exactly what is claimed in Corollary 4.2 Clearly, putting X := S := S (sI ) for some family of sequences sI and T := , the above corollary implies the equivalence of the assertions (a) and (b) in Lemma 1, while the remaining implications \(b) , (c) , (c0 )" are simple consequences of the wellknown facts that any re exive and transitive relation T de ned on a set X induces a partial order on the set X of equivalence classes 15

relative to the equivalence relation T \ T ?1 , and that any partial order can be extended to a linear order. Next, observe that, for any transitive and re exive relation T  X 2 and any pair (y1 ; y2 ) 2 X 2 , we have (T [ f(y1 ; y2 ); (y2; y1 )g)t = T [ T12 [ T21 with Tij := f(u; v) 2 X 2 (u; yi); (yj ; v) 2 T g for (i; j) 2 f(1; 2); (2; 1)g: Indeed, as T [ T12 [ T21 contains T [f(y1 ; y2 ); (y2; y1 )g and is itself necessarily contained in (T [ f(y1; y2 ); (y2 ; y1)g)t , it is enough to observe that T [ T12 [ T21 is transitive which follows immediately from the relations T  T  T; T  Tij  Tij ; Tij  T  Tij ; Tij  Tij  Tij ; and Tij  Tji  T which are easily established (with (i; j) 2 f(1; 2); (2; 1)g of course, just as above). It follows in particular that, in case (y1 ; y2); (y2 ; y1 ) 2= T, we have (T [ f(y1 ; y2 ); (y2; y1 )g)t \ (T [ f(y1 ; y2); (y2 ; y1)g)?t 1 = (T [ T12 [ T21 ) \ (T [ T12 [ T21)?1 = (T \ T ?1) [ f(u; v); (v; u) (u; y1); (y1 ; u); (v; y2); (y2; v) 2 T g in view of T \ T12?1 = T \ T21?1 = T12 \ T ?1 = T21 \ T ?1 = T12 \ T12?1 = T21 \ T21?1 = ; and T12\T21?1 = f(u; v) (u; y1 ); (y1; u) 2 T and (v; y2 ); (y2 ; v) 2 T g = (T21 \T12?1)?1 : We can now conclude Corollary 5 Given X , T and A as above as well as two distinct equivalence classes Y1 ; Y2  X with respect to A, then { assuming T \ T ?1  A { the enlarged equivalence relation

B := A [ Y1  Y2 [ Y2  Y1 is T -consistent if and only if A is T -consistent and we have neither (y1 ; y2) 2 TA nor (y2 ; y1) 2 TA for some/all y1 2 Y1 and y2 2 Y2. 16

Proof: Clearly, (u; v) 2 TA implies (u0; v0 ) 2 TA for all u0; v0 2 X with u0 A u and v0 A v. So, of course, assuming (y ; y ); (y ; y ) 2= TA for some y 2 Y and y 2 Y is equivalent to assuming that for all y ; 2 Y and y 2 Y . Moreover, the above theorem implies that B is T-consistent if and only if A is T-consistent and B is TA -consistent which { in view of Corollary 4 { clearly implies B \ TA?  TA 1

2

2

2

2

1

1

1

1

2

1

2

1

and, hence, B \ TA = B ?1 \ TA = (B \ TA?1 )?1 \ TA  TA?1 \ TA  A; so we cannot have (y1 ; y2) 2 TA or (y2 ; y1 ) 2 TA for any y1 2 Y1 and y2 2 Y2 in view of (y1 ; y2); (y2 ; y1) 2 B ? A. Vice versa, if A is T-consistent and we have (y1 ; y2 ); (y2; y1 ) 2= TA for some (and, hence, for all) y1 2 Y1 and y2 2 Y2 , then B is TA -consistent in view of B \ TA?1 = (B ?1 \ TA )?1 = (B \ TA )?1 = A  TA and TB = (TA [ f(y1; y2 ); (y2 ; y1)g)t and, hence, TB \ TB?1 = (TA \ TA?1 ) [ f(y10 ; y20 ); (y20 ; y10 ) (y10 ; y1); (y1 ; y10 ); (y2 ; y20 ); (y20 ; y2) 2 TA g = A [ f(y10 ; y20 ); (y20 ; y10 ) (y10 ; y1); (y2 ; y20 ) 2 Ag = A [ Y1  Y2 [ Y2  Y1 = B: 2 Other useful consequences of the above results are Corollary 6 Given X; T and A as above, A is a maximal T -consistent equivalence relation if and only if A is T -consistent and we have TA [ TA?1 = X 2 . Corollary 7 If Ak S(k 2 K ) is a family of T -consistent equivalence relations so that TA 1  TA 2  (TA [ TA?1), then the following assertions are equivalent: k

k2K

k

k

k

(i) Ak1 [ Ak2 [ Ak3 is T -consistent for all k1; k2; k3 2 K , S (ii) A := Ak is a T -consistent equivalence relation, k2K

(iii) Q :=

ST A

k2K

k

is transitive and one has Q \ Q?1 = A.

S

In particular, these three assertions are equivalent in case X 2 = (TA [ k2K TA?1), in which case A is a maximal T-consistent equivalence relation. Proof: (i) ) (iii): First, assume (u; w); (w; v) 2 Q and choose k1; k2; k3 2 K so that (u; w) 2 TA 1 ; (w; v) 2 TA 2 and (u; v) 2 TA 3 [ TA?13 according to our general assumption. Then Theorem 3, applied with respect to B := (Ak1 [ Ak2 [ Ak3 )e and A := Ak3 implies (u; v) 2 TA 3 [ (TA 1 [A 2 [A 3 \ TA?13 )  TA 3 [ Ak3 = TA 3 ; k

k

k

k

k

k

k

k

k

17

k

k

k

k

so

S T is indeed transitive. A

k2K

k

Next, assume (u; v); (v; u) 2 Q and choose k1; k2 2 K with (u; v) 2 TA 1 and (v; u) 2 TA 2 . Then, as above, we have (u; v) 2 TA 1 [A 2 \ TA?12  TA 2 and, hence, (u; v) 2 Ak2  A. So, we have indeed Q \ Q?1  A, while the opposite inclusion is trivial. (iii) ) (ii): Clearly, the assumption Q \ Q?1 = A implies in particular that A is an equivalence relation, while the assumption that Q is transitive implies that the transitive relation TA generated by T [ A coincides with Q. So, nally, we have [ TA \ TA?1 = A and A \ T ?1 = (Ak \ T ?1 )  T; k

k

k

k2K

k

k

k

so A is indeed a T-consistent equivalence relation. (ii) ) (i): This is trivial. 2 Next, consider as above a transitive and re exive relation T  X 2 and an arbitrary binary relation R  X 2 . De ne the support supp(R) of R by supp(R) := fx 2 X there exists y 2 X with (x; y) 2 R [ R?1g and, for any subset Y  X, de ne its girth relative to T by girthT (Y ) := max(#Y 0 Y 0  Y and (Y 02) \ T  DX ): We claim Theorem 4 A relation R is T -consistent if and only if any relation R0  R with #R0  girthT (supp(R0 )) is T -consistent; in particular, R is T -consistent if and only if R0 is T -consistent for all R0  R with #R0  girthT (supp(R)). Proof: Clearly, if R is T-consistent, then so is R0 for every subset R0  R. Next, observe that R is T-consistent if and only if there is no pair (u; v) in T 0 := (R [ R?1 [ T)t with (v; u) 2 T n T ?1, that is, if and only if for every (cyclic) sequence x0 ; x1; : : :; xk := x0 2 X of elements from X with (xi?1; xi) 2 R [ R?1 [ T for all i = 1; : : :; k, we have (xi?1; xi) 2 T ?1 for all i 2 f1; : : :; kg with (xi?1; xi) 2 T. So, assuming that R is not T-consistent, we can nd a shortest such sequence x0; x1; : : :; xk 2 X with (xi?1; xi) 2 R [ R?1 [ T for all i = 1; : : :; k which is bad, that is, for which, say, (x0; x1) 2 T n T ?1 holds. We claim that the union R0  R of f(xi?1; xi) i 2 f1; : : :; kg and (xi?1; xi) 2 R n T g and f(xi ; xi?1) i 2 f1; : : :; kg and (xi?1; xi) 2 R?1 n (R [ T)g is a subrelation of R with #R0  girthT (supp(R0 )) 18

which will surely establish Theorem 4. This is evident in case k = 2 as this implies #R0 = 1. So, the minimality of k together with (x0; x1) 2 T n T ?1 implies in particular (x1 ; x2) 2= T and (xk?1; xk ) = (xk?1; x0) 2= T. Next, put



J := fi 2 f1; : : :; kg (xi?1; xi) 2= T g  f2; : : :; kg

and



Y := fxi i 2 J g: Clearly, we have #R0  #J and Y  supp(R0). So, it remains to show that we have girthT (Y ) = #J; that is (xi ; xj ) 2= T for all i; j 2 J with i 6= j. So, we assume the opposite and derive a contradiction: Indeed, if (xi ; xj ) 2 T and i < j, we necessarily would have i < j ? 1 (in view of (xj ?1; xj ) 2= T by de nition of J), so the sequence x0; x1; : : :; xi; xj ; : : :; xk = x0 would be a shorter bad sequence. Next, if (xi ; xj ) 2 T and j < i and, in addition, (xi; xj ) 2= T ?1 , then y0 := xj ; y1 := xj +1; : : :; yi?j := xi ; yi?j +1 := xj would be a shorter bad sequence in view of 2  j and, hence, i ? j +1  k ? 2+1 = k ? 1. And nally, if (xi ; xj ) 2 T; j < i and (xi ; xj ) 2 T ?1, then we have (xi ; xj ) 2 T for i0 := j and j 0 := i from J with i0 < j 0 , so we can argue as in the rst case. Clearly, this establishes Theorem 4. 2 A simple consequence is 0

0

Corollary 8 Given a sequence family sI and an equivalence relation A de ned on S = S (I). Then A is an alignment if and only if every subset A0 of A of cardinality at most #I induces an alignment.

Finally, assume as above that X is a set, that T  X 2 is a re exive and transitive relation de ned on X, that A  X 2 is a T-consistent equivalence relation de ned on X and that B  X 2 is an arbitrary binary relation de ned on X so that B is T-consistent. Then we have:

Theorem 5 The equivalence relation A [ B is T -consistent if and only if A [ B 0 is T -consistent for all subsets B 0 of B with girthT (B 0 )  #B 0 + 1.

19

Proof: As above, it is sucient to show that in case A [ B is not T-consistent there exists a subset B 0  B with girthT (B 0 )  #B 0 + 1 so that A [ B 0 is not T-consistent. And also as above, we proceed { assuming that A [ B is not T-consistent { by choosing some shortest bad sequence, that is, a sequence x ; x ; : : :; xk = x 2 X with (xi? ; xi) 2 A [ B [ B ? [ T for all i = 1; : : :; k and (xi? ; xi) 2 T n T ? for at least one i 2 f1; : : :; kg. As we have assumed B to be T-consistent, it is clear that there must exist some i 2 f1; : : :; kg with (xi? ; xi) 2 A n (T [ B [ B ? ) 0

1

0

1

1

1

1

1

1

and, without loss of generality, we may now assume that this holds for i := 1. As above, we consider the set



J := fi 2 f1; : : :; kg (xi?1; xi) 2= A [ T g  f2; : : :; kg;

we put j0 := min(J); B 0 := f(xi?1; xi) i 2 J and (xi?1; xi) 2 B g [ f(xi; xi?1) i 2 J and (xi?1; xi) 2= B g; we note that j0  2 and B 0  B must hold, we put



Y := fxi i 2 J g [ fxj0?1 g

and we claim that, for i; j 2 J [ fj0 ? 1g, we have (xi ; xj ) 2 T if and only if i = j holds which surely implies our Theorem as it implies #B 0  #J = girthT (Y ) ? 1  girthT (supp(B 0 )) ? 1: So, assume i; j 2 J [ fj0 ? 1g; i 6= j and (xi; xj ) 2 T. We have to derive a contradiction. If i < j, we must have i < j ? 1, so either the sequence x0; x1; : : :; xi; xj ; : : :; xk = x0 would be a shorter bad sequence, or we have (xi; xj ) 2 T ?1 and xi ; xi+1; : : :; xj ; xi would be a shorter bad sequence as both are shorter and at least one of them must contain a pair of consecutive elements in T n T ?1 . If j < i, then we can't have j := 1 and i := n in view of (xn ; x1) = (x0; x1) 2 A n T, so either the sequence xj ; xj +1; : : :; xi; xj would be a shorter bad sequence or we would have (xi; xj ) 2 T \ T ?1 , hence j < i ? 1 and, therefore, x0; x1; : : :; xj ; xi; xi+1; : : :; xk would be a shorter bad sequence. 2

20

Acknowledgement The work of J. S. was supported by the German Research Council (DFG) graduate program (GK Strukturbildungsprozesse) and by the German Academic Exchange Service (DAAD).

21

References Abdeddam, S. 1997 a, Incremental computaton of transitive closure and greedy alignment. Proc. of 8-th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science 1264, 167 - 179. Abdeddam, S. 1997 b, Improvement in incremental computaton of transitive closure and greedy alignment, In preparation. Bourbaki, N. 1968. Elements of mathematics, theory of sets. Addison Wesley, Reading, MA, USA. Chan, S.C., Wong, A.K.C., and Chiu, D.K.Y. 1992. A survey of multiple sequence comparison methods. Bull. Math. Biol. 54, 563-598. Kececioglu, J. 1993. The maximum weight trace problem in multiple sequence alignment. In Apostolico, A., Crochemore, M., Galil, Z., and Manber, U. (editors), Proc. of 4-th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science 684, 106-119. Kruskal, J.B. 1983. An overview of sequence comparison. In Sanko , D. and Kruskal, J.B. (editors), Time warps, string edits, and macromolecules: The theory and practice of sequence comparison, pages 1-44. AddisonWesley, Reading, MA, USA. Miller, W. 1993. Building multiple alignments from pairwise alignments. CABIOS 9, 169-176. Miller, W., Boguski, M., Raghavachari, B., Zhang, Z., and Hardison, R. 1994. Constructing aligned sequence blocks. J. Comp. Biol. 1, 51-64. Morgenstern, B., Dress, A., and Werner, T. 1996. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93, 12098-12103. Myers, G., Selznick, S., Zhang, Z., and Miller, W., 1996. Progressive multiple alignment with constraints. J. Comp. Biol. 3, 563-572. Needleman, S., and Wunsch, C. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453. Smith, T., and Waterman, M. 1981. Comparison of biosequences. Adv. Appl. Math. 2, 482-489. Stoye, J. 1997. Divide-and-conquer multiple sequence alignment. Dissertation Thesis, Technische Fakultat, Universitat Bielefeld, Germany. Report 9702. 22

Taylor, W.R., 1996. Multiple protein sequence alignment: Algorithms and gap insertion. In Doolittle, R.F. (editor), Computer methods for macromolecular sequence analysis. Methods in Enzymology 266, 343-367. Vingron, M., and Argos, P. 1991. Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol. 218, 33-43. Vingron, M., and Pevzner, P. 1993. Multiple sequence comparison and ndimensional image reconstruction. In Apostolico, A., Crochemore, M., Galil, Z., and Manber, U. (editors), Proc. of 4th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science 684, 243-253. Waterman, M. 1995. Introduction to computational biology. Chapman & Hall, London.

23

A1 =

a b c d d d d d ? ? f g h a b c ? ? ? ? ? e e f g h

a b c ? ? A = a b c e e a b c d d A = a b c ? ? a b c d 2

3

A4 =

d d d d d f ? ? ? ? ? f ? ? d d d f e e ? ? ? f d d d d f g a b c ? ? e e ? f g .. .

g h g h g h g h  h h

 

Figure 1: Ambiguity of standard alignments. The standard alignment de nition is ambiguous: Several matrices can express essentially one single way to align two sequences (A1 { A3 ; cf. Kruskal, 1983). In contrast, A4 shows one of many further possible alignments. Clearly, all these alignments will have to be taken into account when searching for the best standard alignment.

24

s

1

r

r

r

s

2

r

r

s

3

r

s

4

x

r

x4 r

r

r

r

r

r

r

r

x3

r

r

r

r

x2 x1

y

r

r

r

r

r

r

r

r

r

r

r

r

r

Figure 2: Quasi partial ordering induced by an alignment. An alignment A (indicated by vertical lines) extends the partial order `' given on the set S of all sites of a given sequence family to a quasi partial order `A' which is the `transitive closure' of the union A [ : We have x  x1, x1 A x2; x2  x3 ; x3 A x4; x4  y and therefore x A y.

25