New formulations of the multiple sequence alignment problem ...

3 downloads 215 Views 138KB Size Report
Loading web-font TeX/Math/Italic ... Search. Optimization Letters · Download PDF · Optimization Letters. February 2011 , Volume 5, Issue 1, pp 27–40 ...
Optim Lett (2011) 5:27–40 DOI 10.1007/s11590-010-0188-8 ORIGINAL PAPER

New formulations of the multiple sequence alignment problem Thiru S. Arthanari · Hoai An Le Thi

Received: 7 September 2009 / Accepted: 12 March 2010 / Published online: 10 April 2010 © Springer-Verlag 2010

Abstract A well known formulation of the multiple sequence alignment (MSA) problem is the maximum weight trace (MWT), a 0–1 linear programming problem. In this paper, we propose a new integer quadratic programming formulation of the MSA. The number of constraints and variables in the problem are only of the order of k L 2 , where, kk is the number of sequences and L is the total length of the sequences, that is, li , where li is the length of sequence i. Based on this formulation we introL = i=1 duce an equivalent linear constrained 0–1 quadratic programming problem. We also propose a 0–1 linear programming formulation of the MWT problem, with polynomially many constraints. Our formulation provides the first direct compact formulation that ensures that the critical circuit inequalities (which are exponentially many) are all met. Keywords Multiple sequence alignment (MSA) · Integer quadratic programming · Linear constrained 0–1 quadratic programming · Maximum weight trace (MWT) · 0–1 Linear programming · DC programming, DCA

Part of this work was completed during the visit of T. S. Arthanari at the Laboratory of Theoretical and Applied Computer Science, UFR MIM, Paul Verlaine University-Metz, France. T. S. Arthanari Department of ISOM, University of Auckland, Auckland, New Zealand e-mail: [email protected] H. A. Le Thi (B) Laboratory of Theoretical and Applied Computer Science, UFR MIM, Paul Verlaine University-Metz, Ile du Saulcy, 57045 Metz, France e-mail: [email protected]

123

28

T. S. Arthanari, H. A. Le Thi

1 Introduction Computational biology provides many challenging combinatorial optimization problems, and multiple sequence alignment (MSA) problem is one such. Dynamic programming has been applied to solve this problem but the curse of dimensionality prevented its use to solve problems with large number of sequences or if the sequence lengths grew in size [18]. Recently [7] has given quadratic integer programming formulations of some problems in computational biology which includes the MSA problem. Other efficient approaches to similar problems are found in [24] and [25]. Kececioglu [10] introduced and studied the maximum weight trace (MWT) formulation of the MSA problem. Reinert et al. [26] consider the trace polytope and give an integer programming formulation of the MWT. They study the trace polytope and show some of the separation problems corresponding to some class of facet defining inequalities can be solved in polynomial time. They use polyhedral combinatorics to develop a branch-and-cut algorithm for solving the problem. Their computational experiments show that the problem size both in terms of number of sequences and sequence length could be increased beyond what was feasible with dynamic programming approaches. Polyhedral approach to sequence alignment problems like generalized maximum trace problem, RNA sequence alignment problem are studied by Kececioglu et al. [11], Lenhof [17]. Le Thi et al. [16] propose a continuous optimization approach based on DC (difference of convex functions) programming and DCA (DC algorithm) to the MSA using the 0–1 linear formulation of the MWT. The approach is inexpensive because it amounts to solve a few linear programs and it converges after a finite number of iterations to an integer solution while it works in a continuous domain. Numerical simulation experiments show the superiority of this approach using DCA with respect to other methods. Prestwich et al. [23] give a pseudo-boolean local search algorithm and compare its performance with state-of-the art algorithms for solving MSA problem. However they report longer execution time though the quality is better in some bench mark cases. They also have a polynomial size 0–1 formulation of the MWT problem. In this paper, a new integer quadratic programming formulation, which directly addresses the MSA problem, is proposed. The number of constraints and variables in is the number of sequences and the problem are only of the order of k L 2 , where, k k li , where li is the length of L is the total length of the sequences, that is, L = i=1 sequence i. Based on the new quadratic formulation of the MSA problem, an equivalent linear constrained 0–1 quadratic programming problem is introduced. Also a linear 0–1 formulation of the MWT problem is given. This formulation has in addition (2L + 1)|E| constraints, where |E| is the number of edges in the alignment graph. And thus we have a compact formulation of the MWT problem. Section 2 gives a brief description of the MSA problem, the required preliminaries and the 0–1 linear programming formulation of the MWT problem. Section 3 develops the new formulations of the MSA problem. Section 3.1 contains the integer quadratic formulation and proves its connection to the MWT problem. Section 3.2 presents the equivalent linear constrained 0–1 quadratic formulation. Section 3.3 gives a compact 0–1 linear formulation of the MWT problem. Section 4 ends with an outline of the com-

123

New formulations of the multiple sequence alignment problem

29

putational experiments planned for comparing quality of solutions and computational efficiency of the different formulations and suggested approaches. 2 The multiple sequence alignment problem Sequence alignment problem appears in different biological re-construction related analysis. For instance, construction of plausible evolutionary trees uses multiple sequence alignment as a tool [12]. The sequence of a DNA molecule, or a protein can be modeled as a string over an appropriate alphabet. The multiple sequence alignment problem (MSA) can be defined as follows: Definition 1 Alignment ˆ =  ∪ {−}, where “−” is a symbol to represent Let  be a finite alphabet and let  gaps in strings. Given a set S = {S1 , S2 , . . . , Sk } of finite strings over the alphabet , where each string Si , i = 1, . . . , k, consists of characters si,1 , . . . , si,li ∈ . A set ˆ is called an alignment of S if the Sˆ = { Sˆ1 , Sˆ2 , . . . , Sˆk } of strings over the alphabet  following properties hold: (1) all strings in Sˆ have the same length l and (2) ignoring gaps, Sˆi is identical with Si , ∀i = 1, . . . , k. An alignment in which each sequence Sˆi has length l can be associated with an array of k rows and l columns, where row i corresponds to Sˆi . Two characters of distinct sequences in S are said to be aligned under Sˆ if they are placed into the same ˆ Depending on the measure of column of the alignment array (corresponding to S). the quality of alignment, the multiple sequence alignment problem aims to find the ‘best’ alignment in some sense. For example when two sequences are aligned, in each column depending on the characters aligned we have a score and the scores are added over columns to measure the value of the alignment. An alignment that maximizes the value is picked up. This idea is extended to multiple sequence alignment as well. Notice that adding a column of ‘−’ to an alignment increases its length by one. But it should not add value as the newly added column does not contain any character from the original alphabet. However if it does we could add more columns of this kind increasing the value of the alignment without a limit. Therefore without loss of generality we will assume in addition an alignment satisfies (3) no column of an alignment has “−” in all its rows. That is, there exits an ir such that sˆir ,r = “−” for each r, 1 ≤ r ≤ l. k li . Notice With this assumption, no alignment will have length larger than i=1 that the length of any alignment cannot be less than max (li ). i=1,...,k

Thus the length of an alignment satisfies: max (li ) ≤ l ≤

i=1,...,k

k 

li .

i=1

Unless otherwise stated we assume an alignment meets the above three requirements.

123

30

T. S. Arthanari, H. A. Le Thi

As noted earlier, dynamic programming approaches are used in searching for such optimal alignments. The notion of trace of two sequences was generalized to multiple sequences by Kececioglu [9]. This leads us to the discussion of the maximum weight trace (MWT) problem. Definition 2 Alignment graph Let  be a finite alphabet. Given a set S = {S1 , S2 , . . . , Sk } of finite strings over the alphabet , where each string Si , i = 1, . . . , k, consists of characters si,1 , . . . , si,li ∈ . Consider the graph G = (V, E), where the vertex set V has a vertex for each si, j and the edge set represents pairs of characters (base symbols) that one would like to have aligned in an alignment of the input sequences. In addition one may have w(e) a nonnegative weight attached to each edge. We say that an edge of the alignment graph is realized by an alignment if the endpoints of the edge are placed in the same column of the corresponding alignment array. Definition 3 Trace of an alignment Given an alignment Sˆ corresponding to a set of sequences S we call the subset of edges, T ⊂ E, realized by Sˆ as the trace of the alignment. And given w(e) a nonnegative weight (without loss ofgenerality we can assume the weights are positive) attached to each edge, w(T ) = e∈E w(e), is the weight of the trace. The multiple sequence alignment problem can now be stated as a maximum weight trace problem (MWT): given a set S = {S1 , S2 , . . . , Sk } of finite strings over the finite alphabet , and an alignment graph G = (V, E), with nonnegative edge weights, find an alignment such that its trace has maximum weight. While MWT is known to be NP-complete [10], polyhedral combinatorics has been successfully used in MWT by Reinert et al. [26]. We shall give some graph preliminaries and then present the 0–1 integer formulation of the MWT problem as developed by them. 2.1 Graph preliminaries We follow standard graph theoretic notations, regarding a graph. A mixed graph is a tuple G = (V, E, A) where V is a set of vertices, E is a set of edges and A is a set of arcs. We use e = {vi , v j } if e is an edge and we use e = (vi , v j ) to denote an arc, specifying the direction that it is from vi to v j . A path in a mixed graph is an alternating sequence v1 , e1 , v2 , e2 , . . . , vk of vertices and arcs or edges such that in vi , ei , vi+1 , the consecutive vertices form the end points of the edges or arcs, for all i, 1 ≤ i ≤ k. A path is called a cycle if the first and the last vertex on the path is the same. We use the adjective ‘mixed’ to denote paths or cycles which contain at least one each of an arc from A and an edge from E. The number of edges in a mixed path is called its size. In addition we use the term length of a path (cycle) to denote the number of edges and arcs it has. Definition 4 Extended alignment graph [26]

123

New formulations of the multiple sequence alignment problem

31

Given an alignment graph, G = (V, E) we consider the mixed graph (V, E, H ) by adding a set of directed arcs H = {(si, j , si, j+1 )|1 ≤ i ≤ k, 1 ≤ j ≤ li }, where si, j is the vertex in V corresponding to jth base symbol in sequence Si . We call this mixed graph the extended alignment graph (EAG). In addition in [26] we have the definition of a critical mixed cycle. Definition 5 Critical mixed cycle Given the extended alignment graph, G = (V, E, H ), we call a mixed cycle R critical if for all i, 1 ≤ i ≤ k, all vertices in R ∩ Si occur consecutively in R. The important characterization of traces using critical mixed cycles is given as Theorem 1 in [26] which is stated as Theorem 1. Theorem 1 Let G = (V, E, H ) be an EAG, let T ⊂ E and let G = (V, T, H ) be the EAG induced by T . Then T is a trace ⇐⇒ there is no critical mixed cycle in G . Using this theorem we are in a position to state the integer programming formulation in [26]. 2.2 0–1 Linear programming formulation of the MWT problem Let xe be the indicator variable of an e ∈ E which is in the trace T , of an alignment, ˆ That is, xe = 1 if e in T and 0 otherwise. S. Then the MWT problem can be stated as: (MWT) max



we x e

e∈E

s.t



xe ≤ |R ∩ E| − 1,

(1)

e∈R∩E

∀ critical mixed cycles R in G xe ∈ {0, 1}, ∀e ∈ E. In addition to this formulation, a rich polyhedral combinatorics is developed in [26] and used subsequently in [11] with respect to more general problems. We summarize some of the facet defining inequalities of the trace polytope, which is the convex hull of the indicator vectors of traces. For the  two sequence case the trivial (xe ≥ 0, xe ≤ 1 for e ∈ E) and the clique inequalities ( e∈E xe ≤ 1) provide the complete description of the trace polytope. They also show in the general case the separation problem for the mixed cycle inequalities (inequalities in MWT integer formulation) can be solved in polynomial time. There are other facet defining inequality classes identified in [26], which are used in developing a branch and cut algorithm. They have pushed the limits of the problems solved using dynamic programming approaches, as claimed through their experimental comparisons with existing approaches.

123

32

T. S. Arthanari, H. A. Le Thi

3 New formulations of the MSA problem We are now ready to give the main results of the paper, namely the new formulations of MSA problem, after introducing the notation required for the quadratic integer formulation. Let any alignment be given by an array having one row for each of the k sequences and l columns. We call (i, r ) a cell. Each cell has either a “−” or a si, j for some j. And in any row the basic symbols in the sequence Si , si, j ’s appear in cells (i, r j ), such that 1 ≤ r1 < r2 , . . . , < rli , respectively. And we have an edge (si, j , s p,q ) of the alignment graph realized if the end points of the edge appear in the same column r , that is, si, j is in cell (i, r ) and s p,q is in cell ( p, r ) for some r . We are interested in maximizing the sum of the weights of the edges realized by an alignment.

3.1 Integer quadratic formulation of the MSA problem Let ai jr be defined such that ai jr = 1 if si, j occupies the cell (i, r ) of the alignment, otherwise ai jr = 0, for all i, j, r : 1 ≤ i ≤ k, 1 ≤ j ≤ li and j ≤ r ≤ l. We next give the quadratic integer formulation of the MSA. Since the formulation gives a 1–1 correspondence between alignments and 0–1 solutions to the formulation we call this formulation true sequence alignment (TSA) formulation.

(TSA Formulation) 

maximizel,ai jr

l 

w((si, j , s p,q ))ai jr a pqr

(si, j ,s p,q )∈E r =1

max (li ) ≤ l ≤

i=1...k

li 

k 

li .

(2)

i=1

l integer.

(3)

ai jr ≤ 1 for all i, r.

(4)

ai jr = 1, for all i, j.

(5)

j=1 l  r =1

ai jr −

r −1 

ai( j−1)s ≤ 0, for all i, r, and 2 ≤ j.

(6)

s= j−1 li k  

ai jr ≥ 1, for all r.

(7)

ai jr ∈ {0, 1} for all i, j, r.

(8)

i=1 j=1

123

New formulations of the multiple sequence alignment problem

33

We formally state the 1–1 correspondence with alignments and the 0–1 solutions to the TSA formulation as Theorem 2. ˆ =  ∪ {−}, where “−” is a symbol Theorem 2 Let  be a finite alphabet and let  to represent gaps in strings. Given a set S = {S1 , S2 , . . . , Sk } of finite strings over the alphabet , every alignment of S, Sˆ = { Sˆ1 , Sˆ2 , . . . , Sˆk } of strings over the alphabet ˆ corresponds to an integer solution to the T S A formulation and vice versa. , ˆ Proof Since Sˆ is an alignment, by definition k the length of Si = l for all i, 1 ≤ i ≤ k. li . Thus l satisfies restrictions (2–3). So l lies between max1≤i≤k li and L = i=1 Assign ai jr = 0 if “−” appears in cell (i, r ), otherwise assign ai jr = 1 for j such that si, j appears in that cell. Repeat this for all cells. Now we have a 0–1 solution, a ∈ {0, 1}l×L . Which meets ai jr being 0–1 integers [restrictions (8)]. And from item 3 of the definition 1, that is, no column of Sˆ is a column of “−”, restrictions (7) are met. Since in a cell (i, r ), either a “−” or a si, j appears for some j, we have restrictions (4) met. Since each si, j for a given sequence Si appears exactly in one cell (i, r ), we have restrictions (5) met. As Sˆ is an alignment implies deleting the “−” yields Si from Sˆi , for each i. Or in any row as noticed earlier the basic symbols in the sequence Si , si, j ’s appear in cells (i, r j ), such that 1 ≤ r1 < r2 , . . . , < rli , respectively. Or ai jr j = 1, for j = 1, . . . , li satisfies the restrictions (6), as the basic symbol si, j appears in column r only if the preceding symbol si, j−1 has already appeared in a column less than r . Thus all the restrictions of the TSA formulation are met. On other hand given a 0–1 solution a to the TSA formulation, we can get a Sˆ corresponding to an k × l array by filling the cell (i, r ) with “−” if ai jr = 0, for all j, 1 ≤ j ≤ li or otherwise with a unique si, j corresponding to ai jr = 1. This is an alignment as it satisfies the conditions stipulated in definition (1). Hence the theorem. 

Remark 1 Recently [7] has suggested some integer quadratic programming models in computational biology, and among them one of the models is for the MSA problem. This model uses a scoring function that measures the propensity for a character in a sequence to align with a character in another sequence. In addition a gap penalty function, and two sets of variables to indicate the opening and extending of gaps in sequences are used in this model. The model also has additional constraints involving theses variables. And thus the TSA formulation is different from the Greenberg model. Also the TSA formulation has fewer constraints and variables. As remarked in [7], the article proposes some IQP models and does not discuss computational algorithms to solve them. Theorem (3) connects traces with the 0–1 solutions of the TSA formulation. Theorem 3 Let G = (V, E) be the graph, where V is the set of vertices corresponding to the base symbols si, j , in the given sequences Si , i = 1, . . . , k. And an edge e = (si, j , s p,q ) is in E if and only if w(e) > 0. T ⊂ E is a trace corresponding to an

123

34

T. S. Arthanari, H. A. Le Thi

alignment if and only if there is a corresponding 0–1 solution to the TSA formulation with the same weight of the trace, w(T ). Proof Given an alignment with l columns and the corresponding trace T with weight w(T ), assign a 0–1 solution to TSA formulation as in the proof of Theorem 2. Partition the columns of the alignment depending on whether the column contributes to the corresponding trace T or not. Let C denote the contributing columns. Further partition the edges in T so that the  in the same  end points of the edges appear column c ∈ C of the alignment, that is, c∈C Tc = T . And w(T ) = c∈C w(Tc ). Now notice that Tc ⊂ {(si, j , s p,q )|ai jr = a pqr = 1}. Therefore w(Tc ) ≤ w({(si, j , s p,q )|ai jr = a pqr = 1}). Setting w(e) = 0, e ∈ / E we can achieve equality in the above inequality. Thus we have an integer solution with the same objective value corresponding to the trace of the alignment. The other way, given a feasible solution to the TSA formulation, we have a corresponding alignment as in the proof of Theorem 2. The trace corresponding to this alignment is obtained by T =

l 

E



{(si, j , s p,q )|ai jr = a pqr = 1}.

r =1

And T has weight w(T ) as desired. Hence the theorem.



3.2 Linear constrained 0–1 quadratic programming formulation of the MSA problem Basing on the TSA formulation we introduce now an equivalent linear constrained 0–1 quadratic programming formulation of the MSA problem. In order to do this we introduce the following definition of a problem derived from the TSA formulation. Definition 6Problem [l] k Let L = i=1 li . For a fixed integer l ∈ L = [ max (li ), L] denote the TSA fori=1...k

mulation without the constraints (2, 3, 7), as Problem [l]. Let Z l denote the optimal ˆ a corresponding optimal alignment. objective function value of Problem [L] and S(l) (Problem [L]) maximize



L 

w((si, j , s p,q ))ai jr a pqr

(si, j ,s p,q )∈E r =1 li  j=1

123

ai jr ≤ 1 for all i, r.

(9)

New formulations of the multiple sequence alignment problem L 

ai jr = 1, for all i, j.

35

(10)

r =1

ai jr −

r −1 

ai( j−1)s ≤ 0, for all i, r, and 2 ≤ j.

(11)

s= j−1

ai jr ∈ {0, 1} for all i, j, r.

(12)

Now we state and prove Theorem 4 which shows the equivalence of the TSA formulation and Problem [L]. Theorem 4 It is sufficient to solve Problem [L] to find an optimal solution to the TSA formulation. Proof Consider Problem [l], for l in L. Then Z l is a non decreasing function of l, as any optimal solution to the Problem [l − 1] is also available for Problem [l] by simply adding to that solution the last column with “−” in all rows. Let l ∗ be the smallest l for which max Z l = Z l ∗ . l∈L

Claim: Any optimal solution to Problem [l ∗ ] satisfies the constraints (7). Proof of the claim: If not, we have a column with “−” in all rows, in that optimal solution. And so that column can be dropped and the resulting solution is optimal for Problem[l ∗ − 1]. This contradicts the minimality of l ∗ . Therefore it is sufficient to solve Problem [L] and then delete any column with all “−” to obtain the solution to the TSA formulation. Thus we have an equivalent 0–1 quadratic programming formulation for the MSA problem given by Problem [L]. ˆ Thus from an optimal solution S(L) to Problem [L], every column with “−” in all rows can be dropped to obtain an optimal solution to the TSA formulation satisfying constraints (2, 3, 7). Hence the theorem. 

Remark 2 On the new formulation (1) The objective function could be modified to reflect any other alignment measures, which depend on pair wise scoring, such as: same basic symbol in si, j and s p,q has a score 2 (a matching), different basic symbols (a mismatch) has score −1 and otherwise −2. (2) There are L × k variables in Problem [L] and the number of constraints is of the order of L 2 × k. However we have a linear constrained 0–1 quadratic program, which is a NP-hard problem. As noticed by Carr and Lancia [6], equivalent formulations which replace a constraint class with exponentially many constraints by the one with polynomially many constraints, are better from both theoretical point and computational point of view. So having polynomially many constraints is an advantage. Next we give a 0–1 linear programming formulation of the new linear constrained 0–1 quadratic problem, so that we could compare it with the MWT formulation earlier discussed. In fact, it is a compact formulation of the MWT problem.

123

36

T. S. Arthanari, H. A. Le Thi

3.3 Compact 0–1 linear programming formulation of the MWT problem We introduce additional variables xe for each e ∈ E, ys (e) for each s, 1 ≤ s ≤ L and e ∈ E. (Equivalent Linear 0–1 Formulation) maximize

L 

w(e)x(e)

e∈E r =1 L 

ys (e) = x(e) for all e ∈ E

s=1

ys (e) ≤ ai js , ys (e) ≤ a pqs , for all s, and e = (si, j , s p,q ) ∈ E xe , ys (e) r eal for all s, and e ∈ E li 

ai jr

≤ 1 for all i, r.

ai jr

= 1, for all i, j.

j=1 L  r =1

ai jr −

r −1 

ai( j−1)s ≤ 0, for all i, r, and 2 ≤ j.

s= j−1

x(e), ys (e), ai jr

∈ {0, 1} for all e ∈ E, s, i, j, r.

Notice that for any e = (si, j , s p,q ) at most for one s, ai js = a pqs = 1. Therefore value 1. So x(e) can be at most 1. Since for any e, at most one ys (e) can take the  we are maximizing the objective function e∈E w(e)x(e) and w(e) > 0, whenever ai js = a pqs = 1, x(e) = 1. Thus {e ∈ E|x(e) = 1} will correspond to the trace of the alignment given by the 0–1 solution to the equivalent linear formulation. Thus we have a compact linear integer formulation of the MWT problem. Like in the MWT problem any 0–1 solution xe , e ∈ E corresponds to a trace. However we only have polynomially many constraints and variables in the equivalent linear formulation, unlike the formulation (1) which has exponentially many constraints. This gives an alternative proof of the fact that separation over the mixed cycle inequalities can be done in polynomial time. Our formulation is straightforward, and has fewer variables and constraints compared to that of the pseudo-boolean approach [23]. 4 Suggested approach for solving the linear constrained 0–1 quadratic programming formulation Linear constrained 0–1 quadratic programming problems form an important class of combinatorial optimization with application in many areas of science, technology,

123

New formulations of the multiple sequence alignment problem

37

and business. Different approaches are developed to solve such problems (see e.g. [1,4,15,19]. Pardalos and Rodgers [19] discuss computational aspects of a branch and bound algorithm. Adams and Sherali [1] discuss a tight linearization based algorithm for solving such problems. Le Thi and Pham Dinh [15] consider a continuous approach for large scale problems of this kind. A recent survey of the approaches for solving linearly constrained 0–1 quadratic programming problems appears in [5]. We find there useful techniques for linearization of 0–1 quadratic programming problems. Since we have an equivalent 0–1 quadratic programming problem of the TSA formulation or Problem [L], we proceed to discuss the available approaches to solve such linearly constrained 0–1 quadratic programming (0–1 QP) problems, which can be stated as: 0–1 QP Problem minimize 21 x T C x + c T x subject to Ax ≤ b, x ∈ {0, 1}n

(13) (14) (15)

where C is an (n × n) symmetric matrix, c, x ∈ Rn , A is an (m × n) matrix and b ∈ Rm . 0–1 QP problems are in the NP-hard class of combinatorial optimization problems [30]. Le Thi and Pham Dinh [15] study the following equivalent strictly concave quadratic minimization problem SCQP Problem minimize 21 x T (C − λI )x + (c + λ2 e)T x subject to Ax ≤ b, x ∈ [0, 1]n

(16) (17) (18)

where e is the vector of ones and λ > λo := λ¯ + to > 0, these constants are suitably chosen. (See [15] for a proof of this). Notice that we do not have the 0–1 integer constraints any more in this formulation. In the literature several approaches for solving the non convex quadratic programming are suggested and computational comparisons are made, among them branchand-bound algorithms whose branching operation takes place only in the “negative eigenvalues space” have been shown to be efficient (see e.g. Benson [3], KalantariRosen [8], Philips-Rosen [22], Phong-An-Tao [27], Vavasis [29]). In 1985, DC (Difference of Convex functions) programming and DCA (DC Algorithms) were introduced by Tao Pham Dinh and extensively developed by Hoai An Le Thi and Tao Pham Dinh since 1994 (see [13,14,20] and references therein). DCA is an efficient approach for nonconvex continuous optimization. Le Thi and Pham Dinh [15] give a DC programming approach to solve linearly constrained quadratic 0–1 programming problems in its equivalent form (16). A hybrid of the DCA, which has a finite convergence, and the branch-and-bound scheme is

123

38

T. S. Arthanari, H. A. Le Thi

proposed. Their algorithm converges quite often to an integer approximate solution. Their computational results of several test problems with up to 1,800 variables substantiate the efficiency of DCA-B&B method, in particular, for linear zero–one programming. Very recently, an efficient combined DCA and B&B using DC/SDP relaxation for globally solving binary quadratic programs has been developed in Pham et al. [21]. Numerical results on several series test problems provided in OR-Library [2] show the robustness and efficiency of the proposed algorithm with respect to standard methods. In particular DCA provides ε-optimal solutions in almost all cases after only one restarting and the combined DCA-B&B-SDP always provides (ε-)optimal solutions. Since DCA has been proved to be successful in solving MWT problem as stated in [16], it is tempting to see how DCA will perform with the new quadratic 0–1 formulation of the MSA problem. Success of DCA-B&B for linearly constrained 0–1 QP is suggesting itself as a possible solution approach to the TSA formulation. In [5] we have some new tips for linearization of 0–1 QP problems. It is interesting to apply them on the equivalent linearization of the TSA formulation. Then apply DCA-B&B approach to the new linearizations of the TSA formulation.

5 Computational experiment The formulations considered in this paper though address the same problem differ with respect to the nature of their objective function (linear or quadratic), number of variables, number of constraints (exponential or polynomial) and type of variables (0–1 or integer). In order to compare these formulations we have planned experiments using BAliBASE benchmarks, version 1.0 [28] (http://www-igbmc.u-strasbg.fr/BioInfo/ BAliBASE/). As BAliBASE is a database of more than 140 manually-refined multiple sequence alignments specifically designed for the evaluation and the unbiased comparison of multiple sequence alignment programs, our experiments would be comparable with published results. Our aim is to compare the performance of DCA approach when applied to these formulations. And also compare with the results of other methods used to solve some of the formulations in the literature like using a commercial package such as CPLEX or using a branch and cut approach [26] or pseudo boolean approach [23]. The general idea is to consider each formulation with DCA based approach and compare which among them is most efficient, best quality etc. Also we need to compare other approaches for each formulation and compare, which approach works best for any formulation. Also if we have variations of the approach due to some choice of parameter, we need to consider that as well, without increasing number of trials (say we have two variations). Thus we will consider a fractional factorial design with factors, Formulation, Approach and Variation. Each of the trial will consist of applying the chosen treatment combination on all test instances. Then summary measures on quality, efficiency and counts on number of instances solved with specified quality will be used for comparison among these trials. We will also report which treatment combination worked best on a particular class of BAliBASE benchmark problems, as well.

123

New formulations of the multiple sequence alignment problem

39

6 Conclusion and future research In this paper, we consider the multiple sequence alignment problem and directly formulate the sequence alignment problem as an integer quadratic programming problem, called the TSA formulation. Based on this formulation we propose an equivalent linear constrained 0–1 quadratic programming problem, called problem [L], where L is the total length of the sequences. We also give a compact 0–1 linear programming problem of the MWT problem, by linearizing it Problem [L]. Based on the success of DCA application to solve MWT problem [16], we propose to use the DCA approach to TSA formulation. Hybrid approaches using DCA and B&B was shown to be efficient in finding global solutions to generic 0–1 quadratic programming problems by Le Thi and Pham Dinh [15] and Pham et al. [21]. Thus it makes proper sense to attempt these hybrid approaches to solve the equivalent problem of the TSA formulation, namely Problem [L]. Different 0–1 linearizations of the quadratic problem, Problem [L], also provide competing alternatives. A sequel to this paper, the results of the computational experiments planned to compare different existing approaches, including that from [16] to solve the MSA problem with the new formulations and the DCA-based global solution approaches will be presented. Acknowledgements The authors thank the referees for useful suggestions and additional references. The Department of ISOM, School of Business, University of Auckland, Auckland, New Zealand and the LITA, University of Paul Verlaine of Metz, Metz, France are thanked for the support for this research work.

References 1. Adams, W.P., Sherali, H.D.: A tight linearization and an algorithm for 0–1 quadratic programming problems. Manage Sci 32(10), 1274–1290 (1986) 2. Beasley, J.E.: Obtaining test problems via internet. J. Global Optim. 8, 429–433, http://people.brunel. ac.uk/~mastjjb/jeb/info.html (1996) 3. Benson, H.P.: Separable concave minimization via partial outer approximation and branch and bound. Oper. Res. Lett. 9, 389–394 (1990) 4. Billionnet, A., Elloumi, S.: Using a mixed integer quadratic programming solver for unconstrained quadratic 0–1 problem. Math. Program. 109(1, Ser.A), 55–68 (2007) 5. Caprara, A.: Constrained 0–1 quadratic programming: basic approaches and extensions. Eur. J. Oper. Res. 187, 494–1503 (2008) 6. Carr, R.D., Lancia, G.: Compact vs exponential-size LP relaxations. Oper. Res. Lett. 30, 57–65 (2002) 7. Greenberg, H.J.: Integer quadratic programming models in computational biology. Operations Research Proceedings, vol. 2006, pp. 83–95. Springer, Berlin (2007) 8. Kalantari, B., Rosen, J.B.: Algorithm for global minimization of linearly constrained concave quadratic functions. Math. Oper. Res. 12, 544–561 (1987) 9. Kececioglu, J.D.: Exact and approximation algorithms for DNA sequence reconstruction. PhD thesis, University of Arizona (1991) 10. Kececioglu, J.: The maximum weight trace problem in multiple sequence alignment. In: Proceedings of the 4th symposium on combinatorial pattern matching, pp. 106–119 (1993) 11. Kececioglu, J.D., Lenhof, H.-P., Mehlhorn, K., Mutzel, P., Reinert, K., Vingron, M.: A polyhedral approach to sequence alignment problems. Discret. Appl. Math. 104, 143–186 (2000) 12. Korostensky, C., Gonnet, G.H.: Using traveling salesman problem algorithms for evolutionary tree construction. Bioinformatics 16(7), 619–627 (2000) 13. Le Thi, H.A., Pham Dinh, T.: The DC (Difference of Convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals Oper. Res. 133, 23–46 (2005)

123

40

T. S. Arthanari, H. A. Le Thi

14. Le Thi, H.A., Pham Dinh, T.: Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. J. Global Optim. 11(3), 253–285 (1997) 15. Le Thi, H.A., Pham Dinh, T.: A continuous approach for large-scale constrained quadratic zero-one programming (In honor of Professor ELSTER, Founder of the Journal Optimization). Optimization 50(1–2), 93–120 (2001) 16. Le Thi, H.A., Belghiti, T., T.M., Pham Dinh T.: Mutiple alignment of sequences by a continuous optimisation approach on DC Programing and DCA. In:Proceedings of the international conference on bioinformatics & computational biology, BIOCOMP’09 (2009) 17. Lenhof, H-P., Retnert, K., Vingron, M.: A polyhedral approach to RNA sequence structure alignment. J. Comput. Biol. 5(3), 517–530 (1998) 18. Notredame, C.: Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3(1), 131–144 (2002) 19. Pardalos, P.M., Rodgers, G.P.: Computational aspects of a branch and bound algorithm for quadratic zero–one programming. Computing 45, 131–144 (1990) 20. Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to d.c. programming: theory, algorithms and applications. Acta Math. Vietnam. 22(1), 289–355 (1997) (dedicated to Professor Hoang Tuy on the occasion of his 70th birthday) 21. Pham Dinh, T., Nguyen Canh, N., Le Thi, H.A.: An efficient combined DCA and B&B using DC/SDP relaxation for globally solving binary quadratic programs. J. Global Optim. (2010) doi:10.1007/ s10898-009-9507-y 22. Phillips, A.T., Rosen, J.B.: A parallel algorithm for partially separable non-convex global minimization: linear constraints. Annals Oper. Res. 25, 101–118 (1990) 23. Prestwich, S., Higgins, D., O’Sullivan, O.: Pseudo-Boolean multiple sequence alignment, Technical report, TR-03-2003, http://www.4c.ucc.ie/web/techreps.jsp, Cork Constraint Computation Centre, University College, Cork, Ireland (2003) 24. Rajasekaran, S., Nick, H., Pardalos, P.M., Sahni, S., Shaw, G.: Efficient algorithms for local alignment search.. J. Comb. Optim. 5(1), 117–124 (2001) 25. Rajasekaran, S., Hu, Y., Luo, J., Nick, H., Pardalos, P.M., Sahni, S., Shaw, G.: Efficient algorithms for similarity alignment search.. J. Comb. Optim. 5(1), 117–124 (2001) 26. Reinert, K., Lenhof, H., Mutzel, P., Mehlhorn, K., Kececioglu, J.D.: A branch-and-cut algorithm for multiple sequence alignment. RECOMB, pp. 241–250 (1997) 27. Thai Quynh, P., Le Thi, H.A., Pham Dinh, T.: On the global solution of linearly constrained indefinite quadratic minimization problems by decomposition branch and bound method. RAIRO Rech. Opér 30(1), 31–49 (1996) 28. Thompson, J., Plewniak, F., Poch, O.: BAliBASE: a benchmark alignments database for the evaluation of multiple sequence alignment programs. Bioinformatics 15, 87–88 (1999) 29. Vavasis, S.A.: Approximation algorithms for indefinite quadratic programming. Math. Program. 57, 279–311 (1992) 30. Vavasis, S.A.: Nonlinear optimization, complexity issues. Oxford University Press, New York (1991)

123

Suggest Documents