A fast algorithm for the partial digest problem - Springer Link

2 downloads 0 Views 711KB Size Report
Aug 3, 2011 - Also, in Table 1, the numbers of backtracks for both algorithms are ... of the PB algorithm is much less than the backtracks of Skiena's algorithm.
Japan J. Indust. Appl. Math. (2011) 28:315–325 DOI 10.1007/s13160-011-0041-1 ORIGINAL PAPER

Area 3

A fast algorithm for the partial digest problem Reza Nadimi · Hassan Salehi Fathabadi · Mohammad Ganjtabesh

Received: 13 November 2010 / Revised: 17 February 2011 / Published online: 3 August 2011 © The JJIAM Publishing Committee and Springer 2011

Abstract A fundamental problem in computational biology is the restriction site mapping. When a particular restriction enzyme is added to a DNA, the DNA strand is cut at particular restriction sites. The goal of the restriction site mapping is to determine the location of every site for a given enzyme. Using gel electrophoresis, one can find the distance between each pair of restriction sites. In the partial digest problem (PDP), we are given these distances arising from digestion experiments by using only one enzyme, and we are asked to compute the locations of all restriction sites. Several approaches, including pseudo-polynomial time algorithm and backtracking algorithm have been proposed to tackle this problem. In this paper we propose a new model for this problem. Based on this model, we present a new branch and bound algorithm for partial digest problem. In comparison with the backtracking algorithm presented by Skiena, the new algorithm has very small search tree and so it is very efficient. The efficiency and advantages of this algorithm are also demonstrated by executing it on different types of instances. There exist some instances of P D P, where the

R. Nadimi (B) Department of Applied Mathematics, Faculty of Mathematics, University of Mazandaran, Babolsar, Mazandaran, Iran e-mail: [email protected] H. S. Fathabadi Department of Applied Mathematics, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran M. Ganjtabesh Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran M. Ganjtabesh Laboratoire d’Informatiques (LIX), Ecole Polytechnique, Palaiseau Cedex 91128, France

123

316

R. Nadimi et al.

Skiena’s algorithm requires exponential running time to solve them. The efficiency of our algorithm is also presented for these kind of problem instances. Keywords

Restriction site mapping · Partial digest problem

Mathematics Subject Classification (2000)

92D20 · 68Q99

1 Introduction A human chromosome, which is a DNA molecule, is too long to be studied entirely and must be broken into fragments or clones. Information is gathered from the individual clones and then, the DNA is constructed by mathematically determining the position of the clones. In the Digest experiments, enzymes are used to cleave DNA molecules to specific sequence patterns, the restriction sites. The process of reconstructing the DNA sequence by the broken fragments from Digest experiments is called digest problem. The resulting fragments are used in many different ways to study the structure of DNA molecules. One of the most interesting problems in computational biology is the restriction site mapping. The goal of this problem is to determine the location of every site for a given enzyme. Using Gel-Electrophoresis, one can find the distance between each pair of restriction sites. In the partial digest problem (P D P), we are given these distances, arising from digestion experiments by using only one enzyme, and we are asked to determine the locations of all restriction sites. Let X = {x0 , x1 , . . . , xn } be the set of all restriction site locations on a DNA strandwhere  x0 = 0 and xn = the length of the strand. We denote the multiset of all pairwise distances between elements of X by X, i.e. M = n+1 2 X = {x j − xi |x j > xi , i, j = 0, 1, . . . , n} In the partial digest problem, given a multiset D = {d1 , d2 , . . . , d M } of distances, the goal is to find a set X = {x0 , x1 , . . . , xn } of points on a line such that D = X. This problem was first defined in 1930 in the area of X-ray crystallography [5]. In 1988, Lemke and Werman [4] solved it in pseudo-polynomial time (i.e. the running time of their algorithm was depended on d M ). Skiena et al. [7] created a backtracking algorithm to solve this problem whose running time as they argued depends only on n. In 1994, Zhang [9] showed that the running time of the Skiena’s backtracking algorithm was exponential in the worst case. Dakic [3], in his Ph.D. thesis, presented a 0 − 1 quadratic programming model for P D P and solved it by a heuristic successive semidefinite programming algorithm. In 2005, Cieliebak et al. [2] proved that partial digest problem is NP-hard for erroneous input data. The exact computational complexity of P D P is a long standing open problem. For this problem, neither a polynomial time algorithm nor a proof of NP-Completeness is known so far [1,6]. In this paper we propose a new model for P D P. Based on this model, we present a new Back Tracking algorithm to efficiently solve this problem. In comparison with the Skiena’s backtracking algorithm, our algorithm has very smaller search tree and

123

A fast algorithm for partial digest problem

317

Fig. 1 Triangle model for partial digest problem

hence it is more efficient in practice. In Sect. 5 we will confirm its efficiency by solving the computationally hard P D P instances presented by Zhang [9]. The rest of this paper is organized as follows: In Sect. 2, the new model for P D P is proposed. The new Back Tracking algorithm is presented in Sect. 3. In Sect. 4, we provide the analysis of the algorithm and prove some theorems. Section 5 contains the computational results of algorithm and its comparison with the other approaches. Finally, conclusions are given in Sect. 6. 2 New model for partial digest problem In this section we present a new model for describing P D P. In the previous section, it is dedicated that in the P D P, given a multiset D = {d1 , d2 , . . . , d M } of distances, the goal is to find a set X = {x0 , x1 , . . . , xn } of points on a line such that D = X. Since x0 = 0, it is obvious that X − {0} ⊆ D. Therefore P D P is equivalent to select n elements of D such that the pairwise distance multiset of “these points plus zero“ be equal to D. To select these n elements of D, we model P D P as in Fig. 1. Suppose that by eliminating repeated members of the multiset D we obtain B = {b1 , b2 , . . . , b N } which is sorted increasingly (B is the corresponding set to D that contains every elements of D without any repeatation). In Fig. 1 we have an isosceles triangle where its base is lied on the x axis and one of its vertices is on the origin. The length of height and base of this triangle is equal to b N . There are some horizontal and diagonal lines in this triangle. For each bi ∈ B there is a horizontal line at height bi which is denoted

123

318

R. Nadimi et al.

by hline(i). We also denote the number of occurrences of bi in D by nr (i). There are two diagonal lines with gradient 2 and −2 corresponding to bi which meet each other on the x axis in point bi . This pair of diagonal lines are referred to as a single broken line denoted by bline(i). Note that, corresponding to 0 and b N , there is only one diagonal line which is one of the legs of triangle. They are called respectively bline(0) and bline(N ). It is clear that each pair bline(i) and bline( j) has exactly one cross point inside the triangle at the height of |bi − b j |. If there is a k such that |bi − b j | = bk then, the cross point of bline(i) and bline( j) is on hline(k). If (n + 1) blines are selected in such a way that all of their pairwise cross points occur on the horizontal lines and exactly nr (i) cross points are placed on each hline(i), then these (n + 1) selected blines make a solution for the given instance of P D P. Obviously the diagonal lines corresponding to 0 and b N should be selected as a part of any solution. There are two possible cases for each bline: 1. Primal case: in this case bline is dashed. 2. Selected case: in this case bline is solid. The cross point of bline(i) and bline( j) is shown as a star and it is denoted by star (i, j). The number of stars on hline(i) and bline(i) are denoted by nh(i) and nb(i), respectively. Each star has three possible colors: white, gray and black. The number of white, gray and black stars on hline(i) are denoted by nwh(i), ngh(i) and nbh(i), respectively. Also the number of stars with these three colors on bline(i) are denoted by nwb(i), ngb(i) and nbb(i), respectively. The color of star (i, j) depends on the case of bline(i) and bline( j) as follows: – If both bline(i) and bline( j) are dashed then the color of star (i, j) is white. – If either bline(i) or bline( j) is dashed and the other one is solid, then the color of star (i, j) is gray. – If both of the bline(i) and bline( j) are solid then the color of star (i, j) is black. During the algorithm some blines may be removed from the model. When a bline is removed, all stars lying on it should also be removed. It is possible that the cross / B). In this case we say point of two blines does not occur on any hline (|bi − b j | ∈ that these blines are incompatible, and so they cannot be selected simultaneously. If bline(i) and bline( j) are incompatible, then either bline(i) or bline( j) (or both of them) will be removed during the algorithm. The new proposed algorithm starts with (N −1) dashed bline(i) (i = 1, 2, . . . , N−1) and two solid blines [bline(0) and bline(N )]. The algorithm tries to select (n − 1) dashed blines and changes their case to solid, such that the solid blines produce a solution for the given instance of P D P. In our model, if the given instance of P D P has a solution, then there are exactly n + 1 solid blines whose pairwise intersections occur on hlines and for each hline(i) we have nbh(i) = nr (i). At this stage, there will be exactly M black stars and no gray or white stars.

123

A fast algorithm for partial digest problem

319

3 Predictive backtracking ( P B) algorithm In this section, we present a new branch and bound algorithm in details. In the algorithm (n − 1) dashed blines are totally selected during the iterations. In each iteration we select some of the dashed blines and change their case to solid. Also some of the dashed blines which are incompatible with the solid blines that have been selected in the previous iterations are removed. If in any iteration, we assess that the picked solid blines cannot be a part of any solution, then we turn back to the previous iteration and try to select another dashed blines. This algorithm acts as a predictive backtracking algorithm with three steps in each iteration. In the first step we select the appropriate dashed blines and change their case to solid (a group of dashed blines are selected to appear in the solution of the given instance of P D P). In the second step, all the blines which can be part of any solution containing current solid blines are selected. Also all dashed blines which are incompatible with the current solid blines are removed (since they cannot be selected in the subsequent iterations). In the last step of each iteration, our algorithm effectively prunes the search tree in some positions by predicting that the current solid blines cannot be a part of any solution. The next three subsections explain the detailed steps. 3.1 Step1: Selecting the blines In the selection step, we choose some of the dashed blines in such a way that finally a minimum search tree can be obtained. In this step a group of dashed blines are selected as a part of the solution containing current solid blines. To have the minimum number of iterations for selecting (n − 1) blines, it is ideal to select the maximum number of dashed blines simultaneously. To illustrate the method for finding these blines, we define function max as follow: max = 2nwh(k) + ngh(k) = max{2nwh(i) + ngh(i)| nbh(i) < nr (i) , nh(i) = nr (i) + 1}. In this formula 2nwh(i) + ngh(i) is the number of dashed blines crossing hline(i) on the stars. In Sect. 4 we will show that in all iterations there are some hline(i) with nbh(i) < nr (i) and nh(i) = nr (i)+1 (except in the iterations that we find a solution or in an instance of P D P which has no solution). There are exactly C = max − nwh(k) white and gray stars on hline(k) and we should choose C − 1 stars amoung them. By selecting these stars, all of the dashed blines crossing them are also selected as well. There are C choices to select C − 1 stars from C stars with white or gray color. The algorithm examines all of the choices by using a backtracking method. Note that on each hline(i) with nr (i) > nbh(i) and nh(i) = nr (i) + 1 there are at least two gray and white stars (since nh(i)  nbh(i) + 2). If the selected C − 1 blines are incompatible then the algorithm does not continue with this choice and examines the other remaining choices. Also it is possible that two stars on hline(k) occur on the same dashed bline. In this case these two stars must be chosen together and so we have fewer choices.

123

320

R. Nadimi et al.

When the algorithm selects a dashed bline, it changes its case to solid and also changes the color of the stars on it with respect to their previous colors. The color of each white star is changed to gray and color of each gray one is changed to black. 3.2 Step2: Concurrent processes This step has two phases. In each iteration we repeat these phases until no selection or removing occurs. – Phase 1: In the beginning of concurrent processes, if the current solid blines are part of a solution, then all of the dashed blines which belong to that solution are selected. In order to find these dashed blines the following two processes are performed: 1. On each solid bline(i) with nb(i) = n, we choose all gray stars and select dashed blines crossing them (since there are n black stars on each bline in any solution of P D P). 2. On each hline(i) with nh(i) = nr (i), we choose all white and gray stars and select dashed blines crossing them [since in any solution of P D P there are nr (i) black stars on the hline(i)]. – Phase 2: After selecting some blines in Phase 1, we remove all blines which cannot be selected in the future iterations. We remove dashed blines in the following cases from the triangle: Case 1: Some dashed blines which are incompatible with the recently selected (solid) blines. Case 2: If there is an index i such that nbh(i) = nr (i), then we find the gray stars on hline(i) and remove dashed blines on these stars. In the future iterations no more stars can be selected on hline(i). Case 3: If there is an index i such that nb(i) < n, then we remove bline(i) from the model (since in all solutions for the given P D P instance, for each selected bline(i), nb(i) = n must be satisfied). 3.3 Step3: Prediction In this step we check some possible cases in which the solid blines cannot be a part of any solution. In these cases the algorithm backs to the previous iteration and follows the backtracking method. These cases are: Case 1: There is a hline(i) with nh(i) < nr (i). This case cannot be a part of any solution because for each hline(i) in any solution we have nr (i) = nbh(i)  nh(i). Case 2: There is a solid bline(i) with nb(i) < n. This case cannot be a part of any solution because for each bline(i) in any solution we have nbb(i) = nb(i) = n. Case 3: There is a hline(i) with nbh(i) > nr (i). This case occurs when a recently selected solid bline has two stars on hline(i). Note that in the beginning of our algorithm, we start from step 2 (concurrent processes), since bline(0) and bline(N ) belong to the set of solid blines. An example is presented in Fig. 2 to show the different steps of our algorithm.

123

A fast algorithm for partial digest problem

321

(a)

(b)

(c)

(d)

(e)

Fig. 2 Illustration of the algorithm with an example

123

322

(a)

R. Nadimi et al.

(b)

Fig. 3 Impossible cases in beginning of an iterations

4 Algorithm analysis In this section the analysis of our algorithm is presented. To prove the convergence of P B algorithm it is sufficient to show that in each iteration the number of solid blines is increased. We show that in each iteration where the number of solid blines is less than n + 1, there is hline(k) so that nbh(i) < nr (i) and nh(i) = nr (i) + 1. The following theorems prove this claim. Theorem 1 If there is a solution for a given P D P instance and in the beginning of each iteration bk = max{bi |bi ∈ B, nbh(i) < nr (i)}, then nh(k) − nbh(k) = 2. Proof We prove that there is no gray or white star on hline(k) except stars on the legs of the triangle. Suppose that star (i, j) is an unselected star on hline(k) (its color is white or gray) between two legs of the triangle (see Fig. 3). Suppose that bline(i) is dashed and we consider the cross point of bline(i) with the legs of triangle. It is clear that one of these crossing point is greater than bk . In Fig. 3 star (i, N ) is upper than bk . There are two possible cases for star (i, N ) as follows: – Case 1: ∃br ∈ B such that star (i, N ) occurs on hline(r ) (see Fig. 3a). – Case 2: br ∈ B such that star (i, N ) occurs on hline(r ) (see Fig. 3b). We prove that none of these cases can happen in the algorithm. In Case 1, there is a gray star on hline(r ), but hline(r ) is above hline(k) and nr (r ) = nbh(r ). Therefore all white and gray stars on the bline(r ) must be deleted from the model in previous iterations. In Case 2, since bline(i) is incompatible with bline(N ), therefore at the beginning of the algorithm bline(i) must be deleted from the model in the concurrent processes. Since there is no unselected star between legs of triangle, we have nh(k) − nbh(k)  2. But nbh(k) < nr (k) yields nbh(k) + 1  nr (k). On the other hand nh(k) > nr (k), therefore nh(k)  nr (k) + 1 and we have nh(k) − nbh(k)  2. Consequently nh(k) − nbh(k) = 2.

123

A fast algorithm for partial digest problem

323

Theorem 2 For any given instance of P D P, if it has a solution then in the beginning of each iteration of P B algorithm, there are some hline(i) such that nbh(i) < nr (i) and nh(i) = nr (i) + 1 (except when we have found a solution). Proof With respect to case 1 of the prediction step, in the beginning of each iteration, there is no hline(i) such that nh(i) − nr (i) < 0. Also with respect to the concurrent processes of previous iterations (see Phase 2-Case 2), for each hline(i) that nh(i) = nr (i) we select all of gray and white stars and we have nbh(i) = nr (i). Therefore for each hline(i) that nbh(i) < nr (i), we have nh(i) > nr (i) and then nh(i) − nr (i) ≥ 1. Now we show that if bk = max{bi |bi ∈ B, nbh(i) < nr (i)}, then nh(k) = nr (k) + 1. From the definition of bk we know that nbh(k) < nr (k) and therefore nbh(k)  nr (k) − 1. From Theorem 1 we have nh(k) − nbh(k) = 2. So nh(k) − 2  nr (k) − 1 and nh(k) − nr (k)  1. With respect to the fact that nh(k) − nr (k)  1 we get nh(k) − nr (k) = 1 Note that if there are C choices for selection of stars in an iteration, then at least C −1 dashed blines are selected and the number of solid blines increases in each iteration. 5 Computational results In this section, the details of the implementation of P B algorithm are discussed. The algorithm is implemented in C#. Net Framework 2.0 and is developed on a personal computer with dual core processor Intel 2.16 GHz running Windows XP. In order to show the efficiency of the algorithm, a comparison with the Skiena’s backtracking algorithm [7,8] for the large instances of P D P is presented. For randomly generated instances, we run the Skiena’s algorithm and P B algorithm on 50 different instances for each size. The average running time for both algorithms are shown in Table 1 (the time unit is second). This table shows that the P B algorithm is much faster than the Skiena’s algorithm. Also, in Table 1, the numbers of backtracks for both algorithms are presented. As it is shown, the number of backtracks for Skiena’s algorithm increases according to the size of instances, but for our algorithm it is equal to 1 for all instances. Since the inverse of any solution of P D P is also a solution, we enforced the inverse solutions to be neglected. As discussed before, Zhang constructed a class of instances for which the Skiena’s algorithm takes exponential time to find a solution [9]. Then we run the P B algorithm on these instances with different size as it is indicated in Table 2. Again we took 50 different instances for each size. The average running times are included in Table 2. As it is shown in Table 2, the P B algorithm solves the Zhang instances in a reasonable amount of time and the running time growth rate of our algorithm is smaller with respect to the exponential running time required by Skiena’s algorithm. As it is shown in Table 2 and because of the exponential running time of Skiena’s algorithm for Zhang’s instances, running time for the instances of size greater than 50 are shown by ∞ to indicate that it is too large. Again the number of backtracks for both algorithms are presented in this table. As it is shown, the number of backtracks of the P B algorithm is much less than the backtracks of Skiena’s algorithm.

123

324

R. Nadimi et al.

Table 1 Comparison of Skiena’s algorithm and our algorithm for random instances n

Time of Skiena’s Alg.

Time of P B Alg.

Backtracks of Skiena’s Alg.

Backtracks of P B Alg.

50

0.0109

0.0031

49

1

100

0.1562

0.0093

111

1

150

0.7796

0.0164

174

1

200

2.3281

0.0343

228

1

250

6.2643

0.0648

325

1

300

11.4664

0.1171

344

1

350

25.3515

0.1828

479

1

400

38.8195

0.3093

517

1

450

61.5906

0.5125

574

1

500

91.2671

0.8501

641

1

Table 2 Comparison of Skiena’s algorithm and our algorithm for Zhang instances n

Time of Skiena’s Alg.

Time of P B Alg.

Backtracks of Skiena’s Alg.

Backtracks of P B Alg.

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0.0009 0.011 0.1356 1.3753 11.0531 90.4019 651.7195 4435.2169 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞

0.0004 0.0007 0.0015 0.0018 0.0043 0.0101 0.0187 0.032 0.0773 0.1375 0.2976 0.5445 1.0296 1.5625 4.4164 12.4523 17.1632 40.2007 65.3765

98 758 5521 35183 187474 983166 5013758 25035262 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞

6 10 21 28 49 62 131 176 365 562 953 1533 2579 3512 8732 20589 25534 54559 80644

6 Conclusions In this paper a new model with an algorithm for P D P was presented and discussed. This model gives a new point of view for P D P and makes a good frame to search for the solution of P D P. Our algorithm is a predictive backtracking algorithm which has a very small search tree. We implemented and compared this algorithm with Skiena’s algorithm for random and hard instances. It was shown that our algorithm is much faster than Skiena’s backtracking algorithm for both kinds of instances.

123

A fast algorithm for partial digest problem

325

References 1. B´la˙zewicz, J., Formanowicz, P., Kasprzak, M., Jaroszewski, M., Markiewicz, W.T.: Construction of DNA restriction maps based on a simplified experiment. Bioinformatics. 17(5), 398–404 (2001) 2. Cieliebak, M., Eidenbenz, S., Penna, P.: Partial digest is hard to solve for erroneous input data. Theor. Comput. Sci. 349(3), 361–381 (2005) 3. Dakic, T.: On the Turnpike problem. PhD thesis, Simon Fraser University (2000) 4. Lemke, P., Werman, M.: On the complexity of inverting the autocorrelation function of a finite integer   sequence, and the problem of locating n points on a line, given the n2 unlabelled distances between them. Preprint 453, Institute for Mathematics and its Application IMA (1988) 5. Patterson, A.L.: A direct method for the determination of the components of interatomic distances in crystals. Zeitschr. Krist. 90, 517–542 (1935) 6. Pevzner, P.A., Waterman, M.S.: Open combinatorial problems in computational molecular biology. In: Proceedings of the 3rd Israel Symposium on Theory of Computing and Systems (ISTCS1995), pp. 158–173 (1995) 7. Skiena, S.S., Smith, W., Lemke, P.: Reconstructing sets from interpoint distances.In: Proceedings of the 6th ACM Symposium on Computational Geometry (SoCG 1990), pp 332–339 (1990) 8. Skiena, S.S., Sundaram, G.: A partial digest approach to restriction site mapping. Bull. Math. Biol. 56, 275–294 (1994) 9. Zhang, Z.: An exponential example for a partial digest mapping algorithm. J. Comput. Biol. 1(3), 235–239 (1994)

123