AcoSeeD: An Ant Colony Optimization for Finding

AcoSeeD: An Ant Colony Optimization for Finding Optimal Spaced Seeds in Biological Sequence Search Dong Do Duc1 , Huy Q. Dinh2 , Thanh Hai Dang3 , Kris Laukens3,4 , and Xuan Huan Hoang5 1

Institute of Information Technology, Vietnam National University, Hanoi, Vietnam 2 Center for Integrative Bioinformatics, Max F Perutz Laboratories, University of Vienna and Medical University, Vienna, Austria 3 Biomina - Biomedical Informatics Research Center Antwerp, Antwerp University Hospital / University of Antwerp, Edegem, Belgium 4 Advanced Database Research and Modelling (ADReM), University of Antwerp, Belgium 5 University of Technology (UET), Vietnam National University, Hanoi, Vietnam [email protected], [email protected]

Abstract. Similarity search in biological sequence database is one of the most popular and important bioinformatics tasks. Spaced seeds have been increasingly used to improve the quality and sensitivity of searching, for example, in seeded alignment methods. Finding optimal spaced seeds is a NP-hard problem. In this study we introduce an application of an Ant Colony Optimization (ACO) algorithm to address this problem in a metaheuristics framework. This method, called AcoSeeD, builds optimal spaced seeds in an elegant construction graph that uses the ACO standard framework with a modified pheromone update. Experimental results demonstrate that AcoSeeD brings a significant improvement of sensitivity while demanding the same computational time as other stateof-the-art methods. We also introduces an alternative way of using local search that exerts a fast approximation of the objective function in ACO.

1

Introduction

The revolution of sequencing technologies is increasingly yielding a tremendous number of biological sequences, which are stored in numerous databases (e.g NCBI gene bank). As a consequence, searching for similarity or local alignments between biological sequences from large databases is among the most popular bioinformatics tasks. It is therefore crucial to develop search algorithms that are highly sensitive and time-efficient. The pioneer work for sequence similarity search, which has been proposed by Smith and Waterman [10], uses dynamic programming to generate the exact solution but demands a quadratic running time. Nevertheless, the current growth of data sets does not allow this class of methods to work sufficiently efficient in terms of computional time. Heuristic alternatives, such as BLAST [1], have been used instead. Those methods are based M. Dorigo et al. (Eds.): ANTS 2012, LNCS 7461, pp. 204–211, 2012. c Springer-Verlag Berlin Heidelberg 2012

AcoSeeD: Ant Colony Optimization for Spaced Seeds

205

on an approximate match between two biological sequences (namely genes, proteins or even the whole genome) that is called a seed. A seed is a string denoting the similarity between biological sequences. Seeded-alignment is widely-used in biological sequence searching applications, recently for example in short-read mapping and genome assembly algorithms for next-generation sequencing data. To obtain search results with high sensitivity, the spaced seed finding method proposed by ([9], [12]) allows for relax matching, thus allows for much more flexibility in alignments. Related methods such as Mandala [12] and Iedera [8] were developed and successfully implemented in a number of alignment methods (e.g BFAST[5], PatternHunter II [9], SHRiMP[2]). To evaluate the quality of a spaced seed, Li et al. [9] introduced dynamic programming for computing its sensitivity (i.e the probability of that a seed set matches an alignment) of a given multiple spaced seeds. Furthermore, Ilie et al. [6] proposed a heuristic approach called Overlap Complexity (OC) that approximates this sensitivity in polynominal computational time. More recently, the state-of-the-art method for finding the optimal spaced seed, called SpEED [7], has been introduced. SpEED uses a popular meta-heuristic method, i.e. hill climbing, together with OC heuristics. It was demonstrated to improve the sensitivity and running time in comparison to previous existing methods. In this regard, we propose an application of ACO for finding multiple spaced seeds. Ant Colony Optimization [4] is a meta-heuristic technique based on simulating the behaviors of a real ant colony. Our method is called AcoSeeD and uses an adaption of the MAX-MIN Ant system that allows an ant colony to travel in a useful construction graph to build spaced seeds. The experimental results demonstrate that the method outperforms the existing state-of-the-art method for finding space seeds, namely SpEED, in all configuration settings of test cases given the same number of intermediate solutions.

2

Spaced Seed Optimization Problem

Under the assumption of the Bernoulli model, a random sequence R of length N consisting of either 0(mismatch) or 1(match) is used to represent a sequence alignment with a matching probability [1]. A spaced seed s of 1(match) and ∗(match/mismatch) is said to hit R if s can be aligned with R at the 1 position. A set S of k spaced seeds is said to hit R if at least one of these hits R. For example, a set of seeds {11 ∗ 1, 1 ∗ 11} can hit the following sequences {100110100001, 1000010110001,1000011110001, 1101001011001}. A spaced seed s is associated with a weight w, indicating the number of 1 in the string. The problem of finding multiple spaced seeds is described as follows: Given a matching sequence R of length N and p being the matching probability between two biological sequences, find a set of k spaced seeds of weight w that maximize the hit to R. This problem is NP-hard [9]. It remains valid either in the case of given or unknown seed length. In this paper, we also present an ACO application to find the corresponding seed length with respect to the sensitivity.

206

3

D. Do Duc et al.


3.1

Construction Graph

A construction graph (Fig. 1A)) is defined to have k rectangles of size w×(lmax − w). Each ant builds k seeds by traveling on each rectangle either up or right (Fig. 1B) from the start node at the coordinate (i, 0, 0) for rectangle i, i = 1 . . . k. Such a travel corresponds to adding ’1’(right) or ’*’(up) to the current seed. The process stops when ant i travels to the node (i, w, lengthi − w) where lengthi ≤ lmax is the respective length of seed i. We note that lengthi can be given or found by another ACO procedure that will be presented later in this paper. Thanks to its special orientation, it is always guaranteed that the ant colony builds the seed of weight w (an example in Fig. 1C). The pheromone concentration τ denotes how likely the ant colony building seed i at coordinate (x, y) by choosing

End ሺ݇ǡ ‫ݓ‬ǡ ݈݁݊݃‫݄ݐ‬௞ െ ‫)ݓ‬

݈௠௔௫ െ ‫ݓ‬

‫ݓ‬

(C)

ሺʹǡ ‫ݓ‬ǡ ݈݁݊݃‫݄ݐ‬ଶ െ ‫)ݓ‬

݈௠௔௫ െ ‫ݓ‬

݅ǡ ‫ݔ‬ǡ ‫ ݕ‬൅ ͳ

‫ݓ‬ ݈௠௔௫ െ ‫ݓ‬

ሺͳǡ ‫ݓ‬ǡ ݈݁݊݃‫݄ݐ‬ଵ െ ‫)ݓ‬

Start

߬‫כ‬ ݅ǡ ‫ݔ‬ǡ ‫ݕ‬

‫ݓ‬

߬ଵ

(A)

(B)

݅ǡ ‫ ݔ‬൅ ͳǡ ‫ݕ‬

Fig. 1. Construction graph and seed building procedure. (A) Construction graph for building a set of k spaced seeds of length w. (B) The direction of an ant’s travel path. (C) An example of building spaced seeds of weight 4 and length 7. The path (RU RU U RR) of an ant as depicted represents the seeds 1 ∗ 1 ∗ ∗11.


207

either orientation Up(τ∗ , coordinate (x, y + 1)) or Right(τ1 , coordinate (x+ 1, y)) (Fig. 1B) probabilistically according to the following probability i P(x,y) (v) =

i τx,y,v , v ∈ {∗, 1} i i τx,y,∗ + τx,y,1

(1)

The pheromone is updated following the adapted MAX-MIN Ant System rule [11], [3]. In more detail, the path si−best of the i-best ant at which the highest sensitivity is obtained at current iteration is used for updating the pheromone in the construction graph as follows ρτmax (x, y, v) ∈ si−best i i τx,y,v ← (1 − ρ)τx,y,v + Δτ, v ∈ {∗, 1}, Δτ = (2) ρτmin otherwise 3.2

ACO-Based Seed Length Identification

We also apply ACO to identify the optimal length for each seed separately based on the construction graph described in Fig. 2. Ants choose the next nodes according to lmin i=1 τi,l ηi,l Pi (l) = lmax ,v = (3) lengthi−1 i > 1 h=v τi,h ηi,h

Seed 1

Seed 2

݈௠௔௫

݈௠௔௫

Seed ݇

݈௠௔௫

݁݊݀

߬௞ǡ௟೘ೌೣ ିଵ ݈௠௔௫ െ ͳ

݈௠௔௫ െ ͳ

݈௠௜௡ ൅ ʹ

݈௠௜௡ ൅ ʹ

݈௠௔௫ െ ͳ

߬ଶǡ௟೘೔೙ାଶ

߬ଵǡ௟೘೔೙ ାଵ ݈௠௜௡ ൅ ͳ

‫ݐݎܽݐݏ‬

݈௠௜௡ ൅ ͳ

݈௠௜௡

Fig. 2. ACO construction graph for the identification of seed lengths. From the starting node in the construction graph, each ant chooses one of the following nodes (1, lmin ), (1, lmin + 1), . . . , (1, lmax ) as such that the seed 1 has the respective length lmin , lmin + 1, . . . , lmax . Because the seed length is increased, the ant chooses the node (i, lengthi−1 ), (i, lmin + 1), . . . , (i, lmax ) where i = 2, .., k.

208

D. Do Duc et al.

where τi,l indicates how likely these ants choose as such that the seed i has a length l. Heuristic information is computed as follows ⎧ ⎪ ⎨0.5 (i > 1)&(l = lengthi−1 ) (4) 1.0 (i = 1)((l > lengthi−1 )&(l ≤ lmax − (k − i))) ⎪ ⎩ 0.1 otherwise Here the seed i with a length equal to the length of seed (i − 1) will be chosen with the priority of 0.5. The seed with a length larger than the length of its preceding seed (i.e, (i − 1)) and smaller than lmax − (k − i) (for k − i succeeding seeds still have chance to be chosen) will be chosen with the priority of 1.0. Otherwise the priority is 0.1. The pheromone update rule is similarly applied using the rule described above (see formula 2). 3.3

AcoSeeD Algorithm

Overall, AcoSeeD works as outlined in the following scheme: Algorithm 1. Pseudo code of the AcoSeeD algorithm Data: w, k, p, N Output: The optimal spaced seed sbest set w.r.t the sensitivity begin sg−best ← null ; Estimating lmin , lmax ; while stop conditions not satisfied do foreach i = 1..Nant do Determine the seed length; {section 3.2} si ← SolutionConstruction(); {seed built by the ant ith } si ← LocalSearch(si ); {using OC heuristics} Computing sensitivity for si : F (si ) ; si−best ← argmax(F (s1 ), F (s2 ), .., F (sna )); ApplyPheromoneUpdate (si−best ); Update the global best seed sg−best ; Output sg−best ; end

3.4

Local Search Using Overlap Complexity

After each ant completes building a spaced seed, due to the running time of the original sensitivity computation, the local search exerted in [7] is performed using an objective function based on the OC heuristic. The OC is an approximation function for the sensitivity that can speed up the computational time, compared to the exponential computational time of the dynamic programming algorithm [9]. Starting from the spaced seed built by the current ant, the local search tries to swap between 1 and * for each seed without changing its weight to obtain a new spaced seed with a better approximated sensitivity.


4 4.1

209

Experimental Results Datasets

To compare with existing methods including the state-of-the-art method SpEED, we experimentally evaluated the spaced seed identification based on the parameter settings that were practically used in a number of popular biological sequence alignment/search programs such as SHRiMP[2], PatternHunter II [9], BFAST[5]. SHRiMP consists of 15 datasets with a small number of seeds (i.e. k = 4) whereas each of the two others uses 3 datasets with a large set of seeds (k = 10, 16). PatternHunter II is the largest dataset with a matching pattern of length N = 64 whereas for the two others N = 50. These datasets have a seed weight w ranging from 10 to 20 and a matching probability p from 0.70 to 0.95. 4.2

Comparison Results

To obtain a fair comparison with the state-of-the-art work [7], we performed AcoSeeD search for Nsolutions = 5000 solutions as done in [7], each was generated once and was then used in the OC-based local search to improve sensitivity. Specifically, in total Nants = 50 ants were used for each of Nloops = 100 loops to determine Nsolutions = Nants ∗ Nloops sets of spaced seeds. Hence, the computational complexity of both two methods are O(Nsolutions ∗(ls+o)), where Table 1. Experimental comparison using the SHRiMP dataset: Results between AcoSeeD and the existing methods (Mandala [12], Iedera [8], SpEED [7] (in italic) for a small number of seeds. The column “ACO-best” and “ACO-worst” represents the AcoSeeD sensitivity for the best and the worst seed, respectively. In addition, the average AcoSeeD sensitivity value over 10 runs is given in the last column. The sensitivity of other methods are retrieved from the SpEED paper [7]. w

10

11

12

16

18

p Mandala Iedera SpEED-best ACO-best ACO-worst ACO-average SHRiMP: 4 seeds (N = 50) 0.75 90.6608 90.6802 90.9098 90.9757 90.9104 90.9513 0.8 97.7316 97.7586 97.8337 97.8584 97.8467 97.8521 0.85 99.7283 99.7437 99.7569 99.7624 99.7599 99.7614 0.75 83.0512 83.2413 83.3793 83.5349 83.4207 83.4728 0.8 94.7845 94.935 94.9861 95.0636 95.0144 95.037 0.85 99.1929 99.2189 99.2431 99.2498 99.2451 99.2478 0.8 90.258 90.3934 90.575 90.6576 90.6147 90.6328 0.85 98.0786 98.0781 98.1589 98.1786 98.1682 98.1766 0.9 99.8633 99.8773 99.8821 99.8866 99.8845 99.8853 0.85 84.3838 84.5795 84.8212 85.0328 84.915 84.9829 0.9 97.3023 97.2806 97.4321 97.483 97.464 97.4712 0.95 99.9287 99.9331 99.9388 99.9429 99.9414 99.9419 0.85 72.1954 72.1695 73.1664 73.3357 73.2432 73.27 0.9 93.0855 93.0442 93.712 93.7912 93.7597 93.7778 0.95 99.6603 99.669 99.75 99.7617 99.7575 99.7599

210

D. Do Duc et al.

ls and o are the complexity of the local search and sensitivity computating procedure as the objective function, respectively. We further set other ACO parameters for AcoSeeD, being: (1) the pheromone evaporation factor ρ = 0.3, (2) the upper bound of pheromone trail τmax = 1.0, (3) the lower bound of pheromone trail τmin = τmax /W where W = 2 ∗ w ∗ k for the seed length finding and W = 2 ∗ w ∗ w ∗ k for the seed building process. The difference between pheromone bounds is thus set proportionally to the number of graph nodes. Table 1 demonstrates that the performance in terms of sensitivity increases from 0.007% to 0.134% in all test cases of the SHRiMP dataset. As noted by Illie et al. (2011) in the SpEED paper [7], a 1% sensitivity improvement is significant. This indicates that using a better seed can help bringing in additional 3 billion nucleotides to be mapped for the 100× coverage of the human genome. The difference between AcoSeeD and SpEED indicates that a significant number of nucleotides can be additionally added to the information extracted from the sequencing data. Futhermore, AcoSeeD gained a higher sensitivity for all 10 runs compared to the best result obtained with SpEED. Interestingly, for all datasets the worst solution (i.e spaced seed) among 10 runs has a higher sensitivity compared to the best result obtained from SpEED. Fig. 3 shows a performance comparison in terms of sensitivity between AcoSeeD and SpEED (both the first and the last run) after running on the PatternHunter II and BFAST dataset. Even though the SpEED shows good performances from the first to the last result based on the OC heuristics, AcoSeeD still yields improved performances compared to SpEED. Our method yielded an improvement of up to 0.89% for the PatternHunter II dataset and 2.33% for the BFAST. This allows us to conclude that the AcoSeeD approach can significantly boost sequence alignment mapping for high coverage sequencing of large genomes.

!&4'

,$.

,

+$.

%

% %

+

%" %!

*$.

* 5*$0

5*$0.

5*$1

#+/& 5/-'

5*$0

5*$0.

5*$1

#+*& 5.*'

Fig. 3. Experimental performance comparison for large size datasets


5

211

Conclusions

In this paper, we proposed an ACO-based approach for tackling the problem of finding spaced seeds for biological sequence searching. Our method, AcoSeeD, used a construction graph in which each spaced seed is built as a forwardonly path in a separate rectangle graph and integrated the refined MAX-MIN pheromone update procedure [3]. A flexible and quick local search procedure based on the Overlap Complexity heuristic is also applied to boost the quality of the seed. The experimental results and comparisons based on several benchmark datasets demonstrate that AcoSeeD outperforms existing methods in terms of sensitivity without consuming extra computing time. Acknowledgments. We would like to thank Prof. von Haeseler for the introduction of the spaced seed problem. This work is partially supported by Vietnam National Foundation for Science & Technology Development (NAFOSTED) and the TRIG project at University of Engineering and Technology, VNU Hanoi.

References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990) 2. David, M., Dzamba, M., Lister, D., Ilie, L., Brudno, M.: SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 27(7), 1011–1012 (2011) 3. Do Duc, D., Dinh, H.Q., Hoang Xuan, H.: On the Pheromone Update Rules of Ant Colony Optimization Approaches for the Job Shop Scheduling Problem. In: Bui, T.D., Ho, T.V., Ha, Q.T. (eds.) PRIMA 2008. LNCS (LNAI), vol. 5357, pp. 153–160. Springer, Heidelberg (2008) 4. Dorigo, M., Stutzle, T.: Ant Colony Optimization. The MIT Press, Cambridge (2004) 5. Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4(11), e7767 (2009) 6. Ilie, L., Ilie, S.: Multiple spaced seeds for homology search. Bioinformatics 23(22), 2969–2977 (2007) 7. Ilie, L., Ilie, S., Bigvand, A.M.: SpEED: fast computation of sensitive spaced seeds. Bioinformatics 27(17), 2433–2434 (2011) 8. Kucherov, G., Noe, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinform. Comput. Biol. 4(2), 553–569 (2006) 9. Li, M., Ma, B., Kisman, D., Tromp, J.: PatternHunter II: highly sensitive and fast homology search. Genome Inform. 14, 164–175 (2003) 10. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981) 11. Stuetzle, T., Hoos, H.: Max-min ant system. Future Gener. Comp. Sy. 16, 889–914 (2000) 12. Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12(6), 847–861 (2005)