2008 International Symposium on Information Science and Engieering
An Improved Ant Colony Algorithm for DNA Sequence Alignment Yidan Zhao1, Ping Ma2, Jie Lan1, Chun Liang3 & Guoli Ji1* 1
Department of Automation, Xiamen University, Xiamen, China (361005)
[email protected] 2
Zhengzhou University of Light Industry, Zhengzhou, China (450002)
[email protected] 3
Department of Botany, Miami University, Oxford, OH 45056, USA
[email protected]
Abstract—DNA sequence alignment forms an
ant bioinformatics challenge. In particular,
important basis for Bioinformatics. Developing
accurate extraction of clean sequences from raw
accurate sequence alignment algorithms remains
ESTs containing cloning vectors, adapters and
to be a very challenging computational problem.
other contaminants remains to be a difficult
When
the
bioinformatics problem. Using the dynamic
traditional ant colony algorithm is limited to
programming method (i.e. Smith-Waterman
aligning sequences of similar length and may cause
algorithm) to identify cDNA termini represents a
a local optimum. An improved sequence alignment
great promise for extracting clean sequences
method based on the ant colony algorithm was
unambiguously
brought forward in this paper.
The new method
processing enormous amount of sequences like
could avoid a local optimum and remove especially
ESTs, the slow performance of the exhaustive
the paths' scores of great difference by regulating
algorithms like the dynamic programming
the initial and final positions of ants and by
method is obviously disadvantageous.
applied
to
sequence
alignment,
[1].
Unfortunately
when
modifying pheromones in different times. Conse-
Ant colony algorithm is a heuristic algorithm
quently, our method has the ability of aligning
proposed a few years ago. Ants produce
sequences with very different lengths and avoiding
chemical
the local optimum caused by the traditional
attract other ants, and more pheromones will
algorithm. The sequence alignment results suggest
attract more ants, that directly affect
that our improved ant colony algorithm is efficient
ities of after arrival ants' choosing the path, that
and feasible in DNA sequence alignment.
is to say the entire ant colony crawled from the
hormone-pheromones,
which
will
probabil-
beginning of multi-route, gradually changed to Keywords-DNA sequence; alignment; ant colony
the shortest route and eventually found the
algorithm; pheromone; score
optimal path between food and ant-nest. Currently, ant colony algorithm has been
I.
INTRODUCTION
successfully used to resolve some combinatorial
With the coming of the genomics era and
optimization problems such as the traveling
development of new high-throughput sequencing
salesman
technologies, biological sequence data have
scheduling problem and the second assignment
shown rapid explosion. How to use the
problem. It shows the effectiveness in solving
enormous amount of DNA sequences including
these afore- mentioned problems. In addition,
Expressed Sequence Tags(EST)for exploration
ant colony algorithm is also a parallel algorithm
of interesting biological questions is an import-
that has good adaptability, strong robustness and
978-0-7695-3494-7/08 $25.00 © 2008 IEEE DOI 10.1109/ISISE.2008.82
683
problem
(TSP)
[2],
Job-shop
positive feedback.
Si ' stands for the sequence derived from
Homology search often adopts pairwise sequence alignment to find the similarity
S i in which gaps have been inserted into[4].
between sequences. It plays a very important
In order to evaluate the quality of an algorithm,
role in elucidating conserved biological patterns
the most direct way is to generate all possible
among related sequences and discovering their
alignments, and calculate relevant alignment
functional,
scores, and then pick one of the highest scores as
structural
and
evolutionary
the final result. Let
relationships. In this research, we first present the application of ant colony algorithm to
function,
pairwise sequence alignment and then put
SCORE
is scoring
S i ' , T j '∈ Ω ∪ {−} , | S ' | is the
length of S ' .
forward an improved ant colony algorithm that resolves the problems of low efficiency and local
| S '|
SCORE = ∑ r ( Si ', T j ')
optima common to the traditional ant colony algorithm and improves the accuracy of DNA sequence alignment.
In our novel algorithm,
Fig. 1 shows the result of a sequence
we provide a new mean that regulates ant start
alignment between two sequences of different
positions, updates the pheromones and locates
length using the traditional ant colony algorithm.
ant end positions, and we remove paths which result in alignments of great difference.
If the ants’ moving path is (1,1) (1,2) (1,3) (2,4)
All
(3, 5) (4,6) (5,7) (5, 8) (6,9) (6,10) (6,11), just as
these measures proved to be very effective in
shown in Fig. 1, according to formula (1) the
avoiding local optima. Through verification using
well-defined
cDNA
alignment score is 7, which represents a optimal
(complementary
alignment.
DNA) sequence features, our new algorithm
Every artificial ant starts from the top left
demonstrated a strong capability in finding
corner of the matrix, chooses a path reach the
better alignment solutions for sequences with
bottom right corner of the matrix and forms an
great differences in length. II.
(2)
i =1
alignment. During the cycle process of the
Basic sequence alignment method based on
algorithm, moving a grid vertically expresses
ant colony algorithm [3]
nucleotide deletion(s) in the sequence
T
DNA sequence is composed of four bases A
whereas moving a grid horizontally means
,T,C and G. Biological sequence alignment
nucleotide insertion(s) in the sequence T . The
can also be considered as character or string
diagonal moving of a grid stands for either
alignment. The differences among homologous
character match or mismatch. Whenever arrive
sequences reflect as substitutions, insertions
at the lower right corner of the matrix, the ants
and/or deletions in sequences. Given two sequences S , T , r ( S , T ) is the score function,
return to the top left corner of the matrix,
which represents score of alignments between S
corner of matrix. Specific algorithm steps [5] are
continue to choose a path to the bottom right as follows:
and T. For Si , T j ∈ Ω , the simplest score
S= Given two sequences ( A, T , C , G , A, G ) and T= (T , T , G , A, T , C , G , A, G , G , A) , building a
a)
function is: ⎧⎪ 2 r=⎨ ⎪⎩ −1
while Si = T j ∈ Ω while Si ≠ T j ∈ Ω
matrix between S and T , as shown in Fig. 1. (1)
b) When time t = 0 , a group of ants (including k ants) start at (1,1) ; the ants’ moving direction of each location is fixed, which
684
D (i, j ) is the matching degree of characters,
includes three directions of movements, they are: down, right, and lower quadrant. c)
is obtained by formula(1).
Whenever time course 1, moving one
α,β
frame, the transfer function is:
are
respectively
the
effect
of
(i, j ) → (i + 1, j ) or (i + 1, j + 1) or (i , j + 1),
pheromones accumulated during ants’ moving
i ∈1, 2,
and effect of character matching degree in
,| S | , j ∈1, 2,
,| T |
selecting path. Difference of character matching The probability of ants from the current location degree is determined by the value of
transferring to the next location at the moment
t is correlated to pheromones and the path
β.
A(i, j ) is the pheromones in the site of
length.
(i, j ) , setting the ant only releases pheromone
Adopting the roulette method for path
on the its path, the pheromone is updated by:
selection: when ant selects a path, a random
P
number
between
(0,1)
is
generated.
If P > ρ 0 , ants select direction k (the greatest probability of three directions) as the next moving direction. If P ≤ ρ 0 , ant takes Pijk as
of
direction
Δτ = Q × ant (i, j )
(5)
ant (i, j ) is the number of ants in the site of
directions. formula
(4)
ρ is the volatile coefficient of pheromone;
the probability of moving direction in the three The
A(i , j ) = (1 − ρ ) × A(i , j ) + Δτ
transition
(i, j ) , Q > 0 is a constant, the value that is
k ij
probability P in (i, j ) position orientation to
changed with time. d) End in (| S |,| T |) , Continue to the next
direction k is:
cycle, return c). Pijk =
A(i, j )α × D(i, j ) β
(3)
3
∑ ( A(i, j )α × D(i, j) β )
e)
Till time t = T , stop moving.
f)
Calculating alignment scores, the total
score is the final result.
k =1
The sequence alignment created using the
k = 1, 2,3 denote respectively down, right,
traditional ant colony algorithm is vulnerable
and lower quadrant.
because it can be easily trapped into a local optimum. Consequently, this approach can not be used to accurately identify short DNA sequences like cDNA termini within the long sequence like raw ESTs, limiting its applications in sequence alignment. By changing the ant start positions and updating pheromones, we avoid local convergence and improve the overall search capabilities of the optimal solutions.
Figure.1. Optimal path of aligning sequence S (ATCGAG) and
T
According to the paths, we then remove scores with large difference and finally get the optimal
(TGATCGGAGGA) based on
sequence alignments for sequences with great
traditional ant colony algorithm
685
difference in lengths (e.g., finding short cDNA
Q varied as time went on, which can reduce
termini within the long EST sequences).
possibility of getting into low efficiency and a local optimum.
III. IMPROVED SEQUENCE ALIGNMENT
t ≤ T1
⎧1 ⎪ Q = ⎨2 ⎪3 ⎩
METHOD BASED ON ANT COLONY ALGORITHM In order to prevent ants leaning to a certain point of the sequence, we allocate the ants
T1 < t ≤ T2
(6)
T2 < t ≤ T
0 < T1 < T2 < T
evenly along the sequence as their starting points, that is, every point in the matrix is placed
d)
by the same number of ants, ants choose a path
continue to the next cycle where ants randomly
End at (| S |, j ), j ∈1, 2,…, len , and then
in the matrix to reach the last row, that forms a start at
comparison. When arrive at the last row, the ants
(1, i ), i ∈1,2,…, len , return c).
will be randomly put at a point in the first row of
e)
the matrix, continue choose a path to reach the
Till time t = T , stop moving.
f)
Calculate the maximum alignment score of
last row. This method scatters ant starting points
each starting point. If there are big differences
evenly along the sequence so that the ants are
between the vertical path length and horizontal
not confined in the same point and limited to the
path length, it suggests that there are more gaps
only end,
in the result and the subsequence getting from
reducing the possibility of local
optimum and achieving faster global optimal.
the alignment is not credible and acceptable,
The sequences obtained in Fig. 1 and Fig. 2
then decreasing the score so that the path will
are the same. The route shows the alignment
not be the final selection. The formula for
result using our improved method. If the ant
calculation is as follows:
route is (1,3) (2,4) (3,5) ( 4,6) (5,7) (5, 8) (6,9), just as show in Fig. 2, according to formula (1)
SCORE (i) =
the alignment score is 11, which is the optimal alignment.
SCORE (i) M
(7)
if end x −i + 1 >| S | + m
At the initial time, n ants are placed on every
or end x − i + 1