An Improved Ant Colony Algorithm for DNA Sequence Alignment

2 downloads 0 Views 295KB Size Report
Abstract—DNA sequence alignment forms an important basis for Bioinformatics. Developing accurate sequence alignment algorithms remains to be a very ...
2008 International Symposium on Information Science and Engieering

An Improved Ant Colony Algorithm for DNA Sequence Alignment Yidan Zhao1, Ping Ma2, Jie Lan1, Chun Liang3 & Guoli Ji1* 1

Department of Automation, Xiamen University, Xiamen, China (361005) [email protected] 2

Zhengzhou University of Light Industry, Zhengzhou, China (450002) [email protected] 3

Department of Botany, Miami University, Oxford, OH 45056, USA [email protected]

Abstract—DNA sequence alignment forms an

ant bioinformatics challenge. In particular,

important basis for Bioinformatics. Developing

accurate extraction of clean sequences from raw

accurate sequence alignment algorithms remains

ESTs containing cloning vectors, adapters and

to be a very challenging computational problem.

other contaminants remains to be a difficult

When

the

bioinformatics problem. Using the dynamic

traditional ant colony algorithm is limited to

programming method (i.e. Smith-Waterman

aligning sequences of similar length and may cause

algorithm) to identify cDNA termini represents a

a local optimum. An improved sequence alignment

great promise for extracting clean sequences

method based on the ant colony algorithm was

unambiguously

brought forward in this paper.

The new method

processing enormous amount of sequences like

could avoid a local optimum and remove especially

ESTs, the slow performance of the exhaustive

the paths' scores of great difference by regulating

algorithms like the dynamic programming

the initial and final positions of ants and by

method is obviously disadvantageous.

applied

to

sequence

alignment,

[1].

Unfortunately

when

modifying pheromones in different times. Conse-

Ant colony algorithm is a heuristic algorithm

quently, our method has the ability of aligning

proposed a few years ago. Ants produce

sequences with very different lengths and avoiding

chemical

the local optimum caused by the traditional

attract other ants, and more pheromones will

algorithm. The sequence alignment results suggest

attract more ants, that directly affect

that our improved ant colony algorithm is efficient

ities of after arrival ants' choosing the path, that

and feasible in DNA sequence alignment.

is to say the entire ant colony crawled from the

hormone-pheromones,

which

will

probabil-

beginning of multi-route, gradually changed to Keywords-DNA sequence; alignment; ant colony

the shortest route and eventually found the

algorithm; pheromone; score

optimal path between food and ant-nest. Currently, ant colony algorithm has been

I.

INTRODUCTION

successfully used to resolve some combinatorial

With the coming of the genomics era and

optimization problems such as the traveling

development of new high-throughput sequencing

salesman

technologies, biological sequence data have

scheduling problem and the second assignment

shown rapid explosion. How to use the

problem. It shows the effectiveness in solving

enormous amount of DNA sequences including

these afore- mentioned problems. In addition,

Expressed Sequence Tags(EST)for exploration

ant colony algorithm is also a parallel algorithm

of interesting biological questions is an import-

that has good adaptability, strong robustness and

978-0-7695-3494-7/08 $25.00 © 2008 IEEE DOI 10.1109/ISISE.2008.82

683

problem

(TSP)

[2],

Job-shop

positive feedback.

Si ' stands for the sequence derived from

Homology search often adopts pairwise sequence alignment to find the similarity

S i in which gaps have been inserted into[4].

between sequences. It plays a very important

In order to evaluate the quality of an algorithm,

role in elucidating conserved biological patterns

the most direct way is to generate all possible

among related sequences and discovering their

alignments, and calculate relevant alignment

functional,

scores, and then pick one of the highest scores as

structural

and

evolutionary

the final result. Let

relationships. In this research, we first present the application of ant colony algorithm to

function,

pairwise sequence alignment and then put

SCORE

is scoring

S i ' , T j '∈ Ω ∪ {−} , | S ' | is the

length of S ' .

forward an improved ant colony algorithm that resolves the problems of low efficiency and local

| S '|

SCORE = ∑ r ( Si ', T j ')

optima common to the traditional ant colony algorithm and improves the accuracy of DNA sequence alignment.

In our novel algorithm,

Fig. 1 shows the result of a sequence

we provide a new mean that regulates ant start

alignment between two sequences of different

positions, updates the pheromones and locates

length using the traditional ant colony algorithm.

ant end positions, and we remove paths which result in alignments of great difference.

If the ants’ moving path is (1,1) (1,2) (1,3) (2,4)

All

(3, 5) (4,6) (5,7) (5, 8) (6,9) (6,10) (6,11), just as

these measures proved to be very effective in

shown in Fig. 1, according to formula (1) the

avoiding local optima. Through verification using

well-defined

cDNA

alignment score is 7, which represents a optimal

(complementary

alignment.

DNA) sequence features, our new algorithm

Every artificial ant starts from the top left

demonstrated a strong capability in finding

corner of the matrix, chooses a path reach the

better alignment solutions for sequences with

bottom right corner of the matrix and forms an

great differences in length. II.

(2)

i =1

alignment. During the cycle process of the

Basic sequence alignment method based on

algorithm, moving a grid vertically expresses

ant colony algorithm [3]

nucleotide deletion(s) in the sequence

T

DNA sequence is composed of four bases A

whereas moving a grid horizontally means

,T,C and G. Biological sequence alignment

nucleotide insertion(s) in the sequence T . The

can also be considered as character or string

diagonal moving of a grid stands for either

alignment. The differences among homologous

character match or mismatch. Whenever arrive

sequences reflect as substitutions, insertions

at the lower right corner of the matrix, the ants

and/or deletions in sequences. Given two sequences S , T , r ( S , T ) is the score function,

return to the top left corner of the matrix,

which represents score of alignments between S

corner of matrix. Specific algorithm steps [5] are

continue to choose a path to the bottom right as follows:

and T. For Si , T j ∈ Ω , the simplest score

S= Given two sequences ( A, T , C , G , A, G ) and T= (T , T , G , A, T , C , G , A, G , G , A) , building a

a)

function is: ⎧⎪ 2 r=⎨ ⎪⎩ −1

while Si = T j ∈ Ω while Si ≠ T j ∈ Ω

matrix between S and T , as shown in Fig. 1. (1)

b) When time t = 0 , a group of ants (including k ants) start at (1,1) ; the ants’ moving direction of each location is fixed, which

684

D (i, j ) is the matching degree of characters,

includes three directions of movements, they are: down, right, and lower quadrant. c)

is obtained by formula(1).

Whenever time course 1, moving one

α,β

frame, the transfer function is:

are

respectively

the

effect

of

(i, j ) → (i + 1, j ) or (i + 1, j + 1) or (i , j + 1),

pheromones accumulated during ants’ moving

i ∈1, 2,

and effect of character matching degree in

,| S | , j ∈1, 2,

,| T |

selecting path. Difference of character matching The probability of ants from the current location degree is determined by the value of

transferring to the next location at the moment

t is correlated to pheromones and the path

β.

A(i, j ) is the pheromones in the site of

length.

(i, j ) , setting the ant only releases pheromone

Adopting the roulette method for path

on the its path, the pheromone is updated by:

selection: when ant selects a path, a random

P

number

between

(0,1)

is

generated.

If P > ρ 0 , ants select direction k (the greatest probability of three directions) as the next moving direction. If P ≤ ρ 0 , ant takes Pijk as

of

direction

Δτ = Q × ant (i, j )

(5)

ant (i, j ) is the number of ants in the site of

directions. formula

(4)

ρ is the volatile coefficient of pheromone;

the probability of moving direction in the three The

A(i , j ) = (1 − ρ ) × A(i , j ) + Δτ

transition

(i, j ) , Q > 0 is a constant, the value that is

k ij

probability P in (i, j ) position orientation to

changed with time. d) End in (| S |,| T |) , Continue to the next

direction k is:

cycle, return c). Pijk =

A(i, j )α × D(i, j ) β

(3)

3

∑ ( A(i, j )α × D(i, j) β )

e)

Till time t = T , stop moving.

f)

Calculating alignment scores, the total

score is the final result.

k =1

The sequence alignment created using the

k = 1, 2,3 denote respectively down, right,

traditional ant colony algorithm is vulnerable

and lower quadrant.

because it can be easily trapped into a local optimum. Consequently, this approach can not be used to accurately identify short DNA sequences like cDNA termini within the long sequence like raw ESTs, limiting its applications in sequence alignment. By changing the ant start positions and updating pheromones, we avoid local convergence and improve the overall search capabilities of the optimal solutions.

Figure.1. Optimal path of aligning sequence S (ATCGAG) and

T

According to the paths, we then remove scores with large difference and finally get the optimal

(TGATCGGAGGA) based on

sequence alignments for sequences with great

traditional ant colony algorithm

685

difference in lengths (e.g., finding short cDNA

Q varied as time went on, which can reduce

termini within the long EST sequences).

possibility of getting into low efficiency and a local optimum.

III. IMPROVED SEQUENCE ALIGNMENT

t ≤ T1

⎧1 ⎪ Q = ⎨2 ⎪3 ⎩

METHOD BASED ON ANT COLONY ALGORITHM In order to prevent ants leaning to a certain point of the sequence, we allocate the ants

T1 < t ≤ T2

(6)

T2 < t ≤ T

0 < T1 < T2 < T

evenly along the sequence as their starting points, that is, every point in the matrix is placed

d)

by the same number of ants, ants choose a path

continue to the next cycle where ants randomly

End at (| S |, j ), j ∈1, 2,…, len , and then

in the matrix to reach the last row, that forms a start at

comparison. When arrive at the last row, the ants

(1, i ), i ∈1,2,…, len , return c).

will be randomly put at a point in the first row of

e)

the matrix, continue choose a path to reach the

Till time t = T , stop moving.

f)

Calculate the maximum alignment score of

last row. This method scatters ant starting points

each starting point. If there are big differences

evenly along the sequence so that the ants are

between the vertical path length and horizontal

not confined in the same point and limited to the

path length, it suggests that there are more gaps

only end,

in the result and the subsequence getting from

reducing the possibility of local

optimum and achieving faster global optimal.

the alignment is not credible and acceptable,

The sequences obtained in Fig. 1 and Fig. 2

then decreasing the score so that the path will

are the same. The route shows the alignment

not be the final selection. The formula for

result using our improved method. If the ant

calculation is as follows:

route is (1,3) (2,4) (3,5) ( 4,6) (5,7) (5, 8) (6,9), just as show in Fig. 2, according to formula (1)

SCORE (i) =

the alignment score is 11, which is the optimal alignment.

SCORE (i) M

(7)

if end x −i + 1 >| S | + m

At the initial time, n ants are placed on every

or end x − i + 1