Metaheuristics applied to Bioinformatics problems

0 downloads 0 Views 2MB Size Report
Example 3 - but what if ATTTTG, ATC and CTC ... I -. - - - C T C - L. ATT = Ile, ATC = Ile, CTC = Leu, TTG = Leu. 10 / 114 ...... C.B. Congdon and K.J. Septor.
Metaheuristics applied to Bioinformatics problems

Metaheuristics applied to Bioinformatics problems Jean-Michel Richer [email protected] http://www.info.univ-angers.fr/pub/richer

IBSS 2011 - Tanger, Maroc 1 / 114

Metaheuristics applied to Bioinformatics problems

Aim

Aim see how Metaheuristics are used to solve different kinds of problems in bioinformatics Multiple Sequence Alignment Phylogenetic Reconstruction other problems

2 / 114

Metaheuristics applied to Bioinformatics problems Outline

Outline

1

Multiple Alignment

2

Phylogenetic Reconstruction

3

Other problems

4

Conclusion

5

Bibliography

3 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Multiple Alignment

Multiple Alignment

4 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

What is an alignment ?

Definition (alignment) Given a set S = {S1 , . . . , Sk } of sequences, find a matrix M(k, n) s.t. P max(|Si |) ≤ n ≤ i=k i=1 |Si | each character M[i, j] is a residue or a gap −

there is no column such that all characters are gaps M[i] = Si if we remove all the inserted gaps

5 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

What is a good alignment ?

Main question Given a set of sequences S, what is a good alignment for S ?

6 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

What is a good alignment ?

Main question Given a set of sequences S, what is a good alignment for S ?

Answer nobody can tell !

7 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

What is a good alignment ?

Example 1 - ATTTC and ATTC A A

T T

T T

T -

C C

A A

T T

T -

T T

C C

A A

T -

T T

T T

C C

all are equivalent

8 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

What is a good alignment ?

Example 2 - ATTTTC, ATC and CTC A A C

T T T

T -

T -

T -

C C C

A A C

T T T

T C C

T -

T -

C -

ATT = Ile, ATC = Ile, CTC = Leu, TTC = Cys

9 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

What is a good alignment ? Example 3 - but what if ATTTTG, ATC and CTC A A C

T T T

T -

T -

T -

G C C

A A C

T T T

T C C

T -

T -

G -

I I L

L -

A A -

T T -

T C -

T C

T T

G C

I I -

L L

ATT = Ile, ATC = Ile, CTC = Leu, TTG = Leu 10 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Quality of an alignment

Sum of pairs a function to assess the quality of an alignment M substitution matrix (PAM, BLOSUM, GONNET, ...) gap cost model (linear, affine, concave, ...) SP(M) =

j=k i=k −1 X X

i=1 j=i+1

r =n X r =1

!

w (M[i, r ], M[j, r ])

11 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Quality of an alignment

Sum of pair

ATTCTCTTATATA... ATTGTGTTATTTT... CTTCTCTTATTCT... CTACTCTTATTCT... evaluation does not depend on the sequence order

12 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Pairwise Sequence Alignment

Definition (Pairwise Sequence Alignment - PSA) Given two sequences S and T , find an alignment M(2, n) s.t. max(|S|, |T |) ≤ n ≤ |S| + |T |

SP(M) is optimal for the Sum of Pairs function

13 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Pairwise Sequence Alignment

How to sove this optimization problem ? Metaheuristics search the best alignment in all alignments for 2 sequences of length n for a given objective function f : a maximization problem Dynamic Programming decomposition into subproblems

14 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Pairwise Sequence Alignment How to sove this optimization problem with Metaheuristics number of alignments for 2 sequences of length n : kX =n

k =0

k n Cn+k × Cnk = 1 + C2n = 1+

(2n)! (n!)2

approximately : 22n ∼√ πn for two sequences of length n = 100, there are 9.05 × 1058 alignments

15 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Dynamic Programming

Definition (Dynamic Programming) introduced by Richard Bellman 1954 [3] a method for solving complex problems by breaking them down into simpler subproblems where one needs to find the best decisions one after another the word programming referred to the use of the method to find an optimal program

16 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA with Dynamic Programming

PSA with Dynamic Programming - Recursive formula the best alignment between S[1..i] and T [1..j] is :   align(S[1..i − 1], T [1..j − 1]) + w (S[i], T [j]) max align(S[1..i], T [1..j − 1]) + w (−, T [j])  align(S[1..i − 1], T [1..j]) + w (S[i], −)

17 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA with Dynamic Programming PSA with Dynamic Programming

S i−1

Si 1

Tj−2

Tj

1

2 2

3

3

Si Tj−1 Tj S i−1S i Tj

gap insertion mismatch

Tj−1

S i−1 S i Tj−1 Tj

substitution match

S i−2

18 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA with Dynamic Programming

PSA with Dynamic Programming complexity : O(n2 ) however, what to do in case of equivalences ? A A - - A T - T T -

19 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA with Dynamic Programming

PSA with Dynamic Programming complexity : O(n2 ) however, what to do in case of equivalences ? A A - - A T - T T -

20 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

Example (Parameters) S = ACAGTC T = CATTGC match : w (a, a) = 1 substitution : w (a, b) = 0 linear gap penalty : go = 0

21 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

Initialization of matrix M M[0, 0] = 0 M[i, 0] = M[i − 1, 0] + go M[0, j] = M[0, j − 1] + go

∀i ∈ [1, N] ∀j ∈ [1, P]

22 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

Recurrence relation M[i − 1, j − 1] M[i, j − 1]

տ ←

M[i − 1, j] ↑ M[i, j]

Recurrence   M[i − 1, j − 1] +w (xi , yj ) M[i, j] = max M[i − 1, j] +go  M[i, j − 1] +go 23 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

initialization S/T C A T T G C j

0 0 0 0 0 0 0 0

A 0

C 0

A 0

G 0

T 0

C 0

1

2

3

4

5

6

i 0 1 2 3 4 5 6

24 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

recurrence on M S/T C A T T G C j

0 0 0 0 0 0 0 0

A 0 0 1 1 1 1 1 1

C 0 1 1 1 1 1 2 2

A 0 1 2 2 2 2 2 3

G 0 1 2 2 2 3 3 4

T 0 1 2 2 3 3 3 5

C 0 1 2 2 3 3 4 6

i 0 1 2 3 4 5 6

25 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

Traceback from M

26 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA - Example

Alignments There are 5 optimal alignments : -CATTG-C ACA--GTC

-CAT-TGC -CATTGC ACA-GT-C ACAGT-C -CA-TTGC ACAGT--C

-CA-TTGC ACAG-T-C

which one is the best ?

27 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

PSA and Dynamic Programming

Different kinds of alignments global with linear gap penalty Needleman and Wunsch, 1970 [21] global with affine gap penalty Gotoh 1982 [14] local (Smith et Waterman 81 Smith and Waterman, 1981 [30]

28 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

29 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Multiple Sequence Alignment

Definition (Multiple Sequence Alignment - MSA) Given a set of k sequences S, find an alignment M(k, n) s.t. P max(|Si |) ≤ n ≤ i=k i=1 |Si | SP(M) is optimal for the Sum of Pairs function

30 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Complexity of MSA

Complexity of MSA intractability proved by Wang and Jiang, 1994 [33] complexity of Dynamic Programming extension : O(k 2 × 2k × l k ) or O(2k × l k ) for a set of k sequences of length l

31 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Trick

Trick

32 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

MSA and Carrillo Lipman

Carrillo and Lipman trick to reduce complexity Carrillo and Lipman, 1988 [4] decrease complexity by considering part of the matrix computations

33 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

MSA and Carrillo Lipman Carrillo and Lipman trick to reduce complexity

1111 0000 00000 000 000 111 000011111 1111 00000111 11111 000 111 000 111 00000 11111 000 111 000 111 00000 11111 000 111 000 111 00000111 11111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 in 2 dimensions

34 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

a first heuristic : Clustal

a first heuristic : Clustal

35 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Clustal : a heuristic method

Clustal 1 generate a guide-tree (UPGMA, NJ, ...) 2

align profiles (∼ consensus) along the tree branches

36 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Clustal : a heuristic method

Characteristics of Clustal influence of guide tree loss of precision with profiles complexity : O(k 2 × n2 )

37 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Muscle

Muscle MUltiple Sequence Comparison by Log Expectation Edgar, 2004 [7] use of a much faster, but somewhat more approximate, method to compute distances use of UPGMA for better progressive alignment because forces alignment of most similar sequences first improvement of alignement

38 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Muscle

Muscle - first step k − mer clustering to build a tree

do not construct an alignment count the number of short sub-sequences (known as k-mers) that two sequences have in common around 3,000 times faster that Clustal’s method but the trees will generally be less accurate

39 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Muscle

Muscle - second step use the tree to construct a progressive alignment proceed as Clustal

40 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Muscle

Muscle - third step build a tree from alignment and compare to initial tree compute the pair-wise identities of each pair of sequences obtain distance matrix and build new tree if trees are identical, nothing to do otherwise rebuild a new alignment process this tree refinement until the tree stabilizes or until a specified maximum number of iterations has been reached

41 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Modification of the fitness function

Modification of the fitness function

42 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Modification of the fitness function : COFFEE

Modification of the fitness function : COFFEE Consistency based Objective Function For alignmEnt Evaluation introduced by Notredame, Holm and Higgins, 1998 [24] describes the quality of a multiple protein sequence alignment first evaluate a library and then compute COFFEE for a given alignment

43 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Modification of the fitness function : COFFEE

COFFEE : build the library the library is made of all pairwise alignments of k sequences for example : use clustal to obtain the pairwise alignments

44 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Modification of the fitness function : COFFEE

COFFEE : evaluate compare each aligned residues of the MSA to library simple version : number of pairs of residues and library divided by total number of pairs in MSA

45 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Metaheuristics for MSA

Metaheuristics for MSA

46 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

SAGA a metaheuristic approach to MSA

SAGA - Sequence Alignment by Genetic Algorithm designed by Notredame and Higgins 1996 [23] population of alignments submitted to GA

47 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

SAGA a metaheuristic approach to MSA

SAGA - variation operators 19 mutation operators (add, delete, move gaps) 6 crossover operator (modify or combine gap regions) use of an OS (Operator Scheduling) strategy to select which operator to apply

48 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

SAGA a metaheuristic approach to MSA

SAGA - Operator Scheduling strategy each operator is assigned a probability (initially equal) operators are rewarded when they create better individuals in the population motivation : difficult to know in which order to apply operators kind of a natural selection for operators

49 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

MSA-EA - an improvement of Clustal

MSA-EA introduced by Thomsen, Fogel and Krink, 2003 [32] a Genetic Algorithm use solution of Clustal as a seed (1.2 length + gaps at end) apply variation operators (BlockShuffle, LocalShuffle, ...)

50 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

MSA-EA - an improvement of Clustal Variation operators : example of LocalShuffle picks a random AA from a randomly chosen row (sequence) checks whether one of its neighbors is a gap if so, swap (if both neighbors are gaps then one of them is picked randomly)

Variation operators discussion for LocalShuffle no biological meaning no optimized choice

51 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Conclusion for MSA

Resolution of MSA Metaheuristics take too much time iterative Dynamic Programming seems sufficient and efficient enough : ClustalW who really knows what a correct or good alignment is ?

52 / 114

Metaheuristics applied to Bioinformatics problems Multiple Alignment

Conclusion for MSA

Some results on balibase : SPS score with bali score Softwares CLUSTAL MAFFT MUSCLE PROBCONS TCOFFEE MALINBA

Set 1 0.809 0.829 0.821 0.849 0.814 0.811

Set 2 0.932 0.931 0.935 0.943 0.928 0.911

Set 3 0.723 0.812 0.784 0.817 0.739 0.752

Set 4 0.834 0.947 0.841 0.939 0.852 0.899

Set 5 0.858 0.978 0.972 0.974 0.943 0.942

Time (s) 120 98 75 711 1653 343

53 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Phylogenetic Reconstruction

Phylogenetic Reconstruction

54 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Methods for Phylogenetic Reconstruction

Methods for Phylogenetic Reconstruction distance based ∼ O(n3 )

UPGMA, WPGMA NJ BioNJ (Gascuel, 1997 [8])

character based : Maximum Parsimony Maximum Likelihood

55 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Maximum Parsimony

Maximum Parsimony character-based approach that relies on the work of the german entomologist Willy Hennig (1913-1976) based on Occam’s razor (1285–1349) or principle of economy

56 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Maximum Parsimony

Maximum Parsimony input : a multiple sequence alignment of n sequences of m residues algorithm : find the tree with minimum changes output : a tree of minimum score, i.e. with minimum changes minimization problem

57 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Small parsimony problem

Definition (Small parsimony problem) given a multiple alignment of length m of a set L of n sequences and a tree T whose leaves are labelled with sequences of L, find the parsimony score of T . complexity : O(n × m)

58 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Large parsimony problem

Definition (Small parsimony problem) given a mulitple alignment of length m of a set L of n sequences, find a most parsimonious tree T , i.e. a tree with minimum parsimony score complexity : factorial

59 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Maximum Parsimony

Number of trees number of taxa 10 20 30 40 50 80 n

# of unrooted trees 2.0e+06 2.2e+20 8.6e+36 1.3e+55 2.8e+74 2.1e+137 Qn i=3 (2i − 5)

# of rooted trees 3.4e+07 8.2e+21 4.9e+38 1.0e+57 2.7e+76 3.4e+139 Qn i=2 (2i − 3)

60 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Phylogenetic reconstruction with Maximum Parsimony

Maximum Parsimony is a combinatorial optimization problem for which we must find efficient methods a minimization problem

61 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Methods

Methods branch and bound local search methods genetic algorithms memetic algorithms other optimization techniques

62 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Branch and bound

Branch and bound introduced by Hendy and Penny 1982 [15] generate a first tree as upper bound then create trees by iteratively adding new taxon under upper bound

63 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Branch and bound

Image from Mikael Thollesson (c) 64 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Branch and bound

Drawback of Branch and bound too many trees are generated maximum number of taxa : 20

65 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

LS descent

LS descent algorithm Algorithm 1 descent(S, f , N) s ← choose or generate an initial configuration ∈ S for a given number of iterations do find s′ ∈ N(s) such that f (s′ ) < f (s) or return s s ← s′ end for return s

66 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

LS generation of the initial configuration

Generation of initial configuration random (sometimes too far from optimum) branch and bound stepwise addition

67 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

LS generation of the initial configuration

Generation of initial configuration random (sometimes too far from optimum) branch and bound stepwise addition

Stepwise addition new taxon added to all branches and keep most parsimonious tree

68 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Acceptance of new configuration

Acceptance of new configuration strict descent : only improving configuration side-walk descent : improving or equivalent neighbors random-walk : possibility to accept deteriorating neighbors simulated annealing : specific random walk with a non-constant probability

69 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Escape from local optimum

Noising techniques Iterated Local Search : perturbation of current configuration Parsimony Ratchet : modification of fitness function

70 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Parsimony Ratchet

Parsimony Ratchet introduced by Nixon or Horovitz, 1999 [22, 17] When a local optimum s⋄ is reached, the Ratchet noises the evaluation function : the weights of a proportion of the characters (10-15%) can be increased or some characters can be eliminated a LS is performed from s⋄ using the noising evaluation function f ′ continue LS back to original objective function f

71 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

Neighborhoods A configuration is a tree, three main neighborhoods exits NNI : (Nearest Neighbor Interchange) [34], small size SPR : move (Subtree Pruning Regrafting) [31], average TBR (Tree Bisection Reconnection) [31], large

72 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

NNI consists in swapping two subtrees which are separated by a branch size : (2n − 6)

extension : p-ECR which shuffles p adjacent branches. In particular, 1-ECR is equivalent to NNI.

73 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

SPR cuts a branch and creates two separate trees : the clipped tree and the residual tree The clipped tree can then be regrafted on each branch of the residual tree size : at least (2n − 3) × (n − 3)2

74 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

TBR breaks the tree into two subtrees and reconnects the re-rooted clipped tree to any branch of the residual tree The clipped tree can then be regrafted on each branch of the residual tree size : 2 × (n − 3) × (2n − 7)

75 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

Comparison of NNI, SPR and TBR NNI : O(n) SPR : O(n2 ) TBR : O(n3 ) NNI ⊆ SPR ⊆ TBR

76 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

Variable Neigborhoods NNI, STEP, SPR, where STEP is a SPR for which only leaves are pruned was proprosed by Ribeiro SPR + 2-SPR Ribeiro, Vianna, 2005 [25] (where k-SPR is the composition of k SPR transformations) ¨ Parametric Progressive Neighborhood (PPN) Goeffon, Richer, Hao, 2007 [11] : SPR → NNI

77 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP

Parametric Progressive Neigborhood others : increase neighborhood size PPN : decrease neighborhood size

78 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Neighborhoods for MP Parametric Progressive Neigborhood 6000 5500 5000

score of the current tree

4500 4000 3500 3000

NNI

2500 SPR

2000

PPN

1500 0

5000

10000

15000

20000

25000 iterations

30000

35000

40000

45000

50000

Evolution of the score of trees for SPR, NNI and PPN with a 300-100 random instance starting from a random tree 79 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Other LS algorithms

Other LS algorithms Tabu Search Yu-Min, Shu-Cherng, Jeffrey, 2007 [36] Simulated Annealing (LVB Barker 2004 [2]) GRASP + VNS Ribeiro et al [1, 25]

80 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Genetic Algorithm Genetic Algorithm Algorithm 2 GeneticAlgorithm(S, f , x) P ← { choose or generate n individuals ∈ S } for a given number of crossovers x do p, q ← select-parents(P) r ← crossover (p, q) mutation(r ) if selection(r) then replace(P, r ) end if end for 81 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Genetic Algorithm

Genetic Algorithms for MP [18, 5, 6, 26] tree crossover operators follow the subtree cutting and regrafting strategy but crossover and mutation should be tailored to the target problem in order to integrate problem-specific constraints and thus improve the search

82 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Genetic Algorithm

A specific crossover operator Distance-Based Information Preservation Crossover ¨ introduced by Goeffon, Richer, Hao, 2006 [10] based on the notion of topological distance between two leaves aims to preserves common properties of parents in terms of topological distance between taxa

83 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Topological distance

Definition (Topological distance) let i and j be two taxa of a tree T the topological distance δT (i, j) between i and j is defined as the number of edges of the path between parents of i and j minus 1 if the path contains the root of the tree

84 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Topological distance

Topological distance Distance δ T

Tree T k j f

h C g

A

F

B D

B C D E F

A 0 1 3 3 2

B C D E 1 3 2 3 2 0 2 1 1 1

E

85 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP crossover algorithm

DiBIP crossover algorithm Algorithm 3 DiBIP(T1 , T2 , δT , ∆, ⊕, Λ) Di ← ∆(Ti ) for (i = 1, 2) D ∗ ← D1 ⊕ D2 T ∗ ← Λ(D ∗ ) return T ∗

86 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP example Tree 1 D I A K J B L N G C M F E H

87 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP example

Tree 1 - topological matrix D1 A B C D E F G H I J K L M N

A 6 5 1 5 5 5 5 0 5 2 7 5 7

B B 3 5 5 5 3 5 6 1 4 1 5 1

C

D

E

F

G

H

I

J

K

L

M

N

C 4 4 4 0 4 5 2 3 4 4 4

D 4 4 4 4 1 4 1 6 4 6

E 2 4 0 5 4 3 6 2 6

F 4 2 5 4 3 6 0 6

G 4 5 2 3 4 4 4

H 5 4 3 6 2 6

I 5 2 7 5 7

J 3 2 4 2

K 5 3 5

L 6 0

M 6

N -

88 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP example Tree 2 M B F L J K A E D H C G I N

89 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP example

Tree 2 - topological matrix D2 A B C D E F G H I J K L M N

A 8 4 1 0 9 4 2 6 7 4 9 6 6

B B 6 7 8 1 6 6 4 1 4 1 2 4

C

D

E

F

G

H

I

J

K

L

M

N

C 3 4 7 0 2 4 5 2 7 4 4

D 1 8 3 1 5 6 3 8 5 5

E 9 4 2 6 7 4 9 6 6

F 7 7 5 2 5 0 3 5

G 2 4 5 2 7 4 4

H 4 5 2 7 4 4

I 3 2 5 2 0

J 3 2 1 3

K 5 2 2

L 3 5

M 2

N -

90 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP example

D ∗ = D1 + D2 D2 A B C D E F G H I J K L M N

A 14 9 2 5 14 9 7 6 12 6 16 11 13

B B 9 12 13 6 9 11 10 2 8 2 7 5

C

D

E

F

G

H

I

J

K

L

M

N

C 7 8 11 0 6 9 7 5 11 8 8

D 5 12 7 5 6 10 4 14 9 11

E 11 8 2 11 11 7 15 8 12

F 11 9 10 6 8 6 3 11

G 6 9 7 5 11 8 8

H 9 9 5 13 6 10

I 8 4 12 7 7

J 6 4 5 5

K 10 5 7

L 9 5

M 8

N -

91 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

DiBIP example T ∗ = Λ(D ∗ ), UPGMA M F B L J N H E D A K I C G

92 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Memetic algorithm

Memetic algorithm for MP combination of a GA helped by a LS improver [19] Implementations : [16, 26] ¨ Hydra Goeffon, Richer, Hao, 2007 [11] is an implementation of a MA

93 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Other optimization methods

Other optimization methods Sectorial Search Goloboff, 1999 [13] Disc-Covering Methods (DCM) [20, 29] Fast character optimization techniques [12, 9, 28, 35] Multi-character optimization techniques [28, 27]

94 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Sectorial Search

Sectorial Search focus on a part of the tree decomposition into subproblems : divide and conquer

95 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Fast character optimization

Fast character optimization a tree modification does not imply to recompute the overall tree complex method but very effective TNT (Tree analysis using New Technology) Goloboff, 1999 [13] : billions of tree in a few seconds

96 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Multi-character optimization

Multi-character optimization use of vector registers of modern CPU to compute in parallel first version Ronquist 2000 [28] for PowerPC Richer 2008 [27] release of the code for Intel and AMD processors with SSE instructions

97 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Multi-character optimization algorithm Algorithm 4 fitch(x, y, z : array[1..m] of bytes) : int changes ← 0 for i ← 1 to m do z[i] ← x[i] ∩ y[i] if (z[i] == 0) then z[i] ← x[i] ∪ y[i] changes ← changes + 1 end if end for return changes 98 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Multi-character optimization

algorithm Code each nucleic acid by a power of 2 : char A C G T -

value 1 2 4 8 16

power 20 21 22 23 24

binary 00001 00010 00100 01000 10000

A ∪ C = 3 = 00011

99 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Multi-character optimization

Intel and AMD processors

SSE Register 16 bytes long xmm4 1

0

4

xmm1 1

4

6

xmm2

3

2

5

xmm3

3

6

7

AND Combine OR

100 / 114

Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction

Multi-character optimization

Multi-character optimization for Intel and AMD time in seconds compiled with nasm for Linux CPU / method Pentium-M 2.0 Ghz Pentium 4 2.8 Ghz Athlon 64 2.2 Ghz Intel Q6600 2.4 Ghz Core i7 860 2.8 Ghz

-O2 48.28 60.15 43.87 40.91 30.58

SSE 7.38 15.61 11.86 2.47 1.89

% -84 -74 -72 -93 -93⋆

⋆ : -95% if use of POPCNT

101 / 114

Metaheuristics applied to Bioinformatics problems Other problems

Other problems

Other problems

102 / 114

Metaheuristics applied to Bioinformatics problems Other problems

Other problems

Other problems DNA fragment assembly : build DNA sequence from thousands of overlapping fragments Gene expression profiling : find smallest subset of genes that regulate other genes or involved in diseases Structure prediction : determine 2D or 3D structure of protein docking : find the best candidate for a substrate ...

103 / 114

Metaheuristics applied to Bioinformatics problems Conclusion

Conclusion

Conclusion

104 / 114

Metaheuristics applied to Bioinformatics problems Conclusion

Reusability and genericity

Reusability and generecity of Metaheuristics Metahreuristics can be applied to a wide range of problems and especially to problems in bioinformatics but components of Metaheuristics must be tailored to the problem

105 / 114

Metaheuristics applied to Bioinformatics problems Conclusion

Decrease complexity by Parallelism

Decrease complexity by Parallelism calculations are carried out simultaneously on the same processor (multicore) on different processors : cluster use of n processors : breaks complexity by a factor of n divide time by a factor < n

106 / 114

Metaheuristics applied to Bioinformatics problems Conclusion

Decrease complexity by Cloud / Grid computing

Decrease complexity by Cloud / Grid computing a different way to obtain the power of a cluster provision of computational resources on demand via a network

107 / 114

Metaheuristics applied to Bioinformatics problems Conclusion

Questions and answers

Questions and answers

108 / 114

Metaheuristics applied to Bioinformatics problems Bibliography

Bibliography I [1]

A.A. Andreatta and C.C. Ribeiro. Heuristics for the phylogeny problem. Journal of Heuristics, 8 :429–447, 2002.

[2]

D. Barker. Parsimony and simulated annealing in the search for phylogenetic trees. Bioinformatics, 20 :274–275, 2004.

[3]

Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60 :503–516, 1954.

[4]

H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal of Applied Mathematics, 59(48) :1073–1082, 1988.

[5]

C.B. Congdon. Gaphyl : An evolutionary algorithms approach for the study of natural evolution. Proceedings of the 6th Joint Conference on Information Science, 2002.

[6]

C.B. Congdon and K.J. Septor. Phylogenetic trees using evolutionary search : Initial progress in extending gaphyl to work with genetic data. Proceedings of the 2003 Congress on Evolutionary Computation, pages 320–326, 2003.

[7]

R. C. Edgar. Muscle : multiple sequence alignment with high accuracy and high throughput. Nucl. Acids. Res., 32(5) :1792–1797, 2004.

109 / 114

Metaheuristics applied to Bioinformatics problems Bibliography

Bibliography II [8]

O. Gascuel. Bionj : an improved version of the nj algorithm based on a simple model of sequence data. Mol Biol Evol, 14(7) :685–695, 1997.

[9]

D.S. Gladstein. Efficient character optimization. Cladistics, 13 :21–26, 1997.

¨ [10] A. Goeffon, J-M. Richer, and J.K. Hao. A distance-based information preservation tree crossover for the maximum parsimony problem. Lecture Notes in Computer Science, 4193 :761–770, 2006. [11] A. Goeffon, J-M. Richer, and J.K. Hao. Progressive tree neighborhood applied to the maximum parsimony problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(1) :136–145, 2008. [12] P.A. Goloboff. Character optimization and calculation of tree lengths. Cladistics, 9 :433–436, 1993. [13] P.A. Goloboff. Analyzing large data sets in reasonable times : solutions for composite optima. Cladistics, 15 :415–428, 1999. [14] O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, Vol. 162 :705–708, 1982.

110 / 114

Metaheuristics applied to Bioinformatics problems Bibliography

Bibliography III [15] M.D. Hendy and D. Penny. Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences, 59 :277–290, 1982. ¨ [16] Tobias Hill, Andor Lundgren, Robert Fredriksson, and Helgi B. Schioth. Genetic algorithm for large-scale maximum parsimony phylogenetic analysis of proteins. Biochimica et Biophysica Acta (BBA) - General Subjects, 1725(1) :19 – 29, 2005. [17] I. Horovitz. A report on one day symposium on numerical cladistics. Cladistics, 15 :177–182, 1999. [18] A. Moilanen. Searching for most parsimonious trees with simulated evolutionary optimization. Cladistics, 15(3) :39–50, 1998. [19] P. Moscato. Chapter Memetic Algorithms : A Short Introduction in New Ideas in Optimization. McGraw-Hill, 1999. [20] L. Nakhleh, U. Roshan, K. St John, J. Sun, and T. Warnow. Designing fast converging phylogenetic methods. Bioinformatics Supplement, 17 :190–198, 2001. [21] Wunsch C.D. Needleman S.B. A general method applicable to the search for similarities in the amino acid sequence of two proteins. JMB, 3(48) :443–453, 1970.

111 / 114

Metaheuristics applied to Bioinformatics problems Bibliography

Bibliography IV [22] K.C. Nixon. The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics, 15 :407–414, 1999. [23] C. Notredame and D. Higgins. Saga : sequence alignment by genetic algorithm. Nucleic Acids Research, 8(24) :1515–1524, 1996. [24] C. Notredame, L. Holm, and D. G. Higgins. COFFEE : an objective function for multiple sequence alignments. Bioinformatics, 14(5) :407–422, 1998. [25] C. C. Ribeiro and D. S. Vianna. A grasp/vnd heuristic for the phylogeny problem using a new neighborhood structure. International Transactions in Operational Research, 12 :1–14, 2005. [26] C.C. Ribeiro and D.S. Vianna. A genetic algorithm for the phylogeny problem using an optimized crossover strategy based on path-relinking. Proceedings of 2nd Bresil Workshop on Bioinformatics, pages 97–102, 2003. [27] J-M. Richer. Three new techniques to improve phylogenetic reconstruction with maximum parsimony. Technical report, LERIA, 2008. [28] F. Ronquist. Fast fitch-parsimony algorithms for large data sets. Cladistics, 14 :387–400, 2000.

112 / 114

Metaheuristics applied to Bioinformatics problems Bibliography

Bibliography V [29] U. Roshan, B.M.E. Moret, T.L. Williams, and T. Warnow. Rec-i-dcm3 : A fast algorithmic technique for reconstructing large phylogenetic trees. Proceedins of IEEE Computational Systems Bioinformatics Conference (CSB 04), pages 98–109, 2004. [30] T. F. Smith and M. S. Waterman. Identification of common molecular sequences. JMB, 147 :195–197, 1981. [31] D.L. Swofford and G.J. Olsen. Molecular Systematics. D.M. Hillis and C. Moritz, 1990. [32] Rene´ Thomsen, Gary B. Fogel, and Thiemo Krink. Improvement of clustal-derived sequence alignments with evolutionary algorithms. In IEEE Congress on Evolutionary Computation (1)’03, pages 312–319, 2003. [33] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1 :1 :337–348, 1994. [34] M.S. Waterman and T.F. Smith. On the similarity of dendograms. Journal of Theoretical Biology, 73 :789–800, 1978. [35] M. Yan and D.A. Bader. Fast character optimization in parsimony phylogeny reconstruction. Technical Report TR-CS-2003-53, University of New Mexico, Albuquerque, NM, USA, 2003.

113 / 114

Metaheuristics applied to Bioinformatics problems Bibliography

Bibliography VI

[36] L. Yu-Min, F. Shu-Cherng, and T. L. Jeffrey. A tabu search algorithm for maximum parsimony phylogeny inference. European Journal of Operational Research, 176(3) :1908–1917, February 2007.

114 / 114

Suggest Documents