Metaheuristics applied to Bioinformatics problems
Metaheuristics applied to Bioinformatics problems Jean-Michel Richer
[email protected] http://www.info.univ-angers.fr/pub/richer
IBSS 2011 - Tanger, Maroc 1 / 114
Metaheuristics applied to Bioinformatics problems
Aim
Aim see how Metaheuristics are used to solve different kinds of problems in bioinformatics Multiple Sequence Alignment Phylogenetic Reconstruction other problems
2 / 114
Metaheuristics applied to Bioinformatics problems Outline
Outline
1
Multiple Alignment
2
Phylogenetic Reconstruction
3
Other problems
4
Conclusion
5
Bibliography
3 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Multiple Alignment
Multiple Alignment
4 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
What is an alignment ?
Definition (alignment) Given a set S = {S1 , . . . , Sk } of sequences, find a matrix M(k, n) s.t. P max(|Si |) ≤ n ≤ i=k i=1 |Si | each character M[i, j] is a residue or a gap −
there is no column such that all characters are gaps M[i] = Si if we remove all the inserted gaps
5 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
What is a good alignment ?
Main question Given a set of sequences S, what is a good alignment for S ?
6 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
What is a good alignment ?
Main question Given a set of sequences S, what is a good alignment for S ?
Answer nobody can tell !
7 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
What is a good alignment ?
Example 1 - ATTTC and ATTC A A
T T
T T
T -
C C
A A
T T
T -
T T
C C
A A
T -
T T
T T
C C
all are equivalent
8 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
What is a good alignment ?
Example 2 - ATTTTC, ATC and CTC A A C
T T T
T -
T -
T -
C C C
A A C
T T T
T C C
T -
T -
C -
ATT = Ile, ATC = Ile, CTC = Leu, TTC = Cys
9 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
What is a good alignment ? Example 3 - but what if ATTTTG, ATC and CTC A A C
T T T
T -
T -
T -
G C C
A A C
T T T
T C C
T -
T -
G -
I I L
L -
A A -
T T -
T C -
T C
T T
G C
I I -
L L
ATT = Ile, ATC = Ile, CTC = Leu, TTG = Leu 10 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Quality of an alignment
Sum of pairs a function to assess the quality of an alignment M substitution matrix (PAM, BLOSUM, GONNET, ...) gap cost model (linear, affine, concave, ...) SP(M) =
j=k i=k −1 X X
i=1 j=i+1
r =n X r =1
!
w (M[i, r ], M[j, r ])
11 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Quality of an alignment
Sum of pair
ATTCTCTTATATA... ATTGTGTTATTTT... CTTCTCTTATTCT... CTACTCTTATTCT... evaluation does not depend on the sequence order
12 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Pairwise Sequence Alignment
Definition (Pairwise Sequence Alignment - PSA) Given two sequences S and T , find an alignment M(2, n) s.t. max(|S|, |T |) ≤ n ≤ |S| + |T |
SP(M) is optimal for the Sum of Pairs function
13 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Pairwise Sequence Alignment
How to sove this optimization problem ? Metaheuristics search the best alignment in all alignments for 2 sequences of length n for a given objective function f : a maximization problem Dynamic Programming decomposition into subproblems
14 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Pairwise Sequence Alignment How to sove this optimization problem with Metaheuristics number of alignments for 2 sequences of length n : kX =n
k =0
k n Cn+k × Cnk = 1 + C2n = 1+
(2n)! (n!)2
approximately : 22n ∼√ πn for two sequences of length n = 100, there are 9.05 × 1058 alignments
15 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Dynamic Programming
Definition (Dynamic Programming) introduced by Richard Bellman 1954 [3] a method for solving complex problems by breaking them down into simpler subproblems where one needs to find the best decisions one after another the word programming referred to the use of the method to find an optimal program
16 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA with Dynamic Programming
PSA with Dynamic Programming - Recursive formula the best alignment between S[1..i] and T [1..j] is : align(S[1..i − 1], T [1..j − 1]) + w (S[i], T [j]) max align(S[1..i], T [1..j − 1]) + w (−, T [j]) align(S[1..i − 1], T [1..j]) + w (S[i], −)
17 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA with Dynamic Programming PSA with Dynamic Programming
S i−1
Si 1
Tj−2
Tj
1
2 2
3
3
Si Tj−1 Tj S i−1S i Tj
gap insertion mismatch
Tj−1
S i−1 S i Tj−1 Tj
substitution match
S i−2
18 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA with Dynamic Programming
PSA with Dynamic Programming complexity : O(n2 ) however, what to do in case of equivalences ? A A - - A T - T T -
19 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA with Dynamic Programming
PSA with Dynamic Programming complexity : O(n2 ) however, what to do in case of equivalences ? A A - - A T - T T -
20 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
Example (Parameters) S = ACAGTC T = CATTGC match : w (a, a) = 1 substitution : w (a, b) = 0 linear gap penalty : go = 0
21 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
Initialization of matrix M M[0, 0] = 0 M[i, 0] = M[i − 1, 0] + go M[0, j] = M[0, j − 1] + go
∀i ∈ [1, N] ∀j ∈ [1, P]
22 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
Recurrence relation M[i − 1, j − 1] M[i, j − 1]
տ ←
M[i − 1, j] ↑ M[i, j]
Recurrence M[i − 1, j − 1] +w (xi , yj ) M[i, j] = max M[i − 1, j] +go M[i, j − 1] +go 23 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
initialization S/T C A T T G C j
0 0 0 0 0 0 0 0
A 0
C 0
A 0
G 0
T 0
C 0
1
2
3
4
5
6
i 0 1 2 3 4 5 6
24 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
recurrence on M S/T C A T T G C j
0 0 0 0 0 0 0 0
A 0 0 1 1 1 1 1 1
C 0 1 1 1 1 1 2 2
A 0 1 2 2 2 2 2 3
G 0 1 2 2 2 3 3 4
T 0 1 2 2 3 3 3 5
C 0 1 2 2 3 3 4 6
i 0 1 2 3 4 5 6
25 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
Traceback from M
26 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA - Example
Alignments There are 5 optimal alignments : -CATTG-C ACA--GTC
-CAT-TGC -CATTGC ACA-GT-C ACAGT-C -CA-TTGC ACAGT--C
-CA-TTGC ACAG-T-C
which one is the best ?
27 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
PSA and Dynamic Programming
Different kinds of alignments global with linear gap penalty Needleman and Wunsch, 1970 [21] global with affine gap penalty Gotoh 1982 [14] local (Smith et Waterman 81 Smith and Waterman, 1981 [30]
28 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Multiple Sequence Alignment
Multiple Sequence Alignment
29 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Multiple Sequence Alignment
Definition (Multiple Sequence Alignment - MSA) Given a set of k sequences S, find an alignment M(k, n) s.t. P max(|Si |) ≤ n ≤ i=k i=1 |Si | SP(M) is optimal for the Sum of Pairs function
30 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Complexity of MSA
Complexity of MSA intractability proved by Wang and Jiang, 1994 [33] complexity of Dynamic Programming extension : O(k 2 × 2k × l k ) or O(2k × l k ) for a set of k sequences of length l
31 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Trick
Trick
32 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
MSA and Carrillo Lipman
Carrillo and Lipman trick to reduce complexity Carrillo and Lipman, 1988 [4] decrease complexity by considering part of the matrix computations
33 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
MSA and Carrillo Lipman Carrillo and Lipman trick to reduce complexity
1111 0000 00000 000 000 111 000011111 1111 00000111 11111 000 111 000 111 00000 11111 000 111 000 111 00000 11111 000 111 000 111 00000111 11111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 in 2 dimensions
34 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
a first heuristic : Clustal
a first heuristic : Clustal
35 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Clustal : a heuristic method
Clustal 1 generate a guide-tree (UPGMA, NJ, ...) 2
align profiles (∼ consensus) along the tree branches
36 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Clustal : a heuristic method
Characteristics of Clustal influence of guide tree loss of precision with profiles complexity : O(k 2 × n2 )
37 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Muscle
Muscle MUltiple Sequence Comparison by Log Expectation Edgar, 2004 [7] use of a much faster, but somewhat more approximate, method to compute distances use of UPGMA for better progressive alignment because forces alignment of most similar sequences first improvement of alignement
38 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Muscle
Muscle - first step k − mer clustering to build a tree
do not construct an alignment count the number of short sub-sequences (known as k-mers) that two sequences have in common around 3,000 times faster that Clustal’s method but the trees will generally be less accurate
39 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Muscle
Muscle - second step use the tree to construct a progressive alignment proceed as Clustal
40 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Muscle
Muscle - third step build a tree from alignment and compare to initial tree compute the pair-wise identities of each pair of sequences obtain distance matrix and build new tree if trees are identical, nothing to do otherwise rebuild a new alignment process this tree refinement until the tree stabilizes or until a specified maximum number of iterations has been reached
41 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Modification of the fitness function
Modification of the fitness function
42 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Modification of the fitness function : COFFEE
Modification of the fitness function : COFFEE Consistency based Objective Function For alignmEnt Evaluation introduced by Notredame, Holm and Higgins, 1998 [24] describes the quality of a multiple protein sequence alignment first evaluate a library and then compute COFFEE for a given alignment
43 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Modification of the fitness function : COFFEE
COFFEE : build the library the library is made of all pairwise alignments of k sequences for example : use clustal to obtain the pairwise alignments
44 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Modification of the fitness function : COFFEE
COFFEE : evaluate compare each aligned residues of the MSA to library simple version : number of pairs of residues and library divided by total number of pairs in MSA
45 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Metaheuristics for MSA
Metaheuristics for MSA
46 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
SAGA a metaheuristic approach to MSA
SAGA - Sequence Alignment by Genetic Algorithm designed by Notredame and Higgins 1996 [23] population of alignments submitted to GA
47 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
SAGA a metaheuristic approach to MSA
SAGA - variation operators 19 mutation operators (add, delete, move gaps) 6 crossover operator (modify or combine gap regions) use of an OS (Operator Scheduling) strategy to select which operator to apply
48 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
SAGA a metaheuristic approach to MSA
SAGA - Operator Scheduling strategy each operator is assigned a probability (initially equal) operators are rewarded when they create better individuals in the population motivation : difficult to know in which order to apply operators kind of a natural selection for operators
49 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
MSA-EA - an improvement of Clustal
MSA-EA introduced by Thomsen, Fogel and Krink, 2003 [32] a Genetic Algorithm use solution of Clustal as a seed (1.2 length + gaps at end) apply variation operators (BlockShuffle, LocalShuffle, ...)
50 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
MSA-EA - an improvement of Clustal Variation operators : example of LocalShuffle picks a random AA from a randomly chosen row (sequence) checks whether one of its neighbors is a gap if so, swap (if both neighbors are gaps then one of them is picked randomly)
Variation operators discussion for LocalShuffle no biological meaning no optimized choice
51 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Conclusion for MSA
Resolution of MSA Metaheuristics take too much time iterative Dynamic Programming seems sufficient and efficient enough : ClustalW who really knows what a correct or good alignment is ?
52 / 114
Metaheuristics applied to Bioinformatics problems Multiple Alignment
Conclusion for MSA
Some results on balibase : SPS score with bali score Softwares CLUSTAL MAFFT MUSCLE PROBCONS TCOFFEE MALINBA
Set 1 0.809 0.829 0.821 0.849 0.814 0.811
Set 2 0.932 0.931 0.935 0.943 0.928 0.911
Set 3 0.723 0.812 0.784 0.817 0.739 0.752
Set 4 0.834 0.947 0.841 0.939 0.852 0.899
Set 5 0.858 0.978 0.972 0.974 0.943 0.942
Time (s) 120 98 75 711 1653 343
53 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Phylogenetic Reconstruction
Phylogenetic Reconstruction
54 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Methods for Phylogenetic Reconstruction
Methods for Phylogenetic Reconstruction distance based ∼ O(n3 )
UPGMA, WPGMA NJ BioNJ (Gascuel, 1997 [8])
character based : Maximum Parsimony Maximum Likelihood
55 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Maximum Parsimony
Maximum Parsimony character-based approach that relies on the work of the german entomologist Willy Hennig (1913-1976) based on Occam’s razor (1285–1349) or principle of economy
56 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Maximum Parsimony
Maximum Parsimony input : a multiple sequence alignment of n sequences of m residues algorithm : find the tree with minimum changes output : a tree of minimum score, i.e. with minimum changes minimization problem
57 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Small parsimony problem
Definition (Small parsimony problem) given a multiple alignment of length m of a set L of n sequences and a tree T whose leaves are labelled with sequences of L, find the parsimony score of T . complexity : O(n × m)
58 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Large parsimony problem
Definition (Small parsimony problem) given a mulitple alignment of length m of a set L of n sequences, find a most parsimonious tree T , i.e. a tree with minimum parsimony score complexity : factorial
59 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Maximum Parsimony
Number of trees number of taxa 10 20 30 40 50 80 n
# of unrooted trees 2.0e+06 2.2e+20 8.6e+36 1.3e+55 2.8e+74 2.1e+137 Qn i=3 (2i − 5)
# of rooted trees 3.4e+07 8.2e+21 4.9e+38 1.0e+57 2.7e+76 3.4e+139 Qn i=2 (2i − 3)
60 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Phylogenetic reconstruction with Maximum Parsimony
Maximum Parsimony is a combinatorial optimization problem for which we must find efficient methods a minimization problem
61 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Methods
Methods branch and bound local search methods genetic algorithms memetic algorithms other optimization techniques
62 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Branch and bound
Branch and bound introduced by Hendy and Penny 1982 [15] generate a first tree as upper bound then create trees by iteratively adding new taxon under upper bound
63 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Branch and bound
Image from Mikael Thollesson (c) 64 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Branch and bound
Drawback of Branch and bound too many trees are generated maximum number of taxa : 20
65 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
LS descent
LS descent algorithm Algorithm 1 descent(S, f , N) s ← choose or generate an initial configuration ∈ S for a given number of iterations do find s′ ∈ N(s) such that f (s′ ) < f (s) or return s s ← s′ end for return s
66 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
LS generation of the initial configuration
Generation of initial configuration random (sometimes too far from optimum) branch and bound stepwise addition
67 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
LS generation of the initial configuration
Generation of initial configuration random (sometimes too far from optimum) branch and bound stepwise addition
Stepwise addition new taxon added to all branches and keep most parsimonious tree
68 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Acceptance of new configuration
Acceptance of new configuration strict descent : only improving configuration side-walk descent : improving or equivalent neighbors random-walk : possibility to accept deteriorating neighbors simulated annealing : specific random walk with a non-constant probability
69 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Escape from local optimum
Noising techniques Iterated Local Search : perturbation of current configuration Parsimony Ratchet : modification of fitness function
70 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Parsimony Ratchet
Parsimony Ratchet introduced by Nixon or Horovitz, 1999 [22, 17] When a local optimum s⋄ is reached, the Ratchet noises the evaluation function : the weights of a proportion of the characters (10-15%) can be increased or some characters can be eliminated a LS is performed from s⋄ using the noising evaluation function f ′ continue LS back to original objective function f
71 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
Neighborhoods A configuration is a tree, three main neighborhoods exits NNI : (Nearest Neighbor Interchange) [34], small size SPR : move (Subtree Pruning Regrafting) [31], average TBR (Tree Bisection Reconnection) [31], large
72 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
NNI consists in swapping two subtrees which are separated by a branch size : (2n − 6)
extension : p-ECR which shuffles p adjacent branches. In particular, 1-ECR is equivalent to NNI.
73 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
SPR cuts a branch and creates two separate trees : the clipped tree and the residual tree The clipped tree can then be regrafted on each branch of the residual tree size : at least (2n − 3) × (n − 3)2
74 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
TBR breaks the tree into two subtrees and reconnects the re-rooted clipped tree to any branch of the residual tree The clipped tree can then be regrafted on each branch of the residual tree size : 2 × (n − 3) × (2n − 7)
75 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
Comparison of NNI, SPR and TBR NNI : O(n) SPR : O(n2 ) TBR : O(n3 ) NNI ⊆ SPR ⊆ TBR
76 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
Variable Neigborhoods NNI, STEP, SPR, where STEP is a SPR for which only leaves are pruned was proprosed by Ribeiro SPR + 2-SPR Ribeiro, Vianna, 2005 [25] (where k-SPR is the composition of k SPR transformations) ¨ Parametric Progressive Neighborhood (PPN) Goeffon, Richer, Hao, 2007 [11] : SPR → NNI
77 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP
Parametric Progressive Neigborhood others : increase neighborhood size PPN : decrease neighborhood size
78 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Neighborhoods for MP Parametric Progressive Neigborhood 6000 5500 5000
score of the current tree
4500 4000 3500 3000
NNI
2500 SPR
2000
PPN
1500 0
5000
10000
15000
20000
25000 iterations
30000
35000
40000
45000
50000
Evolution of the score of trees for SPR, NNI and PPN with a 300-100 random instance starting from a random tree 79 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Other LS algorithms
Other LS algorithms Tabu Search Yu-Min, Shu-Cherng, Jeffrey, 2007 [36] Simulated Annealing (LVB Barker 2004 [2]) GRASP + VNS Ribeiro et al [1, 25]
80 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Genetic Algorithm Genetic Algorithm Algorithm 2 GeneticAlgorithm(S, f , x) P ← { choose or generate n individuals ∈ S } for a given number of crossovers x do p, q ← select-parents(P) r ← crossover (p, q) mutation(r ) if selection(r) then replace(P, r ) end if end for 81 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Genetic Algorithm
Genetic Algorithms for MP [18, 5, 6, 26] tree crossover operators follow the subtree cutting and regrafting strategy but crossover and mutation should be tailored to the target problem in order to integrate problem-specific constraints and thus improve the search
82 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Genetic Algorithm
A specific crossover operator Distance-Based Information Preservation Crossover ¨ introduced by Goeffon, Richer, Hao, 2006 [10] based on the notion of topological distance between two leaves aims to preserves common properties of parents in terms of topological distance between taxa
83 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Topological distance
Definition (Topological distance) let i and j be two taxa of a tree T the topological distance δT (i, j) between i and j is defined as the number of edges of the path between parents of i and j minus 1 if the path contains the root of the tree
84 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Topological distance
Topological distance Distance δ T
Tree T k j f
h C g
A
F
B D
B C D E F
A 0 1 3 3 2
B C D E 1 3 2 3 2 0 2 1 1 1
E
85 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP crossover algorithm
DiBIP crossover algorithm Algorithm 3 DiBIP(T1 , T2 , δT , ∆, ⊕, Λ) Di ← ∆(Ti ) for (i = 1, 2) D ∗ ← D1 ⊕ D2 T ∗ ← Λ(D ∗ ) return T ∗
86 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP example Tree 1 D I A K J B L N G C M F E H
87 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP example
Tree 1 - topological matrix D1 A B C D E F G H I J K L M N
A 6 5 1 5 5 5 5 0 5 2 7 5 7
B B 3 5 5 5 3 5 6 1 4 1 5 1
C
D
E
F
G
H
I
J
K
L
M
N
C 4 4 4 0 4 5 2 3 4 4 4
D 4 4 4 4 1 4 1 6 4 6
E 2 4 0 5 4 3 6 2 6
F 4 2 5 4 3 6 0 6
G 4 5 2 3 4 4 4
H 5 4 3 6 2 6
I 5 2 7 5 7
J 3 2 4 2
K 5 3 5
L 6 0
M 6
N -
88 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP example Tree 2 M B F L J K A E D H C G I N
89 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP example
Tree 2 - topological matrix D2 A B C D E F G H I J K L M N
A 8 4 1 0 9 4 2 6 7 4 9 6 6
B B 6 7 8 1 6 6 4 1 4 1 2 4
C
D
E
F
G
H
I
J
K
L
M
N
C 3 4 7 0 2 4 5 2 7 4 4
D 1 8 3 1 5 6 3 8 5 5
E 9 4 2 6 7 4 9 6 6
F 7 7 5 2 5 0 3 5
G 2 4 5 2 7 4 4
H 4 5 2 7 4 4
I 3 2 5 2 0
J 3 2 1 3
K 5 2 2
L 3 5
M 2
N -
90 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP example
D ∗ = D1 + D2 D2 A B C D E F G H I J K L M N
A 14 9 2 5 14 9 7 6 12 6 16 11 13
B B 9 12 13 6 9 11 10 2 8 2 7 5
C
D
E
F
G
H
I
J
K
L
M
N
C 7 8 11 0 6 9 7 5 11 8 8
D 5 12 7 5 6 10 4 14 9 11
E 11 8 2 11 11 7 15 8 12
F 11 9 10 6 8 6 3 11
G 6 9 7 5 11 8 8
H 9 9 5 13 6 10
I 8 4 12 7 7
J 6 4 5 5
K 10 5 7
L 9 5
M 8
N -
91 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
DiBIP example T ∗ = Λ(D ∗ ), UPGMA M F B L J N H E D A K I C G
92 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Memetic algorithm
Memetic algorithm for MP combination of a GA helped by a LS improver [19] Implementations : [16, 26] ¨ Hydra Goeffon, Richer, Hao, 2007 [11] is an implementation of a MA
93 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Other optimization methods
Other optimization methods Sectorial Search Goloboff, 1999 [13] Disc-Covering Methods (DCM) [20, 29] Fast character optimization techniques [12, 9, 28, 35] Multi-character optimization techniques [28, 27]
94 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Sectorial Search
Sectorial Search focus on a part of the tree decomposition into subproblems : divide and conquer
95 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Fast character optimization
Fast character optimization a tree modification does not imply to recompute the overall tree complex method but very effective TNT (Tree analysis using New Technology) Goloboff, 1999 [13] : billions of tree in a few seconds
96 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Multi-character optimization
Multi-character optimization use of vector registers of modern CPU to compute in parallel first version Ronquist 2000 [28] for PowerPC Richer 2008 [27] release of the code for Intel and AMD processors with SSE instructions
97 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Multi-character optimization algorithm Algorithm 4 fitch(x, y, z : array[1..m] of bytes) : int changes ← 0 for i ← 1 to m do z[i] ← x[i] ∩ y[i] if (z[i] == 0) then z[i] ← x[i] ∪ y[i] changes ← changes + 1 end if end for return changes 98 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Multi-character optimization
algorithm Code each nucleic acid by a power of 2 : char A C G T -
value 1 2 4 8 16
power 20 21 22 23 24
binary 00001 00010 00100 01000 10000
A ∪ C = 3 = 00011
99 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Multi-character optimization
Intel and AMD processors
SSE Register 16 bytes long xmm4 1
0
4
xmm1 1
4
6
xmm2
3
2
5
xmm3
3
6
7
AND Combine OR
100 / 114
Metaheuristics applied to Bioinformatics problems Phylogenetic Reconstruction
Multi-character optimization
Multi-character optimization for Intel and AMD time in seconds compiled with nasm for Linux CPU / method Pentium-M 2.0 Ghz Pentium 4 2.8 Ghz Athlon 64 2.2 Ghz Intel Q6600 2.4 Ghz Core i7 860 2.8 Ghz
-O2 48.28 60.15 43.87 40.91 30.58
SSE 7.38 15.61 11.86 2.47 1.89
% -84 -74 -72 -93 -93⋆
⋆ : -95% if use of POPCNT
101 / 114
Metaheuristics applied to Bioinformatics problems Other problems
Other problems
Other problems
102 / 114
Metaheuristics applied to Bioinformatics problems Other problems
Other problems
Other problems DNA fragment assembly : build DNA sequence from thousands of overlapping fragments Gene expression profiling : find smallest subset of genes that regulate other genes or involved in diseases Structure prediction : determine 2D or 3D structure of protein docking : find the best candidate for a substrate ...
103 / 114
Metaheuristics applied to Bioinformatics problems Conclusion
Conclusion
Conclusion
104 / 114
Metaheuristics applied to Bioinformatics problems Conclusion
Reusability and genericity
Reusability and generecity of Metaheuristics Metahreuristics can be applied to a wide range of problems and especially to problems in bioinformatics but components of Metaheuristics must be tailored to the problem
105 / 114
Metaheuristics applied to Bioinformatics problems Conclusion
Decrease complexity by Parallelism
Decrease complexity by Parallelism calculations are carried out simultaneously on the same processor (multicore) on different processors : cluster use of n processors : breaks complexity by a factor of n divide time by a factor < n
106 / 114
Metaheuristics applied to Bioinformatics problems Conclusion
Decrease complexity by Cloud / Grid computing
Decrease complexity by Cloud / Grid computing a different way to obtain the power of a cluster provision of computational resources on demand via a network
107 / 114
Metaheuristics applied to Bioinformatics problems Conclusion
Questions and answers
Questions and answers
108 / 114
Metaheuristics applied to Bioinformatics problems Bibliography
Bibliography I [1]
A.A. Andreatta and C.C. Ribeiro. Heuristics for the phylogeny problem. Journal of Heuristics, 8 :429–447, 2002.
[2]
D. Barker. Parsimony and simulated annealing in the search for phylogenetic trees. Bioinformatics, 20 :274–275, 2004.
[3]
Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60 :503–516, 1954.
[4]
H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal of Applied Mathematics, 59(48) :1073–1082, 1988.
[5]
C.B. Congdon. Gaphyl : An evolutionary algorithms approach for the study of natural evolution. Proceedings of the 6th Joint Conference on Information Science, 2002.
[6]
C.B. Congdon and K.J. Septor. Phylogenetic trees using evolutionary search : Initial progress in extending gaphyl to work with genetic data. Proceedings of the 2003 Congress on Evolutionary Computation, pages 320–326, 2003.
[7]
R. C. Edgar. Muscle : multiple sequence alignment with high accuracy and high throughput. Nucl. Acids. Res., 32(5) :1792–1797, 2004.
109 / 114
Metaheuristics applied to Bioinformatics problems Bibliography
Bibliography II [8]
O. Gascuel. Bionj : an improved version of the nj algorithm based on a simple model of sequence data. Mol Biol Evol, 14(7) :685–695, 1997.
[9]
D.S. Gladstein. Efficient character optimization. Cladistics, 13 :21–26, 1997.
¨ [10] A. Goeffon, J-M. Richer, and J.K. Hao. A distance-based information preservation tree crossover for the maximum parsimony problem. Lecture Notes in Computer Science, 4193 :761–770, 2006. [11] A. Goeffon, J-M. Richer, and J.K. Hao. Progressive tree neighborhood applied to the maximum parsimony problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(1) :136–145, 2008. [12] P.A. Goloboff. Character optimization and calculation of tree lengths. Cladistics, 9 :433–436, 1993. [13] P.A. Goloboff. Analyzing large data sets in reasonable times : solutions for composite optima. Cladistics, 15 :415–428, 1999. [14] O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, Vol. 162 :705–708, 1982.
110 / 114
Metaheuristics applied to Bioinformatics problems Bibliography
Bibliography III [15] M.D. Hendy and D. Penny. Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences, 59 :277–290, 1982. ¨ [16] Tobias Hill, Andor Lundgren, Robert Fredriksson, and Helgi B. Schioth. Genetic algorithm for large-scale maximum parsimony phylogenetic analysis of proteins. Biochimica et Biophysica Acta (BBA) - General Subjects, 1725(1) :19 – 29, 2005. [17] I. Horovitz. A report on one day symposium on numerical cladistics. Cladistics, 15 :177–182, 1999. [18] A. Moilanen. Searching for most parsimonious trees with simulated evolutionary optimization. Cladistics, 15(3) :39–50, 1998. [19] P. Moscato. Chapter Memetic Algorithms : A Short Introduction in New Ideas in Optimization. McGraw-Hill, 1999. [20] L. Nakhleh, U. Roshan, K. St John, J. Sun, and T. Warnow. Designing fast converging phylogenetic methods. Bioinformatics Supplement, 17 :190–198, 2001. [21] Wunsch C.D. Needleman S.B. A general method applicable to the search for similarities in the amino acid sequence of two proteins. JMB, 3(48) :443–453, 1970.
111 / 114
Metaheuristics applied to Bioinformatics problems Bibliography
Bibliography IV [22] K.C. Nixon. The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics, 15 :407–414, 1999. [23] C. Notredame and D. Higgins. Saga : sequence alignment by genetic algorithm. Nucleic Acids Research, 8(24) :1515–1524, 1996. [24] C. Notredame, L. Holm, and D. G. Higgins. COFFEE : an objective function for multiple sequence alignments. Bioinformatics, 14(5) :407–422, 1998. [25] C. C. Ribeiro and D. S. Vianna. A grasp/vnd heuristic for the phylogeny problem using a new neighborhood structure. International Transactions in Operational Research, 12 :1–14, 2005. [26] C.C. Ribeiro and D.S. Vianna. A genetic algorithm for the phylogeny problem using an optimized crossover strategy based on path-relinking. Proceedings of 2nd Bresil Workshop on Bioinformatics, pages 97–102, 2003. [27] J-M. Richer. Three new techniques to improve phylogenetic reconstruction with maximum parsimony. Technical report, LERIA, 2008. [28] F. Ronquist. Fast fitch-parsimony algorithms for large data sets. Cladistics, 14 :387–400, 2000.
112 / 114
Metaheuristics applied to Bioinformatics problems Bibliography
Bibliography V [29] U. Roshan, B.M.E. Moret, T.L. Williams, and T. Warnow. Rec-i-dcm3 : A fast algorithmic technique for reconstructing large phylogenetic trees. Proceedins of IEEE Computational Systems Bioinformatics Conference (CSB 04), pages 98–109, 2004. [30] T. F. Smith and M. S. Waterman. Identification of common molecular sequences. JMB, 147 :195–197, 1981. [31] D.L. Swofford and G.J. Olsen. Molecular Systematics. D.M. Hillis and C. Moritz, 1990. [32] Rene´ Thomsen, Gary B. Fogel, and Thiemo Krink. Improvement of clustal-derived sequence alignments with evolutionary algorithms. In IEEE Congress on Evolutionary Computation (1)’03, pages 312–319, 2003. [33] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1 :1 :337–348, 1994. [34] M.S. Waterman and T.F. Smith. On the similarity of dendograms. Journal of Theoretical Biology, 73 :789–800, 1978. [35] M. Yan and D.A. Bader. Fast character optimization in parsimony phylogeny reconstruction. Technical Report TR-CS-2003-53, University of New Mexico, Albuquerque, NM, USA, 2003.
113 / 114
Metaheuristics applied to Bioinformatics problems Bibliography
Bibliography VI
[36] L. Yu-Min, F. Shu-Cherng, and T. L. Jeffrey. A tabu search algorithm for maximum parsimony phylogeny inference. European Journal of Operational Research, 176(3) :1908–1917, February 2007.
114 / 114