... a gap in the second. â by one AA in the second sequence and align it with a gap in the first. 2. 1 ,1. 1,. )2,1(. )1,1(. ),(. ),(. 0 max. ),( lj li. SSSbt j. iH. jiF. jiE. jiH.
Computing Large-Scale Alignments on a Multi-Cluster Chunxi Chen and Bertil Schmidt
Contents • • • • • •
Motivation Smith-Waterman Algorithm Multi-Cluster Platform Parallelization Performance Evaluation Conclusions
Motivation Growth of GeneBank
Base Pairs of DNA (in billions)
30 25 20 15 10 5 0 1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
Year
• Genetic sequence databases are growing exponentially • Growth rate will continue, since multiple concurrent genome projects have begun with more to come
Motivation • Discovered sequences are analyzed by comparison with sequence databases • Complexity of sequence comparison is proportional to the product of query size times database size • ⇒ Analysis too slow on sequential computers • There are two approaches: – Heuristics, e.g. BLAST, but the more efficient the heuristics, the worse the quality of the results – Parallel Processing, get high quality results in reasonable time
Motivation • Number of completely sequenced genomes: – 105 microbial, 10 Eukaryotic – ~ 10 nearly completed eukaryotic genomes – > 300 genome projects planned or underway (http://www.nslijgenetics.org/seq/)
• Closely related genomes – Eukaryotic example: C. elegans and C. briggsae – Prokaryotic example: E. coli K12 and O157:H7
• What’s Common and What’s Unique – Identify unique & important proteins in pathogens as targets – Identify Genes and Predict Function based on Synteny – Susceptibility to disease (SNPs)
Sequence Alignment • Several algorithms: BLAST, FastA, Smith-Waterman Slower
SmithWaterman
Search Speed
FastA BLAST
Faster Lower
Data Quality
Higher
Smith-Waterman Algorithm • Optimal local alignment of two sequences • Performs an exhaustive search for the optimal local alignment – Complexity O(n×m) for sequence lengths n and m
• Based on the 'dynamic programming' (DP) algorithm – Fill the DP matrix using a substitution (mutation) matrix – Find the maximal value (score) in the matrix – Trace back from the score until a 0 value is reached
Smith-Waterman Algorithm • Aligning S1 and S2 of length l1 and l2 using Recurrences:
⎧0 ⎪ E (i, j ) ⎪ H (i, j ) = max ⎨ ,1 ≤ i ≤ l1, 1 ≤ j ≤ l 2 ⎪ F (i, j ) ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j ) ⎧ H (i, j − 1) − α ⎧ H (i − 1, j ) − α E (i , j ) = max ⎨ , F (i , j ) = max ⎨ E ( i , j 1 ) β − − H (0, j ) = F (0, j ) = 0 ⎩ ⎩ F (i − 1, j ) − β H (i,0) = E (i,0) = 0
• Calculate three possible ways to extend the alignment – by one AminoAcid (AA) in each sequence – by one AA in the first sequence and align it with a gap in the second – by one AA in the second sequence and align it with a gap in the first
Smith-Waterman Algorithm Align S1=ATCTCGTATGATG S2=GTCTATCAC ⎧2 if ( x = y) Sbt( x, y) = ⎨ ⎩− 1 else
α=1, β=1
∅ G T C T A T C A C
∅ A
T
C
T
C G T
A
T G A
0 0 0 0 0 2 1 0 2 1
0 0 2 1 2 2 4 3 2 1
0 00 1 4 3 2 3 6 5 4
0 0 2 3 6 5 4 5 5 4
0 0 1 4 5 5 4 6 5 7
0 0 3 3 4 7 5 5 7 6
0 0 2 2 5 6 9 8 7 6
0 0 0 0 0 0 0 0 0 0
⎧0 ⎪ H (i − 1, j ) − 1 ⎪ H (i, j ) = max ⎨ ⎪ H (i, j − 1) − 1 ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j )
0 2 1 3 4 4 4 5 5 6
0 1 4 3 5 4 6 5 4 5
T G
0 0 0 2 1 0 1 1 3 1 0 2 4 3 2 5 6 5 8 7 8 8 7 7 7 10 10 9 6 9 9
0 2 2 2 1 4 7 7 8 8
ATCTCGTATGATG GTC −TATCAC
Smith-Waterman Algorithm • Aligning two sequences of length N and M – Time Complexity: O(N×M) – Memory Complexity: O(N×M)
• Example – aligning m. pneumoniae (816,394 nucleotides) and m. genitalium (580,074 nucleotides) – around 500-G Cells in the Similarity Matrix
Computing Maximal Score in Linear Space Idea: Store only one row in Similarity Matrix
Similarity Matrix
• Compute maximal score in similarity matrix • Computing the path (traceback) requires additional computation
Traceback in Linear Space start point max point Optimal midpoint
N/8 N/4 3N/8
i*=N/2 5N/8 3N/4 7N/8
Similarity Matrix
Finding Near-Optimal NonOverlapping Alignments
• •
C=0
C=1
C=2
C=3
C=4
C=5
C=6
C=7
C=8
C=9
C=10
C=11
C=12
C=13
∅
A
T
C
T
C
G
T
A
T
G
A
T
G
R=0
∅
0
0
0
0
0
0
0
0
0
0
0
0
0
0
R=1
G
0
0
0
0
0
0
2
1
0
0
2
1
0
2
R=2
T
0
0
2
1
2
1
1
4
3
2
1
1
3
2
R=3
C
0
0
1
4
3
4
3
3
3
2
1
0
2
2
R=4
T
0
0
2
3
6
5
4
5
4
5
4
3
2
1
R=5
A
0
2
2
2
5
5
4
4
7
6
5
6
5
4
R=6
T
0
1
4
3
4
4
4
6
5
9
8
7
8
7
R=7
C
0
0
3
6
5
6
5
5
5
8
8
7
7
7
R=8
A
0
2
2
5
5
5
5
4
7
7
7
10
9
8
R=9
C
0
1
1
4
4
7
6
5
6
6
6
9
9
8
Aligning Determine k highest scores with different alignment start points. Use divide-conquer algorithm k times to determine the actual alignments.
Multi-Cluster Platform Parallel and Distributed Processing Lab in NTU (PDP)
1Gbps Myrinet
1Gbps Myrinet
BioInformatics Research Center in NTU (BIRC)
30Mbps bandwidth 0
1
Head node
• High performance at low cost • Geographically distributed • Different Bandwidths
Head node
1
0
Multi-Cluster Programming Environment MPICH-GLOBUS2 control node of cluster1
control node of cluster2
MPICH-G2 process
node n
node n-1
MPICH process
MPICH process
sharing memory
sharing memory
MPICH process
MPICH process
MPICH Cluster1
MPICH-G2 Layer
MPICH-G2 process
node n-1
MPICH process
MPICH
internet
Cluster2
node n
MPICH process
MPICH Layer
Parallelization on Multi-Cluster • Wavefront Computation of DP matrix
Sequence Y
P0
P1
P2
Sequence X
Parallelization on Multi-Cluster • Divide-and-Conquer traceback computation • Define special columns (SC’s) as the last columns of the parts of the similarity matrix allocated to each processor, except the last processor • Identify the intersection of an optimal path with the special columns SC1 1 P0
SC2 2 P1
SC3 3 P2
P3
Optimal Path
Performance Evaluation • Using MPICH in PDP and BIRC to check the performance of every cluster 8 7
Ideal Speedup PDP BIRC
Speedup
6 5 4 3 2 1 0 0
1
2
3
4
5
6
Number of processors
7
8
Performance Evaluation • Speedups on two the clusters in PDP for aligning two sequences of length 100,000 Ideal Speedup Two Clusters in PDP
10
Speedup
8
6
4
2
0 0
2
4
6
Number of Processors
8
10
Performance Evaluation • Speedups on the PDP and BIRC clusters for aligning two sequences of length 100,000 20
Ideal Speedup PDP+BIRC
18 16 14
Speedup
12 10 8 6 4 2 0 0
2
4
6
8
10
12
14
Number of Processors
16
18
20
Performance Evaluation • Quality of Alignment Results • Comparison to results of MUMmer – Assumes two sequences are closely-related (highly similar) – Can align two bacterial genomes under 1 minute – Uses Suffix Tree to find Maximal Unique Matches – Maximal Unique Match (MUM) Definition: • A subsequence that occurs in two exactly matching copies, once in each input sequence, and that can not be extended in either direction – Key idea: significant long MUM is certainly to be part of the global alignment
Can you find the MUM?
Performance Evaluation • Quality of Alignment Results • Align mitochondrial genomes of Human and Drosophila (not very related) 20000
20000
16000
Drosophila mitochondrial
Drosophila mitochondrial
MUMmer
Our System
16000
12000
8000
4000
0
12000
8000
4000
0 0
4000
8000
12000
Human mitochondrial
16000
0
4000
8000
12000
Human mitochondrial
16000
Conclusion and Future Work • Presented how alignment of long DNA sequences based Dynamic Programming and Divide-andConquer can be efficiently parallelized on a MultiCluster Architecture • Investigating which other applications can benefit from this type of computing power • Investigating different bandwidths levels