max. ),( ,. )1,(. )1,( max. ),( j. iF j. iH. jiF. jiE. jiH. jiE. Smith-Waterman Algorithm. ⢠Performs an exhaustive search for the optimal local alignment of two sequences ...
CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled Graphics Hardware Weiguo Liu Fraunhofer IDM Centre@NTU
Contents • • • •
Motivation Background Mapping BLASTP onto GPU Clusters Performance Evaluation
GPGPU • Commodity components – Low cost or even zero cost (if users have already access to a modern GPU) – Easy upgrading to next-generation GPUs
• High performance/price ratio • Enhanced programmability – Can be used for general purpose computing, e.g. scientific computing, image processing, bioinformatics (see www.gpgpu.org)
GPU-accelerated Hybrid Architectures
Smith-Waterman Algorithm • Performs an exhaustive search for the optimal local alignment of two sequences • Aligning S1 and S2 of length l1 and l2 using Recurrences:
⎧0 ⎪ E (i, j ) ⎪ H (i, j ) = max ⎨ ,1 ≤ i ≤ l1, 1 ≤ j ≤ l 2 ⎪ F (i, j ) ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j ) ⎧ H (i, j − 1) − α ⎧ H (i − 1, j ) − α E (i , j ) = max ⎨ , F (i , j ) = max ⎨ H (0, j ) = F (0, j ) = 0 ⎩ E (i , j − 1) − β ⎩ F (i − 1, j ) − β H (i,0) = E (i,0) = 0
Smith-Waterman Algorithm Align S1=ATCTCGTATGATG S2=GTCTATCAC
⎧2 if ( x = y) Sbt( x, y) = ⎨ ⎩− 1 else α=1, β=1
∅ G T C T A T C A C
∅ A
T
C
T
C G T
A
T G A
0 0 0 0 0 2 1 0 2 1
0 0 2 1 2 2 4 3 2 1
0 00 1 4 3 2 3 6 5 4
0 0 2 3 6 5 4 5 5 4
0 0 1 4 5 5 4 6 5 7
0 0 3 3 4 7 5 5 7 6
0 0 2 2 5 6 9 8 7 6
0 0 0 0 0 0 0 0 0 0
⎧0 ⎪ H (i − 1, j ) − 1 ⎪ H (i, j ) = max ⎨ ⎪ H (i, j − 1) − 1 ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j )
0 2 1 3 4 4 4 5 5 6
0 1 4 3 5 4 6 5 4 5
T G
0 0 0 2 1 0 1 1 3 1 0 2 4 3 2 5 6 5 8 7 8 8 7 7 7 10 10 9 6 9 9
ATCTCGTATGATG GTC −TATCAC
0 2 2 2 1 4 7 7 8 8
BLASTP Algorithm • The Smith-Waterman algorithm is too compute intensive • Heuristics approach: – Assumes good alignments contain short exact matches – Find such matches quickly using data structures such as lookup tables – Identified short matches are used as seeds for further detailed analysis
BLASTP Algorithm BLASTP: the Basic Local Alignment Searching Tool for Proteins Stage 1 database
Word Matching
hits
Ungapped Extension
Stage 4
Stage 3
Stage 2 HSPs
Gapped Extension
HSAs
Traceback alignments & Display
• Stage 1: Word Matching – Each hit is defined as an offset pair (i,j) for which∑k =0 sbt(Q[i + k ],D[ j + k ]) ≥ T – w, T, sbt input parameters, Q query, D database w−1
• Stage 2: Ungapped Extension – Outputs high-scoring segments pairs (HSPs) – Performs an ungapped extension on a diagonal that contains a nonoverlapping hit pair
BLASTP Algorithm
• Stage 3 and 4: Gapped Extension – Computes high-scoring alignments (HSAs) by performing a gapped alignment algorithm using HSPs as seeds – Traceback path is calculated and displayed
• Execution Profiling for scanning GenBank: – Stage 1: 37% – Stage 2: 31% – Stage 3 and 4: 32%
Mapping BLASTP onto GPU Clusters
• Worker Level Data Parallelization −
The master node partitions the subject database into multiple batches and distributes them to workers
• GPU Level Data Parallelization −
Workers process allocated data batches using GPU
Mapping BLASTP onto GPU Clusters
• Work Generator (Master): Performs data pre-processing and partitioning • Data Crunching (Workers): Performs data scanning tasks on local data • Result Assimilator (Master): Merges results from workers and outputs
Worker Level Data Parallelization
• The database should be sorted • All workers should roughly involve the same amount of computation • The size of each database batch should be optimized such that the amount of computation should be sufficient to justify the runtime overhead: batch factor
Worker Level Data Parallelization • Database sorting and partitioning
Worker Level Data Parallelization • Data distribution: Normal way
Worker Level Data Parallelization • Data distribution: Interleaved way
GPU Level Data Parallelization Initialization and data preprocessing DB subset 1 Thread 1
... ...
CPU host
DB subset m1 n1 Thread m1 n1
GPU kernel 1
Coarse-grained processing for Stages 1 and 2
HSP readback to CPU
HSP subset 1
...
HSP subset m2
Thread block 1
...
Thread block m2
Thread 1 ... Thread n2 Fine-grained processing for Stage 3
...
Thread 1 ... Thread n2 Fine-grained processing for Stage 3
CPU host
GPU kernel 2
HSA readback to CPU CPU host
Stage 4 calculation and final output
Coarse-Grained Parallel Algorithm • Utilizing the compressed deterministic finite-state automaton (DFA) • Illustration of compressed DFA for w=3: i=0 AA
i …
AY
CA
…
CY
i=399 ……
YA
…
DFA[i].next = DFA[(20*i)%(20^(w-1))] 26
DFA[i].nextWords = CurrentBlock;
0 33
char * CurrentBlock[0…19]
16 7 0 13 0
A
C nil
D
……
Y
YY
X
Fine-Grained Parallel Algorithm
..
.
..
. Y
(a)
(b)
Fine-Grained Parallel Algorithm X X
S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3
Y Y
Performance Evaluation • mpiCUDA-BLASTP: Implemented using CUDA 3.2 and MPICH2 • Performance Evaluation on a GPU cluster with four computing nodes, with each node having access to one C1060 GPU. Each node consists of an AMD Opteron 2378 quad-core 2.4 GHz processor and 8 GB RAM • Performance Comparison to GPU-BLAST 1.0-2.2.24 running on one node of the above GPU cluster • Genbank NR Database: 12,852,469 protein sequences
Performance Evaluation • ROC scores of searching results on the ASTRAL SCOP database version 1.75
Performance Evaluation
Performance Evaluation • Runtimes of mpiCUDA-BLASTP using different batch distribution strategies
Performance Evaluation • Runtimes of mpiCUDA-BLASTP using different batch factors
Performance Evaluation • Performance comparison between mpiCUDA-BLASTP and multi-threaded GPU-BLAST for scanning the GenBank NR
Thank You!