CUDA-BLASTP: Accelerating BLASTP on CUDA ... - Semantic Scholar

2 downloads 0 Views 423KB Size Report
max. ),( ,. )1,(. )1,( max. ),( j. iF j. iH. jiF. jiE. jiH. jiE. Smith-Waterman Algorithm. • Performs an exhaustive search for the optimal local alignment of two sequences ...
CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled Graphics Hardware Weiguo Liu Fraunhofer IDM Centre@NTU

Contents • • • •

Motivation Background Mapping BLASTP onto GPU Clusters Performance Evaluation

GPGPU • Commodity components – Low cost or even zero cost (if users have already access to a modern GPU) – Easy upgrading to next-generation GPUs

• High performance/price ratio • Enhanced programmability – Can be used for general purpose computing, e.g. scientific computing, image processing, bioinformatics (see www.gpgpu.org)

GPU-accelerated Hybrid Architectures

Smith-Waterman Algorithm • Performs an exhaustive search for the optimal local alignment of two sequences • Aligning S1 and S2 of length l1 and l2 using Recurrences:

⎧0 ⎪ E (i, j ) ⎪ H (i, j ) = max ⎨ ,1 ≤ i ≤ l1, 1 ≤ j ≤ l 2 ⎪ F (i, j ) ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j ) ⎧ H (i, j − 1) − α ⎧ H (i − 1, j ) − α E (i , j ) = max ⎨ , F (i , j ) = max ⎨ H (0, j ) = F (0, j ) = 0 ⎩ E (i , j − 1) − β ⎩ F (i − 1, j ) − β H (i,0) = E (i,0) = 0

Smith-Waterman Algorithm Align S1=ATCTCGTATGATG S2=GTCTATCAC

⎧2 if ( x = y) Sbt( x, y) = ⎨ ⎩− 1 else α=1, β=1

∅ G T C T A T C A C

∅ A

T

C

T

C G T

A

T G A

0 0 0 0 0 2 1 0 2 1

0 0 2 1 2 2 4 3 2 1

0 00 1 4 3 2 3 6 5 4

0 0 2 3 6 5 4 5 5 4

0 0 1 4 5 5 4 6 5 7

0 0 3 3 4 7 5 5 7 6

0 0 2 2 5 6 9 8 7 6

0 0 0 0 0 0 0 0 0 0

⎧0 ⎪ H (i − 1, j ) − 1 ⎪ H (i, j ) = max ⎨ ⎪ H (i, j − 1) − 1 ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j )

0 2 1 3 4 4 4 5 5 6

0 1 4 3 5 4 6 5 4 5

T G

0 0 0 2 1 0 1 1 3 1 0 2 4 3 2 5 6 5 8 7 8 8 7 7 7 10 10 9 6 9 9

ATCTCGTATGATG GTC −TATCAC

0 2 2 2 1 4 7 7 8 8

BLASTP Algorithm • The Smith-Waterman algorithm is too compute intensive • Heuristics approach: – Assumes good alignments contain short exact matches – Find such matches quickly using data structures such as lookup tables – Identified short matches are used as seeds for further detailed analysis

BLASTP Algorithm BLASTP: the Basic Local Alignment Searching Tool for Proteins Stage 1 database

Word Matching

hits

Ungapped Extension

Stage 4

Stage 3

Stage 2 HSPs

Gapped Extension

HSAs

Traceback alignments & Display

• Stage 1: Word Matching – Each hit is defined as an offset pair (i,j) for which∑k =0 sbt(Q[i + k ],D[ j + k ]) ≥ T – w, T, sbt input parameters, Q query, D database w−1

• Stage 2: Ungapped Extension – Outputs high-scoring segments pairs (HSPs) – Performs an ungapped extension on a diagonal that contains a nonoverlapping hit pair

BLASTP Algorithm

• Stage 3 and 4: Gapped Extension – Computes high-scoring alignments (HSAs) by performing a gapped alignment algorithm using HSPs as seeds – Traceback path is calculated and displayed

• Execution Profiling for scanning GenBank: – Stage 1: 37% – Stage 2: 31% – Stage 3 and 4: 32%

Mapping BLASTP onto GPU Clusters

• Worker Level Data Parallelization −

The master node partitions the subject database into multiple batches and distributes them to workers

• GPU Level Data Parallelization −

Workers process allocated data batches using GPU

Mapping BLASTP onto GPU Clusters

• Work Generator (Master): Performs data pre-processing and partitioning • Data Crunching (Workers): Performs data scanning tasks on local data • Result Assimilator (Master): Merges results from workers and outputs

Worker Level Data Parallelization

• The database should be sorted • All workers should roughly involve the same amount of computation • The size of each database batch should be optimized such that the amount of computation should be sufficient to justify the runtime overhead: batch factor

Worker Level Data Parallelization • Database sorting and partitioning

Worker Level Data Parallelization • Data distribution: Normal way

Worker Level Data Parallelization • Data distribution: Interleaved way

GPU Level Data Parallelization Initialization and data preprocessing DB subset 1 Thread 1

... ...

CPU host

DB subset m1 n1 Thread m1 n1

GPU kernel 1

Coarse-grained processing for Stages 1 and 2

HSP readback to CPU

HSP subset 1

...

HSP subset m2

Thread block 1

...

Thread block m2

Thread 1 ... Thread n2 Fine-grained processing for Stage 3

...

Thread 1 ... Thread n2 Fine-grained processing for Stage 3

CPU host

GPU kernel 2

HSA readback to CPU CPU host

Stage 4 calculation and final output

Coarse-Grained Parallel Algorithm • Utilizing the compressed deterministic finite-state automaton (DFA) • Illustration of compressed DFA for w=3: i=0 AA

i …

AY

CA



CY

i=399 ……

YA



DFA[i].next = DFA[(20*i)%(20^(w-1))] 26

DFA[i].nextWords = CurrentBlock;

0 33

char * CurrentBlock[0…19]

16 7 0 13 0

A

C nil

D

……

Y

YY

X

Fine-Grained Parallel Algorithm

..

.

..

. Y

(a)

(b)

Fine-Grained Parallel Algorithm X X

S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3

Y Y

Performance Evaluation • mpiCUDA-BLASTP: Implemented using CUDA 3.2 and MPICH2 • Performance Evaluation on a GPU cluster with four computing nodes, with each node having access to one C1060 GPU. Each node consists of an AMD Opteron 2378 quad-core 2.4 GHz processor and 8 GB RAM • Performance Comparison to GPU-BLAST 1.0-2.2.24 running on one node of the above GPU cluster • Genbank NR Database: 12,852,469 protein sequences

Performance Evaluation • ROC scores of searching results on the ASTRAL SCOP database version 1.75

Performance Evaluation

Performance Evaluation • Runtimes of mpiCUDA-BLASTP using different batch distribution strategies

Performance Evaluation • Runtimes of mpiCUDA-BLASTP using different batch factors

Performance Evaluation • Performance comparison between mpiCUDA-BLASTP and multi-threaded GPU-BLAST for scanning the GenBank NR

Thank You!

Suggest Documents