CUDA-BLASTP: Accelerating BLASTP on CUDA ... - Semantic Scholar

CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled Graphics Hardware Weiguo Liu Fraunhofer IDM Centre@NTU

Contents • • • •

Motivation Background Mapping BLASTP onto GPU Clusters Performance Evaluation

GPGPU • Commodity components – Low cost or even zero cost (if users have already access to a modern GPU) – Easy upgrading to next-generation GPUs

• High performance/price ratio • Enhanced programmability – Can be used for general purpose computing, e.g. scientific computing, image processing, bioinformatics (see www.gpgpu.org)

GPU-accelerated Hybrid Architectures

Smith-Waterman Algorithm • Performs an exhaustive search for the optimal local alignment of two sequences • Aligning S1 and S2 of length l1 and l2 using Recurrences:

⎧0 ⎪ E (i, j ) ⎪ H (i, j ) = max ⎨ ,1 ≤ i ≤ l1, 1 ≤ j ≤ l 2 ⎪ F (i, j ) ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j ) ⎧ H (i, j − 1) − α ⎧ H (i − 1, j ) − α E (i , j ) = max ⎨ , F (i , j ) = max ⎨ H (0, j ) = F (0, j ) = 0 ⎩ E (i , j − 1) − β ⎩ F (i − 1, j ) − β H (i,0) = E (i,0) = 0

Smith-Waterman Algorithm Align S1=ATCTCGTATGATG S2=GTCTATCAC

⎧2 if ( x = y) Sbt( x, y) = ⎨ ⎩− 1 else α=1, β=1

∅ G T C T A T C A C

∅ A

T

C

T

C G T

A

T G A

0 0 0 0 0 2 1 0 2 1

0 0 2 1 2 2 4 3 2 1

0 00 1 4 3 2 3 6 5 4

0 0 2 3 6 5 4 5 5 4

0 0 1 4 5 5 4 6 5 7

0 0 3 3 4 7 5 5 7 6

0 0 2 2 5 6 9 8 7 6

0 0 0 0 0 0 0 0 0 0

⎧0 ⎪ H (i − 1, j ) − 1 ⎪ H (i, j ) = max ⎨ ⎪ H (i, j − 1) − 1 ⎪⎩ H (i − 1, j − 1) + Sbt ( S1i , S 2 j )

0 2 1 3 4 4 4 5 5 6

0 1 4 3 5 4 6 5 4 5

T G

0 0 0 2 1 0 1 1 3 1 0 2 4 3 2 5 6 5 8 7 8 8 7 7 7 10 10 9 6 9 9

ATCTCGTATGATG GTC −TATCAC

0 2 2 2 1 4 7 7 8 8

BLASTP Algorithm • The Smith-Waterman algorithm is too compute intensive • Heuristics approach: – Assumes good alignments contain short exact matches – Find such matches quickly using data structures such as lookup tables – Identified short matches are used as seeds for further detailed analysis

BLASTP Algorithm BLASTP: the Basic Local Alignment Searching Tool for Proteins Stage 1 database

Word Matching

hits

Ungapped Extension

Stage 4

Stage 3

Stage 2 HSPs

Gapped Extension

HSAs

Traceback alignments & Display

• Stage 1: Word Matching – Each hit is defined as an offset pair (i,j) for which∑k =0 sbt(Q[i + k ],D[ j + k ]) ≥ T – w, T, sbt input parameters, Q query, D database w−1

• Stage 2: Ungapped Extension – Outputs high-scoring segments pairs (HSPs) – Performs an ungapped extension on a diagonal that contains a nonoverlapping hit pair

BLASTP Algorithm

• Stage 3 and 4: Gapped Extension – Computes high-scoring alignments (HSAs) by performing a gapped alignment algorithm using HSPs as seeds – Traceback path is calculated and displayed

• Execution Profiling for scanning GenBank: – Stage 1: 37% – Stage 2: 31% – Stage 3 and 4: 32%

Mapping BLASTP onto GPU Clusters

• Worker Level Data Parallelization −

The master node partitions the subject database into multiple batches and distributes them to workers

• GPU Level Data Parallelization −

Workers process allocated data batches using GPU

Mapping BLASTP onto GPU Clusters

• Work Generator (Master): Performs data pre-processing and partitioning • Data Crunching (Workers): Performs data scanning tasks on local data • Result Assimilator (Master): Merges results from workers and outputs

Worker Level Data Parallelization

• The database should be sorted • All workers should roughly involve the same amount of computation • The size of each database batch should be optimized such that the amount of computation should be sufficient to justify the runtime overhead: batch factor

Worker Level Data Parallelization • Database sorting and partitioning

Worker Level Data Parallelization • Data distribution: Normal way

Worker Level Data Parallelization • Data distribution: Interleaved way

GPU Level Data Parallelization Initialization and data preprocessing DB subset 1 Thread 1

... ...

CPU host

DB subset m1 n1 Thread m1 n1

GPU kernel 1

Coarse-grained processing for Stages 1 and 2

HSP readback to CPU

HSP subset 1

...

HSP subset m2

Thread block 1

...

Thread block m2

Thread 1 ... Thread n2 Fine-grained processing for Stage 3

...

Thread 1 ... Thread n2 Fine-grained processing for Stage 3

CPU host

GPU kernel 2

HSA readback to CPU CPU host

Stage 4 calculation and final output

Coarse-Grained Parallel Algorithm • Utilizing the compressed deterministic finite-state automaton (DFA) • Illustration of compressed DFA for w=3: i=0 AA

i …

AY

CA

…

CY

i=399 ……

YA

…

DFA[i].next = DFA[(20*i)%(20^(w-1))] 26

DFA[i].nextWords = CurrentBlock;

0 33

char * CurrentBlock[0…19]

16 7 0 13 0

A

C nil

D

……

Y

YY

X

Fine-Grained Parallel Algorithm

..

.

..

. Y

(a)

(b)

Fine-Grained Parallel Algorithm X X

S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3

Y Y

Performance Evaluation • mpiCUDA-BLASTP: Implemented using CUDA 3.2 and MPICH2 • Performance Evaluation on a GPU cluster with four computing nodes, with each node having access to one C1060 GPU. Each node consists of an AMD Opteron 2378 quad-core 2.4 GHz processor and 8 GB RAM • Performance Comparison to GPU-BLAST 1.0-2.2.24 running on one node of the above GPU cluster • Genbank NR Database: 12,852,469 protein sequences

Performance Evaluation • ROC scores of searching results on the ASTRAL SCOP database version 1.75

Performance Evaluation

Performance Evaluation • Runtimes of mpiCUDA-BLASTP using different batch distribution strategies

Performance Evaluation • Runtimes of mpiCUDA-BLASTP using different batch factors

Performance Evaluation • Performance comparison between mpiCUDA-BLASTP and multi-threaded GPU-BLAST for scanning the GenBank NR

Thank You!

CUDA-BLASTP: Accelerating BLASTP on CUDA ... - Semantic Scholar

CUDA-BLASTP: Accelerating BLASTP on CUDA ... - Semantic Scholar

Suggest Documents

Accelerating MATLAB with CUDA

Accelerating CUDA Graph Algorithms at Maximum ... - Semantic Scholar

Accelerating MATLAB with CUDA - CiteSeerX

Accelerating Genome-Wide Association Studies Using CUDA ...

Accelerating Virtual Texturing Using CUDA - NotKyon

Accelerating Linpack with CUDA on heterogeneous clusters - CiteSeerX

Fast heterogeneous computing with CUDA ... - Semantic Scholar

ACCELERATING COMPUTER VISION ... - Semantic Scholar

CUDA Accelerated Iris Template Matching on ... - Semantic Scholar

Ray Tracing on a GPU with CUDA - Semantic Scholar

Optimizing Matrix Transpose in CUDA - Semantic Scholar

2D Triangulation of Polygons on CUDA - Semantic Scholar

GPUSVM: A Comprehensive CUDA Based ... - Semantic Scholar

Accelerating Fast Fourier Transforms Using HadoopÂ® and CUDA - arXiv

BLASTP blastp 2.2.18 [Mar-02-2008] - AEROPATH Target Database

Generation of accelerating Airy and accelerating ... - Semantic Scholar

Accelerating cellular automata simulations using AVX and CUDA

Accelerating Boosting-based Face Detection on ... - Semantic Scholar

Accelerating large graph algorithms on the GPU ... - Semantic Scholar

Accelerating Large Scale Image Analyses on ... - Semantic Scholar

Accelerating Drag-and-Drop on Large Screens - Semantic Scholar

Accelerating the Scalar Multiplication on Elliptic ... - Semantic Scholar

Accelerating VASP electronic structure ... - Semantic Scholar

WIPO Re:Search: Accelerating anthelmintic ... - Semantic Scholar