High Performance Direct Pairwise Comparison of ...

High Performance Direct Pairwise Comparison of Genomic Sequences Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine HiCOMB April 4, 2005 Denver, Colorado

Introduction 

Goals  

Generate data for large format visualization Exploit parallel features present in commodity hardware   



Genome Comparison   



Dot plot is the only complete method for comparing genomes Often ruled out due to quadratic running time Size of data has an upper bound and modern hardware is approaching the point where this bound is (almost) within reach

Target Data 



SIMD/vector processors SMP/multiple processors per machine Clusters

DNA sequences, one direction (5’ to 3’)

Target Platform 

Apple dual processor G5, Altivec vector processor April 4, 2005

High-Performance Direct Pairwise Comparison of Large Genomic Sequences

2

Related Work 

BLAST 



Smith-Waterman 



Rognes and Seeberg, 6x speedup using MMX

HMMER 



Apple and Genentech (AGBLAST), 5x speedup using Altivec

Erik Lindahl, 30% improvement using Altivec

Hardware Solutions 

Various commercial FPGA solutions exist for different algorithms (e.g., TimeLogic’s DeCypher platform offers BLAST, HMM, SW) April 4, 2005


3

SIMD Overview 

Single Instruction, Multiple Data 



Vector registers can be divided according to the data type 



Perform the same operation on many data items at once

The Altivec registers in the G5 are 128 bits wide.

Vector programming using gcc on Apple G5s is one step removed from assembly programming   

Normal

SIMD

3

3 2 1 4

+ 2

2 4 5 9

5

5 6 6 13

Functions are thin wrappers around assembly calls The optimizer does not cover vector operations Memory loads and stores are handled by the programmer and must be properly byte aligned

Image from http://developer.apple.com/hardware/ve

April 4, 2005


4

The Dot Plot qseq

NAÏVE_DOTPLOT(qseq, sseq, win, strig): // qseq - column sequence // sseq - row sequence // win - number of elements to compare // for each point // strig - number of matches required // for a point

sseq

for each q in qseq: for each s in sseq:

win = 3 strig = 2

score = 0 for each (q’, s’) in (qseq[q:q+win], s[s:s+win]): if q’ == s’: score += 1 end if q’ end for each (q’,s’) if score > strig: AddDot(q, s) end if score end for each s end for each q Dotplot comparing the human and fly mitochondrial genomes ( generated by DOTTER)

April 4, 2005


5

The Standard Algorithm STD_DOTPLOT(qScores, s, win, strig): dotvec = zeros(len(q)) for each char c in s: dotvec = shift(dotvec, 1) dotvec += qScores[c] if index(c) > win: delchar = s[index(c) - win] dotvec -= shift(qScores[delchar], win) for each dot in dotvec > strig: display(dot) end for each dot end for i end DOTPLOT

April 4, 2005


6

Data Parallel Dot Plot VECTOR_DOTPLOT(qScores, s, win, strig): // Group diagonals by the upper and lower // triangular sections of the martix for each vector diagonal D: runningScore = vector(0) for each char c in s: score = VecLoad(qScores[c]) runningScore = VecAdd(score, r_score) if index(c) > win: delChar = s[index(c) - win] delscore = VecLoad(qScores[delChar]) runningScore = VecSub(score, delscore) if VecAnyElementGte(runningScore, strig): scores = VectorUnpack(runningScore) for each score in scores > strig: Output(row(c), col(score), score) end for each score end if VecGte() end for each c end for each D end VECTOR_DOTPLOT

April 4, 2005


7

Coarse Grained Parallelism 

Block Level Parallelism  



Single Machine  



Block the matrix into columns Overlap by the number of characters in the window Run one thread per processor Create one memory mapped file per processor

Cluster  

Run one instance per machine and one thread per processor. Store results locally (e.g. /tmp)

April 4, 2005


8

Model-driven Implementation Goal: Break the algorithm into basic operations that can be modeled independently to understand the performance issues at each step.

Data Streams

Vector Operations

(data read speed)

(instruction throughput)

Sparse Matrix Format

Data output

(data write speed)

April 4, 2005


9

Data Stream Models Data Stream Performance (Mops) // Base case // S-sequence is one stream pointer s++; // Q-sequence is four streams uchar *qScore[4]; // Option 1: Four Pointers // Keep pointers to the current // position in the score vectors qScore[0]++; qScore[1]++; qScore[2]++; qScore[3]++; score = *qScore[*s]; // Option 2: Index // Index the score vectors with // a counter i++; score = qScore[*s][i];

April 4, 2005







Single stream pointer is similar to indexing, but a little slower For the four score streams, indexed 1/4 of the time, maintaining the pointers costs more than lookup Pointer vs. Index numbers varied based on the compiler version


10

Vector Performance Models // Model Variables uchar *data = randseq(), out[16]; long i = 0, l = len(data); vector uchar sum = 0, value;

Vector Model Performance (Mops)

// VecAdd for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); } // StoreAll for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); out = VecStore(sum); Save(out); } 

// StoreFreq int freq = l * alpha; for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); if(i % freq) { // Pipeline stall! out = VecStore(sum); Save(out); } } April 4, 2005





Attempts to model infrequent write operations were unsuccessful Storing all dots yields high performance, but this is not practical for large comparisons StoreFreq provides a lower bound on performance


11

Pipeline Management // Sequence of Vector Operations // score score1 = score2 = vperm = score =

= VecLoad(qScores[c]) vec_ld(0, ptemp); // unalgined vec_ld(16, ptemp); // loads vec_lvsl(0, ptemp); vec_perm(score1, score2, vperm);

runningScore = vec_add(score, r_score); // delscore = VecLoad(qScores[delChar]) score1 = vec_ld(0, ptemp); score2 = vec_ld(16, ptemp); vperm = vec_lvsl(0, ptemp); delscore = vec_perm(score1, score2, vperm); runningScore = vec_sub(score, delscore);

Cycle-accurate Plots of the Instructions in Flight Each line shows each cycle for one instruction. Instructions are offset (x-axis) based on starting time. Time flows from top to bottom (y-axis). The left plot shows a series of add/delete steps with no dots generated. The bottom plot shows the pipeline being interrupted when a dot is generated.

if(vec_any_ge(runningScore, strig)) { scores = vec_st(runningScore) // Main processor for(i = 0; i < 16; i++) { if(hit[i] > ustrig ) { dm.AddDot(y, x + i, hit[i]); } } }

April 4, 2005


12

Sparse Matrix Format // Option 1 // std::vector CSR-eqse Sparse Matrix struct Dot { int col; int value; };

Sparse Matrix Format Performance (Mops) 6.78x

3.85x

struct Row { int num; vector cols; }; 1.0x

typedef vector DotMatrixVec; // Option 2 // Memory Mapped Coordinate-wise // Sparse Matrix struct RowDot { int row; int col; int value; };





Both approaches required some maintenance to avoid exhausting main memory mmap avoids a second pass through the data during the save step

RowDot *out = (RowDot*)mmap(…);

April 4, 2005


13

Data Location Data Location Performance (Mops)







Large, shared data is often located on network drives This adds a network hop for all disk I/O Even for infrequent I/O, this can significantly affect performance

1.98x

1.35x 1.0x





April 4, 2005

1.0x

The std::vector sparse matrix had a slight benefit. The mmap sparse matrix improved significantly with local data storage.


14

Traditional Manual Optimizations 

Prefetch  



Blocking 



G5 hardware prefetch is very good Attempts to optimize had negative impact Slight negative impact due to burps in the stream

Unrolling  

Complicated code very quickly No measurable improvement

April 4, 2005


15

System Details    

Apple Dual 2.0 GHz G5, 3.5 GB RAM 100 Mbit network to file server OS X 10.3.5 (Darwin Kernel Version 7.5.0) g++ 3.3 (build 1620)    



Libraries 



-O3 -fast (different compiler, aggressive optimizations) -altivec (limited optimizations) Upgrade from 1614 to 1620 improved DOTTER’s performance by 30% Boost::thread

Data (from GenBank)  

Mitochondrial genomes E. Coli, Listeria bacterial genomes April 4, 2005


16

Results Final Results (Mops)



13.0x

Single Machine 

Mitochondrial (~20 kbp) 



DOTTER vs. Data-parallel

7.0x

Bacterial (4.5 Mbp) 

Data-parallel only 1.0x

Scalability



Scalability (time/nodes)

Cluster

(16 dual processor 2.3 GHz G5s)



Bacterial Comparison  

92 min, 8 sec (1 node) 5 min, 42 sec (16 nodes)

April 4, 2005


17

Visualization  

Results rendered to PDF Target Displays  

2x4, 6400x2400 tiled display wall IBM T221, 3840x2400, 204 dpi display 



Magnifying glass required

High resolution formats   

600 dpi laser printer 1200 dpi ink jet printer High resolution, no interactivity

April 4, 2005


18

Conclusions 

Modern commodity hardware is close to providing the performance necessary for large direct genomic comparisons.  



5,000,000 base pair sequences are realistic (bacteria) 50,000,000 base pair sequences are possible (small human chromosomes)

It is important to take a careful, experimental approach to implementation and to test all assumptions.

April 4, 2005


19

Acknowledgements  





Jeremiah Willcock helped develop the initial prototype Eric Wernert, Craig Jacobs, and Charlie Moad from the UITS Advanced Visualization Lab at Indiana University provided visualization support This work was supported by a grant from the Lilly Endowment References Apple Developer’s Connection, Velocity Engine and Xcode, from, Apple Developer Connection, Cupertino, CA, 2004. http://developer.apple.com/hardware/ve http://developer.apple.com/tools/xcode A. J. Gibbs and G. A. M cIntyre, The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences, Eur J Biochem, 16 (1970), pp. 1-11. E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix Program with Dynamic Threshold Control Suited for Genomic DNA and Protein-Sequence Analysis, Gene-Combis, 167 (1995), pp. 1-10.

April 4, 2005


20

High Performance Direct Pairwise Comparison of ...

High Performance Direct Pairwise Comparison of ...

Suggest Documents

High-Performance Direct Pairwise Comparison of Large Genomic ...

High-Performance Direct Pairwise Comparison of Large Genomic ...

Performance comparison across hidden, pairwise ... - IARAS Journals

High Performance Biological Pairwise Sequence Alignment: FPGA ...

Pairwise Comparisons 1 Pairwise Multiple Comparison Test ...

PERFORMANCE COMPARISON OF HIGH DENSITY

Performance comparison of benchtop high

Performance comparison across hidden, pairwise and triplet Markov

Dynamic Response Performance Comparison of ... - Science Direct

High Performance Grinding - Science Direct

Performance Improvement of High-temperature ... - Science Direct

High Performance Direct Torque Control of

PERFORMANCE COMPARISON OF HIGH DENSITY ...

Comparison of High Performance Liquid Chromatography with

Exploiting dependencies of pairwise comparison ... - BMC Bioinformatics

Performance comparison of benchtop high-throughput ... - Nature

Comparison of High-Performance Liquid Chromatographic and ...

Comparison of high-performance liquid ... - Springer Link

CSA: comprehensive comparison of pairwise protein

Constructing Highly Consistent Pairwise Comparison Matrices ... - isahp

PAirwise Sequence Comparison (PASC) and Its ... - BioMedSearch

High Performance Single Signal Direct Conversion Receivers

HIGH-PERFORMANCE ANION-EXCHANGE ... - Science Direct

normal-phase high-performance liquid ... - Science Direct