Image from http://developer.apple.com/hardware/ve. â¡ Single Instruction, Multiple Data. â« Perform the same operation on many data items at once. â¡ Vector ...
High Performance Direct Pairwise Comparison of Genomic Sequences Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine HiCOMB April 4, 2005 Denver, Colorado
Introduction
Goals
Generate data for large format visualization Exploit parallel features present in commodity hardware
Genome Comparison
Dot plot is the only complete method for comparing genomes Often ruled out due to quadratic running time Size of data has an upper bound and modern hardware is approaching the point where this bound is (almost) within reach
Target Data
SIMD/vector processors SMP/multiple processors per machine Clusters
DNA sequences, one direction (5’ to 3’)
Target Platform
Apple dual processor G5, Altivec vector processor April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
2
Related Work
BLAST
Smith-Waterman
Rognes and Seeberg, 6x speedup using MMX
HMMER
Apple and Genentech (AGBLAST), 5x speedup using Altivec
Erik Lindahl, 30% improvement using Altivec
Hardware Solutions
Various commercial FPGA solutions exist for different algorithms (e.g., TimeLogic’s DeCypher platform offers BLAST, HMM, SW) April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
3
SIMD Overview
Single Instruction, Multiple Data
Vector registers can be divided according to the data type
Perform the same operation on many data items at once
The Altivec registers in the G5 are 128 bits wide.
Vector programming using gcc on Apple G5s is one step removed from assembly programming
Normal
SIMD
3
3 2 1 4
+ 2
2 4 5 9
5
5 6 6 13
Functions are thin wrappers around assembly calls The optimizer does not cover vector operations Memory loads and stores are handled by the programmer and must be properly byte aligned
Image from http://developer.apple.com/hardware/ve
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
4
The Dot Plot qseq
NAÏVE_DOTPLOT(qseq, sseq, win, strig): // qseq - column sequence // sseq - row sequence // win - number of elements to compare // for each point // strig - number of matches required // for a point
sseq
for each q in qseq: for each s in sseq:
win = 3 strig = 2
score = 0 for each (q’, s’) in (qseq[q:q+win], s[s:s+win]): if q’ == s’: score += 1 end if q’ end for each (q’,s’) if score > strig: AddDot(q, s) end if score end for each s end for each q Dotplot comparing the human and fly mitochondrial genomes ( generated by DOTTER)
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
5
The Standard Algorithm STD_DOTPLOT(qScores, s, win, strig): dotvec = zeros(len(q)) for each char c in s: dotvec = shift(dotvec, 1) dotvec += qScores[c] if index(c) > win: delchar = s[index(c) - win] dotvec -= shift(qScores[delchar], win) for each dot in dotvec > strig: display(dot) end for each dot end for i end DOTPLOT
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
6
Data Parallel Dot Plot VECTOR_DOTPLOT(qScores, s, win, strig): // Group diagonals by the upper and lower // triangular sections of the martix for each vector diagonal D: runningScore = vector(0) for each char c in s: score = VecLoad(qScores[c]) runningScore = VecAdd(score, r_score) if index(c) > win: delChar = s[index(c) - win] delscore = VecLoad(qScores[delChar]) runningScore = VecSub(score, delscore) if VecAnyElementGte(runningScore, strig): scores = VectorUnpack(runningScore) for each score in scores > strig: Output(row(c), col(score), score) end for each score end if VecGte() end for each c end for each D end VECTOR_DOTPLOT
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
7
Coarse Grained Parallelism
Block Level Parallelism
Single Machine
Block the matrix into columns Overlap by the number of characters in the window Run one thread per processor Create one memory mapped file per processor
Cluster
Run one instance per machine and one thread per processor. Store results locally (e.g. /tmp)
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
8
Model-driven Implementation Goal: Break the algorithm into basic operations that can be modeled independently to understand the performance issues at each step.
Data Streams
Vector Operations
(data read speed)
(instruction throughput)
Sparse Matrix Format
Data output
(data write speed)
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
9
Data Stream Models Data Stream Performance (Mops) // Base case // S-sequence is one stream pointer s++; // Q-sequence is four streams uchar *qScore[4]; // Option 1: Four Pointers // Keep pointers to the current // position in the score vectors qScore[0]++; qScore[1]++; qScore[2]++; qScore[3]++; score = *qScore[*s]; // Option 2: Index // Index the score vectors with // a counter i++; score = qScore[*s][i];
April 4, 2005
Single stream pointer is similar to indexing, but a little slower For the four score streams, indexed 1/4 of the time, maintaining the pointers costs more than lookup Pointer vs. Index numbers varied based on the compiler version
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
10
Vector Performance Models // Model Variables uchar *data = randseq(), out[16]; long i = 0, l = len(data); vector uchar sum = 0, value;
Vector Model Performance (Mops)
// VecAdd for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); } // StoreAll for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); out = VecStore(sum); Save(out); }
// StoreFreq int freq = l * alpha; for(i = 0; i < l - 16; i++) { value = VecLoad(data[i]); sum = VecAdd(value, sum); if(i % freq) { // Pipeline stall! out = VecStore(sum); Save(out); } } April 4, 2005
Attempts to model infrequent write operations were unsuccessful Storing all dots yields high performance, but this is not practical for large comparisons StoreFreq provides a lower bound on performance
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
11
Pipeline Management // Sequence of Vector Operations // score score1 = score2 = vperm = score =
= VecLoad(qScores[c]) vec_ld(0, ptemp); // unalgined vec_ld(16, ptemp); // loads vec_lvsl(0, ptemp); vec_perm(score1, score2, vperm);
runningScore = vec_add(score, r_score); // delscore = VecLoad(qScores[delChar]) score1 = vec_ld(0, ptemp); score2 = vec_ld(16, ptemp); vperm = vec_lvsl(0, ptemp); delscore = vec_perm(score1, score2, vperm); runningScore = vec_sub(score, delscore);
Cycle-accurate Plots of the Instructions in Flight Each line shows each cycle for one instruction. Instructions are offset (x-axis) based on starting time. Time flows from top to bottom (y-axis). The left plot shows a series of add/delete steps with no dots generated. The bottom plot shows the pipeline being interrupted when a dot is generated.
if(vec_any_ge(runningScore, strig)) { scores = vec_st(runningScore) // Main processor for(i = 0; i < 16; i++) { if(hit[i] > ustrig ) { dm.AddDot(y, x + i, hit[i]); } } }
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
12
Sparse Matrix Format // Option 1 // std::vector CSR-eqse Sparse Matrix struct Dot { int col; int value; };
Sparse Matrix Format Performance (Mops) 6.78x
3.85x
struct Row { int num; vector cols; }; 1.0x
typedef vector DotMatrixVec; // Option 2 // Memory Mapped Coordinate-wise // Sparse Matrix struct RowDot { int row; int col; int value; };
Both approaches required some maintenance to avoid exhausting main memory mmap avoids a second pass through the data during the save step
RowDot *out = (RowDot*)mmap(…);
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
13
Data Location Data Location Performance (Mops)
Large, shared data is often located on network drives This adds a network hop for all disk I/O Even for infrequent I/O, this can significantly affect performance
1.98x
1.35x 1.0x
April 4, 2005
1.0x
The std::vector sparse matrix had a slight benefit. The mmap sparse matrix improved significantly with local data storage.
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
14
Traditional Manual Optimizations
Prefetch
Blocking
G5 hardware prefetch is very good Attempts to optimize had negative impact Slight negative impact due to burps in the stream
Unrolling
Complicated code very quickly No measurable improvement
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
15
System Details
Apple Dual 2.0 GHz G5, 3.5 GB RAM 100 Mbit network to file server OS X 10.3.5 (Darwin Kernel Version 7.5.0) g++ 3.3 (build 1620)
Libraries
-O3 -fast (different compiler, aggressive optimizations) -altivec (limited optimizations) Upgrade from 1614 to 1620 improved DOTTER’s performance by 30% Boost::thread
Data (from GenBank)
Mitochondrial genomes E. Coli, Listeria bacterial genomes April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
16
Results Final Results (Mops)
13.0x
Single Machine
Mitochondrial (~20 kbp)
DOTTER vs. Data-parallel
7.0x
Bacterial (4.5 Mbp)
Data-parallel only 1.0x
Scalability
Scalability (time/nodes)
Cluster
(16 dual processor 2.3 GHz G5s)
Bacterial Comparison
92 min, 8 sec (1 node) 5 min, 42 sec (16 nodes)
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
17
Visualization
Results rendered to PDF Target Displays
2x4, 6400x2400 tiled display wall IBM T221, 3840x2400, 204 dpi display
Magnifying glass required
High resolution formats
600 dpi laser printer 1200 dpi ink jet printer High resolution, no interactivity
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
18
Conclusions
Modern commodity hardware is close to providing the performance necessary for large direct genomic comparisons.
5,000,000 base pair sequences are realistic (bacteria) 50,000,000 base pair sequences are possible (small human chromosomes)
It is important to take a careful, experimental approach to implementation and to test all assumptions.
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
19
Acknowledgements
Jeremiah Willcock helped develop the initial prototype Eric Wernert, Craig Jacobs, and Charlie Moad from the UITS Advanced Visualization Lab at Indiana University provided visualization support This work was supported by a grant from the Lilly Endowment References Apple Developer’s Connection, Velocity Engine and Xcode, from, Apple Developer Connection, Cupertino, CA, 2004. http://developer.apple.com/hardware/ve http://developer.apple.com/tools/xcode A. J. Gibbs and G. A. M cIntyre, The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences, Eur J Biochem, 16 (1970), pp. 1-11. E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix Program with Dynamic Threshold Control Suited for Genomic DNA and Protein-Sequence Analysis, Gene-Combis, 167 (1995), pp. 1-10.
April 4, 2005
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
20