Implementing Data Parallel Algorithms for ...

Implementing Data Parallel Algorithms for Bioinformatics Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine SIAM Conference on Computational Science and Engineering February 14, 2005

Introduction 

Goal 



Implement a well known bioinformatics algorithm for a data parallel system (Altivec)

Motivation   

Current implementations do not scale well to support full genomes Vector processors are common, even on commodity hardware New supercomputing architectures are including vector (DSP) units (again)

SIMD Overview Single Instruction, Multiple Data: Perform the same operation on many data items at once

Normal

SIMD

3

3 2 1 4

+ 2

2 4 5 9

5

Vector registers can be divided according to the data type. The Altivec registers in the G5 are 128 bits wide.

5 6 6 13 Image from http://developer.apple.com/hardware/ve

General Issues 

Altivec code is one step removed from assembly  



Compiler optimizations are not available 



Programmer manages load/store operations Debugging and maintenance is a challenge But, the compiler handles register assignments and can insert load/store operations

For maximum performance, the processor must be fed continuously

Application: Dot Plot

qseq, sseq = sequences win = number of elements to compare for each point Strig = number of matches required for a point

Dotplot comparing the human and fly mitochondrial genomes (generated by DOTTER)

for each q in qseq: for each s in sseq: if CompareWindow(qseq[q:q+win], s[s:s+win], strig): AddDot(q, s)

The Standard Algorithm DOTPLOT(qScores, s, win, strig): dotvec = zeros(len(q)) for each char c in s: dotvec = shift(dotvec, 1) dotvec += qScores[c] if index(c) > win: delchar = s[index(c) - win] dotvec -= shift(qScores[delchar], win) for each dot in dotvec > strig: display(dot) end for each dot end for i end DOTPLOT

Vector Dot Plot VECTORDOTPLOT(qScores, s, win, strig): for each vector diagonal D: runningScore = vector(0) for each char c in s: score = VecLoad(qScores[c]) runningScore = VecAdd(score, r_score) if index(c) > win: delChar = s[index(c) - win] delscore = VecLoad(qScores[delChar]) runningScore = VecSub(score, delscore) if VecAnyElementGte(runningScore, strig): scores = VectorUnpack(runningScore) for each score in scores > strig: Output(row(c), col(score), score) end for each score end for VecGte() end for each c end for each D end VECTORDOTPLOT

Expectations

Data Types 

DNA  



unsigned char Window size is generally 16-40, max score 40 with no scoring matrix

Protein   

short Window size is smaller Scoring matrices can lead to negative scores and scores > 127

Stream Management // S-sequence is one stream pointer s++; // Q-sequence is four streams // Option 1: Four Pointers // Keep pointers to the current // position in the score vectors qScore[0]++; qScore[1]++; qScore[2]++; qScore[3]++; score = *qScore[*s]; // Option 2: Index // Index the score vectors with // a counter i++; score = qScore[*s][i];





Stream

Speed (Mops)

Pointer

4448

Four Pointers

3028

Index

4600

Single stream pointer is similar to indexing, but a little slower For the four score streams, indexed 1/4 of the time, maintaining the pointers costs more than lookup

Pipeline Management Sequence of Vector Operations // score score1 = score2 = vperm = score =

= VecLoad(qScores[c]) vec_ld(0, ptemp); // unalgined vec_ld(16, ptemp); // loads vec_lvsl(0, ptemp); vec_perm(score1, score2, vperm);

Cycle-accurate plots of the instructions in flight.

runningScore = vec_add(score, r_score)

The left plot shows a series of add/delete steps with no dots generated.

// delscore = VecLoad(qScores[delChar]) score1 = vec_ld(0, ptemp); score2 = vec_ld(16, ptemp); vperm = vec_lvsl(0, ptemp); delscore = vec_perm(score1, score2, vperm);

The bottom plot shows the pipeline being interrupted when a dot is generated.

runningScore = vec_sub(score, delscore) if vec_any_ge(runningScore, strig): scores = vec_st(runningScore)

Dot Matrix Structure std::vector sparse matrix

Memory mapped array

struct Dot { int col; int value; }; struct Row { int num; vector cols; }; typedef vector DotMatrixVec;

struct RowDot { int row; int col; int value; }; RowDot *out = (RowDot*)mmap(…);

Performance in Mops of sparse matrix formats based on data location Base

std::vector

mmap

Ideal

140

1163

1163

NFS

88

370

400

Local

-

500

881

•An ‘op’ is one complete dot comparison •Base is a direct port of the DOTTER algorithm

Traditional Optimizations 

Prefetch  



Blocking 



G5 hardware prefetch is very good Attempts to optimize had negative impact Slight negative impact due to burps in the stream

Unrolling  

Complicated code very quickly No measurable improvement

Acknowledgements 



Jeremiah Willcock helped develop the initial prototype References Apple Developer’s Connection, Velocity Engine and Xcode, from, Apple Developer Connection, Cupertino, CA, 2004. http://developer.apple.com/hardware/ve http://developer.apple.com/tools/xcode A. J. Gibbs and G. A. McIntyre, The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences, Eur J Biochem, 16 (1970), pp. 1-11. E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix Program with Dynamic Threshold Control Suited for Genomic DNA and Protein-Sequence Analysis, Gene-Combis, 167 (1995), pp. 1-10.

Implementing Data Parallel Algorithms for ...

Implementing Data Parallel Algorithms for ...

Suggest Documents

Implementing Data Parallel Algorithms for ...

Implementing Scalable Parallel Search Algorithms for Data-intensive

Implementing Scalable Parallel Search Algorithms for ... - CiteSeerX

Data Structures and Algorithms for Data-Parallel ... - Infoscience - EPFL

Engineering Parallel Algorithms for Community ... - Parallel Computing

Data-Parallel Volume Rendering Algorithms - CiteSeerX

Array Structures and Data-Parallel Algorithms

Thinking in Parallel: Some Basic Data-Parallel Algorithms and ...

PARALLEL ALGORITHMS FOR EFFECTIVE CORRESPONDENCE ...

OPTIMAL RANDOMIZED PARALLEL AlGORITHMS FOR ...

Parallel Algorithms for Arrangements - CiteSeer

Parallel Dynamic Algorithms for Minimum

Parallel Data Mining Algorithms for Association Rules and Clustering

Data-Parallel Algorithms for Agent-Based Model ... - Denise Kirschner

Data-Flow Algorithms for Parallel Matrix ... - Semantic Scholar

Parallel Induction Algorithms for Data Mining - Semantic Scholar

Efficient Data Parallel Algorithms for Multi-Dimensional Array

Parallel Filter Algorithms for Data Assimilation in Oceanography ... - Core

Parallel Implementation of Big Data Pre-Processing Algorithms for ...

Parallel sorting algorithms for declustered data - Semantic Scholar

A Comparison of Data Mapping Algorithms for Parallel Iterative PDE

Mapping Algorithms and Software Environment for Data Parallel PDE ...

Parallel Implementation of Big Data Pre-Processing Algorithms for ...

Mapping Algorithms and Software Environment for Data Parallel PDE ...