Parallel Processing in Sequence Matching - Personal - UPC

Parallel Processing in Sequence Matching Friman Sánchez∗ , Esther Salamí∗ , Alex Ramirez∗ , Mateo Valero∗ ∗

Universitat Politècnica de Catalunya (UPC), Barcelona, Spain

ABSTRACT The comparison and analysis of DNA and protein sequences are important tasks in molecular biology and bioinformatic. One of the most recognized algorithms to perform the string-matching operation in these tasks is the Smith-Waterman algorithm (SW). However, this algorithm is computationally intensive, for this reason, many researches have developed heuristic strategies to avoid its use. In this poster, some alternatives to perform the SW algorithm are studied, these are based on the implementation of the algorithm in processor that include 1-dimensional SingleInstruction Multiple-Data (SIMD) extensions. In this work, different strategies about how to implement the SW algorithm are studied. Aditionally, an optimized implement of the algorithm is proposed. This optimization allows to extract more data-level parallelism than previous implementations and a significant reduction of 30% in the execution time is reached. KEYWORDS :

1

Smith-Waterman; Bioinformatic; SIMD Extension, Sequence Matching

Introduction

During the last decades, the most important advances in molecular biology and genetic, have led to a growth in the biological information generated by scientific community. This flowing of genomic information has required not only computerized databases to store, organize, and index the data, but specialized tools to view and analyze the data. One important task is the matching of DNA and protein sequences, The Smith-Waterman [SW81] algorithm is one of the most recognized algorithm to quantify the similarity of a pair of sequences. However, the computationally intensive nature of this algorithm is a very restrictive factor that avoids its uses. For this reason, many heuristic strategies have been proposed to reduce the computational space in the search tasks (BLAST [AGM+ 90]). However, this reduction is obtained at the expense of sensitivity, that is, some distantly similar sequences can not be detected in a search using the heuristic methods. General purpose processors (GPPs) with parallel processing capabilities can be used to perform the algorithms for database searching. There have been some proposals to use GPPs with SIMD extensions to execute the SW 1

E-mail: {fsanchez,esalami,aramirez,mateo}@ac.upc.edu

Figure 1: Data-dependency graph in the execution of Smith-Waterman algorithm.

algorithm. In this work, different implementationS of SW algortihm over GPPs have been studied, Additionally, another way of implementing the SW algorithm is discussed and evaluated. This new implementation allows to reduce de execution time on search around 30% compared to the best implementation of the algorithm [TR00].

2

Smith-Waterman Algorithm

The SW is a dynamic programming algorithm for computing the optimal local-alignment score, which takes alignments of any length, at any location, in any sequence, and determines whether an optimal alignment can be found. To quantify this process, a substitution score matrix is used to indicates the score associated with matching one amino acid with another. Given a query sequence A of length m, a database sequence B of length n, a substitution score matrix Z, a gap-open penalty q and a gap extension penalty r, the optimal local alignment score T can be computed by the following recursion relations: e{ i, j} = max{e{ i, j − 1}, h{ i − 1, j} − q} − r f{ i, j} = max{f{ i − 1, j}, h{ i, j − 1} − q} − r h{ i, j} = max{h{ i − 1, j − 1} + Z[A[i]], B[i]], e{ i, j}, f{ i, j}, 0} T = max{h{ i, j}} Where, ei,j and fi,j represent the maximum local-alignment score involving the first i symbols of A and the first j symbols of B, and ending with a gap in sequence B or A, respectively. The overall-maximum local-alignment score involving the first i symbols of A and the first j symbols of B, is represented by h{ i, h}. The recursion should be calculated with i going fromm 1 to m and j from 1 to n, starting with e{ i, j} = f{ i, j} = h{ i, j} = 0 for all i = 0 or j = 0. The order of the computation of the values in the alignment matrix is strict because the value of any cell cannot be computed before the value of all cells to the left and above it has been computed. Figure 1 shows the data dependency in the calculation.

Figure 2: Strategies to Execute Smith-Waterman Algorithm Using 1-Dimensional SIMD Extensions

3

Strategies to Execute Smith-Waterman Algorithm Using 1Dimensional SIMD Extensions

Figure 2 shows some ways to exploit the parallelism in the computation. First, we can perform the execution of vectors of cells parallel to the minor diagonal in the matrix (figure 2a, however, this alternative has many memory problems due to the non-uniform access of data in cache. Second, calculations are made on vectors of cells parallel to the query sequence (figure 2b). This strategy has to handle with data dependencies within the vector. It takes advantage of the fact that in most cells in the matrix, e and f are zero, and hence do not contribute to h. As long as h is less that the threshold q + r, which is the penalty of a single symbol gap, e and f will stay at zero along a column or row in the matrix. This characteristic of the problem can save many computations, allowing to remove data dependencies in the calculation of h and simplifying the computations. However, if any of the cell values are above the threshold, the computation of the h-values must be done. This alternative have been evaluated by Rognes [TR00]. Another alternative to extract more DLP of the problem consists of performing the calculation in a combination of the first and second alternatives, that is, processing vector of cells parallel to the query sequence and at the same time to process the vector of cells from the following column. It is posible because of both vector of cells does not have dependences between them. This alternative is shown in figure 2c. This alternative does not eliminate the data dependences into a group, then it must be handle as in the previous alternative.

4

Experimental Methodology

The applications have been implemented in a PowerPC 970 processor and using Altivec SIMD extension. The evaluations were done using a set of 11 different protein sequences against Swissprot database. The length of the query sequences ranged from 88 to 500 aminoacids. The gap open penalty is 10 and the gap extension penalty is 1, additionaly, we used the BLOSUM62 amino-acid substitution score matrix [HP99]. In sumarize, the evaluated appli-

(a) Execution Time [1]

500

(b) Execution Time [2]

70 60

400

50 300

40 30

200

20 100 10 0 100

200 ssearch sw-par1

300 sw-par2 h-rognes

400

500 blast

600

0 100 sw-par1

200

300 sw-par2

400 h-rognes

500

600 blast

Figure 3: Execution Time for many SW implementation and some heuristics applications cations are: scalar1: best known scalar implentation present in SSEARCH program; sw_par1: parallel implementation based on the Rognes strategy (figure 2b); sw_par2: Optimized Parallel implementation (processing 3 columns, figure 2c); h_rognes: Parallel implementation based on the Rognes heuristic [TR00]; blast: Blast Program. Figures 3a shows the time that is required by each application to search similar sequences in the database. Figure 3b is a zoom of the first figure. The horizontal axis in the figures indicate the lenght of the query sequences.

5

Conclusions

As can be seen, heuristic strategies perform faster than SW implementation. However, our SW optimization (sw_par2) are in average 30% faster than the Rognes implementation with the same algorithms. We can see that SIMD extensions are useful to extract parallelism in the SW algorithm. It could be interesting to evaluate another alternatives of parallelism in the SW algorithm (using SIMD extension). In fact, we are exploiting some other strategies.

References [AGM+ 90] S. F. Altschul, W. Gish, W. Miller, Myers Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990. [HP99]

J.G Henikoff, S. Henikoff and S Pietrokovski. Blocks+: a mpm-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics, 15, 1999.

[SW81]

T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.

[TR00]

Erling Seeberg Torbjorn Rognes. Six-fold speed-up of smith-waterman sequence database searches using parallel processing on common microprocessors. BIOINF: Bioinformatics, 16, 2000.

Parallel Processing in Sequence Matching - Personal - UPC

Parallel Processing in Sequence Matching - Personal - UPC

Suggest Documents

Parallel Processing in Biological Sequence Comparison ... - CiteSeerX

Parallel Processing of Multiple Pattern Matching ...

Parallel Processing of Multiple Pattern Matching ... - CiteSeerX

article in press - Personal - UPC

The XtreemFS Architecture - Personal - UPC

MPLS-Over-Flexgrid ... - Personal - UPC

personal identity matching - AIRCC

Parallel Image Processing System on a Cluster of Personal Computers

Parallel Computers in Signal Processing

Parallel computer processing in systematics

Context in temporal sequence processing

LNCS 4974 - Architecture Performance Prediction ... - Personal - UPC

Approaches and Standards for Metadata ... - Personal - UPC

Reliable and Randomized Data Distribution ... - Personal - UPC

Performance Management of Accelerated ... - Personal - UPC

Dynamic Cluster Assignment Mechanisms - Personal - UPC

History Matching In Parallel Computational Environments - OSTI.GOV

Verified Parallel String Matching in Haskell

Efficient Parallel and External Matching

Optimal Parallel Dictionary Matching and

Optimal Parallel Dictionary Matching and

Text analysis with sequence matching

Learning Sequence-to-Sequence Correspondences from Parallel ...

Tutorial 25. Parallel Processing