Development of Novel Data Compression Technique for ... - IEEE Xplore

2009 Third UKSim European Symposium on Computer Modeling and Simulation

Development of Novel Data Compression Technique for Accelerate DNA Sequence Alignment Based on Smith–Waterman Algorithm Al Junid, S. A. M.; Haron, M.A.; Abd Majid, Z.; Halim, A.K.; Osman, F.N.; Hashim, H. Faculty of Electrical Engineering University Technology MARA (UiTM) 40450 Shah Alam E-mail: [email protected], [email protected]

functionality and determine the synchronization of two sequences align that are very useful to find related genes, disease and etc. There are many existing tools and algorithm for sequence alignment. Among those, FASTA [2] and BLAST [3] are two commonly used ones, where the time complexity has been reduced through some heuristic algorithms. These heuristic algorithms obtain efficiency but degrade the sensitivity of the algorithms. As a result, a distantly related sequence may not be found in a search using these heuristic algorithms since it not details. As a consequence, the use of the Smith-Waterman algorithm is becoming more and more widespread. As the Smith-Waterman search guarantees to find optimal local alignments and returns only one result per comparison. Therefore, a basic rule is that the SmithWaterman algorithm [4] which is based on dynamic programming technique that provides high sensitivity should be used when determine for exact results are more important than time duration. The Smith-Waterman algorithm is a complex and sensitive algorithm for the DNA sequencing [4]. High level of accuracy and sensitivity compared to other algorithm make the algorithm is applicable until today. However, the implementation of the algorithm has become more challenging as the algorithm suffering for complex memory requirements and poor performance due to high sensitivity of the large sequence long. One of the biological sequencing solutions uses software analysis approach on general purpose computer platform [5]. However, the increment in biological sequences from year to year has reduced its efficiency. Pertaining to this situation, some use cluster technique for DNA sequencing analysis by running it on multiple general purpose computers for performed a supercomputing solution platform. Subsequently, sensitivity issues and higher in implementation cost make it less attractive in the long run. Nonetheless, the hardware driven biological sequencing analysis based on Field Programmable Logic Array (FPGA) implementation is found to be more suitable and offers a lot rooms for optimization since its provides flexibility in design. FPGA features can support for high performance applications with existing parallel computing features while it required low development and production cost and fast time consuming – develop.

Abstract—This paper presents the development of high performance accelerating technique for DNA sequences alignment. The scope of the paper focuses on speed optimization and memory reduction of the existing technique on initialization module. The novel development of the optimization using data compression technique for accelerates the Smith-Waterman (SW) algorithm has been revealed through this paper. This technique has been implemented on hardware based acceleration device. The development is targeted to Altera Cyclone II 2C70 FPGA and using 50MHz oscillator. The code is written in verilog HDL syntax using Quartus 2 version 7.2 and the simulation is verified using Quartus 2 version 7.2 simulator tool. The theoretical analysis and simulation result based on implementation of the design using FPGA are presented and well organized in this paper. The comparative study based on theoretical and simulation results of this technique has been made to accomplish result of analysis. The compilation result for data compression technique development of SW algorithm consisting of 73 logic elements with 93.75% reduction in memory space requirement. Keywords-Data Compression, Smith-Waterman algorithm, DNA sequence alignment, bioinformatic, FPGA

I.

INTRODUCTION

Nowadays, the demand for sensitive and high performance DNA sequence alignment tools for DNA sequences analysis study and investigation is dramatically increased from year to year. This pattern is supported and proven where the size of genomic database is double in every 16 month [1]. Besides, the evidence of proven pharmaceutical industry failure due to bad or high chemical contained for fast effect in pharmaceutical product and bad reaction to consumer can be detected via DNA sequences comparison. This phenomenon cause an investigation must be made to protect the consumer and to prove evidence of the unethical pharmaceutical company in court. For the entire today applications and demand, the complexity and sensitivity of DNA sequences alignment tools still become the main issues. Basic sequence alignment is to find the similarity between two sequences. The operation is subject to compare between a particular query (search) sequence and a subject (search) sequence. This operation allows molecular biologist to point out sequences sharing common subsequences. From a biological point of view, it leads to identifying similar 978-0-7695-3886-0/09 $26.00 © 2009 IEEE DOI 10.1109/EMS.2009.93

179 181

II.

SMITH-WATERMAN ALGORITHM

The technique of global alignment based on dynamic programming was introduced by Needleman and Wunsch [6] and Sellers [7]. It was followed by Smith and Waterman with the algorithm to identify the common molecular subsequence but the technique is on local alignment [4]. Smith-Waterman algorithm incorporates the evolutionary insertions and deletions techniques with the scoring function. After that, Gotoh [8] introduced some modifications by considering fine gap penalties and still be used until today for calculate the score alignment when involve with Smith-Waterman algorithm. Myres and Miller [9] modified this algorithm further in order to reduce the space required by introducing the quadratic time and linear space algorithm for optimization of Gotoh’s modification. Then its followed by Aho, Hirschberg and Ulman [2] who introduced the comparison algorithm based on equal and non equal but it still suffering from poor performance and space complexity problem. The Smith-Waterman algorithm is rigorous and based on dynamic programming algorithm. It compares the sequences of functions based on local alignment in order to find the match bit between two sequences. The Smith-Waterman algorithm is use to compute the optimal local alignment of two sequences. The procedure consists of two steps:

Figure 1.

Alignment of 9 based-pair DNA sequence using SmithWaterman algoritm

III.

S-W ALGORITHM MODULE

In this section, we will introduce the new method for simplifying the S-W algorithm from the existing procedure that has been discussed in section II. The two procedure steps for S-W algorithm are still complex for sequence alignment process. To reduce the complexity, these two procedures is simplified by breaking in into sub module pertaining to its function using function partitioning technique based on divide and conquer algorithm approach. From two procedures, it is broke up into four module as shown in figure 2. The first module is initialization module that prepares the data for alignment. The process of compare both of the sequences and find the dynamic matrix result is taken by dynamic matrix module. Third module known as Optimal Path Module is used to determine the optimal result for two DNA sequences alignment while the last module is Score Calculation Module that used to determine the maximum score for two DNA sequences alignment.

1) Fill in the dynamic prgramming matrix. 2) Find the maximal value (score) and trace back the patch that leads to maximal score to find the optimal local alignment. The basis of a Smith-Waterman search is comparison of the two DNA sequences. It used individual pair-wise comparison between characters such equation (1): For 0 ≤ i ≤ M , 0 ≤ j ≤ N, Di0 = D0j = 0 For 1 ≤ i ≤ M and 1 ≤ j ≤ N, Dij = 0 or

Dij = max

(1)

d = penalty Sbt = substitution matrix i = matrix cell row of search sequence j = matrix cell row of target sequence M = maximum length of search sequence N = maximum length of target sequence Dij = dynamic matrix cell

Figure 2. Simplified module for S-W a;gorithm using function partitioning technique.

All of the modules are focusing on theirs specific function and not overlap to each other. This technique has reduced the complexity of the algorithm itself by defining the specific function for each of the module.

The complete alignment of 9 based pair of two DNA sequences using Smith-Waterman algorithm as shown in figure 1.

182 180

comparison, memory requirement and size of FPGA architecture design. The DNA characters is assigned with two bit data representation as shown in Table 2, where A is represented by “00”, C with “01”, G with “10” and lastly T with “11”.

From the entire modules, the main objective of this paper is to narrow down the initialize module and optimize this module as the result to accelerate S-W algorithm as general target. IV.

INITILIZATION MODULE TABLE II.

Initialization module function is to prepare the DNA sequences data for alignment process. The problem for the existing technique is that it’s required large of memory area for store the data before alignment process can be performed. This memory space requirement for storing the data is proportion to length of DNA sequences.

ΜΣ

(2)

The memory space requirement for search sequences, Ms is proportion to the length of the search sequences, SL

ΜΤ

Name adenine cytosine guanine thymine

(3)

ASCII 65 67 71 84

Characters

adenine cytosine guanine thymine

A C G T

Reduction Data 00 01 10 11

1) The S and T data are converted into two bit data width using Reduction Data Width technique. We have S = { 00 01 01 10 10 11} and T = { 01 00 01 10 10 11}. 2) S and T is divided with specific P width. After the partition process, we have the new data S1 = {00 01 01}, S2={10 10 11} , T1 = {01 00 01} and T2 = {10 10 11}. 3) All of new four DNA are stored inside the memory register.

ASCII CODE DNA CHARACTER DATA REPRESENTATION

Characters A C G T

Name

B. Data Conmpression Technique Data Compression Technique (DCT) purposes is to compress the data from a group of data to become one set of data. Let we have twelve DNA data which search sequences S = {a c c g g t } and target sequence T = { c a c g g t}. Both of the sequences have six pair or six lengths of data. In data compression technique, this data will be compressed to become set of data with data width, P. In this example, we choose P=3. The next procedure is to convert the compressed data as follow:

While the memory space requirement for search sequences, MT is proportion to the length of the target sequences, TL. From equation (2) and (3), we can conclude that the requirement of the memory to store DNA sequences is double the length of the DNA sequences since for DNA sequences alignment is required two set of DNA sequences data which are target and search data. The maximum DNA sequences reported today is three billion [5]. By taking these statements as bench mark, the number of memory space requirement will increase. The format of the standard DNA data is written in ASCII code format with eight bit data width as shown in Table 1, in which each characters is been represent with it specific ASCII code. This initial problem contributed for large memory space requirement. TABLE I.

PROPOSED DNA SEQUENCE CHARACTERS REDUCTION DATA ASSIGNMENT

The total numbers of data for both of the sequences are twelve DNA but after the compression technique the total has been merged to only four data {S1, S2, T1 and T2}. The process of data compression is shown in Figure 3 in which it starts with Reduction Data Width technique where it converts the standard ASCII code to two bit data width. Second procedure is to combine the data with specific data width, P. Lastly, the merge data are stored inside the memory or register for a new data.

Data Bit 0110 0101 0110 0111 0111 0001 1000 0100

In this initialization module, we have introduced two new techniques to reduce the memory space requirement problem and increase the performance of the S-W algorithm simultaneously. The two techniques is illustrated in figure 3, so called Reduction Data Width and Data Compression Technique. A. Reduction Data Width The purpose of this technique is to reduce the data width from standard ASCII format to two bit width data format. This technique narrows down the data width to the minimum data width. Then, it will reduce the probability of

183 181

analysed according to performance and memory requirement. Performance of the data compression technique is proportion to the data width size, P. The effect of P for the performance is shown in Figure 5. From figure 5, it shows data compression technique can reduce the numbers of data involved in DNA sequences alignment from 50% up to 97%. This shows that data compression can increase the performance of DNA sequences alignment based on reduction numbers of data involved in alignment process.

Figure 3. Process of data compression technique for P=16

The comparison of this technique with the existing technique is shown in Figure 4 where it shows that this technique has simplified the data and reduced the size of memory usage of data storage. Figure 5. Effect of data width in data compression technique for DNA sequences alignment.

For memory requirement, it is analysed with the number of data which it can be compressed to perform single data and the result for this effect is illustrated in Figure 6. From Figure 6, the Data Compression Technique provide superior reduction compare to existing technique [4],[5],[11],[12]. These shows that the data compression technique can reduce the memory space requirement for analyse DNA sequences alignment sixteen times better then before.

Figure 4. Comparison between Data Compression Technique and existing tehcnique.

V.

RESULT AND DISCUSSION

A. Theoritical The theoretical results for development of the data compression technique for DNA sequences alignment are

Figure 6. Effect of Data Compression Technique for DNA sequences alignment on memory space requirement.

184 182

B. Implementation The Data Compression Technique for DNA sequences alignment is written in Verilog HDL by using Quartus II version 7.2 The Verilog codes are synthesized and targeted and targeted on Altera Cyclone II 2C70 FPGA. The simulation process is verified using simulator tools in Quartus II version 7.2. Figure 7 shows the evidence of RTL view of 4 based-pair of two DNA sequences using Data Compression Technique. The implementation need overall 73 logics elements.

sequences alignment process on targeted hardware. A part from that, this paper has achieved the design target for reducing the complexity of the DNA sequences alignment of both memory and performance complexity factor by introduced this novel of Data Compression Technique. REFERENCES [1]

D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, D.L. Wheeler, “GenBank, Nucleic Acids Res.”, Jan 1;33(Database issue):D34-8, 2005. [2] A.V.Aho, D.S. Hirscgberg, J.D. Ullman, “Bounds on the complexity of the longest common subsequence problem”J. of ACM, 23(1): 1-12, 1976 [3] D. J. Lipman and W. R. Pearson, "Rapid and Sensitive Protein Similarity Searches," Science, vol. 227, pp. 1435-41, 1985. [4] T.F. Smith, M.S. Waterman, “Identification of common molecular subsequences” J. of Molecular Biology, 147(1): 195-197, 1981 [5] F. Zhang, X-Z. Qiao, Z-Y. Liu, “A parallel smith-waterman algorithm based on divide and conquer,” ICA3PP ’02, 2002. [6] S.B. Needleman, C.D. Wunsch, “A general method applicable to search the similarities in the amino acid sequence of two sequences” J. of Molecular Biology, 48(3): 443-453, 1970 [7] P.H. Sellers, “On the theory and computational of evolutionary distances”, SIAM J. of Applied Mathematics, 26: 787-793, 1974 [8] O. Gotoh, “An improved algorithm for matching biological sequences,” J. of Molecular Biology, 162(3): 705-708, 1982 [9] E.W. Myres, W. Miller, “Optimal alignments in linear space,” Comp. App. in the Biosciences, 4(1): 11-17, 1988 [10] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic Local Alignment Search Tool," J Mol Biol, vol. 215, pp. 40310, 1990. [11] Syed Abdul Mutalib Al Junid, Zulkifli Abd Majid, Abdul Karimi Halim, “Development of DNA Sequensing Accelerator based on Smith-Waterman Algorithm with Heuristic Divide and Conquer Technique for FPGA Implementation” ICCCE 08, 2008. [12] Syed Abdul Mutalib Al Junid, Zulkifli Abd Majid, Abdul Karimi Halim “ High Performance DNA Sequences Alignment” ICED 08, 2008

C. Simulation The simulation for the implementation has been done using Quartus 2 version 7.2 simulator tools and the clock input is set to 50MHz and targeted device is Altera Cyclone 2 EP2C70. The simulation of the initialization module which consists of both the Reduction Data Width and Data Compression Technique is shown in Figure 8. D. Comparative Study Between Theoritical and Implementation Data Compression Technique has shown superior advantage as compare to existing technique [11],[12], according to theoretical study. This technique can be implemented in FPGA and it has been proved that implementation experiment based on this technique on FPGA is feasible as shown in Figure 7 and Figure 8. VI.

CONCLUSION

This paper proposes a novel technique in Data Compression for initialization module in DNA sequences alignment. Through this technique, the DNA sequences data has been compressed to minimize the number of data involved in DNA sequences alignment process and reduce the number of alignment cycle process. The maximum data can be compressed is sixteen DNA sequences data length with reduction up to 93.75%. The implementation has verified that Data Compression Technique can be implemented via FPGA platform to accelerate DNA

[13] .

185 183

enable

counter_out~[3..0] counter_out[0]~reg0 D

PRE

counter_out~[7..4]

SEL

Add0

SEL

DATAA

ENA

OUT0

DATAA 4' h0 --

CLR A[3..0]

counter_out[1]~reg0

OUT[3..0]

4' h1 --

counter_out~[19..16]

SEL OUT0

DATAB

DATAA DATAB

MUX21

MUX21

Q


SEL 4' h0 --

MUX21

PRE

DATAA

OUT0

DATAB

DATAB

D

counter_out~[11..8]

Q

clock

MUX21

B[3..0]


SEL OUT0

DATAA DATAB

MUX21

DATAA DATAB

Equal0 data2[0]~reg0

PRE OUT0

D

D

CLR

data1[0]~reg0 PRE

data2[1]~reg0 D

CLR

PRE

ENA

Q

DATAA

DATAB

OUT0

DATAB

DATAB

OUT0

DATAA

data2[7]~reg0

MUX21

PRE

data2[2]~reg0 4' h4 --

data1[0]~reg0_OUT0

SEL

OUT

CLR

DATAB

OUT0

D

Q

PRE

B[3..0]

data1~_OUT0 Add4_OUT

counter_out[3..0]

SEL OUT0

data2~[47..40]

CLR A[3..0]

ENA

DATAA

Q

ENA

Q

counter_out[2]~reg0 D

D

SEL DATAA

PRE

EQUAL

Equal6

data2~[39..32]

ENA

OUT

B[3..0]

ADDER

CLR

Q

data2~[23..16]

A[3..0]

CLR 4' h1 --

ENA

SEL

PRE Q

ENA MUX21

data2~[31..24]

counter_out[3]~reg0

SEL OUT0

D

Q

MUX21

ENA

MUX21

CLR

data2[7..0]

ENA

EQUAL

CLR

Equal2

Add2 MUX21

data2[3]~reg0 PRE D

A[3..0]

Q

A[2..0]

OUT 3' h1 --

ENA 4' h3 --

B[3..0]

OUT[2..0]

Add2_OUT

B[2..0]

CLR ADDER

data2[4]~reg0

EQUAL

PRE

Equal1

D

Q

ENA CLR A[3..0]

OUT

data2[5]~reg0 4' h2 --

B[3..0]

D

PRE

Q

ENA

EQUAL

CLR

data1~[40..33] data2[6]~reg0 data1[1]~reg0

SEL

PRE D

PRE D

Q

Q

data1~[48..41] ENA

ENA

data1~[56..49]

SEL

DATAA

CLR

OUT0

CLR

DATAA SEL OUT0

DATAB

data1~[64..57]

DATAA

data1[2]~reg0 PRE D

8' h00 --

Q

SEL DATAB

DATAB

OUT0

DATAA 6' h00 --

DATAB

MUX21 ENA

OUT0

data1[2]~reg0_OUT0

MUX21

data1~[26..21]

CLR

data1[3]~reg0

DATAA

PRE

DATAB

D

data1~[32..27]

Q

ENA

MUX21

MUX21

SEL

SEL OUT0

DATAA

data1[3]~reg0_OUT0

MUX21

OUT0

DATAB

CLR

data1[4]~reg0 PRE D

Q

data1[4]~reg0_OUT0

MUX21

Add1

data1~[10..7]

ENA

data1~[14..11]

CLR SEL DATAA

A[3..0]

data1[5]~reg0

OUT[3..0]

SEL OUT0

DATAB

DATAA

PRE D

Q

4' h1 --

B[3..0]

DATAB

MUX21

data1[5]~reg0_OUT0

OUT0

ENA ADDER

CLR

data1~[2..1] MUX21

data1[6]~reg0 PRE D

SEL

0 1

Q

1

data1~0

DATAA 2' h0 --

data1[6]~reg0_OUT0

OUT0

DATAB

ENA CLR

MUX21

data1[7]~reg0 PRE D

Q

ENA CLR

data1[7..0]

Equal4 data_in[7..0]

A[7..0] 8' h01 --

data1[7]~reg0_OUT0 Equal4_OUT data1[1]~reg0_OUT0

OUT

B[7..0]

EQUAL

Equal3

A[7..0] 8' h00 --

OUT

B[7..0]

EQUAL

Equal5

A[7..0] 8' h02 --

Equal5_OUT

OUT

B[7..0]

EQUAL

data2~_OUT0 data1~_OUT0 reset

Figure 7. RTL view of Initliazation Module for 4 based-pair DNA sequences

Figure 8. Simulation result for Initilization Module for 4based-pair DNA sequences.

186 184

Development of Novel Data Compression Technique for ... - IEEE Xplore

Development of Novel Data Compression Technique for ... - IEEE Xplore

Suggest Documents

A Novel Compression Technique for Super Resolution ... - IEEE Xplore

Electrocardiogram compression technique for global ... - IEEE Xplore

Data Compression Technique for High Resolution ... - IEEE Xplore

A Novel Quad Tree Based Data Clustering Technique - IEEE Xplore

Image data compression using counterpropagation ... - IEEE Xplore

Novel fabrication technique for improving the figure-of ... - IEEE Xplore

A Novel Technique for Analysis of Electromagnetic ... - IEEE Xplore

Modified Technique of Insertion Methods for Data Hiding ... - IEEE Xplore

Novel Breast Cancer Detection Technique for TMS - IEEE Xplore

A Novel Multi-Tag Identification Technique for ... - IEEE Xplore

Two-Step Channel SelectionâA Novel Technique for ... - IEEE Xplore

A Novel Technique for Detecting Electromagnetic ... - IEEE Xplore

a novel trilateral filter technique for depth map ... - IEEE Xplore

A Novel Yield Optimization Technique for Digital CMOS ... - IEEE Xplore

A Novel Ultrasound Technique for Non-Invasive ... - IEEE Xplore

Novel cell-AGC technique for Burst-Mode CMOS ... - IEEE Xplore

A Novel Technique for Wideband Bipolar Amplifiers - IEEE Xplore

A Novel Blending Technique for Underwater ... - IEEE Xplore

A Novel Wire Planning Technique for Optimum Pin ... - IEEE Xplore

Technique - IEEE Xplore

A New Technique for Text Data Compression

A New Hybrid Technique for Data Encryption - IEEE Xplore Digital ...

A Novel Switching Technique of Current Ripple ... - IEEE Xplore

A Novel Simulation Methodology for Development of ... - IEEE Xplore

Development of Novel Data Compression Technique for ... - IEEE Xplore