2009 Third UKSim European Symposium on Computer Modeling and Simulation
Development of Novel Data Compression Technique for Accelerate DNA Sequence Alignment Based on Smith–Waterman Algorithm Al Junid, S. A. M.; Haron, M.A.; Abd Majid, Z.; Halim, A.K.; Osman, F.N.; Hashim, H. Faculty of Electrical Engineering University Technology MARA (UiTM) 40450 Shah Alam E-mail:
[email protected],
[email protected]
functionality and determine the synchronization of two sequences align that are very useful to find related genes, disease and etc. There are many existing tools and algorithm for sequence alignment. Among those, FASTA [2] and BLAST [3] are two commonly used ones, where the time complexity has been reduced through some heuristic algorithms. These heuristic algorithms obtain efficiency but degrade the sensitivity of the algorithms. As a result, a distantly related sequence may not be found in a search using these heuristic algorithms since it not details. As a consequence, the use of the Smith-Waterman algorithm is becoming more and more widespread. As the Smith-Waterman search guarantees to find optimal local alignments and returns only one result per comparison. Therefore, a basic rule is that the SmithWaterman algorithm [4] which is based on dynamic programming technique that provides high sensitivity should be used when determine for exact results are more important than time duration. The Smith-Waterman algorithm is a complex and sensitive algorithm for the DNA sequencing [4]. High level of accuracy and sensitivity compared to other algorithm make the algorithm is applicable until today. However, the implementation of the algorithm has become more challenging as the algorithm suffering for complex memory requirements and poor performance due to high sensitivity of the large sequence long. One of the biological sequencing solutions uses software analysis approach on general purpose computer platform [5]. However, the increment in biological sequences from year to year has reduced its efficiency. Pertaining to this situation, some use cluster technique for DNA sequencing analysis by running it on multiple general purpose computers for performed a supercomputing solution platform. Subsequently, sensitivity issues and higher in implementation cost make it less attractive in the long run. Nonetheless, the hardware driven biological sequencing analysis based on Field Programmable Logic Array (FPGA) implementation is found to be more suitable and offers a lot rooms for optimization since its provides flexibility in design. FPGA features can support for high performance applications with existing parallel computing features while it required low development and production cost and fast time consuming – develop.
Abstract—This paper presents the development of high performance accelerating technique for DNA sequences alignment. The scope of the paper focuses on speed optimization and memory reduction of the existing technique on initialization module. The novel development of the optimization using data compression technique for accelerates the Smith-Waterman (SW) algorithm has been revealed through this paper. This technique has been implemented on hardware based acceleration device. The development is targeted to Altera Cyclone II 2C70 FPGA and using 50MHz oscillator. The code is written in verilog HDL syntax using Quartus 2 version 7.2 and the simulation is verified using Quartus 2 version 7.2 simulator tool. The theoretical analysis and simulation result based on implementation of the design using FPGA are presented and well organized in this paper. The comparative study based on theoretical and simulation results of this technique has been made to accomplish result of analysis. The compilation result for data compression technique development of SW algorithm consisting of 73 logic elements with 93.75% reduction in memory space requirement. Keywords-Data Compression, Smith-Waterman algorithm, DNA sequence alignment, bioinformatic, FPGA
I.
INTRODUCTION
Nowadays, the demand for sensitive and high performance DNA sequence alignment tools for DNA sequences analysis study and investigation is dramatically increased from year to year. This pattern is supported and proven where the size of genomic database is double in every 16 month [1]. Besides, the evidence of proven pharmaceutical industry failure due to bad or high chemical contained for fast effect in pharmaceutical product and bad reaction to consumer can be detected via DNA sequences comparison. This phenomenon cause an investigation must be made to protect the consumer and to prove evidence of the unethical pharmaceutical company in court. For the entire today applications and demand, the complexity and sensitivity of DNA sequences alignment tools still become the main issues. Basic sequence alignment is to find the similarity between two sequences. The operation is subject to compare between a particular query (search) sequence and a subject (search) sequence. This operation allows molecular biologist to point out sequences sharing common subsequences. From a biological point of view, it leads to identifying similar 978-0-7695-3886-0/09 $26.00 © 2009 IEEE DOI 10.1109/EMS.2009.93
179 181
II.
SMITH-WATERMAN ALGORITHM
The technique of global alignment based on dynamic programming was introduced by Needleman and Wunsch [6] and Sellers [7]. It was followed by Smith and Waterman with the algorithm to identify the common molecular subsequence but the technique is on local alignment [4]. Smith-Waterman algorithm incorporates the evolutionary insertions and deletions techniques with the scoring function. After that, Gotoh [8] introduced some modifications by considering fine gap penalties and still be used until today for calculate the score alignment when involve with Smith-Waterman algorithm. Myres and Miller [9] modified this algorithm further in order to reduce the space required by introducing the quadratic time and linear space algorithm for optimization of Gotoh’s modification. Then its followed by Aho, Hirschberg and Ulman [2] who introduced the comparison algorithm based on equal and non equal but it still suffering from poor performance and space complexity problem. The Smith-Waterman algorithm is rigorous and based on dynamic programming algorithm. It compares the sequences of functions based on local alignment in order to find the match bit between two sequences. The Smith-Waterman algorithm is use to compute the optimal local alignment of two sequences. The procedure consists of two steps:
Figure 1.
Alignment of 9 based-pair DNA sequence using SmithWaterman algoritm
III.
S-W ALGORITHM MODULE
In this section, we will introduce the new method for simplifying the S-W algorithm from the existing procedure that has been discussed in section II. The two procedure steps for S-W algorithm are still complex for sequence alignment process. To reduce the complexity, these two procedures is simplified by breaking in into sub module pertaining to its function using function partitioning technique based on divide and conquer algorithm approach. From two procedures, it is broke up into four module as shown in figure 2. The first module is initialization module that prepares the data for alignment. The process of compare both of the sequences and find the dynamic matrix result is taken by dynamic matrix module. Third module known as Optimal Path Module is used to determine the optimal result for two DNA sequences alignment while the last module is Score Calculation Module that used to determine the maximum score for two DNA sequences alignment.
1) Fill in the dynamic prgramming matrix. 2) Find the maximal value (score) and trace back the patch that leads to maximal score to find the optimal local alignment. The basis of a Smith-Waterman search is comparison of the two DNA sequences. It used individual pair-wise comparison between characters such equation (1): For 0 ≤ i ≤ M , 0 ≤ j ≤ N, Di0 = D0j = 0 For 1 ≤ i ≤ M and 1 ≤ j ≤ N, Dij = 0 or
Dij = max
(1)
d = penalty Sbt = substitution matrix i = matrix cell row of search sequence j = matrix cell row of target sequence M = maximum length of search sequence N = maximum length of target sequence Dij = dynamic matrix cell
Figure 2. Simplified module for S-W a;gorithm using function partitioning technique.
All of the modules are focusing on theirs specific function and not overlap to each other. This technique has reduced the complexity of the algorithm itself by defining the specific function for each of the module.
The complete alignment of 9 based pair of two DNA sequences using Smith-Waterman algorithm as shown in figure 1.
182 180
comparison, memory requirement and size of FPGA architecture design. The DNA characters is assigned with two bit data representation as shown in Table 2, where A is represented by “00”, C with “01”, G with “10” and lastly T with “11”.
From the entire modules, the main objective of this paper is to narrow down the initialize module and optimize this module as the result to accelerate S-W algorithm as general target. IV.
INITILIZATION MODULE TABLE II.
Initialization module function is to prepare the DNA sequences data for alignment process. The problem for the existing technique is that it’s required large of memory area for store the data before alignment process can be performed. This memory space requirement for storing the data is proportion to length of DNA sequences.
ΜΣ
(2)
The memory space requirement for search sequences, Ms is proportion to the length of the search sequences, SL
ΜΤ
Name adenine cytosine guanine thymine
(3)
ASCII 65 67 71 84
Characters
adenine cytosine guanine thymine
A C G T
Reduction Data 00 01 10 11
1) The S and T data are converted into two bit data width using Reduction Data Width technique. We have S = { 00 01 01 10 10 11} and T = { 01 00 01 10 10 11}. 2) S and T is divided with specific P width. After the partition process, we have the new data S1 = {00 01 01}, S2={10 10 11} , T1 = {01 00 01} and T2 = {10 10 11}. 3) All of new four DNA are stored inside the memory register.
ASCII CODE DNA CHARACTER DATA REPRESENTATION
Characters A C G T
Name
B. Data Conmpression Technique Data Compression Technique (DCT) purposes is to compress the data from a group of data to become one set of data. Let we have twelve DNA data which search sequences S = {a c c g g t } and target sequence T = { c a c g g t}. Both of the sequences have six pair or six lengths of data. In data compression technique, this data will be compressed to become set of data with data width, P. In this example, we choose P=3. The next procedure is to convert the compressed data as follow:
While the memory space requirement for search sequences, MT is proportion to the length of the target sequences, TL. From equation (2) and (3), we can conclude that the requirement of the memory to store DNA sequences is double the length of the DNA sequences since for DNA sequences alignment is required two set of DNA sequences data which are target and search data. The maximum DNA sequences reported today is three billion [5]. By taking these statements as bench mark, the number of memory space requirement will increase. The format of the standard DNA data is written in ASCII code format with eight bit data width as shown in Table 1, in which each characters is been represent with it specific ASCII code. This initial problem contributed for large memory space requirement. TABLE I.
PROPOSED DNA SEQUENCE CHARACTERS REDUCTION DATA ASSIGNMENT
The total numbers of data for both of the sequences are twelve DNA but after the compression technique the total has been merged to only four data {S1, S2, T1 and T2}. The process of data compression is shown in Figure 3 in which it starts with Reduction Data Width technique where it converts the standard ASCII code to two bit data width. Second procedure is to combine the data with specific data width, P. Lastly, the merge data are stored inside the memory or register for a new data.
Data Bit 0110 0101 0110 0111 0111 0001 1000 0100
In this initialization module, we have introduced two new techniques to reduce the memory space requirement problem and increase the performance of the S-W algorithm simultaneously. The two techniques is illustrated in figure 3, so called Reduction Data Width and Data Compression Technique. A. Reduction Data Width The purpose of this technique is to reduce the data width from standard ASCII format to two bit width data format. This technique narrows down the data width to the minimum data width. Then, it will reduce the probability of
183 181
analysed according to performance and memory requirement. Performance of the data compression technique is proportion to the data width size, P. The effect of P for the performance is shown in Figure 5. From figure 5, it shows data compression technique can reduce the numbers of data involved in DNA sequences alignment from 50% up to 97%. This shows that data compression can increase the performance of DNA sequences alignment based on reduction numbers of data involved in alignment process.
Figure 3. Process of data compression technique for P=16
The comparison of this technique with the existing technique is shown in Figure 4 where it shows that this technique has simplified the data and reduced the size of memory usage of data storage. Figure 5. Effect of data width in data compression technique for DNA sequences alignment.
For memory requirement, it is analysed with the number of data which it can be compressed to perform single data and the result for this effect is illustrated in Figure 6. From Figure 6, the Data Compression Technique provide superior reduction compare to existing technique [4],[5],[11],[12]. These shows that the data compression technique can reduce the memory space requirement for analyse DNA sequences alignment sixteen times better then before.
Figure 4. Comparison between Data Compression Technique and existing tehcnique.
V.
RESULT AND DISCUSSION
A. Theoritical The theoretical results for development of the data compression technique for DNA sequences alignment are
Figure 6. Effect of Data Compression Technique for DNA sequences alignment on memory space requirement.
184 182
B. Implementation The Data Compression Technique for DNA sequences alignment is written in Verilog HDL by using Quartus II version 7.2 The Verilog codes are synthesized and targeted and targeted on Altera Cyclone II 2C70 FPGA. The simulation process is verified using simulator tools in Quartus II version 7.2. Figure 7 shows the evidence of RTL view of 4 based-pair of two DNA sequences using Data Compression Technique. The implementation need overall 73 logics elements.
sequences alignment process on targeted hardware. A part from that, this paper has achieved the design target for reducing the complexity of the DNA sequences alignment of both memory and performance complexity factor by introduced this novel of Data Compression Technique. REFERENCES [1]
D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, D.L. Wheeler, “GenBank, Nucleic Acids Res.”, Jan 1;33(Database issue):D34-8, 2005. [2] A.V.Aho, D.S. Hirscgberg, J.D. Ullman, “Bounds on the complexity of the longest common subsequence problem”J. of ACM, 23(1): 1-12, 1976 [3] D. J. Lipman and W. R. Pearson, "Rapid and Sensitive Protein Similarity Searches," Science, vol. 227, pp. 1435-41, 1985. [4] T.F. Smith, M.S. Waterman, “Identification of common molecular subsequences” J. of Molecular Biology, 147(1): 195-197, 1981 [5] F. Zhang, X-Z. Qiao, Z-Y. Liu, “A parallel smith-waterman algorithm based on divide and conquer,” ICA3PP ’02, 2002. [6] S.B. Needleman, C.D. Wunsch, “A general method applicable to search the similarities in the amino acid sequence of two sequences” J. of Molecular Biology, 48(3): 443-453, 1970 [7] P.H. Sellers, “On the theory and computational of evolutionary distances”, SIAM J. of Applied Mathematics, 26: 787-793, 1974 [8] O. Gotoh, “An improved algorithm for matching biological sequences,” J. of Molecular Biology, 162(3): 705-708, 1982 [9] E.W. Myres, W. Miller, “Optimal alignments in linear space,” Comp. App. in the Biosciences, 4(1): 11-17, 1988 [10] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic Local Alignment Search Tool," J Mol Biol, vol. 215, pp. 40310, 1990. [11] Syed Abdul Mutalib Al Junid, Zulkifli Abd Majid, Abdul Karimi Halim, “Development of DNA Sequensing Accelerator based on Smith-Waterman Algorithm with Heuristic Divide and Conquer Technique for FPGA Implementation” ICCCE 08, 2008. [12] Syed Abdul Mutalib Al Junid, Zulkifli Abd Majid, Abdul Karimi Halim “ High Performance DNA Sequences Alignment” ICED 08, 2008
C. Simulation The simulation for the implementation has been done using Quartus 2 version 7.2 simulator tools and the clock input is set to 50MHz and targeted device is Altera Cyclone 2 EP2C70. The simulation of the initialization module which consists of both the Reduction Data Width and Data Compression Technique is shown in Figure 8. D. Comparative Study Between Theoritical and Implementation Data Compression Technique has shown superior advantage as compare to existing technique [11],[12], according to theoretical study. This technique can be implemented in FPGA and it has been proved that implementation experiment based on this technique on FPGA is feasible as shown in Figure 7 and Figure 8. VI.
CONCLUSION
This paper proposes a novel technique in Data Compression for initialization module in DNA sequences alignment. Through this technique, the DNA sequences data has been compressed to minimize the number of data involved in DNA sequences alignment process and reduce the number of alignment cycle process. The maximum data can be compressed is sixteen DNA sequences data length with reduction up to 93.75%. The implementation has verified that Data Compression Technique can be implemented via FPGA platform to accelerate DNA
[13] .
185 183
enable
counter_out~[3..0] counter_out[0]~reg0 D
PRE
counter_out~[7..4]
SEL
Add0
SEL
DATAA
ENA
OUT0
DATAA 4' h0 --
CLR A[3..0]
counter_out[1]~reg0
OUT[3..0]
4' h1 --
counter_out~[19..16]
SEL OUT0
DATAB
DATAA DATAB
MUX21
MUX21
Q
counter_out~[15..12]
SEL 4' h0 --
MUX21
PRE
DATAA
OUT0
DATAB
DATAB
D
counter_out~[11..8]
Q
clock
MUX21
B[3..0]
counter_out~[23..20]
SEL OUT0
DATAA DATAB
MUX21
DATAA DATAB
Equal0 data2[0]~reg0
PRE OUT0
D
D
CLR
data1[0]~reg0 PRE
data2[1]~reg0 D
CLR
PRE
ENA
Q
DATAA
DATAB
OUT0
DATAB
DATAB
OUT0
DATAA
data2[7]~reg0
MUX21
PRE
data2[2]~reg0 4' h4 --
data1[0]~reg0_OUT0
SEL
OUT
CLR
DATAB
OUT0
D
Q
PRE
B[3..0]
data1~_OUT0 Add4_OUT
counter_out[3..0]
SEL OUT0
data2~[47..40]
CLR A[3..0]
ENA
DATAA
Q
ENA
Q
counter_out[2]~reg0 D
D
SEL DATAA
PRE
EQUAL
Equal6
data2~[39..32]
ENA
OUT
B[3..0]
ADDER
CLR
Q
data2~[23..16]
A[3..0]
CLR 4' h1 --
ENA
SEL
PRE Q
ENA MUX21
data2~[31..24]
counter_out[3]~reg0
SEL OUT0
D
Q
MUX21
ENA
MUX21
CLR
data2[7..0]
ENA
EQUAL
CLR
Equal2
Add2 MUX21
data2[3]~reg0 PRE D
A[3..0]
Q
A[2..0]
OUT 3' h1 --
ENA 4' h3 --
B[3..0]
OUT[2..0]
Add2_OUT
B[2..0]
CLR ADDER
data2[4]~reg0
EQUAL
PRE
Equal1
D
Q
ENA CLR A[3..0]
OUT
data2[5]~reg0 4' h2 --
B[3..0]
D
PRE
Q
ENA
EQUAL
CLR
data1~[40..33] data2[6]~reg0 data1[1]~reg0
SEL
PRE D
PRE D
Q
Q
data1~[48..41] ENA
ENA
data1~[56..49]
SEL
DATAA
CLR
OUT0
CLR
DATAA SEL OUT0
DATAB
data1~[64..57]
DATAA
data1[2]~reg0 PRE D
8' h00 --
Q
SEL DATAB
DATAB
OUT0
DATAA 6' h00 --
DATAB
MUX21 ENA
OUT0
data1[2]~reg0_OUT0
MUX21
data1~[26..21]
CLR
data1[3]~reg0
DATAA
PRE
DATAB
D
data1~[32..27]
Q
ENA
MUX21
MUX21
SEL
SEL OUT0
DATAA
data1[3]~reg0_OUT0
MUX21
OUT0
DATAB
CLR
data1[4]~reg0 PRE D
Q
data1[4]~reg0_OUT0
MUX21
Add1
data1~[10..7]
ENA
data1~[14..11]
CLR SEL DATAA
A[3..0]
data1[5]~reg0
OUT[3..0]
SEL OUT0
DATAB
DATAA
PRE D
Q
4' h1 --
B[3..0]
DATAB
MUX21
data1[5]~reg0_OUT0
OUT0
ENA ADDER
CLR
data1~[2..1] MUX21
data1[6]~reg0 PRE D
SEL
0 1
Q
1
data1~0
DATAA 2' h0 --
data1[6]~reg0_OUT0
OUT0
DATAB
ENA CLR
MUX21
data1[7]~reg0 PRE D
Q
ENA CLR
data1[7..0]
Equal4 data_in[7..0]
A[7..0] 8' h01 --
data1[7]~reg0_OUT0 Equal4_OUT data1[1]~reg0_OUT0
OUT
B[7..0]
EQUAL
Equal3
A[7..0] 8' h00 --
OUT
B[7..0]
EQUAL
Equal5
A[7..0] 8' h02 --
Equal5_OUT
OUT
B[7..0]
EQUAL
data2~_OUT0 data1~_OUT0 reset
Figure 7. RTL view of Initliazation Module for 4 based-pair DNA sequences
Figure 8. Simulation result for Initilization Module for 4based-pair DNA sequences.
186 184