USE OF THE HASH TABLE FOR BUILDING THE DISTANCE MATRIX IN A PAIR-WISE SEQUENCE ALIGNMENT
MUHANNAD A. ABU-HASHEM, NUR'AINI ABDUL RASHID, ROSNI ABDULLAH, ATHEER A. ABDULRAZZAQ AND AWSAN A. HASAN School of Computer Sciences, Universiti Sains Malaysia USM Pulau Pinang, 11800, Malaysia Email:
[email protected];
[email protected];
[email protected];
[email protected] ;
[email protected]
ABSTRACT
In bioinformatics, distance matrices are used for many purposes, such as clustering sequences, representing protein structures without relying on coordinates, constructing phylogenetic trees, and building multiple sequence alignments. The pair-wise alignment plays a significant role in the construction of distance matrices because it rates the similarities and distances between the sequences. The N-Gram-Hirschberg (NGH) algorithm is a fast, dynamicprogramming pair-wise alignment algorithm, which produces the same optimal results as the Smith-Waterman algorithm. In this paper, we present Hash TableN-Gram-Hirschberg (HT-NGH) method, a new and practical method for constructing a distance matrix using a pair-wise alignment. HT-NGH method is an enhancement to the NGH method, and it is fast and produces the same results as the Smith-Waterman algorithm. HT-NGH algorithm uses the hash table capabilities to enhance the transformation process of the two former methods, NGH and Hashing-N-Gram-Hirschberg (H-NGH). The proposed enhancement demonstrates an improvement in time and outperforms H-NGH, without sacrificing space complexity. Furthermore, our algorithm run-time outperforms the NGH and H-NGH methods by 60% and 30%, respectively. In addition, the transformation phase complexity of HT-NGH algorithm is O (min (NM)/w) compared to O (min (NM)) for NGH. KEY WORDS
Pair-Wise alignment, Hash table, Dynamic programming, Distance matrix, Protein sequence similarity
1 INTRODUCTION
Distance matrices are a basic representation in mathematics, computer science, graph theory, and bioinformatics. In bioinformatics, the distance matrix is a matrix that contains the distances between sequences taken by applying a pair-wise alignment. It has been used for various aims, such as building phylogenetic trees [37], constructing multiple sequence alignments and many others. Distance matrices play an important role in constructing multiple sequence alignments. Many popular methods in the field, such as ClustalW [3], DIALIGN [4-6,38-40], TCoffee [7], MAFFT [8,9], MUSCLE [10], Probcons [11],
Probalign [12], and MSAProbs [13], have built a distance matrix using a pairwise alignment as a pre-process of the multiple alignment construction. Therefore, there is an increased need for faster and accurate methods for constructing the distance matrix to cope with the daily discovery of new protein sequences by biologists. The pair-wise alignment is broadly used for calculating the distances among protein sequences and constructing the distance matrices. To improve the performance of the pair-wise sequence alignment, many methods using different approaches and techniques have been proposed. Some methods and techniques were novel, while others were adopted from other fields. The Smith-Waterman algorithm [14,15], which is a dynamic-programming-based algorithm proposed to build sequence alignments, produces the most optimal results compared with the heuristic-based sequence alignment algorithms, but it requires a lot of time and space. It builds the alignment with O (MN) time and O (MN) space. Furthermore, the Hirschberg algorithm [16], which is the space-saving version of the Needleman-Wunsch and Smith-Waterman algorithms, produces the alignment with O (min (mn)) space. An enhancement on the Hirschberg algorithm, named N-Gram-Hirschberg (NGH) [17], was proposed for further space and time reduction. Recently, an extended algorithm, named Hashing-NGram-Hirschberg (H-NGH) [18], was proposed for further time enhancement of the NGH algorithm. This research takes advantage of hash table capabilities and uses them to gain further enhancement over the NGH and H-NGH algorithms. 2 RELATED WORK
A distance matrix is the set of values that tells the distance between the group of sequences, and each cell in the matrix returns a distance between the two sequences. These values are calculated using a pair-wise sequence alignment. The distance matrix calculates the similarities and distances between the sequences by finding the matched regions between them. The methods of calculating the distances and similarities between sequences using pair-wise alignment can be classified into heuristic-based methods or dynamic-based methods. A. Heuristic-based methods The main property shared by the heuristic approaches is the scanning of all possible solutions of a certain problem to find or select the most optimal solution. The selected solution is considered a near-optimal solution of the problem because there is no guarantee of finding the optimal solution. However, the algorithms that follow the heuristic approach are quite fast, and they produce the results in a reasonable time compared to dynamic-programming-based methods. Many methods have followed this approach to take the advantage of the speed of the heuristic approach in finding the solutions. A heuristic method for sequence search and alignment, named FASTA, was proposed in 1985. It divides the sequences into terms, called K-tuple (words), and then it locates the matched patterns between the two sequences to measure the similarity and produce the alignment [19,20,21]. The locations of all Ktuples that come from the query sequence are stored in a hash table. A similar method to FASTA, called BLAST, was proposed in 1990 and was designed to improve the sensitivity [22,23]. BLAST was also designed to improve the speed
of the overall search and place the database search on a firm statistical foundation. Later, in 2001, another word method, called SSAHA (Sequence Search and Alignment by Hashing Algorithm), was proposed. It uses the hash table technique to store the K-tuple occurrence positions [24]. B. Dynamic-programming-based methods Dynamic programming, which was coined by Richard Bellman [25,26], is a technique that involves solving a problem by dividing it into smaller subproblems recursively until the sub-problems become indestructible. Then, it builds the solution by combining the solutions of the sub-problems [27,28]. Dynamic programming will definitely find the optimal solution of a given problem, which is an advantage over the heuristic approach, but it usually consumes more run time compared to the heuristic methods. Accuracy is very important for sequence alignments, which gives the dynamic programming method an advantage over the heuristic approach. The Needleman-Wunsch algorithm [29], proposed by Saul B. Needleman and Christian D. Wunsch in 1970, is the pioneering method for all dynamicprogramming-based methods in the sequence comparison and alignment field. It was introduced for protein sequence alignment and comparison and measures the similarity between the sequences by building a similarity matrix guided by a substitution matrix [30,31]. Building the similarity matrix requires O (MN) space, which is quite high if we consider the length of biological sequences. The Smith-Waterman algorithm, introduced in 1981 [14], follows the same general workflow as the Needleman-Wunsch algorithm. The main difference between the Smith-Waterman and the Needleman-Wunsch methods is that the former contains the similarity between suffixes. In addition, the idea is to build a local alignment where the minimum value of a score is equal to zero. Although the Smith-Waterman algorithm generates optimal results, the time and space requirements are still large. The complexity is O (MN) space and time. In 2006, Abdul Rashid proposed the N-Gram-Smith-Waterman algorithm (NGSW) [2], which is an extended algorithm for the Smith-Waterman algorithm. The proposed algorithm outperforms the Smith-Waterman algorithm in terms of the space and time requirements, and it reduces these without sacrificing the sensitivity of the original algorithm. The Hirschberg algorithm [16] proposed a solution for the space issue of the Needleman-Wunsch algorithm, where it reduces the space to O (min (m,n)). Unlike the Needleman-Wunsch and Smith-Waterman algorithms, the Hirschberg algorithm divides the similarity matrix into two parts. Thus, the similarity matrix can be filled with both directions (top down and bottom up) as observed in Fig. 1. Because the algorithm has one row of the array and two variables, the space that will be used to run the algorithm is now linear with respect to the shorter length of the two sequences [32,33]. As an enhancement to the Hirschberg algorithm, the N-Gram-Hirschberg algorithm (NGH) [17] was proposed in 2008. The NGH shows an enhancement in both space and time, where the required space and time to build the alignment are both reduced. The enhancement was in three levels. The first level is that the protein alphabet is reduced from 20 to 10 letters. The second level is dividing the protein sequence into words, where each word has the same length as the other words, by adopting the N-Gram method. The third level is transforming
Sequence A
Ba Sequence B
Bb
Fig.1 The splitting of the similarity matrix in the hirschberg algorithm [1] the newly generated words into integers, where each word is represented by one integer, to speed up the comparison process. Because of this enhancement, the NGH method reduces the space complexity and speeds up the alignment without sacrificing the accuracy, especially when the word length = 2. In 2009, the Hashing-N-Gram-Hirschberg (H-NGH) [18] algorithm was proposed for enhancing the N-Gram-Hirschberg algorithm. This enhancement focuses on the transforming phase of the NGH method and adopts the hash function technique to transform the words into integers. Because of this adoption, the H-NGH algorithm outperforms the NGH algorithm in terms of speed, but it sacrifices the accuracy. The output matches compared with the NGH algorithm averaged 93%. 3 MATERIALS AND METHODS
The hash table method has been used in many algorithms for creating pairwise alignments. The BLAST algorithm is considered the first algorithm that embedded and used the hash table concept in the pair-wise alignment field [34]. The use of the hash table in the pair-wise alignment methods is for creating the alignment with the K-tuples procedure, which is equivalent in nature to the NGram method. They convert the whole dataset of sequences into the hash table by finding and storing the occurrences of each word in the dataset, and then they look at the occurrence of the query sequence words in the dataset. However, the HT-NGH method uses the hash table technique to represent the words produced by applying the N-Gram method to the reduced protein sequence. In other words, the use of the hash table is to transform the sequences into integer values to speed up the searching and comparing process. The proposed method can be considered an extension of the H-NGH method, where both methods contribute by enhancing the performance of the NGH transformation phase. The H-NGH algorithm generates the integer values from the words by passing them through the hash function. This process requires many calculations, where the value of each single word must be calculated for all of the sequences. In contrast, the HT-NGH algorithm uses the hash table to assign values for the words. It generates and fills the hash table first, then it fetches the value that represents the word from the hash table. Fig. 2 shows the framework of the proposed method; the details of the method are below.
Fig.2: Framework Scheme of the HT-NGH Algorithm As shown in Fig. 2, we can divide the distance-matrix construction process into two main phases, the protein sequence transformation and the construction of the alignment and building of the distance matrix. A. Protein sequence transformation Sequences transformation is the stage of reforming the protein sequences into a shorter and easier-to-compare form to connect the alignment and distance calculating process. This stage consists of reducing the protein sequence alphabet, dividing the protein sequence into words and constructing the hash table and fetching values. 1) Reducing the protein sequence alphabet The protein amino acid alphabet consists of 20 characters; each character indicates a different amino acid. This alphabet has been reduced by grouping the amino acids by similarity of some of their properties. The similarity of the physicochemical properties of the amino acids is one of the ways that has been used to reduce the alphabet [35]. The ability of mutation of the amino acids in some places within the protein sequences without affecting their activities is also used to reduce the alphabet, where those amino acids can be grouped together [36]. Tab. 1 shows the groups of the reduced alphabet table that have been applied in this method, which was proposed and used in the NGH algorithm. 2) Divide the protein sequence into words The N-Gram method has been used to divide and shorten the protein sequence into words (terms), without overlapping the windows. Each term has N
length and shares the same length with the other terms. Tab. 2 shows an example of how the sequences are cut into words. Tab.1 the reduced groups of the amino acids [2] Group Code
AA alphabet
0
C
1
S,A,T,N
2
N,H,S,D
3
P
4
E,D,Q,K
5
G
6
H,N,Y
7
Q,E,R,K
8
L,M,I,V
9
Y,H,F,W
Tab.2 examples of n-gram terms N
Example N-Gram terms For S= HMKPRSWDLMNT
2
HM, KP, RS, WD, LM, NT
3
HMK, PRS, WDL, MNT
4
HMKP, RSWD, LMNT
5
HMKPR, SWDLM, NT
3) Constructing the hash table and fetching values After reducing the amino acid alphabet and cutting the protein sequence into words, the protein sequence is ready to be transformed. The transformation stage is the step of converting the words into the integer values that represent them. To represent the words as integers, we create a hash table of size , where w is the word length. The size of the hash table allows the assignment of values for all possible combinations of letters. Therefore, each combination of letters (word) will have a unique value (integer number) to represent them in the table. The filling process is a pre-process because it happens once and before the representing stage starts. The constructed hash table has N dimensions; each dimension will be connected to a letter in the word. Because the amino acid alphabet of the sequence has already been transformed from the character representation to the integer representation, it will be easier to locate the addresses of the words. The example in Fig. 3 illustrates how the transformation was performed with the hash table. To fetch the values from the hash table, we use two important attributes of the hash table. The first attribute is that each cell in the table has an address and value pointed to by the address. The second attribute is the hash table structure, where it is an N-dimensional matrix (N is equal to the word length). The length of the word and the nature of it (i.e., it is a combination of integers) are attributes used to form the addresses in the hash table. By employing these two attributes of the hash table along with those of the sequence words, we can assign a unique value for each word without going through the overhead of the calculations. Fig. 3 shows the process of converting the word to an address in the hash table.
Fig.3 Illustration of fetching values from the hash table As shown in Fig. 3, each word indicates a certain address in the hash table, where each letter points to a certain place (address) in each dimension of the matrix. The fetching of the values is performed by moving the pointer of the NDimensional matrix regarding the value of the letters, where each letter of the word is connected to a dimension in the matrix. As shown in the example in Fig. 3, the word 28 is pointing to the 3rd column and 9th row cell in the matrix. Because each address in the matrix has a unique value, a unique number will represent each word. This method provides a fast and robust algorithm, and it avoids collisions while also speeding up the process of transformation by reducing the overhead of the calculations. B. Constructing the alignment and building the distance matrix Alignment construction is divided into two phases; the first is filling in the similarity matrix, and the second is the backtracking phase. For filling in the similarity matrix, we first initialise the first row and the first column in the similarity matrix by multiplying the number of rows or columns by the gappenalty value. The gap-penalty value presents the penalty value of opening a gap in the sequence for alignment. We have to take note that the minimum value for each cell in the similarity matrix in the local alignment is zero. Therefore, if the value in the cell goes below zero, it will be set to zero. The second step in filling the similarity matrix is to fill in all of the cells in the similarity matrix row by row, by taking the maximum value returned by the three equations described in Formula 1:
1
where d is the gap penalty and S( ) is the similarity value between the two amino acids taken from the substitution matrix. Algorithm 1 shows the how the similarity matrix was filled. The second phase is the backtracking phase, which is used to track the alignment between the two sequences to find the best alignment between them. After constructing the alignment, the distance between the two sequences is
for (row=1;row < m;row++) for ( col = 1;col < n; col++) { up = SIM[row-1][col] + GAP_PENALTY; diag = SIM[row-1][col-1] + COMP(A[row],B[col]) right = SIM[row][col-1] +GAP_PENALTY; SIM[row,col] = MAX(0, up,diag,right); }
Algo.1 Filling in the values for each cell in the similarity matrix [17] calculated. The distance matrix was constructed by applying all phases to the sequence alignment procedure. 3 RESULTS AND ANALYSIS
The experiments show an improved performance in both the accuracy and speed of the proposed algorithm (HT-NGH) as compared with the previous algorithms, NGH and H-NGH. HT-NGH method attains a better accuracy than H-NGH method by comparing the results with NGH, which is the previous algorithm for both methods. The accuracy of HT-NGH method has increased to 100% compared with NGH. In contrast, the new method outperformed both methods (NGH, H-NGH) in terms of speed; the time was reduced on average by 61% and 30% when compared with NGH and H-NGH, respectively. This section discusses the results (time, accuracy and complexities). A. Accuracy Evaluation The accuracy of the proposed algorithm is measured by matching the calculated distance between the sequences with the premier algorithm results. The results of the new algorithm (HT-NGH) and NGH algorithm were identically matched, and the matching result between H-NGH and NGH averaged 93%. Because accuracy is the basic and most common goal that all methods struggle to improve, our algorithm outperforms the previous algorithm, H-NGH, and produces the same outcome as NGH algorithm. The NGH algorithm performs as well as the Smith-Waterman algorithm, which produces the most optimal results compared with the heuristic methods [17], indicating that our new method performs the same as the Smith-Waterman algorithm. Fig. 4 illustrates the accuracy of the H-NGH algorithm and the proposed method HT-NGH by matching their results with the NGH algorithm. As shown in Fig. 4, the HT-NGH method out performs the H-NGH algorithm in all cases and provides the same accuracy as the NGH method. This percentage of accuracy was achieved by avoiding collision, which is a well-known problem associated with the use of hashing techniques. The HT-NGH algorithm has provided a simple and unique representation of each word, where the words are represented by a unique number. B. Time Evaluation The proposed algorithm for data (sequences) transformation has been tested on different datasets and compared with the premier methods (NGH and HNGH). The parameters we used in the experiments are the number of sequences, the database size and the word length. The number of sequences in the databases
Fig.4 The percentages of matching results of HT-NGH and H-NGH compared with NGH range from 200k to 800k sequences with database sizes range from 97 MB to 368 MB. The word length has been shortened to 2 and 3 grams only because sensitivity is sacrificed by increasing the word length. The proposed method has reduced the run time and the complexity of the two premier methods. The run time of the proposed method is much lower than the NGH and H-NGH algorithms: the run time is reduced by 60% (on average) compared with NGH and 30% (on average) compared with H-NGH, as shown in Fig. 5 and Fig. 6. In addition, the time complexity for the data transformation phase of the HT-NGH method has been reduced from O (MN) to O (MN/w). As shown in Fig. 5 and Fig. 6, the proposed method has the lowest run time in all datasets and term sizes. This time reduction is due to several factors of the HT-NGH algorithm, which are provided by the hash table technique. These factors can be concluded by the process of assigning the values for the words. The algorithm calculates the values for all possible combinations of words at earlier stages, and then it saves these values into the hash table. Therefore, each word is calculated only once (i.e., the same word will not be calculated repeatedly). Because word redundancy in protein sequences is so high, the proposed algorithm saves the time of recalculations by getting rid of the overhead caused by calculating the same word many times. C. Complexity 1) Time Complexity Here we will discuss and analyse the large O notation for the transformation phase of the proposed algorithm (HT-NGH) and the two previous methods (NGH and H-NGH). NGH and H-NGH Because the transformation process has three nested for-loops in the NGH algorithm, the complexity will be as follows. The purpose of the first loop is to make sure that the algorithm will pass through all sequences, where it takes N time and N is the number of sequences. The second loop, which makes sure that the algorithm will convert each word within the sequence, takes M/w time, where M is the sequence length and w is the word size. The last loop, which makes sure that each letter of the word has been converted, takes w time, where w is the word size. By calculating the three loops, we obtain the large O notation for the transformation process of NGH and H-NGH as follows:
Time Per second
Time Per second
Fig.5 Run-time results with term length =2
Fig.6 Run-time results with term length =3
N * M/w * w = NM N is sequences number, M is number of words within the sequence
HT-NGH
The proposed algorithm has reduced the transformation phase complexity by transforming the word as one block instead of going through each letter of the word. This technique reduces the number of loops to two loops; the first loop takes N time, where N is number of sequences, and the second loop takes M/w time, where M is the sequence length and w is the word size. By calculating the time consumed by each loop, the large O notation of HT-NGH is as follows: N * M/w = NM/w N is sequences number, M is number of words within the sequence and w is the word length.
2) Space complexity The space required for the transformation phase of the proposed method can be calculated as follows. The algorithm requires N spaces for the database sequences, where N refers to the number of sequences in the database. For each sequence in the database, the algorithm reserved L/w spaces, where L is the sequence length and w is the word length, and each sequence will be divided into words of length w. An additional constant space, , is required for the hash table, where w is the word length. While w ranges from 2 to 3, the additional space will be trivial compared with the overall required space. The complete space complexity of the transformation phase for proposed method is: NL/w + N is the number of sequences in the database, L is the length of sequence, and w is the word length.
The proposed algorithm space complexity is same as the NGH and H-NGH algorithms. 4 CONCLUSION
In this paper, we have presented a dynamic-programming method for building the distance matrix. The distance matrix was built using the HT-NGH algorithm, an enhancement to the NGH algorithm, based on the hash table
technique. The HT-NGH method demonstrates a dramatic accuracy improvement over the H-NGH method, where the results of the HT-NGH algorithm are 100% matched with the NGH algorithm results. Additionally, the HT-NGH algorithm outperforms its two predecessors, the NGH and H-NGH algorithms in terms of speed: the run time of the transformation phase was reduced on average by 60% and 30%, respectively. Furthermore, the HT-NGH method has reduced the time complexity of the transformation phase for the NGH and H-NGH methods to O (MN/w), without sacrificing the space complexity, which is O (NL/w). The hash table technique contributes the most to improving the time and accuracy of the NGH and H-NGH transformation phase. It improves the time by avoiding the overhead caused by redundancy, which appears in recalculating the hash value for repeated words. The accuracy is improved by avoiding collisions because each word has a unique representation. Our proposed method can be utilised for other problems in computational biology and text editing, such as constructing multiple sequence alignments, building the phylogenetic tree and string matching. ACKNOWLEDGMENT
This research is supported by the UNIVERSITI SAINS MALAYSIA and has been funded by the Research University grant titled by "A GPU based high throughput multiple sequence alignment algorithm for protein data" (1001/PKOMP/817065). REFERENCES [1]
a. Chan, "An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch,Smith-Waterman,FASTA,BLAST and Gapped BLAST. Singapore: Final Project Report, NUS.," 2004. [2] N. a. Rashid, R. Abdullah, Abdullah, and Z. Ali, "Fast Dynamic Programming Based Sequence Alignment Algorithm," in Distributed Frameworks for Multimedia Applications, 2006. The 2nd International Conference on, 2006, pp. 1 -7. [3] J. Thompson, D. Higgins, and T. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position -specific gap penalties and weight matrix choice," Nucleic Acids Research, vol. 22, pp. 4673 -4680, 1994. [4] B. Morgenstern, A. Dress, and T. Werner, "Multiple DNA and protein sequence alignment based on segment-to-segment comparison," Proceedings of the National Academy of Sciences of the United States of America, vol. 93, pp. 12098 -12103, 1996 . [5] B. Morgenstern, K. Frech, A. Dress, and T. Werner, "DIALIGN: finding local similarities by multiple sequence alignment," Bioinformatics, vol. 14, pp. 290 -294, 1998. [6] B. Morgenstern, "{DIALIGN 2:} improvement of the segment-to-segment approach to multiple sequence alignment," Bioinformatics, vol. 15, pp. 211 -218, 1999. [7] C. Notredame, "T-coffee: a novel method for fast and accurate multiple sequence alignment," Journal of Molecular Biology, vol. 302, pp. 205 -217, 2000. [8] K. Katoh, K. Misawa, K.-i. Kuma, and T. Miyata, "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform," Nucleic Acids Research, vol. 30, pp. 3059-3066, 2002. [9] K. Katoh, G. Asimenos, and H. Toh, "Multiple alignment of DNA sequences with MAFFT," Bioinformatics for DNA Sequence Analysis, vol. 537, pp. 39-64, 2009. [10] R. Edgar, "MUSCLE: a multiple sequence alignment method with reduced time and space complexity," BMC Bioinformatics, vol. 5, 2004.
[11] C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou, "ProbCons: Probabilistic consistency-based multiple sequence alignment," Genome Research, vol. 15, pp. 330-340, 2005. [12] U. Roshan and D. R. Livesay, "Probalign: multiple sequence alignment using partition function posterior probabilities," Bioinformatics, vol. 22, pp. 2715 -2721, November 15, 2006 2006. [13] Y. Liu, B. Schmidt, and D. Maskell, "MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities," Bioinformatics, vol. 26, pp. 1958-1964, 2010. [14] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," Journal of Molecular Biology, vol. 147, pp. 195-197, 1981. [15] O. Gotoh, "An improved algorithm for matching biological sequences," Journal of Molecular Biology, vol. 162, pp. 705-708, 1982. [16] D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," Commun. ACM, vol. 18, pp. 341-343, 1975. [17] N. A. B. AbdulRashid, "Enhancement of Hirschberg Algorithm Using N -Gram and Parallel Methods for Fast Protein Homologous Search," in School of Computer Sciences. vol. PhD: Universiti Sains Malaysia, 2008. [18] M. A. Abu-Hashem and N. A. A. Rashid, "Enhancing N-Gram-Hirschberg Algorithm by Using Hash Function," in AMS 2009 Third Asia International Conference on Modelling & Simulation Bandung, Bali, Indonesia IEEE, 2009, pp. 282-286. [19] W. J. Wilbur and D. J. Lipman, "Rapid Similarity Searches in Nucleic Acid and Protein Databanks. Proc. Natl. Acad. Sci. USA, Vol(80), 726-730.," 1983. [20] d. j. Lipman and w. r. Pearson, "Rapid and Sinsitive Protein Similarity Searches. Sceince 227, 1435-1441.," 1985. [21] w. r. Pearson and d. j. Lipman, "Improved Tools for Biological Sequence Comparisons. Proc.Nat. Acad. Sci.USA," vol. 85, 2444-2448, 1988. [22] s. f. Altschul, w. Gish, w. Miller, w. e. Myers, and d. j. Lipman, "Basic Local Alignment Search Tool. J.Mol.Biol 215 , 403-410," 1990. [23] s. f. Altschul, t. l. Madden, a. a. Schaffer, j. Zhang, z. Zhang, w. Miller, and d. j. Lipman, "Gapped PLAST and PSI-BLAST : A New Generation of Protein Database Search Programs. Nucleic Acids Research , Vol.25 (17), 3389-3402.," 1997. [24] Z. Ning, A. Cox, and J. Mullikin, "SSAHA: A Fast Search Method for Large DNA Databases," Genome Research, vol. 11, pp. 1725-1729, 2001. [25] R. Bellman, "{On the Theory of Dynamic Programming}," in Proceedings of the National Academy of Sciences, 1952, pp. 716-719. [26] S. Dreyfus, "Richard Bellman on the Birth of Dynamic Programming," Operations Research, vol. 50, pp. 48-51, 2002. [27] w. Pearson, "Coparison of Method for Searching Protein Sequences Databases. Protein Science, Vol(4) no6 , 1145-1160.," 1995. [28] K. A. Berman and J. L. Paul, Algorithms: Sequential, Parallel, and Distributed. Unversity of Cincinnati: Thomson, 2005. [29] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins," Journal of Molecular Biology, vol. 48, pp. 443-453, 1970. [30] M. Dayhoff, Atlas of protein sequence and structure: Nat Biomed Research Foundation, 1965. [31] S. Henikoff and J. G. Henikoff, "Amino acid substitution matrices from protein blocks," Proceedings of the National Academy of Sciences of the United States of America, vol. 89, pp. 10915-10919, 1992. [32] A. Driga, P. Lu, J. Schaeffer, D. Szafron, K. Charter, and I. Parsons, "FastLSA: A Fast, Linear-Space, Parallel and Sequential Algorithm for Sequence Alignment," 2003, pp. 48 48.
[33] A. Driga, P. Lu, J. Schaeffer, D. Szafron, K. Charter, and I. Parsons, "FastLSA: A Fast, Linear-Space, Parallel and Sequential Algorithm for Sequence Alignment," Algorithmica, vol. 45, pp. 337-375, 2006. [34] L. Heng and H. Nils, "A survey of sequence alignment algorithms for next -generation sequencing," Briefings in Bioinformatics, vol. 11, pp. 473-483, 2010. [35] C. K. Mathew and Z. E. Van Holde, Biochemistry. Benjamin Cumming,San Francisco, CA, 1995. [36] N. Sinha and R. Nussinov, "Point mutations and sequence variability in proteins: Redistributions of preexisting populations," Proceedings of the National Academy of Sciences of the United States of America, vol. 98, pp. 3139-3144, March 13, 2001. [37] G. Blackshields, F. Sievers, W. Shi, A. Wilm, and D. Higgins, "Sequence embedding for fast construction of guide trees for multiple sequence alignment," Algorithms for Molecular Biology, vol. 5, p. 21, 2010. [38] B. Morgenstern, "DIALIGN: multiple DNA and protein sequence alignment at BiBiServ," Nucleic Acids Research, vol. 32, pp. W33-W36, July 1, 2004 2004. [39] A. Subramanian, J. Weyer-Menkhoff, M. Kaufmann, and B. Morgenstern, "DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment," BMC Bioinformatics, vol. 6, p. 66, 2005. [40] A. R. Subramanian, M. Kaufmann, and B. Morgenstern, "DIALIGN -TX: greedy and progressive approaches for segment-based multiple sequence alignment," Algorithms for molecular biology : AMB, vol. 3, p. 6, 2008.