310 Int. J Comp Sci. Emerging Tech
Vol-2 No 5 October, 2011
Biological Sequence Alignment for Bioinformatics Applications Using MATLAB Sonali Vijan1 and Rajesh Mehra2 1
Student, Electronics Engineering, NITTTR, Chandigarh 2 Faculty Members, Electronics Engineering, #3290, Sector 35 D, Chandigarh Email:
[email protected]
Abstract: Biological Sequence alignment is widely used operation in the field of Bioinformatics and computational biology as it is used to determine the similarity between the biological sequences. The two basic alignment algorithms i.e. Smith Waterman for local alignment and Needleman Wunsch for global alignment have been used in this paper. The algorithms have been developed and simulated using MATLAB for genome analysis and sequence alignment. The local and global alignment has been presented and the results are shown in the form of Dot plots and local and global scores for the sequences. The proposed work is a useful tool that can aid in the exploration, interpretation and visualization of data in the field of molecular biology. Keywords: Bioinformatics, Biological Sequence Alignment, Smith-Waterman, Needleman-Wunsch, MATLAB, local alignment, global alignment
1. Introduction Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science. It is a union of biology and informatics as it involves the technology that uses computers for storage, retrieval, manipulation and distribution of information related to biological macromolecules such as DNA, RNA and proteins [1]. The emphasis here is on the use of computers because most of the tasks in genomic data analysis are highly repetitive or mathematically complex. Common activities in Bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures. Major research efforts in the field includes sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, genome-wise association studies and modeling of association. Bioinformatics has developed out of the need to understand the code of life, DNA. Massive DNA sequencing projects have evolved and added in the ___________________________________________________________________________________ International Journal of Computer Science & Emerging Technologies IJCSET, E-ISSN: 2044 - 6004 Copyright © ExcelingTech, Pub, UK (http://excelingtech.co.uk/)
growth of the science of bioinformatics. Biological sequence alignment is a widely used operation in the field bioinformatics and computational biology. It aims to find out whether two or more biological sequences (e.g., DNA, RNA, or Protein sequences) are related or not. This has many important real world applications. For instance, if some information about one of the sequences is already known (e.g., the sequence represents a cancerous gene) then this information could be transferred to the other unknown sequences, which could be vital in early disease diagnosis and drug engineering. Other applications include the study of evolutionary development and the history of species and their groupings As individual laboratories exchange more annotated biological data through comprehensive databases such as NCBI’s retrieval system, Entrez, (which integrates GenBank1), researchers have recently become interested in detecting remote homologies by querying a sequence of interest against a subfamily of a distant lineage [2]. In order to unveil the structural or functional importance of an unknown sequence, one conducts, as an initial procedure, a sequence alignment in the framework of the comparative computational biology. A sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify the regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences . The resulting alignment yields an edit transcript of mismatches and indels, i.e., insertions and deletions, where mismatches can be interpreted as point mutations and gaps as indels. As a result, we can infer how sequences with the same origin would diverge from one another.
2. DNA Alignment Sequence comparison lies at the heart of the bioinformatics analysis. As new biological sequences are being generated at exponential rates, sequence comparison is becoming increasingly important to draw
311 Int. J Comp Sci. Emerging Tech
functional and evolutionary inference of new protein with proteins already existing in database. Sequence alignment is the process by which sequences are compared by searching for common character patterns and establishing residue-residue correspondence among related sequences [3], [4]. The rapid evolution of sequencing techniques combined with the intense growth in the number of large-scale genome projects is producing a huge amount of biological sequence data. Nevertheless, determining the genome sequence is only the first step toward deciphering the genetic message encoded in those sequences. In genome projects, newly determined sequences are first compared with those placed in genomic databases, in order to discover similarities. This is done because relevant sequence similarity is evidence of common evolutionary origin and homology relationship. Sequence comparison is, therefore, a very basic but important step in genome projects. As a result of this step, one or more sequence alignments can be produced. A sequence alignment has a similarity score associated to it that is obtained by placing one sequence above the other, making clear the correspondence between the characters.
Vol-2 No 5 October, 2011
letters becomes a real challenge [5]. In general, computational approaches to sequence alignment are classified as either global or local alignments. By global alignment, we consider aligning the entire scope of all query sequences against a reference sequence. On the other hand, the method of local alignment identifies isolated regions of high similarity within the entire sequence, which makes the technique a better choice in some situations but a more complex one in general. 3.1
Local Alignment: The Smith Waterman Algorithm
The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. The explanation of the algorithm is given below:The Smith-Waterman is used to compute and find the optimal local alignment or region which shares the same properties of two sequences. The procedure consists of two steps:
2.1 Methods of Sequence Alignment 2.1.1 Global Alignment: Two sequences to be aligned are assumed to be generally similar over their entire length. Alignment is carried out from beginning to the end of both the sequences to find the best possible alignment. 2.1.2 Local alignment: This method of alignment does not assume that the two sequences have the similarity over the entire length [4]. It only finds local regions with the highest level of similarity between two sequences and align these regions without regard for the alignment of rest of the regions. The two sequences to be aligned can be of different lengths.
3. Alignment Algorithms DNA sequences are strings of letters from a four-letter alphabet called nucleotides (A, C, G, T). The length of a sequence is variable and sometimes we require the alignment of lengthy and highly variable or extremely numerous sequences. Hence, constructing algorithms to produce high-quality sequence alignments using four
Step 1: Fill in the dynamic programming matrix. Step 2: Find the maximal value (score) and trace back the patch that leads to maximal score to find the optimal local alignment. The basis of a Smith-Waterman search is comparison of the two DNA sequences. It uses individual pair-wise comparison between characters such as: For
Or, (
{
)(
)
(
)( )
( )(
(
)
(1.1)
)
Where, d=penalty, Sbt =Substitution matrix, i = matrix cell row of search sequence, j= matrix cell column of
312 Int. J Comp Sci. Emerging Tech
Vol-2 No 5 October, 2011
search sequence, M=maximum length of target sequence, N= maximum length of target sequence, D(i,j)= dynamic matrix cell 3.2
Global Alignment: The Needleman Wunsch Algorithm
Needleman-Wunsch uses dynamic programming in order to obtain global alignment between two sequences. Global alignment, as the name suggests takes into account all the elements of the two sequences while aligning the two sequences. We can also call it as an “end to end” alignment. In Needleman-Wunsch algorithm, a scoring matrix of size m*n (m being the length of longer sequence and n being that of the shorter sequence) is first formed.
environment [7]. After that we can get global alignment (NW) and local alignment (SW) with a score that determines the degree of similarity. Dot plots are one of the easiest ways to find similarity between the two sequences. Many dots in the dot plot line up to form diagonal lines indicating good alignment between the two sequences. In the proposed work, the results are also presented in the form of dot plots.
5. Result Simulation and Discussion 5.1 To retrieve sequences from a database:Different sequences that have to be analyzed, aligned and read are retrieved from public database into MATLAB environment.
The optimal score at each matrix position is calculated by adding the current match score to previously scored positions and subtracting gap penalties. Each matrix position may have a positive, negative or 0 value. For two sequences S1=a1a2..................am
(1.2)
S2=b1b2..................bn
(1.3)
where Tij=T(a1a2..................am, b1b2..................bn) then the element at the i,jth position of the matrix Tij is given by Tij =Max{ Ti-1,j-1+s, Max(Ti-x,j – px), x >=1 Max(Ti,j-y–py),y>=1}
(1.4)
Where, Tij is the score at position i in the sequence S1 and j in the sequence S2, T(ai,bj) is the score for aligning the characters at positions i and j, px is the penalty for a gap of length x in the sequence S1, and, py is the penalty for a gap of length y in the sequence S2.
Figure 1 Open Reading Frame of Human DNA sequence
Figure 1 shows the open reading frame of human. Once the ORF for a gene or mRNA is known, the user can translate a nucleotide sequence to its corresponding amino acid sequence. 5.2 Sequence comparison by using dot plot:-
4. Proposed DNA Alignment For applying a global and local alignment and to get a score for both of them, the user can enter the sequence in two ways. The first way is by the accession numbers of the sequence to retrieve the sequences in its ORF (Open Reading Frames). The second way is to retrieve the sequences from the web (public database) and bringing the sequence information into the MATLAB
The most basic sequence alignment method is the dot matrix method also known as dot plot method. It is the graphical way of comparing two dimensional matrix [10]. In dot matrix, two sequences to be compared are written in horizontal and vertical axes of the matrix. MATLAB function has been used for this comparison. When the two sequences have substantial regions of
313 Int. J Comp Sci. Emerging Tech
similarity, many dots line up to form diagonal lines, which reveals the sequence alignment.
Vol-2 No 5 October, 2011
5.3 Global alignment of sequences Needleman Wunsch Algorithm
by
using
Figure 2 Dot plot of Human and Chicken Figure 4 Global Alignment (NW) of Human and Chicken by using blosum60 scoring matrix
Figure 3 Dot plot of Human and Fly Dot plots of human and chicken DNA sequences and human and fly DNA sequences have been shown in fig. 2 and fig. 3 respectively. The dot plots above shown shows that human and chicken DNA sequences show better alignment as compared to human and fly DNA sequences.
Figure 5 Global Alignment (NW) of Human and Fly by using blosum60 scoring matrix Global alignment of human and Chicken DNA sequences is shown in fig. 4 and of human and fly DNA sequences is shown in fig.5 respectively.
314 Int. J Comp Sci. Emerging Tech
5.4 Local alignment of sequences by using Smith Waterman Algorithm
Vol-2 No 5 October, 2011
form of dot plots, local alignment score by using Smith Waterman algorithm and global alignment score by using Needleman Wunsch algorithm. The alignment score for human and chicken DNA sequences for global alignment is 1454.33 and for local alignment it is 1644.5. The alignment score for human and fly DNA sequences is 49.6667 for global alignment and is 176 for local alignment. The proposed work is a useful tool that can aid in the exploration, interpretation and visualization of data in the field of molecular biology.
7. References [1] Hassan Mathkour, Muneer Ahmad, “A Comprehensive Survey on Genome Sequence Analysis”, IEEE International Conference on Bioinformatics and Biomedical Technology, pp. 14-18, 2010
Figure 6 Local Alignment (SW) of Human and Fly by using blosum60 substitution matrices
[2] Changjin Hong, Ahmed H. Tewfic, “Heuristic Reusable Dynamic Programming: Efficient Updates of Local Sequence Alignment”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 6, No. 4, pp. 570-562, 2009 [3] Khaled Benkrid, Ying Liu, AbdSamad Benkrid, “A Highly Parameterized and Efficient FPGA-Based Skeleton for Pairwise Biological Sequence Alignment”, IEEE Transactions on Very Large Scale Integration Systems, Vol. 17, No. 4, pp. 561-570, 2009 [4] Azzedine Boukerche, Jan M. Correa, Alba Cristina M.A de Melo, Ricardo P. Jacobi, “A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space”, IEEE Transactions on Computers, Vol. 59, No. 6, pp. 808-821, 2010 [5] Van-Hoa Nguyen, Alexandre Cornu, Dominique Lavenier, “Implementing Protein Seed-Based Comparison Algorithm on the SGI RASC-100 Platform”, IEEE Symposium on Parallel and Distributed Processing, pp. 1-7, 2009
Figure 7 local Alignment (SW) of Human and Chicken by using blosum60 substitution matrices Local alignment of human and fly DNA sequences is shown in fig.6 and of human and chicken DNA sequences is shown in fig.7
6. Conclusion In this paper DNA sequence alignments algorithm have been developed and simulated using MATLAB. Sequence alignment results have been presented in the
[6] Sanghamitra Bandyopadhayay, Ramakrishna Mitra, “A Parallel Pairwise Local Sequence Alignment Algorithm”, IEEE Transactions on Nano Bioscience, Vol. 8, No. 2, pp. 139-146, 2009 [7] Nahar, N.L. Hamel, M.S. Popstova and J.P. Gogarten, “GPX: A Tool for the Exploration and visualization of Genome Evolution”, 7th IEEE International Conference on Bioinformatics and Bioengineering, pp. 1338 – 1342, 2007
315 Int. J Comp Sci. Emerging Tech
[8] Agrawal, A. and S.K. Khaitan, “A new heuristic for multiple sequence alignment”, IEEE International Conference on Electro/Information Technology, pp. 215 – 217, 2008 [9] Liu Weiguo, B. Schmidt, G. Voss and W. MullerWittig, “Streaming Algorithms for Biological Sequence Alignment on GPUs”, IEEE Transactions on Parallel and Distributed Systems, Vol. 18, Issue 9, pp. 1270 – 1281, 2007 [10] Mai S.Mabrouk, Marva Hamdy, MarvaMamdouh, Marva Aboelfotoh, Yesser M.Kadah, “BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab”, Cairo International Biomedical Engineering Conference, pp. 1-9, 2006
Author Biographies
Sonali Vijan: Sonali Vijan is currently a student at National Institute of Technical Teachers’ Training and Research, Chandigarh, India. She is pursuing her M.E. from NITTTR, in Electronics Engineering. She completed her Bachelors of Technology in Electronics Engineering from Punjab Technical University, Jallandhar.
Rajesh Mehra: Mr. Rajesh Mehra is currently Assistant Professor at National Institute of Technical Teachers’ Training & Research, Chandigarh, India. He is pursuing his PhD from Panjab University, Chandigarh, India. He has completed his M.E. from NITTTR, Chandigarh, India and B.Tech. from NIT, Jalandhar, India. Mr. Mehra has 15 years of academic experience. He has authored more than 30 research papers in reputed International Journals and 45 research papers in National and International conferences. Mr. Mehra’s interest areas are VLSI Design, Embedded System Design, Advanced Digital Signal Processing, Wireless & Mobile Communication and Digital System Design. Mr. Mehra is life member of ISTE.
Vol-2 No 5 October, 2011