Arab J Sci Eng DOI 10.1007/s13369-013-0760-5
RESEARCH ARTICLE - COMPUTER ENGINEERING AND COMPUTER SCIENCE
DNA Sequence Comparisons Using Codons Khalid Thabit · Sumaia M. Al-Ghuribi · Fatima N. Al-Aswadi
Received: 17 March 2012 / Accepted: 15 September 2012 © King Fahd University of Petroleum and Minerals 2013
Abstract One of the significant steps in the analysis of phylogenetic relationships between species is the DNA sequence comparison. In this paper, a fast method for comparing DNA sequences was introduced based on the frequencies of the codons it contains. The frequencies of codons are calculated in a way where the similarity of DNA sequences can be avoided. The method is tested for 15 species in both short and long DNA sequence. A software was implemented using wxpython that uses this method and tests it. The results were compared manually with the data in NCBI, and it was found that our method works well, thus proving its benefit. Finally, to make this method available for specialists, a software is uploaded using this method to cloud depends on cloud computing platform, “Google App Engine”. Keywords DNA classification model · DNA codons · Species · Wxpython · Cloud computing platform · Google App Engine
K. Thabit · S. M. Al-Ghuribi (B) · F. N. Al-Aswadi Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia e-mail:
[email protected] K. Thabit e-mail:
[email protected] F. N. Al-Aswadi e-mail:
[email protected]
1 Introduction Bioinformatics is a cross-discipline, intersection field formed by life science, computer science, information science and math subject. DNA sequence data is one of the main objects of Bioinformatics study. Through analyzing DNA sequence, scientists cannot only explicate the existing sequence, also can better study the new sequence and its function, deciphering roles that sequence plays in organisms, then can understand life nature [1]. The number of DNA sequence is rapidly increasing in the DNA database. One of the challenges for bioscientists is to analyze mathematically the large volume of genomic DNA sequence data [2]. As computer programmers we suggest in this paper a method for comparing DNA sequences. DNA sequence comparison remains as one of the critical steps in the analysis of phylogenetic relationships between species [2]. The graphical representation methods of DNA sequences have been used popularly in DNA sequence
123
Arab J Sci Eng
comparison. In these methods, some mathematical invariants are applied for mutation analysis [3–5], similarity analysis [3,6–9], phylogenetic analysis [2,3,6] and sequence alignment [5,10]. The graphical representation of DNA sequence could be based on 2D [4,11], 3D [3,10–12], 4D [8], or 5D [9] space. The most common method which used to describe graphs numerically, is transforming the plots to matrices [7]. These matrices could be used for DNA sequences comparisons. Although graphical representation methods have been used popularly, it has some disadvantages: (1) loss of information due to overlapping and crossing of the curve representing DNA with itself [2,8]; (2) some mathematical models which are based on complex equations and numerical computations may ignore biological information which is hidden between the neighboring nucleotides [9]; (3) cannot compare or observe differences in the structure of DNA sequences directly [8]. More recently, a new representation based on binary representation is used for DNA sequence comparison. In [13], the authors proposed an idea in which they used the binary labels for the four nucleic acid bases and the “worm” curve as template on which binary codes were placed to solve the compact problem of 2D graphical representation of DNA. In [5], the authors used the binary sequences reduced by primary DNA sequences for phylogenetic inference. They found that the binary sequences reduced by the purine/pyrimidine classification give reliable phylogeny (almost the same as that given by primary sequences). In [10], the authors proposed an extended binary coding method for RNA secondary structure alignment by converting the structure alignment to sequence alignment. The RNA secondary structure reduces into three binary digit sequences. They had also used the exclusive-OR operation for searching the optimal alignment between RNA secondary structures. They said that by using their method, the result of structure alignment can be obtained quickly. The binary sequences have much higher compression ratio compared with primary DNA sequence. In other words, binary coding method reduces the storage space and execution time; also it facilitates the use of some signal processing techniques in biological data analysis [5]. Due to the disadvantages and complexity in graphical representation and the advantage of binary coding, we decide to make a method for DNA sequence comparison. Unlike the previous methods which are known for their complex calculation our method is fast, simple, and does not require sequence alignment or sequence graphical representation. At the same time, it can be used to analyze both short and long DNA sequences. The main idea of this method depends on codons. In other words, we take each DNA sequence and divide it into three parts as Fig. 1 shows. We took the data from the USA National Center for Biotechnology (NCBI).
123
Fig. 1 DNA sequence codons
We find the frequencies of each codon using a new way. Then we use some operations to check whether the inserted sequence belongs to the determined species or not. We applied our method to 15 species; detailed explanation will be given later. The rest of this paper is organized as follows: Sect. 2 presents an idea of DNA sequence codons and describes the species that we have included in our study. Section 3 shows the ways of calculating the frequencies of codons we have used to avoid similarities between sequences. Section 4 demonstrates the design and the implementation of our software, while Sect. 5 presents our software as a service in the cloud and finally we conclude this work in Sect. 6.
2 DNA Sequence Codons DNA is one of the nucleic acids, it consists of four different subunits or nucleotides which constitute a four-letter, (Adenine (A), Guanine (G), Thymine (T) and Cytosine (C)), “language” coding for every other molecule found in living organisms. Nucleotides are linked together to form chains that are, in turn, united to form chromosomes. Each segment of a chromosome that codes for a specific molecule is called a gene. To determine the function of specific genes, scientists have learned to read or “decode” the sequence of nucleotides composing DNA in a process referred to as DNA sequencing [14]. How does DNA encode the information for a protein? There are only four DNA bases, but there are 20 amino acids that can be used for proteins. So, groups of three nucleotides form a word (codon) that specifies which of the 20 amino acids goes into the protein (a3-base) codon yields 64 possible patterns (4 × 4 × 4), which is more than enough to specify
Arab J Sci Eng Table 1 The 64 possible DNA sequence codons 1
AAA
9
AGA
17
TAA
25
TGA
33
GAA
41
GGA
49
CAA
57
CGA
2
AAC
10
AGC
18
TAC
26
TGC
34
GAC
42
GGC
50
CAC
58
CGC
3
AAG
11
AGG
19
TAG
27
TGG
35
GAG
43
GGG
51
CAG
59
CGG
4
AAT
12
AGT
20
TAT
28
TGT
36
GAT
44
GGT
52
CAT
60
CGT
5
ATA
13
ACA
21
TTA
29
TCA
37
GTA
45
GCA
53
CTA
61
CCA
6
ATC
14
ACC
22
TTC
30
TCC
38
GTC
46
GCC
54
CTC
62
CCC
7
ATG
15
ACG
23
TTG
31
TCG
39
GTG
47
GCG
55
CTG
63
CCG
8
ATT
16
ACT
24
TTT
32
TCT
40
GTT
48
GCT
56
CIT
64
CCT
Table 2 The 15 species and Number of IDs for each one
Specie name Caraway Thyme
Number of IDs 16 33
Specie name
Number of IDs
Specie name
Rue Savory
334, 169 63
Rosemary Acanthus
940 186
16
Myrtle
413
265, 693
Pepper
11, 865
Cardamom
134
Tarragon
Turmeric
335
Wormwood
20 amino acids [14]. Table 1 shows the 64 possible codons from DNA sequence. So, in our approach DNA sequence will be divided into codons — sets of three bases that specify an amino acid or signal the end of the protein. For example, suppose that we have the following DNA sequence: [’AGATATACGAATACCCCATCTCATCCATGTGGAAATCTTGGTTCAAACTCTT CGCTATTG’]. After dividing it into triple codons, it will be as following: [’AGA’, ’TAT’, ’ACG’, ’AAT’, ’ACC’, ’CCA’, ’TCT’, ’CAT’, ’CCA’, ’TGT’, ’GGA’, ’AAT’, ’CTT’, ’GGT’, ’TCA’, ’AAC’, ’TCT’, ’TCG’, ’CTA’, ’TTG’]. Experimental DNA sequences are taken from the USA National Center for Biotechnology (NCBI). We study 15 species; each species contains large number of DNA sequences. Table 2 shows all the species that are chosen and the number of sequences in each species.
Number of IDs
3 Ways to Calculate the Frequencies of Codons
nary will contain the 64 values and the key for each value will be zero. codonDictionary ={ ’GCT’:0, ’GCC’:0, ’GCA’:0, ’GCG’: 0, ’TGT’:0, ’TGC’:0, ’GAA’:0, ’GAG’:0, ’GAT’:0, ’GAC’:0, ’GGT’:0,’GGC’:0,’GGA’:0, ’GGG’:0, ’TTT’:0, ’TTC’:0, ’ATT’:0, ’ATC’:0, ’ATA’:0, ’TAA’:0, ’TGA’:0, ’TAG’:0, ’AAA’:0, ’AAG’:0, ’ATG’:0, ’TTA’:0, ’TTG’:0, ’CTT’:0, ’CTC’:0, ’CTA’:0, ’CTG’:0, ’AAT’:0, ’AAC’:0, ’CAA’:0, ’CAG’:0, ’CCT’:0, ’CCC’:0, ’CCA’:0, ’CCG’:0, ’TCT’:0, ’TCC’:0, ’TCA’:0, ’TCG’:0, ’AGT’:0, ’AGC’:0, ’CGT’:0, ’CGC’:0, ’CGA’:0, ’CGG’:0, ’AGA’:0, ’AGG’:0, ’ACT’:0, ’ACC’:0, ’ACA’:0, ’ACG’:0, ’GTT’:0, ’GTC’:0, ’GTA’:0, ’GTG’:0, ’TAT’:0, ’TAC’:0, ’ATG’:0, ’CAT’:0, ’CAC’:0, ’TGG’:0}. We create three levels to calculate the key of each codon (frequency of each codon) to get the perfect less similarity sequence. Sequence similarity is done by testing how many different DNA sequences in a species have the same frequency sequence when they occur in a specific representation (1 - 4 -5 – 6 – 8 bit).
Recently, there is a rapid increase in the number of the nucleotide acids and proteins in the international bioinformatics database, which increases the needs for developing approaches and software to help the bioinformatics researchers in the comparison process. In this paper, we introduce an approach for comparing DNA sequences using the concept of the frequencies of codons. In our approach, we define a codon as a dictionary that contains a value and a key. The value for the codon dictionary will be the codons given above in Table 1 and the key will be the frequency of the codon in the sequence (frequency of occurrence of this codon in the sequence). At the beginning, the codon dictio-
Level 1: The idea behind Level 1 is to compare each value (codon) in the dictionary with the DNA sequence, if it is found we put 1 in the codon dictionary value, regardless of how many times it is repeated. And put 0 when the codon in the codon dictionary does not appear in the DNA sequence. The result of this level is a sequence for each DNA sequence which contains 64 bit either 0 (not found) or 1 (found). The following sequence is the frequency using level 1 for one of the caraway’s sequence with ID =‘295656260’ “10111011111111111111011100111111101110101001 11111110101111110110”
123
Fig. 2 Similarity between species in each level
Number of Simialr Sequences
Arab J Sci Eng
Species Names
After applying level 1 for all the 15 species, we observed that the resulted sequences are not logical, because there are many similarities among them (many different DNA sequences have become the same binary (frequency) sequence after applying the level 1 on them). For example in saffron species, there are 71 similar sequences; in savory species, there are 50 similar sequences and in pepper species, there are 56 similar sequences. This leads us to know that in one sequence the same codon repeats more than one time. So we move to level 2 where we consider how many times the codon appears into account. Level 2: In level 2, we compare each codon in the codon dictionary with the DNA sequence, if it’s found we put a number in hexadecimal that refers to how many times it appears, if it is more than 15 we put it as 15. And put 0 when the codon in the dictionary does not appear in the DNA sequence. The result of this level is a sequence for each DNA sequence which contains 64 bit either 0 (not found) or [1–9 or (a–f)] (found) [depends on how many times the codon appears. The following sequence is the frequency using level 2 for one of the caraway’s sequence with ID =‘295656260’ “a014b02226125118255102230053523460a1106010021 1312c60403464310350”. Because level 1 is represented in binary and level 2 is represented in hexadecimal, we decide to represent all the levels using the same method; binary representation. Also, OR and AND operations are used as a way of comparison in our method, where these operations are easily used in binary than hexadecimal. Now we will move to level 3 then we will test the similarity between sequences. Level 3: In the third level each sequence contains 256 bit. First we find the value of each codon as we did in level 2; we represent each value with 4 bit.
123
The following sequence is the frequency using level 3 for the caraway’s sequence with ID =‘295656260’ “10100000000101001011000000100010001001100001 00100101000100011000001001010101000100000010001 00011000000000101001101010010001101000110000010 10000100010000011000000001000000000010000100010 01100010010110001100000010000000011010001100 100001100010000001101010000”. After applying level 3 for all the 15 species and test the similarities between sequences, we find a great result compared with level 1. Four species (rosemary, rue, myrtle and savory) have zero sequence similarity. Six species (caraway, turmeric, sage, tarragon, wormwood and acanthus) have 1 or 2 or 3 sequence similarity which is a very small value and can be ignored compared with the huge number of the sequences in the species. Where the other five species have also small values compared with level 1, see Fig. 2. We try to enhance the sequence representation more by using 5 bit, and restrict the times that the codon appears to 32, to represent the sequence. Unfortunately, the results are unexpected. The results of 12 species are almost similar or different in 1 or 2 sequences. However, the results in cardamom and pepper species are completely different where the sequence similarity in 5 bit becomes a large value compared with 4 bit. We also try 6 and 8 bit. The results of both of them are the same in 14 species, same sequence similarity, and they only differ in pepper species where 8 bit representation gives less similarity. At the end to choose the best representation with less sequence similarity, we gather the number of similar sequences in the 15 species of all the previous representations and find that the summation is as follows: in 1 bit representation 1,033 sequences appeared to be similar, in 4 bit representation only 88 sequences appeared to be similar. The summations of similar sequences in 5 bit representation are
Arab J Sci Eng
254 sequences, 184 sequences in 6 bit representation, and 140 sequences in 8 bit representations). Based on this result, we conclude that 4 bit gives the best output with least similarity, and we will not get any more benefit if we represent the sequence in 1 or 5 or 6 or 8 bit. Therefore, we chose level 3 with 4 bit to find the frequency of the sequence. Figure 2 shows the similarity between the previous ways.
4 Implementation of Our Software We design our system for comparing DNA sequences depending on the concepts of codons and their frequencies. We use python language which support Bio library. Bio library contains Entrez, through which we can fetch DNA sequence from online databases. We fetch the sequences from nucleotide database and save DNA sequence as fasta format. Figure 3 shows the interface of our system, where the user can choose one of the 15 species that has been mentioned above. Then, all the sequences of this species are fetched from the nucleotide database. Each sequence is divided into triple codons then the frequency sequence of each fetched sequence is generated using level 3 with 4 bit idea. As known, each species has many sequences that might reach tens of thousands of sequences. Using the direct comparison ways, we need a huge memory space as well as a long time to do the comparison, because of the thousands of
Fig. 3 The interface of the system
Fig. 4 The second window of the system where a user insert a sequence to be compared
comparison that must be done between the DNA sequence that is needed to check and all the sequences (thousands) in the selected species. Therefore, to reduce the amount of required memory space and decrease the comparison time, we use the logical AND and logical OR operations as the way of comparison. AND and OR operations are done only one time, when the sequences of the selected species are fetched from the database. This means that these operations will not repeat for each sequence that is needed to check. The process will be as follows: After the frequency sequence of each fetched sequence is generated, a logical AND for all the frequency sequences is calculated, call it A(i). Also a logical OR to the same frequency sequences is done call it O(i). For a DNA sequence which we want to check if it belongs to the selected species or not, we convert it to 256 bits string call X and then apply the following condition, If (A(i) AND X = A(i)) and (O(i) OR X= O(i)) then the sequence X is of species i. As observed, only one step is done to know if a sequence belongs to the selected species instead of the thousands of comparisons, which confirm that logical AND and logical OR operations will improve the comparison performance by reducing memory space and decreasing comparison time. The following example will show how fast and easy to use the previous condition is. Figure 4 shows the second window of our system, where the user can insert the sequence to be compared with the
123
Arab J Sci Eng
Example: Let a species contains of two sequences: seq1: 11011011 & seq2: 11000101 After applying AND and OR operations we will get A(i): 11000001 & O(i): 11011111 CASE 1: let X = 10000101 A(i) AND X = 10000001 A(i) So the first part of the condition isn't applied, as a result the sequence X isn't a type of the species i. CASE 2: let X = 11000101 A(i) AND X = 11000001 = A(i) & O(i) OR X = 11011111 = O(i) ; As it clear that the two parts of the condition are applied, as a result the sequence X is a type of the species i. Fig. 5 Compare the inserted DNA Sequence with caraway’s sequences
sequences of the selected species. Then the previous condition is applied, if its achieved, then the inserted sequence will be considered as a sequence of the selected species. Otherwise, it will be considered as a new sequence for the selected species. As an example, if we choose caraway species and we want to compare the DNA sequence that has an ID = “295656260” with the caraway sequences to test if it belongs to them or
123
not. Then we will insert the sequence into the system and the system will compare and give the result as shown in Fig. 5.
5 Our Software as a Service in Cloud Cloud computing represents the new revolution in IT world. This area and its related technology are worthy to be used
Arab J Sci Eng Fig. 6 Home page of the website of the system
Fig. 7 An example using our system
Fig. 8 The result of the inserted sequence in Fig. 7
123
Arab J Sci Eng
in IT solutions. One of these technologies is the cloud computing platforms, in which the users can access the services through web browsers [15]. Cloud computing offers three kinds of services [16], Infrastructure-as-a- Service (IaaS), in which providers like Amazon [17], provide machine instances to developers. Platform-as-a-Service (PaaS), in which providers like Google App Engine [18], provide a programming environment that abstracts machine instances and other technical details from developers. Software-asa-Service (SaaS) like Microsoft’s (Windows Live) Hotmail does not interface with user information (e.g. documents). Google App Engine can be defined as a platform which allows host web applications on Google’s infrastructure and runs it. Google App Engine offers scalable and deployment environments whenever traffic and data storage are needed. Developers can access Google’s BigTable database, storage, and the same technologies for access control, security, and Web-services integration. Google App Engine supports Python scripting language and Java [19]. As we mentioned above, we use Google App Engine as cloud computing platform and python language for our system. Figure 6 shows the website for our system. In the previous figure, the user inserts the sequence that he/she wants to compare after he/she selects the species to compare with. Figure 7 shows an example, caraway species is chosen and one binary sequence is inserted to be compared with the caraway’s species. The result of the comparing process for Fig. 7 appears in Fig. 8, where we can know if the inserted sequence belongs to the selected species or not. The service that we provide from our system is available through the following link http://dnabinaryseq.appspot. com/. It is easy to be used by any specialists.
6 Conclusion and Future Work In this paper, we present an approach for DNA sequence comparison using codons. The main reason for our method is to avoid complex calculation such as graphical representation methods. Our method is fast, simple and does not require sequence alignment or sequence graphical representation. At the same time, it can be used to analyze both short and long DNA sequences. We also make software for the approach and test it on 15 species. It works correctly; we compared the results manually with the data in NCBI and find that the method works well thus proving the benefit of our new method. In addition, we upload the software to cloud to be available for all specialists. The limitation of our method occurs in the limitation of the binary representation which lacks the positional dependence but it is still the appropri-
123
ate way. For the future work, we plan to make our software applicable for all the species in the nucleotide database. We also plan to develop the system to be a classifier besides being a comparative system.
References 1. Zhou, Q.; Jiang, Q.; Wei, D.: A new method for classification in DNA sequence. In: 6th International Conference on Computer Science Education (ICCSE). pp 218–221, 3–5 August (2011) 2. Qi, X.; Xin, X.; Li, S.: DNA sequence comparisons based on codons in the double helix. In; 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), vol 2, pp 558–562, 9–11 July (2010) 3. Yujuan, H.; Tianming, W.: New graphical representation of a DNA sequence based on the ordered dinucleotides and its application to sequence analysis. Int. J. Quantum Chem. 112, 1746–1757 (2012) 4. Liao, B.; Ding, K.: Graphical approach to analyzing DNA sequences. J. Comput. Chem. 26, 1519–1523 (2005) 5. Zheng, X.; Dou, Y.; Wang, J.: Phylogenetic inference from binary sequences reduced by primary DNA sequences. J. Math. Chem. 46(4), 1137–1148 (2008) 6. Yu, C.; Deng, M.; Yau, S.S.-T.: DNA sequence comparison by a novel probabilistic method. Inform. Sci. 181(8), 1484–1492 (2011) 7. Dorota, B.W.: Graphical and numerical representations of DNA sequences: statistical aspects of similarity. J. Math Chem. 49(10), 2345–2407 (2011) 8. Chi, R.; Ding, K.: Novel 4D numerical representation of DNA sequences. Chem. Phys. Lett. 407(1–3), 63–67 (2005) 9. Liao, B.; Li, R.; Zhu, W.; Xiang, X.: On the similarity of DNA primary sequences based on 5-D representation. J. Math. Chem. 42(1), 47–57 (2007) 10. Cao, Z.; Liao*, B.; Li, R.; Luo, J.; Zhu, W.: RNA secondary structure alignment based on an extended binary coding method. Int. J. Quantum Chem. 111(5), 978–982 (2011) 11. Yuan, C.; Liu, L.; Wang, T.; Li, C.: On property of the invariant of graphical representations of DNA sequences. J. Math. Chem. 43(3) (2008) 12. Cao, Z.; Li, R.; Chen, W.: A 3D graphical representation of DNA sequence based on numerical coding method. Int. J. Quantum Chem. 110(5), 975–980 (2010) 13. Randi´c, M.; Vraˇcko, M.; Zupan, J.; Noviˇc, M.: Compact 2-D graphical representation of DNA. Chem. Phys. Lett. 373(5–6), 558-562 (2003) 14. BCCM:http://bccm.belspo.be/newsletter/4-97/bccm01.htm# bccm4a1 15. Peng, J.; Zhang, X.; Lei, Z.; Zhang, B.; Zhang, W.; Li, Q.: Comparison of several cloud computing platforms. In: Second International Symposium on Information Science and Engineering (ISISE), pp 23–27, 26–28 December (2009) 16. Marinos, A.; Briscoe, G.: Community cloud computing. CORR, abs/0907.2485 (2009) 17. Amazon: Amazon Elastic Compute Cloud (EC2). Amazon Web Services LLC, Tech. Rep., 2009. [Online]. Available: http://aws. amazon.com/ec2/ 18. Google: Google App Engine: Run your web apps on Google’s infrastructure. Google, Tech. Rep., 2009. [Online]. Available: http://code.google.com/appengine 19. Lawton, G.: Developing software online with platform-as-a-service technology. Computer 41(6), 13–15 (2008)