2011 IEEE World Congress on Services
Enabling Data Hiding for Resource Sharing in Cloud Computing Environments Based on DNA Sequences Mohammad Reza Abbasy, Bharanidharan Shanmugam Advanced Informatics School (AIS), International Campus, Universiti Teknologi Malaysia (UTM), Kuala Lumpur, Malaysia
[email protected],
[email protected] In order to convert binary data into amino acids as a DNA sequence, the base pairing rules must be used. Synthesizing nucleotides in real environment (biology) is done in constant rules:
Abstract—The main target of this paper is to propose an algorithm to implement data hiding in DNA sequences to increase the confidentiality and complexity by using software point of view in cloud computing environments. By utilizing some interesting features of DNA sequences, the implementation of a data hiding is applied in cloud. The algorithm which has been proposed here is based on binary coding and complementary pair rules. Therefore, DNA reference sequence is chosen and a secret data M is hidden into it as well. As result of applying some steps, M´´´ is come out to upload to cloud environments. The process of identifying and extracting the original data M, hidden in DNA reference sequence begins once clients decide to use data. Furthermore, security issues are demonstrated to inspect the complexity of the algorithm.
x Purine Adenine (A) always pairs with the pyrimidine Thymine (T) x Pyrimidine Cytosine (C) always pairs with the purine Guanine (G) Always, those rules are done naturally because the opportunities to synthesize hydrogen bonds between A and T (two bonds), and also between C and G (three bonds) is different, basically (hydrogen bonds have been shown with dotted lines in Fig1). These concepts are named Watson-Crick base pairing rules when they discovered DNA’s fundamental structure as a Noble prize [13].
Keywords: DNA sequence; Cloud Computing; Data hiding; DNA base pairing rules; complementary rules; DNA binary coding; data confidentiality.
I.
INTRODUCTION
In order to protect data through the unsecure networks like the Internet, using various types of data protection is necessary. With advent of Cloud Computing idea, a common problem, confidentiality of data, was emerged [19, 20, 21, 22, and 23]. Thus, to solve the raised difficulty, combining different ideas can help to achieve an acceptable level of confidentiality in Cloud Computing environments. One of the famous ways to protect data through the Internet is data hiding. Because of the increasing number of Internet users, utilizing data hiding or Steganographic techniques is inevitable. Eliminating the role of the intruder and authorizing the clients are eventual goals of these techniques. Therefore, the role of data hiding has become more eminent nowadays. Before employing biological properties of DNA sequences, the common way of embedding a secret data into the host images was the traditional way of data hiding [1, 2, 3, 4, and 5]. It unfortunately leads to some liabilities. The most important ones was the detection of the distortions of the image when the host image changed to some degrees. That was the best spot to start the wholly detection of the secret data through the image. By advent of biological aspects of DNA sequences to the computing areas, new data hiding methods have been proposed by researchers, based on DNA sequences [6, 7, 8, 9, and 10]. The key portion of their work is, utilizing biological characteristics of DNA sequences. 978-0-7695-4461-8/11 $26.00 © 2011 IEEE DOI 10.1109/SERVICES.2011.45
Fig.1. Synthesizing basic nucleotides naturally
In binary computing area, it is possible to change the natural rules by own decision. For example, in biology A is synthesized to T while we can assume A to C or A to G, and so on, as we prefer. Increasing the complexity of the algorithm is the main purpose of the changing the rules. In this paper, the authors consider A=00, T=01, C=10, and G=11 to convert binary data to DNA sequences. A way to increase the complexity is complementary pair rule. Complementary pair rule is a unique equivalent pair which is assigned to every nucleotides base pair. As an example, complementary rule is applied on strand in below [16]:
385
Complementary rule: ((AC) (CG) (GT) (TA)) DNA strand: AATGC Applying complementary rule on DNA strand: CCATG
Increasing the complexity is the main purpose of using those rules, in this paper. It means that, finding the original data by intruder needs extra calculations because there are four basic alphabets therefore four likelihood of complementary rule for every DNA sequences. So, the final number of possible those rules are 4×3×2×1=24. On the 1 other hand, the possibility to happen a correct guess is 24 . Extra information and findings for some basic definitions can be obtained in molecular biology reference books [11, 12, and 13]. The reminder of this paper is structured as follows: first of all, some related works is presented. After that, in section III as a technical core of this paper, the proposed method including embedding secret data and extracting original data phases is discussed. In the next section, the security issues are presented. Finally, this paper is concluded in section V.
Fig.2. Assembly of DNA binary strands [8].
All binary sequences are illustrated in form of s{0|1}e. An arbitrary DNA bits has been concatenated between two terminations, s and e, as start and end, respectively. Annealing and ligation are the way of concatenating DNA bits between s and e. Both of terminations and DNA bits have been made by annealing complementary oligonucleotides. In order to concatenate on both sides, having sticky ends (A, Ā, X, Ȳ) are necessary. The sticky end A(Ā) acts as a variable for correct concatenation of bits and terminations. For subsequent cloning, the sticky ends X and Ȳ should be used. In 2000 Leier and et al. [8] brought a robust technique by utilizing a special key strand. They called it, primer. Primer has key role to decrypt a coded strand. Using a public DNA strand was utilized as a reference sequence in their DNA based encryption technique. In this scheme, the receiver must also be informed about the reference sequence. Namely, the receiver will receive a selected primer and an encrypted strand. The intruder is not able to decrypt the binary data without knowing about both of primer and reference strand, certainly. A primer is a complementary subset from a sort of DNA strand. Normally, the primer is called a short substring. For example, assume S is a DNA strand: S= “ATGCTTAGTTCCATCGGAGACTAATGGCCTA” and “two primers ATCAA and GATTAC”. So “ATCAA and GATTAC are complementary substring of TAGTT and CTAATG”, correspondingly. Definitely, a complementary rule is needed to handle the manipulations, correctly. For this reason, they defined a complementary rule which is AT, T-A, C-G, and finally G-C. Accomplishing those states in biology is done by a series of chemical mechanisms to combine primers with a DNA strand. Indicating the right position of the primers also is shown by a fluorescent chemical substance. When hybridization occurs among primers and substrings of the reference DNA reference sequence, it will become bright, obviously. Except bright places, remaining are dim sections. The exact message of bright portion is binary data ‘1’ and naturally, a dim portion is referred to the binary data ‘0’. So, according to the above, the proper output of the hybridization is:
II. RELATED WORKS For the purpose of clarifying the algorithm in current paper, introducing some backgrounds of the knowledge is necessary [8, 11, 12, and 13]. The most important part of each DNA base data hiding algorithm is, manipulating four letters which has been called as nucleotides in biology. The letters are A, C, G, and T. Any composition from them will make a sequence. For instance, two DNA sequences have been appeared in [16]. They mentioned sequences from European Bioinformatics Institute (which is known as EBI Database) [15] for the purpose of extracting DNA sequences of Litmus and Balsaminaceae. So, Litmus with 154 nucleotides and Balsaminaceae with 2283 are shown in below, respectively: Litmus: “ATCGAATTCGCGCTGAGTCACAATTCGCGCTG AGTCACAATTCGCGCTGAGTCACAATTGTGACTCA GCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGC AGAGATAATTGTATTTAAGTGCCTGCTCGATACAA TAAACGCCATTTGACC”. Balsaminaceae: “TTTTTATTATTTTTTTTCATTTTTTTCTCAGTTTT TAGCACATATCATTACATTTTATTTTTTCATTACTTC TATCATTCTATCTATAAAATCGATTATTTTTATCAC TTATTTTTCTAATTTCCATATTTCATCTAATGATTAT ATTACATTAAAGAAATCG”. Even though data can be shown by DNA nucleotides, but representing to binary was emerged in 1999 by Rauhe and et al. [17]. They represented numbers by using binary DNA sequences in figure 1. The creativity of their work was in how they could separate binary sequences from each other.
386
in Table 1, there are only twenty different amino acids. It is because of some codons might be mapped to the same amino acids. For instance, the codons ‘UUA’, ‘CUU’, ‘CUA’ and ‘UUG’ are mapped to the same amino acid Leu. Table 1. The mapping of codon to amino acid [18]. UUU →Phe UUA →Leu CUU →Leu CUA →Leu AUU →Ile AUA →Ile GUU →Val GUA →Val UUC →Phe UUG →Leu CUC →Leu CUG →Leu AUG →Ile AUG →Start GUC →Val GUG →Val
Therefore, the final secret message in form of binary is “01010”. If the sender prefers to complicate that proposed technique, it can send one part of primers to the legitimate receiver. For the purpose of recovering primers, the receiver must apply PCR. PCR is another chemical scheme which has been known as Polymerase Chain Reaction, scientifically. It can recover primers correctly [16]. Considering a certain substring from DNA sequence as a character was proposed by Peterson in 2001 [9]. By substituting three successive nucleotides as a character, he could hide data in DNA sequence, appropriately. For instance, ‘A’=GGC, ‘B’=ATG, and so on. So, for this scheme we face to 64 symbols that can be possibly encrypted. The main liability of this method is the frequency for both of ‘E’ and ‘I’ in an English message. Because of those holes, an intruder can apply a cryptanalysis technique base on frequencies of the most repetitive letters in English and subsequently extracts the secret message, simply. The next scheme in 2002 [18] needs some backgrounds before explaining. An arrangement of nucleotides determines a protein. The responsibility of the proteins is almost every activity in cells. Transcription is the process by which RNA is created, an intermediary copy of the instruction contained in DNA. Naturally, RNA has four bases. They are adenine (A), cytosine (C), uracil (U) and guanine (G). The RNA copy is related to mRNA (abstract of messenger RNA) by throwing away the coming among sequences of RNA. On mRNA, each codon has three nucleotides for the purpose of representing which amino acid must be assigned in the next position. All different amino acids have been listed in Table 1. They are: Phe, Leu, Ile, Val, Ser, Pro, Thr, Ala, Tyr, His, Gln, Asn, Lys, Asp, Glu, Cys, Trp, Arg, Met and Gly. Each codon binds a set of three nucleotides into an anticodon which is called tRNA molecule (transfer RNA). The responsibility of the tRNA is, translating nucleic acid into protein. Totally, there are forty separate tRNA molecules which each one has a binding for one of amino acids, as represented in Table 1. Apposite tRNA is bound to codon on the mRNA. When the STOP codon was seen, the translation is completed, and then a protein is released. Because of the codon redundancy, Shimanovsky et al. [18] utilized from that feature to hide data in mRNA. In general, each mRNA codon has been composed of three nucleotides. The likely nucleotides are: ‘U’, ‘C’, ‘A’, and ‘G’. Therefore, there are many different probable combinations to form each mRNA codon, whereas
UCU →Ser UCA →Ser CCU →Pro CCA →Pro ACU →Thr ACA →Thr GCU →Ala GCA →Ala UCC →Ser UCG →Ser CCC →Pro CCG →Pro ACC →Thr ACG →Thr GCC →Ala GCG →Ala
UAU →Tyr UAA →Stop CAU →His CAA →Gln AAU →Asn AAA →Lys GAU →Asp GAA →Glu UAC →Tyr UAG →Stop CAC →His CAG →Gln AAC →Asn AAG →Lys GAC →Asp GAG →Glu
UGU →Cys UGA →Stop CGU →Arg CGA →Arg AGU →Ser AGA →Arg GGU →Gly GGA →Gly UGC →Cys UGG →Trp CGC →Arg CGG →Arg AGC →Ser AGG →Arg GGC →Gly GGG →Gly
Feasibility of embedding information in the mRNA codon is because of the redundancy. For example, if the codon must be encrypted with ‘UUA’, while the secret message if four, it is possible to use the codon ‘UUG’ in order to replace the original. It is feasible because ‘UUG’ is the fourth codon of all the codons whose mapping amino acid is Leu. Even though previous replacement does not have any effect on the results of transcription, but it modifies the nucleotides of the original sequence, that might probably trigger unidentified consequences. Although the previous beginning schemes utilize distinct biological features for hiding data in DNA sequences, they are not cost-effective and efficient to employ. In this paper, we apply a scheme to hide data in DNA strands based on a software point of view to improve them. It is necessary to think as a software engineer to bring biological aspects of the DNA to the computing areas like Cloud Computing because the nature of the Cloud Computing is different from biology [24]. III. PROPOSED METHOD In our method, there is a cloud environment and its clients in a same company. The clients (client1 and client 2) want to upload data on cloud in such a manner that confidentiality of data be in highest point. Therefore, the clients need to employ a method to increase the level of confidentiality of data so that no one can see data when someone intentionally or unintentionally accessed to them. In figure 3, we demonstrate the flow of data as well as the flow data hiding method by describing the method.
387
The next (second) sub-phase is, applying complementary rules. Increasing the complexity is the real and exact purpose of this step. By applying the complementary rules, the new form of the M´ which is M´´ emerges. Now, M´´ is appeared. As mentioned before, both of clients have a DNA reference sequence from a large number of possibilities base on EBI [14] or NCBI [15] database. It means that, they have selected the same DNA reference sequence, exactly. The exact role of the third sub-phase is, extracting the index of each couple nucleotides in DNA reference sequence, numerically. When all the indexes have been extracted, M´´´ has been made, properly. M´´´ is precisely the secret data with some changes through the embedding phase. Now, sender can send the data (M´´´) to cloud. Clarification of the current phase is continued by demonstrating an example, step by step. In this example, assume original data M=100111000011 should be uploaded to the cloud. DNA Reference Sequence:
Fig3.Flow of data and using data hiding
First, client1 must apply the method of data hiding on its data which it wants to hide to the cloud computing environments. This section (hiding data) is divided into two phases. The first one is, embedding data and the second one is, extracting the original data.
AT1CG2AA3TT4CG5CG6CT7GA8GT9CA10CA11AT12TC13 GC14GC15TG16AG17TG18AA19CC20 M=100111000011 Sub-phase1 (A= 00, T= 01, C= 10, G= 11): M´= CTGAAG Sub-phashe2 ((AC) (CG) (GT) (TA)): M´´= GATCCT Sub-phase3 (Indexes): M´´´=8137 Now, embedding phase is finally completed. Then, sender sends 8,13,7 to the cloud. In the next section, the client 2 will apply the extracting phase for extracting the original data by using three consecutive phases.
A. Phase1: Embedding Secret Data In order to explain embedding phase, separating the phases into some successive and vivid sub-phases, is the best way of proposing current method. In below, sub-phases have been shown, respectively. M= Data Convert Binary to DNA Nucleotides
B. Phase2: Extracting Original data Now, client2 takes the secret data in form of some numbers. For the purpose of extracting the original data from DNA reference sequence, phase two with its subphases will extract the original data, correctly.
M´=DNA Sequence Applying Complementary Rules on M´
M´´= New Form of M´ Finding Index of each Couple of Nucleotides in DNA Reference Sequence
M´´´=Secret Data Finding Index of each Couple of Nucleotides in DNA Reference Sequence
M´´´=Secret Data
M´´= Previous Form of M´
Fig.4. Phase 1: embedding secret data
Applying Complementary Rules on M´
Obviously, there is an original data M which the client decides to upload via a network to cloud computing environments. So, there are three sub-phases to provide the final form of M which is M´´´ and upload it to cloud. The first sub-phase is, converting by DNA base pairing rules. The product is M´. M´ contains nucleotides sequences. By applying DNA base pairing rules, the data can convert from binary to DNA sequence. Not only DNA base pairing helps to encrypt the data from binary to DNA sequence but also it is applied to decrypt the secret data to original one, truly.
M´=DNA Sequence Convert Binary to DNA Nucleotides
M= Data
Fig.5. Phase2: extracting original data
So, the first sub-phase manipulates the M´´´. Because of the nature of the secret data which is some sorts of numbers
388
V. CONCLUSION
(exact positions (indexes) of the original data on DNA reference sequence), extracting the data starts by finding the indexes on DNA reference sequence one by one according to the numbers which sender has sent in form of the current secret data. M´´ is the exact product of the first sub-phase. Consequently, the second sub-phase applies complementary rules on M´´ in order to extracting M´, correctly. The importance of the M´ is the form of it. M´ is the last form of data, based on DNA nucleotides. Converting the M´ to the M is the third sub-phase. Transforming from DNA nucleotides to the binary is the responsibility of the last subphase. Now, the client2 has truly extracted the original data M. Those steps are demonstrated through the example in below: DNA Reference Sequence:
One of basic problem in cloud computing environments is data confidentiality. Considering DNA characteristics brings new ideas in data hiding in order to increase the level of data confidentiality among clients. DNA sequences are potential to implement new data hiding techniques or even transforming previous schemes to new one. In this paper, a reference DNA sequence has been shared among clients. Not only this DNA reference sequence can be retrieved from EBI [15] or NCBI [14] databases but it can also be simply selected from any databases. Therefore, by considering any sort of database, there are 163 million targets to select it. Guessing the correct DNA sequence by attacker is virtually unachievable. The crucial feature of the DNA sequences is visibility. Finding secret data in a DNA sequence is difficult because the visibility of the sequences is very low. As a result, attacker cannot find out whether this sequence is a fake or not. Compared to previous techniques such as in images, implementation of this method is not only difficult but also it is formidable to detect as well.
AT1CG2AA3TT4CG5CG6CT7GA8GT9CA10CA11AT12TC13 GC14GC15TG16AG17TG18AA19CC20 M´´´=8137 Sub-phase1 (Indexes): M´´= GATCCT Sub-phase2 ((AC) (CG) (GT) (TA)): M´= CTGAAG Sub-phase3 (A= 00, T= 01, C= 10, G= 11): M=100111000011
Acknowledgment. We would like to express our appreciation to (AIS) Universiti Teknologi Malaysia (UTM) for providing financial support for this research.
So, the receiver extracted the original data, accurately by using a simple algorithm. In the next section, security and liabilities of the algorithm will inspect, briefly.
VI.
IV. SECURITY ISSUES
REFERENCES
1. C.C. Chang, C.C. Lin, C.S. Tseng, W.L. Tai, Reversible hiding in DCTbased compressed images, Information Sciences 177 (2007). 2. C.C. Chang, W.C. Wu, Y.H. Chen, Joint coding and embedding techniques for multimedia images, Information Sciences 178 (2008). 3. C.H. Huang, J.L. Wu, Fidelity-guaranteed robustness enhancement of blind-detection watermarking schemes, Information Sciences 179 (2009). 4. H.H. Tsai, D.W. Sun, Color image watermark extraction based on support vector machines, Information Sciences 177 (2007). 5. H.W. Tseng, C.P. Hsieh, Prediction-based reversible data hiding, Information Sciences 179 (2009). 6. C.C. Chang, T.C. Lu, Y.F. Chang, R.C.T. Lee, Reversible data hiding schemes for deoxyribonucleic acid (DNA) medium, International Journal of Innovative Computing, Information and Control 3 (2007). 7. C.T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots, Nature 399 (1999). 8. A. Leier, C. Richter, W. Banzhaf, H. Rauhe, Cryptography with DNA binary strands, BioSystems 57 (2000). 9. I. Peterson, Hiding in DNA, Muse (2001). 10.B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002). 11.B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, J.D. Watson, Molecular Biology of the Cell, Garland Publishing, New York & London, (1994). 12. A.L. Lehninger, D.L. Nelson, M.M. Cox, Principles of Biochemistry, Worth, New York, (2000). 13.J. Watson, N. Hopkins, J. Roberts, J. Steitz, Molecular Biology of the Gene, fourth ed., Benjamin Cummings, Menlo Park, CA, (1987). 14. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/ 15. European Bioinformatics Institute, http://www.ebi.ac.uk/ 16. H.J. Shiu, K.L. Ng, J.F. Fang, R.C.T. Lee, C.H. Huang, Information Sciences, Volume 180, Issue 11, 1 June( 2010). 17. Rauhe, H., Vopper, G., Banzhaf, W., Howard, J.C., (1999) .
In terms of security, each intruder must be aware from the following information, correctly. Without this fundamental information, possibility of extracting original data is near to zero, scientifically. They are: x DNA reference sequence: there are 163 million DNA reference sequence on EBI database. Therefore, the likelihood of making a doing well 1 conjecture by attacker is . 24 x Binary coding rule: as mentioned, the clients are free to select any equivalent binary form for every nucleotide. It means that, A can be ‘00’, ‘01’, ‘10’, or ‘11’; C can be ‘00’, and so on. In other words, all the binary coding rules are 4×3×2×1=24. So, the likelihood of making correct 1 guess by attacker is 24 .
x Complementary pairing rule: like binary coding rule, there is 4×3×2×1=24 complementary alphabet among basic nucleotides. Therefore, the 1 possibility of making successful attack is 24 .
Eventually, the final probability of making a correct and 1 1 1 successful guess by attacker is 163 ×10 6 × 24 × 24 .
389
18. B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002). 19. Minqi Zhou; Rong Zhang; Wei Xie; Weining Qian; Aoying Zhou; , "Security and Privacy in Cloud Computing: A Survey," Semantics Knowledge and Grid (SKG), 2010 Sixth International Conference on , vol., no., pp.105-112, 1-3 Nov. 2010. 20. Jian Wang; Yan Zhao; Shuo Jiang; Jiajin Le; , "Providing privacy preserving in cloud computing," Test and Measurement, 2009. ICTM '09. International Conference on , vol.2, no., pp.213-216, 5-6 Dec. 2009 21. Itani, W.; Kayssi, A.; Chehab, A.; , "Privacy as a Service: PrivacyAware Data Storage and Processing in Cloud Computing Architectures," Dependable, Autonomic and Secure Computing, 2009. DASC '09. Eighth IEEE International Conference on , vol., no., pp.711716, 12-14 Dec. 2009. 22. Doelitzscher, F.; Reich, C.; Sulistio, A.; , "Designing Cloud Services Adhering to Government Privacy Laws," Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on , vol., no., pp.930-935, June 29 2010-July 1 2010. 23. Pearson, S.;, "Taking account of privacy when designing cloud computing services," Software Engineering Challenges of Cloud Computing, 2009. CLOUD '09. ICSE Workshop on , vol., no., pp.44-52, 23-23 May 2009. 24. Chunye Gong; Jie Liu; Qiang Zhang; Haitao Chen; Zhenghu Gong; , "The Characteristics of Cloud Computing," Parallel Processing Workshops (ICPPW), 2010 39th International Conference on , vol., no., pp.275-279, 13-16 Sept. 2010.
390