2015 Asia-Pacific Conference on Computer Aided System Engineering
Review of Big Data Storage based on DNA Computing Hanadi Ahmed Hakami, Zenon Chaczko and Anup Kale Centre for Real-time Information Networks (CRIN) Faculty of Engineering and Information Technology(FEIT) University of Technology, Sydney(UTS) Australia, NSW, Sydney Email:
[email protected] Email:
[email protected] Email:
[email protected]
Abstract—There is a need of noteworthy scaling down in the information approached may be saved in the most recent decade. Delicate and advanced version hard paper duplicate which helps in two ways that they increased the effectiveness from claiming data management but also improved the distribution of entrance of information. On engineered DNA, it may be a chance to view the late improvement on the possibility about data capacity. Similarly in this way we have figured out how leap forward engineering could dramatically change the lifestyle out of our information capacity. This topic ’ Big Data Storage based DNA’ is described from the first research to newer one, their advantages and disadvantages, their techniques and how it will become a practice in the future.We also propose an approach is proposed as simple method to store data into DNA. The experiment work is done to validate the proposed approach result clearly show advantages merits of proposed method.
II. R ELATED W ORK Any organism which is made of two stranded spiral of nucleotides has cells called Deoxyribonucleic Acid (DNA) cells. Four polymers which are Adenine, Cytosine, Guanine and Thymine make up these nucleotides that consist of one of a five carbon sugar, four nitrogen bases and phosphate group. An immense amount of information is stored in DNA to utilize a significant number of combinatorial issues, DNA registering methodologies are utilized. It can be a possibility that a few gram of DNA have the possibility of storing all data in the world which is 1 gram of DNA has about 1021 DNA bases. Also, this DNA can be kept in dry cold and dark conditions. As it comes for storage problem, there are a lot of reasons to use DNA due to its ubiquity and its very small size. In their initial work, Lipton, Adelman [1] and several other researchers suggested that DNA based approaches can be used to solve such problems as: SAT problem and salesman traveling problems. In 1994, according to Adelman it was indicated to store information and did a few calculations similarly, the DNA might be used. And also it is stated by Adleman in 1994 that DNA used four bases with each input parameter[1]. In 1995, Lipton also described how all could be allowed of qualities to a SAT issue might be depicted in graph, and then can be translated into DNA in the same salesman traveling example[10]. A few researches from NEC showed some results after using computing of molecular for solving NP-complete problem. The researcher’s team and Adelman executed a graph from six vertexes and every node connecting to other individually. Then they employed a method called “thermal cycling” which generated all possible paths between the nodes. Adelman and other researchers in 2001 solved the largest problem using DNA computer with a 20-variable 3-SAT problem[3][4].In 1995, Lipton proposed solving SAT problem that general hunts against 1 million possibilities utilizing a comparative encoding plan of one were performed by the team if researchers. Clell, Risca and Bancroft in 1999, developed the idea of storing data in DNA based on encoding data in DNA strands[6]. In order to encrypt the information they used DNA nucleotide polymer, PCR and the key. Then, one needs to repeat the sequence of DNA for a hundred times
Index Term- DNA, DNA Computing, Storage Based DNA, Coding theory, Encoding, DNA cloud
I. I NTRODUCTION Our information network is being loomed due to the preservation problem of the data which is stored and retrieved and is inevitable. Day by day the demand for storing more and more data is increased. In 2012, the total digital information in the world wasabout 2.7 zettabytes and it is increasing by 50% in every passing year. From rocks, bones, paper, drums, films, punched cards, magnetic tapes, gramophone records etc. the journey of data storage has started. As the data was stored into DVDs and CDs, now it has been shifted to portable hard drives and USB flash drives. Yet all of these techniques are of no use when dealing with the rapid growth of data problem. Moreover, the environment can be polluted with silicon and the other non-biodegradable materials which are limited in resources and would exhaust one day. Till 2015, the projected data demand can be raised to 8000 exabyte as the maximum storage density on these devices is 1 terabyte per square inch. For archival purposes libraries, corporations and file sharing systems are in favor of shifting to newer technologies as the current storage technologies are not capable to handle it efficiently. 978-1-4799-7588-4/15 $31.00 © 2015 IEEE DOI 10.1109/APCASE.2015.27
113
Figure 2. Structure of DNA molecules used for data storage (Bancroft 2011) Figure 1. Identified the research work and coding approaches (Clelland and Catherine Taylor 1999)
in order to have a strong background to store the data. This process is called PCR. The technology which is used to encode the picture to period and try to solve this DNA strand is called Microdots technology. By concealing the message in DNA, the privacy and security are ensured using a complex background. In order to hide the information on DNA,the gel electrophoresis analysis was performed. As a result, it was made clear how infromation can be stored using the model of DNA. The research into DNA storage suggests that it can allow for far more privacy and security, If compared with silicon devices. In the Figure 1, the coding approach is identified and used in store data according to the study of Cell, Risca and Bancroft. According to Bancroft, IDNA is similar idea to store and encrypt the data[3]. Although the same mechanism is used but in IDNA they use poly primer key sequences to access information to DNA. DNA information and encoded data are used for using IDNA. By using encryption scheme, orderly sequence analysis takes place in order to decode this data. Microdot data storage mechanism was the first type of experiment which is based on DNA and it take the way for the future for DNA computing. On the other hand, the drawback is that it do not provide the privacy and security from inside and outside of DNA. The Fig. 2 Computes the DNA molecules structure which is used for storage reading information. In 2003, regarding to store data in DNA and retrieval there is the research was published by Wong et al.[11]to ensure the growth of DNA and duration of information; they needed a vector to contain data with DNA. Because the information can be lose if the strands of DNA break from ends. They can store and protect the encoded strands from harsh condition and synthesized gene sequences. The team used Escherichia coli and Deinococcus radidurans factor to increase the speed of reproduction rate. Like digital arrangement they used oligonucleotide sequences as 1’s and 0’s to make the ASCII scheme that can present the text in silicon devices. Out of 10 billion sequences they can found, they should carefully chose the safe sequences which cannot harm of the encoding safety. After this step, a restriction enzyme is used by the
Figure 3.
Encoding system (Parker 2003)
team to create the strands which have twenty bases long that plays two roles one can be insert in the encoded required and the other is finally the reproduced take double strands into a recombinant plasmid.Consequently, PCR is used when the data is required. Stop codons is used in order to keep and protect the message. From that in bacteria with the help of 57-99 base pair of foreign encoded information, a procurement of 7 chemically synthesized DNA fragments was enabled. Since one of the features of DNA is the fact it is dense it is able to store in a suitable host a large amount of the encoded data. In figure 3 the encoding schemes are shown. Therefore, this analysis need those ticket to utilize the DNA with store information. Moreover, this investigates might have been a critical on it need those strategies of insurance of the fancied information starting with extremities on environment, radiation Overall hurtful to the delicate DNA part. The period the middle of 2003-2009 really kept tabs on the techniques for encoding DNA. Those advanced information might have been constantly utilized On 2009 Also begin will encode this data to triple about base done manifestation. Looking into the individuals six quite some time there is an alternate encoding plan cam-wood be utilized. Between DNA bases Also different languages there appear An requirement to structure An widespread plan for correspondence which may be illustrative from claiming know workable information Also to extend will fit new formats about information. The prudent utilization of nucleotide bases for every character will be that standout amongst the great prerequisite for the best DNA coding. In light of those base to characters rate around three it might have been turned out mathematically, subsequently a significant number analysts favored Furthermore at present do it which is Huffman coding plan [9]. Craig Venter’s undertaking clinched alongside 2010
114
Figure 4.
The Whole process (Church 2012)
might have been encoding a 7920-bit in the genome successions of the bacterium mycoplasma [7].This endeavour might have been one of the biggest ventures. What’s more it encodes biggest data under DNA to date; however, the advanced information stockpiling was not yet considered within the venture. Moreover, it might have been to start with time to process those engineered cell, thus it might have been a critical accomplishment. Moreover, it might have been to start with time to process those engineered cell, thus it might have been a critical accomplishment. In 2011 Church, GAO and Kosurib published a foundation paper that reported the use of over 11 JPG images, an html coded draft of 53,000 expression book for their experimentation[5]. On show those possibility about DNA will store the sum information they utilized combination on their record determination. They also determination adenine or cytosine as 0 and thymine or Guanine as 1, which imply ‘10011001’ might have been encoded clinched alongside double as ‘TAAGGCCT’ successions for DNA. That grouping perused to one direction (5’-3’) might provide for those same consequences on read previously, (3’-5’). Encoded those 0’s also 1’s onto oligonucleotide each holding contain 96 bit information square. Due to the period may be gigantic those successions chunk under 96 nucleotide bits each odds encode onto 54,898 159-nt oligonucleotide. These strategies might have been different and workable in view naturally right errors starting with right duplicate as the errors to amalgamation and sequencing. In the end, for the next generation technologies they utilized the state of the craftsmanship to union Furthermore sequencing which the cosset is lesquerella regarding 105 times compared with the first generation encoding. The figure 4 and figure 5 portray those essential rules of the system and examination of the fill in with some existing advances. Seen an alternate breakthrough for DNA information capacity in 2013 Goldman and his group attempted to use the DNA as stockpiling for Data and recover this information adequately[8]. There are numerous choices for documenting data yet all have their downsides. For example, hard drive expense is so high and need to electrical supply additionally the attractive tape will be harm in a period. Be that as it may one petabyte of data on one gram of DNA later on
Figure 5. Observation sketched on graph for the experiments (Church 2012)
Figure 6.
Functionality of DNA Cloud (Shah and Gupta 2014, p 4)
can be a superior answer for store an immense measure of information that is originating from advanced media. In the past we compose the data on the hard drives to the glimmer drives which created by human data capacity productively control. Subsequently, the information is comes an immense to taking care of we require a system to archival this data and recovery purposes emerges. There would numerous approaches to store information with respect to DNA Yet there will be approach might have been formed on encourage this capacity which will be DNA cloud[13]. This DNA cloud could store any sort for information image, feature Furthermore content afterward might be spare it ahead DNA. This programming need three steps on store any information with respect to DNA, principal venture will be encoding information after that unravel information and At last capacity estimator Similarly as indicated to figure 6. There will be numerous encoding procedures Store data under DNA progressions by utilizing DNA codes[2]. A Haffman code is one of the proficient source coding
115
strategy[9]additionally it is interesting and capable to decipher. Moreover, DNA matrix has been used to represent the Metadata follow by converting the DNA matrix into Quick Response (QR) representation that offers a broad scope of practical usage [12]. DNA Cloud is like Haffman encoding which was executed[8]. Encoding model information takes any configuration from the information document, for example, (text.jpg.mp3.etc). DNA chunks and the parts of DNA pieces covered to execute fourfold excess for blunder amendment so to encode the DNA groupings is isolated into altered length of this chunks entwined. They change over the unique document to code length 5 and 3 code Huffman(0,1,2) which is changed three codon to DNA code and substituting one from these three with one of three nucleotide diverse from the past one. This module spares encode of document as ’record format extension .dnac’ however if these codes was erased any chunks or base then it can recovered by perusing that covered code successions. To recuperate the information on DNA this information ought to translate from DNA. Consequently, for translate this information changing over back the base 3 Huffman DNA code to unique information. The DNA arrangements are data on this module and store the unique information as yield. The succession yield can be including in this module as ".dnac" document. There will be two fundamental areas for the stockpiling estimator process from the framework the clients can picked the document. At that point this document has esteem, for example, (record estimate in bytes and size of DNA strings) can be evaluated which Assistance the customer pick what amount memory from this framework can be utilized to encode and store information in DNA. Next step is expense and biochemical properties which select document ".dnac" from the framework to store the information and will ask this record. Concerning delineation an enter Furthermore accommodate those content Also condensing temperature values Furthermore cosset for aggregate DNA then spare this assessed data. Sometime in the future, DNA could be not simply for the diagram forever however could be as house for gather the advanced information.
Figure 7.
2) 1) 2) 3)
III. P OSSIBLE A PPROACH TO THE DNA COMPUTING FOR B IG DATA A. Process B. Encoding / Decoding algorithm 1) Encoding algorithm: 1) Read data stream: A 2) Check size of the data [r,c,n] = size (A) where r = rows, c = columns, n = number of matrix 3) Calculate DNA sequences size for image 4) Create a zero matrix of DNA sequences size 5) While DNA sequences size = max a) Convert even smallest piece of data to binary form b) Insert binary DNA code of an individual data cell to DNA sequence c) Continue till all of max size of DNA sequence is reached
Flowchart proposed method
Decoding algorithm: Read DNA sequence Calculate size : length, size of individual cells Convert DNA to real data decode a) Convert one cell to data b) Insert the converted value to template data matrix c) Stop when entire DNA is converted IV. C ONCLUSION
Although in its early stages of research, the DNA storage is shown to be very effective. The storage of data based on DNA technology is no limited to just science fiction as it is becoming at an increasing rate an important and ubiquitose domain of research by rmany teams. DNA storage techniques show massive progress. The number of publications related to various DNA based models and techniques increseas tenfold annually. Presented methods literally convert each and every smallest possible information into DNA form. Thus this method is highly applicable for handling and storage of massive amounts of various types of data.
116
R EFERENCES [1] L M Adleman, Adleman, and Leonard M. Molecular computation of solutions to combinatorial problems. Science 266, 266(5187):1021– 1024, 1994. [2] Masanori. Arita. Writing information into DNA. In Aspects of Molecular Computing, pages 23–35, 2004. [3] Carter Bancroft, Timothy Bowler, Brian Bloom, and Catherine Taylor Clelland. Long-term storage of information in DNA. Science 293, pages 1763–5, 2001. [4] Braich, Ravinderjit S., Nickolas Chelyapov, Cliff Johnson, and Leonard Adleman Paul WK Rothemund. Solution of a 20-Variable 3-SAT Problem on a DNA Computer. Science 296, (5567):499–502, 2002. [5] Church, George M., Yuan Gao, and Sriram Kosuri. Next-generation digital information storage in DNA. Science 337, (6102):1628–1628, 2012. [6] Viviana Risca Clelland, Catherine Taylor and Carter Bancroft. Hiding messages in DNA microdots. Nature 399, (6736):533–534, 1999. [7] Gibson, Daniel G., John I. Glass, Carole Lartigue, Vladimir N. Noskov, Ray-Yuan Chuang, Mikkel A. Algire, and Gwynedd A. Benders et Al. Creation of a bacterial cell controlled by a chemically synthesized genome. science 329, (5987):52–56, 2010. [8] Goldman, Nick, Paul Bertone, Siyuan Chen, Emily M. LeProust Christophe Dessimoz, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, (7435):77–80, 2013. [9] Huffman, David A, and David A Huffman. A method for the construction of minimum redundancy codes. In Proceedings of the IRE 40, number 9, pages 1098–1101, 1952. [10] Lipton and Richard J. DNA solution of hard computational problems. Science 268, (5210):542–545, 1995. [11] Jack Parker. Computing with DNA. EMBO reports, 4:7–10, 2003. [12] Raniyah Wazirali, Zenon Chaczko and Lucia Carrion. Bioinformatics with Genetic Steganography Technique. 2015. [13] Shrivastava Siddhant, Badlani, and Rohan. Data Storage in DNA. International Journal of Electrical Energy, 2(2):119–124, 2014.
117