Information Sciences 180 (2010) 2196–2208
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Data hiding methods based upon DNA sequences H.J. Shiu a,d, K.L. Ng b, J.F. Fang c, R.C.T. Lee d, C.H. Huang e,* a
Department of System Development, ModioTek Co., Ltd., Science Park, HsinChu 300, Taiwan Department of Bioinformatics, Asia University, Wufeng, Taichung 413, Taiwan Department of Digital Content and Technology, National Taichung University, Taichung 403, Taiwan d Department of Computer Science, National Chi Nan University, Puli, Nantou 545, Taiwan e Department of Computer Science and Information Engineering, National Formosa University, Huwei, Yunlin 632, Taiwan b c
a r t i c l e
i n f o
Article history: Received 8 March 2008 Received in revised form 19 October 2009 Accepted 30 January 2010
Keywords: DNA Data hiding Complementary pair Data recovery
a b s t r a c t In this paper, three data hiding methods are proposed, based upon properties of DNA sequences. It is highlighted that DNA sequences possess some interesting properties which can be utilized to hide data. These three methods are: the Insertion Method, the Complementary Pair Method and the Substitution Method. For each method, a reference DNA sequence S is selected and the secret message M is incorporated into it so that S0 is obtained. S0 is then sent to the receiver and the receiver is able to identify and extract the message M hidden in S0 . Furthermore, the robustness and the tightly embedded capacity analysis of the three proposed methods are demonstrated. Finally, experimental results indicate a better performance of the proposed methods compared to the performance of the competing methods with respect to several parameters such as capacity, payload and bpn. Ó 2010 Elsevier Inc. All rights reserved.
1. Introduction As access to various kinds of data through the Internet becomes more and more popular, important information must be concealed while being transmitted via the Internet so that only the authorized receiver can retrieve it. Thus, data hiding has become a new and important issue. Traditionally, data hiding approaches usually embed a secret message into the host images [5,7,12,19,20]. However, this could distort the host image to some degree, and may therefore, be detected and attacked by the intruders. In recent years, research work has been carried out on DNA-based data hiding schemes [6,8,15,17,18]. Most of them use the biological properties of DNA sequences. The data hiding method introduced in this paper does not make use of biological properties; instead, it uses other properties of DNA sequences which will be explained below. Firstly, however, some background knowledge should be introduced [1,14,21]. A DNA sequence is a sequence consisting of four letters: A; C; G and T. Each letter is related to a nucleotide. For instance, two DNA sequences appear as follows: The first is the DNA sequence from Litmus with 154 nucleotides retrieved from the European Bioinformatics Institute (EBI) [10]: ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAGATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACC. The second is a segment of DNA sequence from Balsaminaceae with 2283 nucleotides: TTTTTATTATTTTTTTTCATTTTTTTCTCAGTTTTTAGCACATATCATTACATTTTATTTTTTCATTACTTCTATCATTCTATCTATAAAATCGATTATTTTTATCACTTATTTTTCTAATTTCCAATATTTCATCTAATGATTATATTACATTAAAGAAATCG.
* Corresponding author. Tel.: +886 56315588; fax: +886 56330456. E-mail address:
[email protected] (C.H. Huang). 0020-0255/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2010.01.030
2197
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
In 2000, Leier et al. proposed a robust scheme using a special key sequence, called a primer, to decode an encrypted DNA sequence [15]. In their DNA-based encryption scheme, a public DNA sequence is used as a reference, which the receiver also knows. Thus, a selected primer and an encrypted sequence are sent to the receiver. Without the primers and the designated sequences it is not possible to correctly decode the binary data. A primer is a short complementary substring of a DNA sequence. Suppose the following DNA sequence:
ATGCTTAGTTCCATCGGAGACTAATGGCCTA and two primers: ATCAA and GATTAC which are the complementary substrings of TAGTT and CTAATG, respectively. The complementary rule is defined as A—T; T—A; C—G and G—C. In biology, there is a chemical mechanism to combine primers and a DNA sequence and then a fluorescent chemical substance can be used to indicate where the positions of the primers are. The fluorescence will cause the positions of hybridization of primers and substrings of DNA sequence to become bright. A bright section corresponds to the binary data ‘1’ and a dark section corresponds to ‘0’. The above case will result in the following hybridization:
ATGCTTAGTTCCATCGGAGACTAATGGCCTA ATCAA GATTAC 0
1
0
1
0
Thus finally revealing a binary sequence of ‘‘01010”. A more complicated version of the proposed data hiding scheme only sends part of the primers to the receiver. There is another chemical scheme, known as the PCR (Polymerase Chain Reaction), which can be used to correctly recover primers. A robustness analysis was provided by the authors. In [17], Peterson proposed a method to hide data in DNA sequences by substituting three consecutive bases as a character. For example, ‘B’ = CCA, ‘E’ = GGC, and so on. There are 64 symbols which can be encoded. However, the frequencies of characters ‘E’ and ‘I’ appearing in an English text are quite high. Therefore, an attacker could use this property to crack the coded message. Before the next proposed scheme from 2002 [18] is introduced, further background should be explained. The DNA sequence determines the arrangement of amino acids, which form a protein. Proteins are responsible for almost everything in the cells. Transcription is the process by which RNA is created, an intermediary copy of the instructions contained in DNA. The four bases in RNA are: adenine (A), cytosine (C), uracil (U) and guanine (G). The RNA copy (transcript) is referred to mRNA (messenger RNA) by discarding the intervening sequences of RNA. On mRNA, a codon, comprising three nucleotides, indicates which amino acid will be attached next. As shown in Table 1, all distinct amino acids are: Phe, Leu, Ile, Val, Ser, Pro, Thr, Ala, Tyr, His, Gln, Asn, Lys, Asp, Glu, Cys, Trp, Arg, Met and Gly. The codon binds a group of three nucleotides onto an anticodon, a tRNA molecule. tRNA can be treated as a medium to translate nucleic acid code into protein. There are about forty distinct tRNA molecules, and each one of them has a binding site for one of the amino acids, as shown in Table 1. Appropriate tRNA binds to codon on the mRNA. Translation is completed while encountering a STOP codon, and the protein is therefore, released. Shimanovsky et al. exploited the codon redundancy to hide data in mRNA [18]. Generally, an mRNA codon is composed of three nucleotides. The possible nucleotides are: ‘U’, ‘C’, ‘A’, and ‘G’. Hence, there are many possible combinations to form an mRNA codon, while there are only twenty distinct amino acids shown in Table 1 encoded from the mRNA codon. It is obvious that some codons might be mapped to the same amino acids. For example, the codons ‘CCU’, ‘CCC’, ‘CCA’ and ‘CCG’ are mapped to the same amino acid Pro. The redundancy could be exploited to embed information in the mRNA codon. In their scheme, if the codon should be encoded with ‘CCU’, but the secret message is four, they use the codon ‘CCG’ to replace the original because ‘CCG’ is the fourth codon of the set of codons whose mapping amino acid is Pro. Although the replacement will not affect the transcription results, they modify the nucleotides of the original sequence, which might potentially cause unknown effects. As a result, a reversible hiding mechanism that can both conceal information into the DNA sequence and completely re tore the original one is required. Although the above preliminary works use different biological properties to hide data in DNA, it is not economic and efficient to implement the proposed schemes. Recently, Chang et al. proposed two schemes to hide data in DNA sequences based Table 1 The mapping of codon to amino acid [18]. UUU ! Phe UUA ! Leu CUU ! Leu CUA ! Leu AUU ! Ile AUA ! Ile GUU ! Val GUA ! Val
UUC ! Phe UUG ! Leu CUC ! Leu CUG ! Leu AUG ! Ile AUG ! Start GUC ! Val GUG ! Val
UCU ! Ser UCA ! Ser CCU ! Pro CCA ! Pro ACU ! Thr ACA ! Thr GCU ! Ala GCA ! Ala
UCC ! Ser UCG ! Ser CCC ! Pro CCG ! Pro ACC ! Thr ACG ! Thr GCC ! Ala GCG ! Ala
UAU ! Tyr UAA ! Stop CAU ! His CAA ! Gln AAU ! Asn AAA ! Lys GAU ! Asp GAA ! Glu
UAC ! Tyr UAG ! Stop CAC ! His CAG ! Gln AAC ! Asn AAG ! Lys GAC ! Asp GAG ! Glu
UGU ! Cys UGA ! Stop CGU ! Arg CGA ! Arg AGU ! Ser AGA ! Arg GGU ! Gly GGA ! Gly
UGC ! Cys UGG ! Trp CGC ! Arg CGG ! Arg AGC ! Ser AGG ! Arg GGC ! Gly GGG ! Gly
2198
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
upon a software view [6]. The first proposed scheme is a lossless compression-based information hiding scheme. The scheme begins by compressing the decimal formatted DNA sequence using the lossless compression method. Next, the secret message is appended to the end of the compression and result to form a bit stream. Then, the scheme adds a 16-bit header before the bit stream, which is used to record the size of the compression result. Finally, the bit stream is converted back to nucleotides. The second proposed hiding scheme adopts the difference expansion technique to conceal a secret bit in two neighboring words. Also, a location map is used to indicate whether a pair is expandable to hide data or not. The scheme concatenates the compressed bit stream of the location map, collected LSBs, which are used to reconstruct the original words and the secret message to form a bit stream. In the end, the bit stream is converted as a new DNA sequence. A capacity analysis was provided by the authors. From real DNA sequences, it is easy to discover one special property of a DNA sequence. That is, there is almost no difference between a real DNA sequence and a fake one. This is a property which shall be exploited in this study. Another useful element is the fact that there is a large number of DNA sequences publicly available on various web-sites such as the EBI database. A rough estimation would put the number of DNA sequences publicly available to be around 163 million (EMBL nucleotide sequences database release 101) [10]. The above two facts enabled three DNA-based data hiding methods to be designed. All these methods secretly select a reference sequence S from some publicly available DNA sequence databases. Only the sender and the receiver are aware of this reference sequence. The sender transforms this selected DNA sequence S into a new sequence S0 by incorporating the secret message M into the DNA sequence S. This transformed sequence S0 is sent by a sender to the receiver. The receiver then examines the received sequence, identifies S0 and recovers the secret message M, changing S0 back to the reference sequence S. Three methods will be introduced in the following sections. For all of these methods, it is assumed that there are two schemes used by the sender and the receiver which are kept secret. The first one is a binary coding rule, which transforms letters A; C; G and T into binary codes and vice versa. For instance, the following may be used as a binary coding: ((A:00)(C:01)(G:10)(T:11)). It should be noted that more digits may be used. The second scheme is a complementary rule. That is, each letter x is assigned a complement, denoted as CðxÞ. The following may be such a rule: ((AC)(CG)(GT)(TA)), where CðAÞ ¼ C. To hide various kinds of data, it is convenient to assume that the secret message M is a binary sequence. Throughout this paper, jSj is the length of sequence S. The paper is organized as follows. In Sections 2–4, three DNA-based data hiding methods are proposed. As stated previously, the first method is called the Insertion Method. The main idea of Method 1 is to break the secret message and the reference DNA sequence into segments, before assembling the segments one by one from the secret message and the reference sequence. Method 2, the Complementary Pair Method is discussed in Section 3. In this method, the secret message is inserted before pairs of complementary substrings. Method 3, the Substitution Method is proposed in Section 4. Each base pair in the reference DNA sequence S is changed by some conditions to hide the secret message. A number of experiments and comparisons, including current methods and preliminary related works, are outlined in Sections 5 and 6. Finally, conclusions are given in Section 7. The robustness analysis for each method is discussed in Sections 2–4.
2. Method 1: the Insertion Method To simplify the discussion, the most basic version is outlined and a simple example is given. The more complicated version of the method will be presented after the basic one is explained. All of the methods use a reference sequence, S. Suppose the secret message M is 01001100. Let S be ACGGTTCCAATGC. The method works as follows: Step 1. Code S into a binary sequence by using the binary coding rule. Thus the sequence S will now become 00011010111101010000111001. Step 2. Divide S into segments, whereby each segment contains k bits. Suppose k is 3. Then there are the following segments: 000, 110, 101, 111, 010, 100, 001, 110, 01. Step 3. Insert bits from M, one at a time, into the beginning of segments of S. The result is as follows: 0000, 1110, 0101, 0111, 1010, 1100, 0001, 0110, 01. Those segments without any secret message inserted should be ignored. Thus, there are the following segments: 0000, 1110, 0101, 0111, 1010, 1100, 0001, 0110. Concatenating the above segments results in the following binary sequence: 00001110010101111010110000010110. Step 4. Use the inverse function of the binary coding rule to produce the following faked DNA sequence: S0 ¼ AATGCCCTGGTAACCG. As the reader can see, this sequence is quite different from S. Step 5. Send the above sequence S0 to the receiver. The above procedure is the data hiding process. It is easy to see that the data recovery process is just to reverse the data hiding process. For every received sequence, the receiver extracts a subsequence out of it, based upon some mechanisms introduced in the following. If the extracted subsequence is not a prefix of the reference sequence S, ignore it. If it is, the receiver knows that he has also successfully extracted the secret message M as a by-product. The recovery process is given as follows:
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
2199
Step 1. Code received S0 into a binary sequence by using the binary coding rule. Thus the sequence S0 will now become: 00001110010101111010110000010110. Step 2. Extract the first bits from each 4-bit segment, one at a time, into the beginning of these divided segments. The result is as follows: 0, 1, 0, 0, 1, 1, 0, 0. Extract the last three bits from each 4-bit segment. The result is as follows: 000, 110, 101, 111, 010, 100, 001, 110. Step 3. Concatenate the extracted bits and the above remaining segments, which results in the following two binary sequences: 01001100 and 000110101111010100001110. The inverse function of the binary coding rule is used to transform the following binary sequence of Step 3: 000110101111010100001110. As the reader can see, this sequence will be transformed as ACGGTTCCAATG. This DNA sequence is a prefix of S and it is recovered back to the reference sequence according to S. Thus the extracted binary sequence 01001100 is the desired secret message. The above is the basic version of the method. In a more complicated version, S is divided into many segments using a random number generator. That is, the lengths of segments are not the same anymore. Instead, it is determined by some random number seeds, which are known only to the sender and the receiver. Suppose the number sequence generated by the random number seed k is 6, 3, 2, 4. Then S is divided into segments with lengths 6, 3, 2 and 4, respectively. Note that there is also a secret message M. Similarly, the same random number generator may also be used with different random number seed r to divide M into segments. Algorithm 1-1 shows the formal hiding algorithms for Method 1. Algorithm 1-1. Data hiding algorithm of Method 1 (Insertion Method) Input: Output: Step 1. Step 2.
Step 3. Step 4. Step 5. Step 6.
A reference DNA sequence S, random number seeds k and r, a secret binary message M and a binary coding rule to code base pairs ‘A’, ‘C’, ‘G’ and ‘T’ into binary digits A faked DNA sequence S0 with the secret message M hidden Code S into a binary sequence S1 by using the binary coding rule Generate the number sequence r 1 ; r 2 ; . . . ; rp ; . . . by random number seed r. Find the smallest integer t such that Pt i¼1 r i > jMj. Sequentially divide the secret message M into segments with lengths r 1 ; r 2 ; . . . ; r t1 in order, denote these segments by m1 ; m2 ; . . . ; mt1 , and let the residual part be mt Generate the number sequence k1 ; k2 ; . . . ; kt1 ; . . . by random number seed k. Sequentially divide S1 into segments with lengths k1 ; k2 ; . . . ; kt1 in order and truncate the residual part of S1 . Denote these segments by s1 ; s2 ; . . . ; st1 Insert each mi ; 1 6 i 6 t 1, of M before si of S1 to produce a new binary sequence S2 . Finally, append mt at the end of st1 to generate S3 Transform sequence S3 into a faked DNA sequence S0 according to the inverse function of the binary coding rule Send the above sequence S0 together with other irrelevant DNA sequences to the receiver
Notice that the sender sends S0 together with many other DNA, or DNA-like sequences, to the receiver. The receiver processes every sequence received, extracts the message sequence and recovers the original sequence. If the recovered sequence is not a prefix of the reference sequence S, it means that the receiver should test some other received sequences until the recovered sequence is exactly a prefix of S. Then the receiver knows that the secret message has been extracted. The receiver uses the following algorithm to recover the hidden message: Algorithm 1-2. Data recovery algorithm of Method 1 (Insertion Method) Input: Output: Step 1. Step 2.
Step 3. Step 4. Step 5. Step 6.
A set of DNA sequences, random number seeds k and r, a reference DNA sequence S and the same binary coding rule used by the sender The hidden secret message M Generate two number sequences k1 ; k2 ; . . . ; km ; . . . and r 1 ; r 2 ; . . . ; rm ; . . . by using the random number seeds k and r, respectively Choose a sequence S0 of the input sequence set and code it into a binary sequence S1 . Find the largest integer p such that Pp i¼1 ðr i þ ki Þ 6 jS1 j, and divide S1 into binary segments with lengths r 1 þ k1 ; r 2 þ k2 ; . . . ; r p þ kp . The remaining part of S1 is denoted as mpþ1 For each segment i; 1 6 i 6 p, of S1 , extract the first r i bits, called mi For each segment i; 1 6 i 6 p, of S1 , extract the last ki bits, called si Concatenate all si ’s,1 6 i 6 p, to be S2 . Derive the corresponding DNA sequence S3 by applying the inverse function of the binary coding rule to S2 . If S3 is not a prefix of S, then go back to Step 2 Concatenate all mi ’s,1 6 i 6 p þ 1, to be M and recover S3 back to the reference sequence according to the reference DNA sequence S
In order for an intruder to retrieve the secret message, they must be as achieve the following. Firstly, there are roughly 1 163 million DNA sequences available publicly. Thus, the probability of an attacker’s success is 1:6310 8 . Secondly, the random number generator and the two seeds may be required. Thirdly, the intruder has to know the Binary Coding Scheme.
2200
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
... k1
r1
k2
r2
k4
k3
r3
r4
kp
rp
mp+1
A prefix of S with size s
The secret message with size m
Fig. 1. The relationship between mðsÞ and ri ’s (ki ’s).
For the second situation: suppose a binary sequence S1 is handled with size n during the data recovery stage. S1 is composed of the secret message M and the prefix substring in reference sequence S. The size of M and the prefix of S are defined to be m and s, respectively. It is difficult for an attacker to know m and s. It can be imagined that an attacker could guess the size m and s first. It is known m þ s ¼ n; m; s; n P 1 and there will be 2þn21 n1 C ¼C ¼n1 possibilities here. For instance, assume n ¼ 10, there will be n2 n2 2 þ 10 2 1 9 C ¼C ¼ 9 possibilities as follows: 10 2 8
m ¼ 1;
s¼9
m ¼ 2; m ¼ 3;
s¼8 s¼7
m ¼ 4;
s¼6
m ¼ 5;
s¼5
m ¼ 6;
s¼4
m ¼ 7;
s¼3
m ¼ 8;
s¼2
m ¼ 9;
s¼1
1 The probability of an attacker successful guessing m and s is n1 . However, it is not enough for the attacker to recover the data. The problem is that the attacker does not know the number sequences generated by random number seeds r and k denoted as r 1 ; r 2 ; . . . ; rp and k1 ; k2 ; . . . ; kp , respectively, which are used to break the secret message and the reference sequence S. The summation of r i ’s and mpþ1 is equal to m, and the summation of ki ’s is equal to s. Fig. 1 indicates the relationship between mðsÞ and r i ’s (ki ’s). Notice that it is hard for an attacker to know how many segments are divided. Thus, they will need to try two segments, three segments, then four segments and so on. Let the length of the residual part in Step 2 of Algorithm 1-1 be q. Then, there may be the following cases:
r 1 þ q ¼ m;
r 1 P 1; q P 0
r 1 þ r 2 þ q ¼ m; r 1 ; r 2 P 1; q P 0 r 1 þ r 2 þ r 3 þ q ¼ m; r 1 ; r 2 ; r 3 P 1; q P 0 r 1 þ r 2 þ r 3 þ r 4 þ q ¼ m;
r 1 ; r2 ; r 3 ; r 4 P 1; q P 0
r 1 þ r 2 þ r 3 þ þ r m þ q ¼ m;
r 1 ; r2 ; r 3 ; . . . ; r m P 1; q P 0
m 2þm11 . The number of solutions to the second one is ¼C The number of solutions to the first formula is C m1 m1 3þm21 m 4þm31 m C ¼C . The number of solutions to the third one is C ¼C . The number m2 m2 m3 m3 5þm41 m ¼C . The number of solutions to the last one is of solutions to the fourth one is C m4 m4 mþ1þmm1 m C ¼C ¼ 1. The above numbers could be summarized as the following formula: mm 0
2201
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
C
m m1
þC
m m2
þC
m m3
þ þ C
m 0
¼
m1 X C i¼0
m m1i
¼ 2m 1:
Thus, the probability of an attacker making a successful guess at this stage is 2m11. Similarly, the probability of an attacker 1 . making a successful guess for s at this stage is 2s1 For the third situation: the number of the binary coding rules is 4! ¼ 24. The probability of an attacker making a successful 1 . guess at this stage is 24 Finally, the probability of an attacker making a successful guess at Method 1 is given by the following Lemma: Lemma 1. The probability of an attacker making a successful guess at the Insertion Method is
1 1:63108
1 n1 2m11
1 2s1
1 24 .
3. Method 2: the Complementary Pair Method This section illustrates Method 2, the Complementary Pair Method. The detailed meaning of the base pairs can be found in a molecular biology textbook [1,14,21]. For this method, complementary pairs will be defined. That is, a unique counterpart is assigned for each base pair. For instance, the following complementary rule may apply:
ððACÞðCGÞðGTÞðTAÞÞ as discussed in Section 1. Then the complementary string of AATGC will be CCATG. For instance, in the sequence: ‘‘ATCTGAATGCTTGTCTACCATGTCAAT”, there is a pair of complementary substrings with length of five, as indicated by the bold characters. To find the longest complementary substrings, the dynamic programming approach may be used [2,13]. Let it be assumed that there is the secret message M which has an even number of bits. Note that it is always reasonable to assume so because an even number of bits may always be used to code. To give an example, assume that M ¼ 0110. Again, as in Method 1, assume that there is a reference DNA sequence S ¼ ACGGTTCCAATGC. It is easy to see that the longest complementary substring pair in S is (TT AA). Thus, the length of the longest complementary substrings in S is two. The method must therefore insert complementary substring pairs with a length of three into S to ensure that the longest complementary pair of substrings are the newly inserted ones. Method 2 works as follows: Step 1. Divide M into segments such that each segment contains even number of bits. In this case there are 01 and 10. By using the binary coding rule given in the previous section, 01 and 10 will be coded as C and G, respectively. Suppose M ¼ m1 ; m2 ; . . . ; mp . Therefore, in this case, m1 ¼ C and m2 ¼ G. Step 2. Artificially generate two complementary string pairs with a length of three padded with character ‘T’ before and after each string and insert them one by one into S without overlapping. Assume the following two complementary string pairs are used: (AGC CTG) and (CCT GGA). The character ‘T’ is padded in the hope that (AGC CTG) and (CCT GGA) will remain the longest complementary pairs in the resulting string. After padding a character ‘T’ before and after them, the four substrings are TAGCT; TCTGT; TCCTT and TGGAT. The sequence S may now become S1 ¼ ACGTAGCTGTTCTGTTCTCCTTCATGGATATGC. Step 3. For each pair of the longest complementary substrings ag and a0g in S1 , insert mg immediately before Tag T. Thus, S1 will be a new sequence, called S0 .
S0 ¼ ACGCTAGCTGTTCTGTTCGTCCTTCATGGATATGC: Step 4. Check the longest complementary substrings of S0 . If they are not (AGC CTG) and (CCT GGA), go back to Step 2. Otherwise, send the above sequence S0 to the receiver. Note that the character ‘T’ in Step 2 can be replaced by any other nucleotide. However, even with the padding of ‘T’, it still cannot be guaranteed that the newly inserted complementary pairs will not produce other, even longer complementary pairs in S0 . That is why there is Step 4 in which it is possible to check whether the newly inserted substrings are indeed the longest complementary substring pairs. The sender may send S0 together with many other DNA, or DNA-like, sequences to the receiver. The receiver processes every sequence received, finds all the longest complementary substrings, extracts the secret message and tries to recover the original sequence. If the recovered sequence is equal to the reference sequence S, the secret message is correctly extracted. The recovery process is as follows: Step 1. For the next DNA sequence S0 in the set, use the dynamic programming strategy to discover the longest complementary substring. If the substring is not of the correct length, go back to Step 1. In the above case, S0 ¼ ACGCTAGCTGTTCTGTTCGTCCTTCATGGATATGC, the longest complementary substrings are (AGC CTG) and (CCT GGA) starting at positions (6, 13) and (21, 28).
ðACGCTAGCTGTTCTGTTCGTCCTTCATGGATATGCÞ
2202
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
Step 2. For the pairs (AGC CTG) and (CCT GGA), extract the alphabet before (TAGCT TCTGT) and (TCCTT TGGAT), in the above case, they are ‘C’ and ‘G’.
ðACGCTAGCTGTTCTGTTCGTCCTTCATGGATATGCÞ Step 3. After extracting the secret alphabets, delete the above two pair of substrings (AGC CTG) and (CCT GGA) and their padding pair ‘T’ before and after them in S0 . The resulting sequence is ACGGTTCCAATGC. Step 4. The sequence ACGGTTCCAATGC is equal to the reference sequence S. If it is not, go back to Step 1. Step 5. Concatenate ‘C’ and ‘G’ to be a sequence CG and apply the inverse function of the binary coding rule to CG. The secret message is 0110. In Algorithm 2-1, the formal hiding algorithms for Method 2 are presented. Algorithm 2-1. Data hiding algorithm of Method 2 (Complementary Pair Method) Input: Output: Step 1.
A reference DNA sequence S, a secret binary message M with an even length, a binary coding rule to code base pairs ‘A’, ‘C’, ‘G’ and ‘T’ into binary digits and a complementary rule A faked DNA sequence S0 with the secret message M hidden Let the length of the longest complementary substring in S be k. Divide M into
jMj 2
segments with the same size, denote
jMj 2 .
Step 2.
Code these segments to be base pairs by using the binary coding rule. Let M ¼ m1 ; m2 ; . . . ; mp p¼ Generate a set A ¼ fa1 a01 ; a2 a02 ; . . . ; ap a0p g, where each ðai a0i Þ; 1 6 i 6 p , is a complementary string with length k þ 1
Step 3.
For each ag and a0g ; 1 6 g 6 p, pad them with character ‘T’ such that ag and a0g become Tag T and Ta0g T. Insert each string into
Step 4.
S one by one without overlapping. Call the resulting sequence S1 For each pair of complementary substrings ag and a0g in S1 , insert mg before Tag T. Thus, S1 will be a new sequence, called S0
Step 5.
If the longest complementary substrings in S0 are not the same as set A, then go back to Step 2; otherwise, send S0 to the receiver, amid many other irrelevant sequences
S0 , together with many other DNA, or DNA-like sequences are sent to the receiver. The receiver is able to identify the particular sequence with M hidden in it and ignore all of the other sequences (see Algorithm 2-2). Algorithm 2-2. Data recovery algorithm of Method 2 (Complementary Pair Method) Input:
Step 2.
A set of DNA sequences, a reference DNA sequence S, the length k of the longest complementary substrings in S, a binary coding rule and a complementary rule used in Algorithm 2-1 The hidden secret message M For the next DNA sequence S0 in the set, use the dynamic programming strategy to discover the longest complementary substring. If the substring is not of the correct length k þ 1, go back to Step 1 For each pair of the longest complementary substrings ag and a0g , extract ng , which is a character before Tag T
Step 3. Step 4.
Concatenate all segments ng ’s to generate N, then delete them from S0 Delete all the longest complementary substrings and the padding character ‘T’, that means, Tag T and Ta0g T in S0 . Let the
Step 5. Step 6.
resulting sequence to be S1 If S1 is not equal to S, then go back to Step 1 Derive the secret message M by applying the inverse function of the binary coding rule to N
Output: Step 1.
In order for an intruder to discover the secret message, he will need the following information: (1) The reference DNA sequence S; (2) the complementary rule; (3) the binary coding rule; and (4) how to find the longest complementary substrings in a string. For (1): There are roughly 163 million DNA sequences available publicly. Thus, the probability of an attacker making a suc1 cessful guess is 1:6310 8. For (2): There are four possibilities of the complementary alphabet to each alphabet of DNA sequences. The total number of 1 . possible complementary rules is 4 3 2 1 ¼ 24, hence the probability of making a correct guess is 24 For (3): The number of the binary coding rules is 4 3 2 1 ¼ 24. The probability of an attacker making a successful 1 . guess at this stage is 24 Finally, the probability of an attacker making a successful guess at Method 2 is given by the following Lemma: Lemma 2. The probability of an attacker making a successful guess at The Complementary Pair Method is addition to the cost for retrieving the longest complementary pairs of substrings in a string.
1 1:63108
1 242
, in
Instead of a complementary pair of substrings, palindromes [2–4,9,11,16] may also be used in Method 2. A palindrome is a string of the form aa0 where a0 is the reverse of a. For instance: AACGTTGCAA is a palindrome where a ¼ AACGT. There are several methods proposed to find the longest palindrome in a string [2–4,9,11].
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
2203
The approach using palindromes is quite similar to that using complementary pairs. Let a ¼ a1 ; a2 ; . . . ; ah be a substring, a0i ¼ Cðai Þ; 1 6 i 6 h, and a0 ¼ a0h a0h1 ; . . . ; a01 . Then aa0 is a complementary palindrome. For example, assume the complementary rule is ((AC)(CG)(GT)(TA)). Then ACCTAGGC is a complementary palindrome. Method 2 may use complementary palindromes because methods to find palindromes can be easily extended to find the complementary palindromes.
4. Method 3: the Substitution Method This method also uses a reference DNA sequence S. Let it be assumed that S ¼ ACGGAATTGCTTCAG and the secret message M ¼ m1 ; m2 ; . . . ; mp is 0111010. The length of S is 15, which is longer than the length of M, p, which is 7 in this case. It is stipulated that the complementary rule used in Method 3 must satisfy that for each letter x of a DNA sequence, all x, CðxÞ; CðCðxÞÞ and CðCðCðxÞÞÞ are not equal, where CðxÞ is the complement of x. This property is to guarantee the complementary rule to be of injective mapping. For instance, the following complementary rule may apply:
ðATÞðCAÞðGCÞðTGÞ: Let the reference DNA sequence S ¼ s1 ; s2 ; s3 ; . . . ; sm and the secret message M ¼ m1 ; m2 ; . . . ; mp . The method works as follows:
Step 1. Suppose the length of the reference sequence S is 15. Select p distinct numbers randomly from 1 to 15. Assume that p ¼ 7 and these selected p numbers are 2, 3, 5, 10, 12, 13 and 15, then let A ¼ fA1 ; A2 ; . . . ; Ap g ¼ f2; 3; 5; 10; 12; 13; 15g. Step 2. Transform S into S0 by the following rule: For all integer i from 1 to 15: if i is equal to some Aj and mj is 1, 1 6 j 6 p, set si to be Cðsi Þ; else if i is equal to some Aj and mj is 0, do not change si ; else if i is not equal to any Aj , set si to be CðCðsi ÞÞ; Thus S0 ¼ GCCATGCCAACTAGG. Step 3. Send S0 to the receiver. The receiver knows the reference sequence S and the complementary rule. Let the ith character of S0 and S be denoted as s0i and si , respectively. The recovery process is as follows:
Step 1. Initialize i and j to be 1. Step 2. For i from 1 to 15: if s0i is the same with si , then set mj ¼ 0 and j ¼ j þ 1; else if s0i is the same with Cðsi Þ, then set mj ¼ 1 and j ¼ j þ 1; Step 3. Concatenate all mk ’s, 1 6 k 6 j 1, to be M. M is the secret message.Set all s0i ’s to be si ’s to recover S0 back to the reference sequence S.
Method 3 has the most economic space utilization. Besides, the receiver would not need to know set A in advance. In Algorithm 3-1, the formal hiding algorithms for Method 3 are presented. Algorithm 3-1. Data hiding algorithm for Method 3 (Substitution Method) Input: Output: Step 1. Step 2. Step 3. Step 4.
A sequence S1 ¼ s1 ; s2 ; s3 ; . . . ; sn , which is a prefix of the reference DNA sequence S, a secret binary message M ¼ m1 ; m2 ; . . . ; mp , where p 6 n, and a complementary rule A faked DNA sequence S0 with the secret message M hidden Use a random number generator to generate p distinct integers, called set A, where every integer is no larger than n Sort set A in increasing order Initialize i to 1 For each element si of S1 , do the following operation:
if i is equal to Aj and mj is 1, 1 6 j 6 p, change si to be Cðsi Þ; else if i is equal to Aj and mj is 0, do not change si ; else if i is not equal to any Aj , set si to be the CðCðsi ÞÞ; Step 5.
Send the above sequence S0 to the receiver
2204
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
The receiver uses the following algorithm (see Algorithm 3-2) to recover the hidden message. Algorithm 3-2. Data recovery algorithm for Method 3 (Substitution Method) Input: Output: Step 1. Step 2.
A faked DNA sequence S0 ¼ s01 ; s02 ; . . . ; s0n , a reference DNA sequence S ¼ s1 ; s2 ; s3 ; . . . ; sm , where m P n, and the complementary rule The hidden secret message M Initialize i and j to be 1 For i from 1 to n:
if s0i is the same with si , then set mj ¼ 0 and j ¼ j þ 1; else if s0i is the same with Cðsi Þ, then set mj ¼ 1 and j ¼ j þ 1; Step 3.
Concatenate all mk ’s,1 6 k 6 j 1, to be M and set all s0i ’s to be si ’s
For an intruder to discover the secret message, the following information is necessary: (1) the reference DNA sequence and (2) the complementary rule. For (1): There are roughly 163 million DNA sequences available publicly. Thus, the probability of an attacker making a 1 successful guess is 1:6310 8. For (2): The number of legal complementary rules should be considered. A legal complementary rule is defined as the following: for each letter x of a DNA sequence, all x, CðxÞ; CðCðxÞÞ and CðCðCðxÞÞÞ are not equal, where CðxÞ is the complement of x. There are six legal complementary rules as follows:
ðATÞðTCÞðCGÞðGAÞ; ðATÞðTGÞðGCÞðCAÞ; ðACÞðCTÞðTGÞðGAÞ; ðACÞðCGÞðGTÞðTAÞ; ðAGÞðGTÞðTCÞðCAÞ; and ðAGÞðGCÞðCTÞðTAÞ The probability of making the correct guess is 16. Finally, the probability of an attacker making a successful guess at Method 3 is as follows:
1
1 : 6 1:63 10 8
Let another case be considered. Suppose the attacker does not know the reference sequence S and guesses a complementary rule CðxÞ. Thus, according to the algorithm, the attacker should consider three possible cases for each character between S0 and S. Case 1: si is x and s0i is x, too. In this case, mi ¼ 0. Case 2: si is x and s0i is CðxÞ. In this case, mi ¼ 1. Case 3: si is x and s0i is CðCðxÞÞ. In this case, i is not an element of A. If the length of S0 is n, there will be 3n possibilities. The probability of the attacker making a successful guess is 3n . Lemma 3. The probability of an attacker making a successful guess at The Substitution Method is either
1 1:63108
16 or
1 . 3n
5. Experiments and comparisons I This section proceeds to a series of experiments to evaluate the performance of the methods and make comparisons among them. The previous sections have demonstrated the robustness of the three proposed schemes, which are summarized in Table 2. In the following, other concerns in the field of data hiding such as capacity; payload and bpn will be discussed. The definition of capacity, denoted by C, is the total length of the increased reference sequence after the secret message is hidden within it. The payload, denoted by P, is the remaining length of the new sequence after extracting out the reference DNA sequence. The bpn is the number of bits hidden per character. Before the above parameters are calculated, there are several notations which should first be defined. Assume S is the representation of a DNA sequence, and jSj is the length of a DNA sequence, while M is a serial bits message to be hidden, jMj is the length of the secret message. Consider that a DNA sequence is composed by nucle. In Method 2, otides; a nucleotide should be transformed as a two bits representation such as ‘A’ = 01. In Method 1, C ¼ jSj þ jMj 2 Table 2 Security of each method. Method
Insertion
Cracking probability
1 1:63108
1 n1
1 2m 1
1 2s1
1 24
Complementary Pair
Substitution
1 1:63108
1 1:63108
1 242
16 or
1 3n
2205
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208 Table 3 Performance of each method. Method type
capacity C
payload P
bpn
jMj 2
jMj jSjþjMj 2
Insertion
jSj þ
Complementary Pair
jSj þ jMj k þ 3 12
jMj k þ 3 12
Substitution
jSj
0
jMj 2
jMj jSjþjMjðkþ312Þ jMj jSj
other information called complementary substrings helps to hide the secret. In order to record the additional information, some deliberately inserted artificial complementary strings are necessary, denoted by I. In this scheme, the length of the longest complementary pairs in the reference DNA sequence is found first. Suppose it is k. Then, for a secret character, a k þ 1-long and artificial complementary pair of strings with two characters ‘T’ before and after themselves are inserted together with the secret character, until all secret characters are hidden. The length of the secret M should be even and it should be transformed as a DNA sequence instead of a binary one. For each character of the hidden secret messages, 2ðk þ 3Þ additional characters are needed in total, which include a k þ 1-long complementary pair of strings with two ‘T”s at the head jMj and tail of themselves. Thus, the length of the additional information I is ( 2 2ðk þ 3Þ ¼ jMj ðk þ 3ÞÞ. Therefore, 1 þ jMjðk þ 3Þ ¼ jSj þ jMj k þ 3 is the capacity of Method 2. Method 3 only substitutes each nucleotide. Thus, addiC ¼ jSj þ jMj 2 2 tional information and the capacity of the secret are not required. The total capacity is C ¼ jSj. The payload P is the remaining ; jMj k þ 3 12 and 0 for the three proposed methods, respeclength after extracting out the space of the medium, which are jMj 2 . Table 3 shows the corresponding capacity; payload, and bpn of the three methods. tively. The value of bpn is equal to jMj C As shown in Table 4, eight DNA sequences were used as the test sample [22], where ‘Mus musculus’ is the scientific name of house mice and ‘Bos taurus’ is the scientific name of cows. The secret message comprised 20k bytes (160 k bits) of randomly selected data.
Table 4 The tested DNA sequences. Locus
Number of nucleotides
Species definition
AC153526 AC166252 AC167221 AC168874 AC168897 AC168901 AC168907 AC168908
200,117 149,884 204,841 206,488 200,203 191,456 194,226 218,028
Mus musculus 10 BAC RP23-383C2 Mus musculus 6 BAC RP23-100G10 Mus musculus 10 BAC RP23-3P24 Bos taurus clone CH240-209N9 Bos taurus clone CH240-190B15 Bos taurus clone CH240-185I1 Bos taurus clone CH240-195I7 Bos taurus clone CH240-195K23
Table 5 The results of using the Insertion Method to hide 20k bytes secret message within the tested DNA sequence. Sequence
Number of nucleotides
capacity C
payload P
bpn ¼ jMj C
AC153526 AC166252 AC167221 AC168874 AC168897 AC168901 AC168907 AC168908
200,117 149,884 204,841 206,488 200,203 191,456 194,226 218,028
280,117 229,884 284,841 286,488 280,203 271,456 274,226 298,028
80,000 80,000 80,000 80,000 80,000 80,000 80,000 80,000
0.57 0.70 0.56 0.56 0.57 0.59 0.58 0.54
Table 6 The corresponding lengths of the longest complementary pairs of the tested DNA sequences. Sequence
Number of nucleotides
k
kþ1
P ¼ jMj k þ 3 12
AC153526 AC166252 AC167221 AC168874 AC168897 AC168901 AC168907 AC168908
200,117 149,884 204,841 206,488 200,203 191,456 194,226 218,028
10 11 9 8 9 12 11 10
11 12 10 9 10 13 12 11
2,160,000 2,320,000 2,000,000 1,840,000 2,000,000 2,480,000 2,320,000 2,160,000
2206
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
Table 7 The results of using the Complementary Pair Method to hide 20k bytes secret message within the tested DNA sequences. Sequence
Number of nucleotides
capacity C
payload P
bpn ¼ jMj C
AC153526 AC166252 AC167221 AC168874 AC168897 AC168901 AC168907 AC168908
200,117 149,884 204,841 206,488 200,203 191,456 194,226 218,028
2,360,117 2,469,884 2,204,841 2,046,488 2,200,203 2,671,456 2,514,226 2,378,028
2,160,000 2,320,000 2,000,000 1,840,000 2,000,000 2,480,000 2,320,000 2,160,000
0.07 0.06 0.07 0.08 0.07 0.06 0.06 0.07
Table 5 shows the results of using Method 1 to hide the secret message within the eight tested DNA sequences. The average bpn is 0.58. For Method 2, a Java program was written to calculate the length of the longest complementary pairs and the payload P in a DNA sequence. The results are shown in Table 6. Table 7 shows the results of using Method 2 to hide the secret message within the tested DNA sequences. The average bpn is 0.07. Table 8 shows the results of using Method 3 to hide the secret message within the tested DNA sequences. The average bpn is 0.82. Table 9 lists the average bpns of the three approaches. The corresponding data hiding parameters for the three proposed methods are listed in Table 10. According to Table 10, it could be found that Method 3 shows the best capacity, while the Method 2 offers the worst capacity. Method 1 is the most robust and Method 3 the least. Furthermore, many insignificant DNA sequences could be sent together with the encrypted reference sequence during the transmission in Methods 1 and 2, which would confuse the attacker when choosing which one to crack. However, this can’t be done in Method 3 because the receiver cannot exactly identify the DNA sequence with secret messages hidden among the transmitted sequences. In this section, the approaches were compared under many data hiding parameters such as robustness and capacity. The next section will discuss the differences between these methods and other proposed hiding schemes via DNA sequences.
Table 8 The results of using the Substitution Method to hide 20k bytes secret message within the tested DNA sequences. Sequence
Number of nucleotides
capacity C
payload P
bpn ¼ jMj C
AC153526 AC166252 AC167221 AC168874 AC168897 AC168901 AC168907 AC168908
200,117 149,884 204,841 206,488 200,203 191,456 194,226 218,028
200,117 149,884 204,841 206,488 200,203 191,456 194,226 218,028
0 0 0 0 0 0 0 0
0.80 1.00 0.78 0.77 0.80 0.84 0.82 0.73
Table 9 Average bpn of each scheme by secretly hiding 20k bytes. Method
Insertion
Complementary Pair
Substitution
Average bpn
0.58
0.07
0.82
Table 10 The comparison of each approach. Method
Insertion
Cracking probability
1 1 n1 1:63108 jMj jSj þ 2 jMj 2 jMj jMj jSjþ 2
capacity C payload P bpn Experimental results of average bpn
0.58
Complementary Pair
1 2m 1
1 2s1
1 24
1 1:63108
1 242
Substitution 1 1:63108
jSj þ jMj k þ 3 12 jMj k þ 3 12
jSj
jMj jSjþjMjðkþ312Þ
jMj jSj
0.07
0.82
0
16 or
1 3n
2207
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208 Table 11 The comparison of the average bpn between Chang’s and our approaches. Provider
Approaches
Average bpn
Chang’s approaches
Lossless compression-based Difference expansion-based
0.78 0.11
Current approaches
Insertion Method Complementary Pair Method Substitution Method
0.58 0.07 0.82
Table 12 A comparison of the current methods with the previous hiding schemes based upon DNA sequences. Property
Biological view
Provider Reversibility Robustness analysis Capacity analysis ðbpnÞ
Leier Yes Yes Not given
Software view Peterson Yes Not safe enough Not given
Shimanovsky No Not given Not given
Chang Yes Not given Yes
The Authors Yes Yes Yes
6. Experiments and comparisons II As mentioned in Section 1, plenty of data hiding approaches were proposed in which the secret message was embedded into the host images [5,7,12,19,20]. However, this could distort the host image to some degree, and the distortion could be detected and attacked by the intruders. A number of related works have been proposed using DNA properties [6,15,17,18] as further mentioned in Section 1. The following sections draw comparisons between these previous works and the current methods. Leier et al. proposed a robust scheme using a special key sequence called a primer to decode an encrypted DNA sequence [15]. A robustness analysis was provided by the authors, but a capacity analysis was not provided. In [17], Peterson proposed a method to hide data in DNA sequences by substituting three consecutive bases as a character. However, it is considered unsafe, as previously explained in Section 1. Shimanovsky et al. exploited the codon redundancy to hide data in mRNA [18], but it is an unrecoverable scheme, as also mentioned in Section 1. Chang et al. proposed two schemes to hide data in DNA sequences based upon software views [6]. Both bpns of the two proposed schemes are well performed and listed in their paper, while a robustness analysis is not mentioned. Table 11 shows the difference of the average bpn between Chang’s and the current approaches under the same test sample, Table 4. Clearly, the Substitution Method offers the best average bpn and the Complementary Pair Method the worst. The Lossless compression-based scheme is better than both the Insertion Method and the Complementary Pair Method while the Difference expansion-based scheme is only better than the Complementary Pair Method. To sum up, a simple conclusion about the discussion of this section is listed in Table 12. 7. Concluding remarks In this paper, it has been demonstrated that DNA sequences have special properties which can be utilized for data hiding purposes. Three methods have been proposed for data hiding, which are all based upon a reference DNA sequence known only to the sender and the receiver. This reference sequence can be selected from any DNA database. Since there are roughly 163 million publicly available DNA sequences, it is virtually impossible to guess this sequence. It is difficult for an attacker to detect whether or not there are secret messages hidden in a DNA sequence. Even though s/ he may know that secret messages are present in the fake DNA sequence, it is still virtually impossible for the sequence to be correctly recovered. In contrast with many traditional image media schemes, all the three proposed methods are easy to implement and hard to detect. Acknowledgements The authors would like to thank the referees for their valuable suggestions. C.H. Huang and K.L. Ng works are supported by the National Science Council of the Republic of China under the Grants of NSC 98-2221-E-150-062 and NSC 98-2221-E468-013, respectively. Our gratitude goes to Dr. Timothy Williams, Asia University, for his help with proofreading the manuscript. References [1] B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, J.D. Watson, Molecular Biology of the Cell, Garland Publishing, New York & London, 1994. [2] A. Apostolico, D. Breslauer, Z. Galil, Optimal parallel algorithms for periods, palindromes and squares, in: Proceedings of the International Colloquium on Automata, Languages, and Programming, 1992, pp. 296–307.
2208 [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]
H.J. Shiu et al. / Information Sciences 180 (2010) 2196–2208
A. Apostolico, D. Breslauer, Z. Galil, Parallel detection of all palindromes in a string, Theoretical Computer Science 141 (1995) 163–173. D. Breslauer, Z. Galil, Finding all periods and initial palindromes of a string, Algorithmica 14 (1995) 355–366. C.C. Chang, C.C. Lin, C.S. Tseng, W.L. Tai, Reversible hiding in DCT-based compressed images, Information Sciences 177 (2007) 2768–2786. C.C. Chang, T.C. Lu, Y.F. Chang, R.C.T. Lee, Reversible data hiding schemes for deoxyribonucleic acid (DNA) medium, International Journal of Innovative Computing, Information and Control 3 (2007) 1–16. C.C. Chang, W.C. Wu, Y.H. Chen, Joint coding and embedding techniques for multimedia images, Information Sciences 178 (2008) 3543–3556. C.T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots, Nature 399 (1999) 533–534. M. Crochemore, W. Rytter, Jewels of Stringology, World Scientific, 2002. European Bioinformatics Institute, . Z. Gail, Real-time algorithms for string-matching and palindrome recognition, STOC (1976) 161–173. C.H. Huang, J.L. Wu, Fidelity-guaranteed robustness enhancement of blind-detection watermarking schemes, Information Sciences 179 (2009) 791– 808. R.C.T. Lee, R.C. Chang, S.S. Tseng, Y.T. Tsai, Introduction to the Design and Analysis of Algorithms, A Strategic Approach, McGrawHill, 2005. A.L. Lehninger, D.L. Nelson, M.M. Cox, Principles of Biochemistry, Worth, New York, 2000. A. Leier, C. Richter, W. Banzhaf, H. Rauhe, Cryptography with DNA binary strands, BioSystems 57 (2000) 13–22. G. Manacher, A new linear-time on-line algorithm for finding the smallest initial palindrome of the string, Journal of the ACM 22 (1975) 346–351. I. Peterson, Hiding in DNA, Muse (2001) 22. B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002) 373–386. H.H. Tsai, D.W. Sun, Color image watermark extraction based on support vector machines, Information Sciences 177 (2007) 550–569. H.W. Tseng, C.P. Hsieh, Prediction-based reversible data hiding, Information Sciences 179 (2009) 2460–2469. J. Watson, N. Hopkins, J. Roberts, J. Steitz, Molecular Biology of the Gene, fourth ed., Benjamin Cummings, Menlo Park, CA, 1987. Website, NCBI Database: .