Involution codes: with application to DNA coded languages Nataˇsa Jonoska (
[email protected]) Department of Mathematics, University of South Florida, Tampa FL 33620, USA
Kalpana Mahalingam (
[email protected]) Department of Mathematics, University of South Florida, Tampa FL 33620, USA
Junghuei Chen (
[email protected]) Department of Chemistry and Biochemistry, University of Delaware, Newark DE 19716, USA Abstract. For an involution θ : Σ∗ → Σ∗ over a finite alphabet Σ we consider involution codes: θ-infix, θ-comma-free, θ-k-codes and θ-subword-k-codes. These codes arise from questions on DNA strand design. We investigate conditions under which both X and X + are same type of involution codes. General methods for generating such involution codes are given. The information capacity of these codes show to be optimized in most cases. A specific set of these codes was chosen for experimental testing and the results of these experiments are presented. Keywords: codes, Watson-Crick involution, DNA codes
1. Introduction In bio-molecular computing and in particular DNA based computations and DNA nanotechnology, one of the main problems is associated with the design of the oligonucleotides such that mismatched pairing due to the Watson-Crick complementarity is minimized. In laboratory experiments non-specific hybridizations pose potential problems for the results of the experiment. Many authors have addressed this problem and proposed various solutions. Common approach has been to use the Hamming distance as a measure for uniqueness (Baum, 1996; Deaton et.al., 1997; Faulhammer et.al., 2000; Garzon et.al., 2000; Liu et.al., 2003). Deaton et.al. (Deaton et.al., 1997; Garzon et.al., 2000) used genetic algorithms to generate a set of DNA sequences that satisfy predetermined Hamming distance. Marathe et.al. (Marathe et.al., 1999) also used Hamming distance to compute combinatorial bounds of DNA sequences, and they used dynamic programing for design of the strands used in (Liu et.al., 2003). Seeman’s program (Seeman, 1990) generates sequences by testing overlapping subsequences to enforce uniqueness. This program is designed for producing sequences that are suitable for complex three-dimensional DNA structures, and the generation of c 2005 Kluwer Academic Publishers. Printed in the Netherlands.
nature.tex; 7/01/2005; 23:43; p.1
2 suitable sequences is not as automatic as the other programs have proposed. Feldkamp et.al. (Feldkamp et.al., 2002) also uses the test for uniqueness of subsequences and relies on tree structures in generating new sequences. Ruben et.al. (Ruben et.al., 2002) use a random generator for initial sequence design, and afterwards check for unique subsequences with a predetermined properties based on Hamming distance. One of the first theoretical observations about number of DNA code words satisfying minimal Hamming distance properties was done by Baum (Baum, 1996). Experimental separation of “good” codes that avoid intermolecular cross hybridization on big pool of random strands was reported in (Deaton et.al., 2003). In (Hussini et.al., 2003), the authors introduce a theoretical approach to the problem of designing code words. Based on these ideas and code-theoretic properties, a computer program for generating code words is being developed (Jonoska et.al., 2002; Kephart and Lefevre, 2004). Another algorithm based on backtracking, for generating such code words is also developed by Li (Li, 2002). Every biomolecular protocol involving DNA or RNA generates molecules whose sequences of nucleotides form a language over the four letter alphabet ∆ = {A, G, C, T }. The Watson-Crick complementarity of the nucleotides defines a natural involution mapping θ, A 7→ T and G 7→ C which is an anti-morphism of ∆∗ . Undesirable Watson-Crick bonds (undesirable hybridizations) can be avoided if the language satisfies certain coding properties. In particular for DNA code words, no involution of a word is a subword of another word, or no involution of a word is a subword of a composition of two words. These properties are called θ-infix and θ-comma-free respectively. The case when a DNA strand may form a hairpin, (i.e. when a word contains a reverse complement of a subword) was introduced in (Jonoska et.al., 2002) and was called θ-subword-k-code. For words representing DNA sequences we use the following convention. A word u over ∆ denotes a DNA strand in its 50 → 30 orientation. The Watson-Crick complement of the word u, also in orientation 50 → 30 ← ← is denoted with u . For example if u = AGGC then u = GCCT . There are two types of unwanted hybridizations: intramolecular and intermolecular. The intramolecular hybridization happens when two sequences, one being a reverse complement of the other appear within the same DNA strand (see Figure 1). In this case the DNA strand forms a hairpin. Two particular intermolecular hybridizations are of interest (see Figure 2). In Figure 2 (a) the strand labeled u is a reverse complement of a subsequence of the strand labeled v, and in the same figure (b)
nature.tex; 7/01/2005; 23:43; p.2
Involution Codes – DNA Coded Languages
3
Figure 1. Intramolecular hybridization (θ-subword-k-code): (a) the reverse
complement is at the beginning of the 50 end, (b) the reverse complement is at the end of the 30 . The 30 end of the DNA strand is indicated with an arrow.
represents the case when u is the reverse complement of a portion of a concatenation of v and w.
Figure 2. Two types of intermolecular hybridization: (a) (θ-infix) one code
word is a reverse complement of a subword of another code word, (b) (θ-comma-free) a code word is a reverse complement of a subword of a concatenation of two other code words. The 30 end is indicated with an arrow.
We start the paper with definitions of languages with coding properties that avoid intermolecular and intramolecular cross hybridizations. The definitions of θ-infix and θ-comma-free languages are same as the ones introduced in (Hussini et.al., 2003). Here we also consider intramolecular hybridizations and subword hybridizations. Hence, we have two additional coding properties: θ-subword-k-code and θ-k-code. In (Jonoska and Mahalingam, 2004) necessary and sufficient conditions for preserving these properties under splicing were obtained. Here we make several observations about the closure properties of the code word languages. In particular, we concentrate on properties of languages that are preserved with concatenation. If a set of DNA strands has “good” coding properties that are preserved under concatenation, then the same properties will be preserved under arbitrary ligation of the strands. Section 3 provides necessary and sufficient conditions for a finite set of words to generate (by concatenations) an infinite set of code words. In practice, besides use in ligation, these conditions provide a way to generate new “good” code words starting from a small set of initial “good” code words and as such might facilitate the otherwise difficult task of strand design. Section 4 describes several ways how to generate “good” code words. These sets of code words also provide sufficient informational entropy such that they can be used to encode binary strings bit→ symbols. Sets of molecules obtained by the described
nature.tex; 7/01/2005; 23:43; p.3
4 methods were tested for cross-hybridization experimentally. The results are shown in section 5. Experiments showed that the designed θ-k-codes had no visible cross hybridizations. The other two sets (θ-comma-free, θ-subword-k-code) of code words avoided one specific way of annealing, but in general, stronger properties like θ-k-code are needed for reliable experiments. We end with few concluding remarks.
2. Definitions An alphabet Σ is a finite non-empty set of symbols. We will denote by ∆ the special case when the alphabet is {A, G, C, T } representing the DNA nucleotides. A word u over Σ is a finite sequence of symbols in Σ. We denote by Σ∗ the set of all words over Σ, including the empty word 1 and, by Σ+ , the set of all non-empty words over Σ. We note that with the word concatenation, Σ∗ is the free monoid and Σ+ is the free semigroup generated by Σ. The length of a word u = a1 · · · an is n and is denoted with |u|. Throughout the rest of the paper, we concentrate on sets X ⊆ Σ+ that are codes meaning, every word in X + can be written uniquely as a product of words in X (i.e. X + is a free semigroup generated by X). For the background on codes we refer the reader to (Berstel and Perrin, 1985). We will need the following definitions: Pref(w) Suff(w) Sub(w) PPref(w) PSuff(w) RX (w) LX (w)
= {u | ∃v ∈ Σ∗ , uv = w} = {u | ∃v ∈ Σ∗ , vu = w} = {u | ∃v1 , v2 ∈ Σ∗ , v1 uv2 = w} = Pref(w) \ {w} = Suff(w) \ {w} = {x ∈ Σ∗ | wx ∈ X} = {x ∈ Σ∗ | xw ∈ X}
where RX (w) and LX (w) are respectively the right and the left context of w in X. We extend these definitions to the set of prefixes, suffixes and subwords of a set of words. Similarly, we have Suffk (w) = Suff(w) ∩ Σk , Prefk (w) = Pref(w) ∩ Σk and Subk (w) = Sub(w) ∩ Σk . We follow the definitions initiated in (Hussini et.al., 2003) and used in (Jonoska and Mahalingam, 2004; Jonoska et.al., 2002; Kari et.al., 2003). An involution θ : Σ → Σ of a set Σ is a mapping such that θ2 equals the identity mapping, θ(θ(x)) = x, for all x ∈ Σ. The mapping ν : ∆ → ∆ defined by ν(A) = T , ν(T ) = A, ν(C) = G, ν(G) = C is an involution on ∆ and can be extended to a morphic
nature.tex; 7/01/2005; 23:43; p.4
Involution Codes – DNA Coded Languages
5
involution of ∆∗ . Since the Watson-Crick complementarity appears in a reverse orientation, we consider another involution ρ : ∆∗ → ∆∗ defined inductively, ρ(s) = s for s ∈ ∆ and ρ(us) = ρ(s)ρ(u) = sρ(u) for all s ∈ ∆ and u ∈ ∆∗ . This involution is antimorphism such that ρ(uv) = ρ(v)ρ(u). The Watson-Crick complementarity then is the antimorphic involution obtained with the composition νρ = ρν. Hence for a DNA ← strand u we have that ρν(u) = νρ(u) = u . The involution ρ reverses the order of the letters in a word and as such is used in the rest of the paper. For the general case, we concentrate on morphic and antimorphic involutions of Σ∗ that we denote with θ. The notions of θ-infix and θ-comma-free in 3, 4 of Definition 2.1 below were called as θ-compliant and θ-free respectively in (Hussini et.al., 2003; Jonoska et.al., 2002; Jonoska and Mahalingam, 2004; Kari et.al., 2003; Jonoska and Mahalingam, 2004). Here we use the notion of θ-infix and θ-comma-free since when θ is the identity mapping these notions correspond to infix code and comma-free code. Recall that X ⊆ Σ∗ is an infix code if X ∩ (Σ∗ XΣ+ ∪ Σ+ XΣ∗ ) = ∅ and X ⊆ Σ∗ is a comma-free code if X 2 ∩ Σ+ XΣ+ = ∅. Various other intermolecular possibilities for cross hybridizations were considered in (Kari et.al., 2003) (see Fig.3). All of these properties are included with θ-k-code (5 of Definition 2.1). DEFINITION 2.1. Let θ : Σ∗ → Σ∗ be a morphic or antimorphic involution. 1. The set X is called θ-subword-k-m-code if for all u ∈ Σ∗ such that |u| = k we have Σ∗ uΣi θ(u)Σ∗ ∩ X = ∅ for all 1 ≤ i ≤ m. 2. The set X is called θ-subword-k-code if for all u ∈ Σ∗ such that |u| = k we have Σ∗ uΣi θ(u)Σ∗ ∩ X = ∅ for all i ≥ 1. 3. We say that X is called θ-infix if Σ∗ θ(X)Σ+ ∩ X = ∅ and Σ+ θ(X)Σ∗ ∩ X = ∅. 4. The set X is called θ-comma free if X 2 ∩ Σ+ θ(X)Σ+ = ∅. 5. The set X is called θ-k-code for some k > 0 if Subk (X) ∩ Subk (θ(X)) = ∅. 6. The set X is called strictly θ if X 0 ∩ θ(X 0 ) = ∅ where X 0 = X \ {1}. The notions of θ-prefix, θ-suffix (subword) code can be defined naturally from the notions described above, but since this paper does not investigate these properties separately, we don’t list the formal definitions here. We have the following observations:
nature.tex; 7/01/2005; 23:43; p.5
6
Figure 3. Various cross hybridizations of molecules one of which contains
subword of length k and the other its complement.
OBSERVATION 2.2. In the following we assume that k ≤ min{ |x| : x ∈ X}. 1. When θ is identity, X is an infix (comma-free) code iff X is θ-infix (comma-free). 2. When X is such that X = θ(X) then X is θ-infix (comma-free) iff X is infix (comma-free). 3. X is strictly θ-infix iff Σ∗ θ(X)Σ∗ ∩ X = ∅. 4. If X is strictly θ-comma-free then X and θ(X) are strictly θ-infix and θ(X) is θ-comma-free. 5. If X is strictly θ-infix then X ∗ is both θ-prefix and θ-suffix code. 6. If X is θ-k-code, then X is θ-k 0 -code for all k 0 > k. 7. X is a θ-k-code iff θ(X) is a θ-k-code. 8. If X is strictly θ such that X 2 is θ-subword-k-code, then X is strictly θ-k-code. 9. If X is a θ-k-code then both X and θ(X) are θ-infix, θ-subword-kcode, θ-prefix-k and suffix-k-code for any m ≥ 1. If k ≤ |x| 2 for all x ∈ X then X is θ-comma-free and hence avoids the cross hybridizations as shown in Fig. 1 and 2. 10. If X is a θ-k-code then X and θ(X) avoids all cross hybridizations of length k shown in Fig. 3 and so all cross hybridizations presented in Fig. 2 of (Kari et.al., 2003). It is clear that θ-subword-k-m-code implies θ-prefix-k-m-code and θsuffix-k-m-code. We note that when θ = ρν, the θ-subword-k-m-code of
nature.tex; 7/01/2005; 23:43; p.6
Involution Codes – DNA Coded Languages
7
the code words X ⊆ ∆∗ does not allow intramolecular hybridization as in Fig. 1 for a predetermined k and m. The maximal length of a word that together with its reverse complement can appear as subwords of code words is limited with k. The length of the hairpin, i.e. “distance” between the word and its reversed complement is bounded between 1 and m. The values of k and m would depend on the laboratory conditions (ex. the melting temperature and the length of the code words). In order to avoid intermolecular hybridizations as presented in Fig. 2, X has to satisfy θ-infix and θ-comma-free. Most applications would require X to be strictly θ. The most restricted and valuable properties are obtained with a θ-k-code, and the analysis of this type of codes is also most difficult. When X is θ-k-code, all intermolecular hybridizations presented in Fig. 3 are avoided. We include several observations in the next section.
3. Generating infinite sets of code words It is easy to note that other than the θ-comma-free codes none of the θ-infix, θ-k-code and θ-subword-k-code are closed under arbitrary concatenation. In this section we investigate what are the properties of a finite set of “good” code words X that can generate an infinite set of code words X + with the same “good” properties. In practice, it is much easier to generate a relatively small set of code words that has certain properties (i.e. in case of DNA or RNA, mismatched hybridization is avoided), and if we know that any concatenation of such words would also satisfy the requirements, the process of generating code words could be rather simplified. Hence, we give necessary and sufficient conditions for X such that X + is θ-(subword)-k-code or θ-infix. The Lemma below shows that if X is θ-comma-free then X 2 is “almost” θ-infix. The difference is in + vs ∗ in 3 of Definition 2.1. All properties below refer to a finite set X ⊆ Σ+ . LEMMA 3.1. If X is θ-comma-free then X 2 ∩ Σ+ θ(X 2 )Σ+ = ∅. Proof. Suppose lemma does not hold then there are x1 , x2 , y1 , y2 ∈ X such that x1 x2 = aθ(y1 y2 )b with a, b ∈ Σ+ . When θ is morphic, x1 x2 = aθ(y1 )θ(y2 )b, and when θ is anti-morphic, we have x1 x2 = aθ(y2 )θ(y1 )b. In both cases X would not be θ-comma-free. Hence X 2 ∩Σ+ θ(X 2 )Σ+ = ∅. 2 Note that the converse of the above need not be true. For example consider X = {a3 b, a2 b2 } with θ being morphism a 7→ b, b 7→ a. Then X 2 ∩ Σ+ θ(X 2 )Σ+ = ∅ since all words in X 2 are of length 8, but X is not θ-comma-free since a2 b2 a3 b = a2 θ(a2 b2 )ab.
nature.tex; 7/01/2005; 23:43; p.7
8 LEMMA 3.2. If X is strictly θ-infix then X n is strictly θ-infix for all n > 1. Proof: If the lemma does not hold then X n is not strictly θ-infix for some n. This means that there are x, y ∈ X n such that x = sθ(y)t for some s, t ∈ Σ∗ (not both equal to 1). Let x = x1 ...xn = sθ(y1 ...yn )t with xi , yi ∈ X then one of θ(yi ) is a subword of some xj which is a contradiction to X being θ-infix.2 The Kleene ∗ closure of X contains the union of all X n and in order for it to be θ-infix we need stronger properties. The next proposition is a stronger version of Lemma 1(ii) and Proposition 1 in (Hussini et.al., 2003). PROPOSITION 3.3. The follwoing are equivalent: (i) X is strictly θ-comma-free (ii) X + is strictly θ-infix (iii) X + is strictly θ-comma-free Proof: (i)⇒(ii). Suppose X is strictly θ-comma-free. Hence, by observation 4 X is strictly θ-infix. By Lemma 3.2 X n is strictly θ-infix for all n ≥ 1. Suppose X + is not strictly θ-infix then there exist x, y ∈ X + such that x = aθ(y)b for some a, b ∈ Σ∗ (not both equal to 1). Let x = x1 ...xn and y = y1 ...ym , for xi , yj ∈ X, hence for some yi either (a) θ(yi ) is a subword of xj for some j, (b) θ(yi ) is a subword of xj xj+1 , (c) θ(yi ) is a subword of xj−1 xj xj+1 The cases (a) and (c) contradict the fact that X is θ-infix and (b) contradicts to X being θ-comma-free. Hence X + is strictly θ-infix. (ii)⇒(iii). Given that X + is strictly θ-infix, suppose X + is not θcomma-free. Then there exist x, y, z ∈ X + such that xy = aθ(z)b for some a, b ∈ Σ+ . Let x = x1 ...xn , y = y1 ...ym and z = z1 ...zr with xi , yi , zi ∈ X. Then θ(zi ) is a subword of one of xj xj+1 , xj , ys , ys ys+1 or xn y1 . All cases contradict that X + is θ-infix. Hence X + is strictly θ-comma-free. (iii)⇒(i). Obvious, since X is a subset of X + .2 The following definition is the same as the one defined in (Hussini et.al., 2003). DEFINITION 3.4. For X ⊆ Σ+ define 1. Xis = PSuff(θ(X)) ∩ PPref(X) 2. Xip = PSuff(X) ∩ PPref(θ(X))
nature.tex; 7/01/2005; 23:43; p.8
Involution Codes – DNA Coded Languages
3. Xs =
S
4. Xp =
S
9
x∈PSuff(X) Rθ(X) (x) x∈PPref(X) Lθ(X) (x)
The following two properties are similar observations as proposition 6 and proposition 9 in (Hussini et.al., 2003). PROPOSITION 3.5. Let X ⊆ Σ+ then X is θ-infix and Xip Xis ∩ θ(X) = ∅ iff X is θ-comma-free. Proof: Let X be θ-infix and Xip Xis ∩ θ(X) = ∅. Suppose X is not θ-comma-free and there are x, y, z ∈ X such that xy = aθ(z)b for some a, b ∈ Σ+ . Since X is θ-infix,θ(z) is not a subword of x or y. Let θ(z) = z1 z2 such that az1 = x and z2 b = y. But z1 z2 ∈ θ(X), so z2 b ∈ X implies z2 ∈ Xis and az1 ∈ X implies z1 ∈ Xip . Hence z1 z2 ∈ Xip Xis ∩ θ(X) which contradicts the hypothesis. Conversely, let X be θ-comma-free. Suppose xy ∈ Xip Xis ∩ θ(X). Then, x ∈ Xip implies that there are u, v ∈ Σ+ such that ux ∈ X and xv ∈ θ(X). For y ∈ Xis there exists w, r ∈ Σ+ such that wy ∈ θ(X) and yr ∈ X. Hence uxyr ∈ X 2 with xy ∈ θ(X) which is a contradiction with X being θ-comma-free.2 PROPOSITION 3.6. If X is θ-infix and Xp Xs ∩ θ(X) = ∅ then X is θ-comma-free. Proof: Suppose X is not θ-comma-free. Then there exists x, y, z ∈ X such that xy = aθ(z)b for some a, b ∈ Σ+ . Since X is θ-infix, θ(z) is not a subword of x or y. Let θ(z) = z1 z2 for some z1 , z2 such that az1 = x and z2 b = y which implies z2 ∈ Xs and z1 ∈ Xp . Hence z1 z2 ∈ Xp Xs ∩ θ(X) which is a contradiction with the initial assumption.2 The converse of the above proposition is not true. For example, X = {bba, bbab} is θ-comma-free for an antimorphic θ mapping a → b and b → a. But Xs = {baa, a, aa} and Xp = ∅ which implies baa ∈ Xp Xs ∩ θ(X). The following proposition investigates the case when the property of subword-k-m-codes are preserved with Kleene ∗. It turned out that conditions under which a subword-k-m-code is closed under concatenation with itself are somewhat more demanding than the ones for θ-comma-free and θ-infix. Considering 8,9,10 in Observation 2.2 these properties might turn out to be quite important. PROPOSITION 3.7. Let k, m be positive integers and X ⊆ Σ∗ be such that for every word x ∈ X, |x| ≥ 2k + m. Let L=
m [
[
Suffi (X)Prefj (X)
(1)
l=1 i+j=2k+l
nature.tex; 7/01/2005; 23:43; p.9
10 Then X ∗ is θ-subword-k-m-code if and only if X is θ-subword-k-m-code and for all y ∈ L, Prefk (y) ∩ Suffk (θ(y)) = ∅. Proof. Assume that X is θ-subword-k-m-code and for all y ∈ L, Prefk (y) ∩ Suffk (θ(y)) = ∅. Suppose X ∗ is not θ-subword-k-m-code. Then there exists x ∈ X ∗ such that x = x1 usθ(u)x2 where |s| = l ≤ m and |u| = k. We claim that this is impossible. If usθ(u) is a subword of some y ∈ X, then this contradicts the property that X is θ-subword-k-m-code. If usθ(u) is a subword of some x1 x2 for x1 , x2 ∈ X then there is a p such that P refk (p)∩Suffk (θ(p)) 6= ∅ which is again a contradiction with the hypothesis. If usθ(u) is a subword of some x1 x2 x3 then we have u2 su3 = x2 for x1 , x2 , x3 ∈ X for some u1 u2 = u and u3 u4 = θ(u) and since |u2 | < k, |u3 | < k and |x2 | ≥ 2k + m which implies |s| > m which is a contradiction. Hence X ∗ is θ-subword-k-m-code. Conversely, note that if X ∗ is θ-subword-k-m-code then X is θsubword-k-m-code. Suppose there exists x ∈ L such that Prefk (x) ∩ Suffk (θ(x)) 6= ∅. Then x is such that x is either a subword of some y ∈ X or a subword of some y1 y2 with y1 , y2 ∈ X. Both cases are contradictory to the fact that X ∗ is θ-subword-k-m-code. 2 The conditions under which codes with one of the coding properties in Definition 2.1 are closed under Kleene* are discussed above. But when is a language θ-subword-k-code, θ-infix and θ-comma-free all at once? The next two propositions try to give an answer to this question. The condition in the first proposition is quite strong due to the strong requirements. However, the condition is only sufficient, and may not be necessary. PROPOSITION 3.8. Let X be a θ-2-code such that for all x ∈ X, |x| ≥ 3. Then both X and X + are strictly 1. θ-subword-k-code for k ≥ 3. 2. θ-infix and θ-comma-free. Proof. 1. By induction on the powers of X. Since Sub2 (X) ∩ Sub2 (θ(X)) = ∅ and k ≥ 3, Σ∗ uΣm θ(u)Σ∗ ∩ X = ∅ for all u ∈ Σk and for m ≥ 1. Hence X 1 = X is θ-subword-k-code. Consider X 2 and suppose that there exists an x ∈ X 2 such that x = y1 uy2 θ(u)y3 for some y1 , y3 ∈ Σ∗ , and u ∈ Σk , y2 ∈ Σm for m ≥ 1. Let x = x1 x2 for x1 , x2 ∈ X.
nature.tex; 7/01/2005; 23:43; p.10
Involution Codes – DNA Coded Languages
11
It is not the case that x1 = y1 and x2 = uy2 θ(u)y3 since X is θ-subword-k-code. So suppose that x1 = y1 u1 , x2 = u2 y2 θ(u)y3 for some u1 , u2 such that u1 u2 = u and u1 6= 1. Since k > 2 one of u1 or u2 has length at least 2. This contradicts our assumption that Sub2 (X) ∩ Sub2 (θ(X)) = ∅. Now the inductive step is done similarly. Assume X n is θ-subword-k-code. Suppose there exists an x = x1 · · · xn+1 ∈ X n+1 such that x = y1 uy2 θ(u)y3 for some y1 , y3 ∈ Σ∗ , u ∈ Σk , y2 ∈ Σm , m ≥ 1. Suppose u ∈ Sub(xi xi+1 ..xi+t ) with |u| > 2. Then there is a subword of u say uj ∈ Sub(xj ), such that |uj | ≥ 2, and θ(uj ) ∈ Sub(X). This is a contradiction to the condition that Sub2 (X) ∩ Sub (θ(X)) = ∅. So for every i, X i is θ-subword-k-code. S2∞ + Hence X = i=1 X i is θ-subword-k-code. 2. Since Sub2 (X) ∩ Sub2 (θ(X)) = ∅, X is both θ-infix and θ-commafree and by proposition 3.3 X + is θ-infix and θ-comma-free. 2 The following observation is straight forward. PROPOSITION 3.9. Let X be a θ-k-code for k ≤ min{|x| : x ∈ X}. Then X + is strictly θ-k-code iff X 2 is a strictly θ-k-code.
4. Methods to generate good code words With the previous section we describe the necessary and sufficient conditions under which concatenations of “good” code words produce new “good” code words. With the constructions in this section we show several ways to generate such codes. Many authors have realized that in the design of DNA strands it is helpful to consider three out of the four bases. This was the case with several successful experiments (Braich et.al., 2002; Faulhammer et.al., 2000; Liu et.al., 2003). It turns out that this, or a variation of this technique can be generalized such that codes with some of the desired properties can be easily constructed. In this section we concentrate on providing methods to generate “good” code words X such that X + has the same property. For each code X the entropy of X + is computed. The entropy measures the information capacity of the codes, i.e., the efficiency of these codes when used to represent information. The standard definition of entropy of a code X ⊆ Σ+ uses probability distribution over the symbols of the alphabet of X (see (Berstel and Perrin, 1985)). However, for a p-symbol alphabet, the maximal entropy
nature.tex; 7/01/2005; 23:43; p.11
12 is obtained when each symbol appears with the same probability p1 . In this case the entropy essentially counts the average number of words of a given length as subwords of the code words (Keane, 1991). From the coding theorem, it follows that {0, 1}+ can be encoded by X + with Σ 7→ {0, 1} if the entropy of X + is at least log 2 ((Adler et.al., 1983), see also Theorem 5.2.5 in (Lind and Marcus, 1999)). The codes for θ-comma-free, strictly θ-comma-free, and θ-k-codes designed in this section have entropy larger than log 2 when the alphabet has p = 4 symbols. Hence, such DNA codes can be used for encoding bit-strings. We start with the entropy definition as defined in (Lind and Marcus, 1999). DEFINITION 4.1. Let X be a code. The entropy of X + is defined by ~(X) = limn→∞
1 log |Subn (X + )|. n
If G is a deterministic automaton or an automaton with a delay that recognizes X + and AG is the adjacency matrix of G, then by PerronFrobenius theory AG has a maximal positive eigen value µ ¯ and the + entropy of X is log µ ¯ (see Chapter 4 of (Lind and Marcus, 1999)). We use this fact in the following computations of the entropies of the designed codes. In (Hussini et.al., 2003), Proposition 16, authors designed a set of DNA code words that is strictly θ-comma-free. The following propositions show that in a similar way we can construct codes with additional “good” propoerties. In what follows we assume that Σ is a finite alphabet with |Σ| ≥ 3 and θ : Σ → Σ is an involution which is not identity. We denote with p the number of symbols in Σ. We also use the fact that X is (strictly) θ-comma-free iff X + is (strictly) θ-comma-free, (Proposition 3.3). PROPOSITION 4.2. Let a, b ∈ Σ be such that for all c ∈ Σ \ {a, b}, S m i m θ(c) ∈ / {a, b}. Let X = ∞ i=1 a (Σ \ {a, b}) b for a fixed integer m ≥ 1. Then X and X + are θ-comma-free. The entropy of X + is such that log(p − 2) < ~(X + ) < log(p − 1). Proof. Let x1 , x2 , y ∈ X such that x1 x2 = sθ(y)t for some s, t ∈ Σ+ such that x1 = am pbm , x2 = am qbm and y = am rbm , for p, q, r ∈ (Σ \ {a, b})+ . Since θ is an involution, if θ(a) 6= a, b, then there is a c ∈ Σ \ {a, b} such that θ(c) = a, which is excluded by assumption. Hence, either θ(a) = a or θ(a) = b. When θ is morphic θ(y) = θ(am )θ(r)θ(bm ) and when θ is antimorphic θ(y) = θ(bm )θ(r)θ(am ). So, θ(y) = am θ(r)bm or θ(y) = bm θ(r)am . Since x1 x2 = am pbm am qbm = sbm θ(r)am t or x1 x2 = am pbm am qbm = sam θ(r)bm t the only possibilities for r are
nature.tex; 7/01/2005; 23:43; p.12
Involution Codes – DNA Coded Languages
13
θ(r) = p or θ(r) = q. In the first case s = 1 and in the second case t = 1 which is a contradiction with the definition of θ-comma-free. Hence X is θ-comma-free. Let A = (V, E, λ) be the automaton that recognizes X + where V = {1, ..., 2m + 1} is the set of vertices, E ⊆ V × Σ × V and λ : E → Σ (with (i, s, j) 7→ s) is the labeling function defined in the following way:
λ(i, s, j) =
a, for 1 ≤ i ≤ m, j = i + 1
b, for m + 2 ≤ i ≤ 2m, j = i + 1, and i = 2m + 1, j = 1 s, for i = m + 1, m + 2, j = m + 2, s ∈ Σ \ {a, b}
Then the adjacency matrix for A is a (2m+1)×(2m+1) matrix with ijth entry equal to the number of edges from vertex i to vertex j. Then the characteristic polynomial can be computed to be det(A − µI) = (−µ)2m (p − 2 − µ) + (−1)2m (p − 2). The eigen values are solutions of the equation µ2m (p−2)−µ2m+1 +p−2 = 0 which gives p−2 = µ− µ2mµ+1 . Hence 0 < µ2mµ+1 < 1, i.e., p − 2 < µ < p − 1.2 In the case of the DNA alphabet, p = 4 and for m = 1 the above characteristic equation becomes µ3 −2µ2 −2 = 0. The largest real value of µ is approximately 2.3593 which means that the entropy of X + is greater than log 2. EXAMPLE 4.3. Consider the DNASalphabet ∆ with θ = ρν. Let m=2 and choose A and T such that X ⊆ ni=1 A2 {G, C}i T 2 . Then X and so X + is θ-comma-free. PROPOSITION 4.4. SChoose distinct a, b, c ∈ Σ such that θ(a) 6= b, c m m−1 c)i bm for some m ≥ 2. Then X, θ(a) 6= a. Let X = ∞ i=1 a (Σ + and so X is strictly θ-comma-free. The entropy of X + is such that m−1 1 log(p m ) < ~(X + ) < log((pm−1 + 1) m ). Proof. The proof that X is strictly θ-comma-free is not difficult and we just give a sketch. Suppose there are x, x1 , x2 ∈ X such that sθ(x)t = x1 x2 for some s, t ∈ Σ+ . Let x = am s1 cs2 c..sk cbm then θ(x) is either θ(am )θ(s1 c...sk c)θ(bm ) or θ(bm )θ(s1 c...sk c)θ(am ) which cannot be a proper subword of x1 x2 for any x1 , x2 ∈ X. Hence X is θ-commafree. Let A = (V, E, λ) be the automaton that recognizes X + where V = {1, ..., 3m} is the set of vertices, E ⊆ V × Σ × V is the set of edges and λ : E → Σ (with (i, s, j) 7→ s) is the labeling function defined in the following way:
nature.tex; 7/01/2005; 23:43; p.13
14
λ(i, s, j) =
a, for 1 ≤ i ≤ m, j = i + 1
b, for 2m ≤ i ≤ 3m − 1, j = i + 1, and i = 3m, j = 1 c, for i = 2m, j = m + 1 s, for m + 1 ≤ i ≤ 2m − 1, j = i + 1, s ∈ Σ
Note that this automaton is not deterministic, but it has a delay 1, hence the entropy of X + can be obtained from its adjacency matrix. Let A be the adjacency matrix of this automaton. The characteristic equation for A is −(µ)3m +(µ)2m pm−1 +pm−1 = 0. This implies pm−1 = m−1 µ3m µm µm = µm − µ2m . Since p is an integer and 0 < µ2m < 1, p m < µ2m +1 +1 +1 1
µ < (pm−1 + 1) m .2 For the DNA alphabet, p = 4 and for m = 2 the above characteristic equation becomes µ6 − 4µ4 − 4 = 0. Solving for µ, the largest real value of µ is 2.055278539. Hence the entropy of X + is greater than log 2. EXAMPLE 4.5. Consider ∆ and θ = ρν and let m = 2, a = A, c = C, b = G. S i + Then X = ∞ i=1 AA(∆C) GG and X are strictly θ-comma-free. With the following propositions we consider ways to generate θsubword-k-code and θ-k-codes. PROPOSITION 4.6. Let a, b ∈ Σ such that θ(a) = b and let X = S∞ k−1 a ((Σ \ {a, b})k−2 b)i . i=1 Then X is θ-subword-k-code for k ≥ 3. Moreover, when θ is morphic k−2 then X is a θ-k-code. The entropy of X + is such that log((p − 2) k−1 ) < ~(X + ) < log(((p − 2)k−2 + 1)k−1 ). Proof. Suppose there exists x ∈ X such that x = rusθ(u)t for some r, t ∈ Σ∗ , u, θ(u) ∈ Σk and s ∈ Σm for m ≥ 1 . Let x = ak−1 s1 bs2 b...sn b where si ∈ (Σ \ {a, b})k−2 . Then the following are all possible cases for u: 1. u is a subword of ak−1 s1 , 2. u is a subword of as1 b, 3. u is a subword of s1 bs2 , 4. u is a subword of bsi b for some i ≤ n. In all the these cases, since θ(a) = b, θ(u) is not a subword of x. Hence X is θ-subowrd-k-code.
nature.tex; 7/01/2005; 23:43; p.14
15
Involution Codes – DNA Coded Languages
Let A = (V, E, λ) be the automaton that recognizes X + where V = {1, ..., 2k −2} is the set of vertices, E ⊆ V ×Σ×V and λ : E → Σ (with (i, s, j) 7→ s) is the labeling function defined in the following way:
λ(i, s, j) =
a, for 1 ≤ i ≤ k − 1, j = i + 1
b, for i = 2k − 2, j = k, and i = 2k − 2, j = 1 s, for k ≤ i ≤ 2k − 3, j = i + 1, s ∈ Σ \ {a, b}
This automaton is with delay 1. Let A be the adjacency matrix of this automaton. The characteristic equation for A is [(−µ)2k−2 − (p − 2)k−2 µk−1 − (p − 2)k−2 = 0. So k−1
k−1
k−2
µ µ (p − 2)k−2 = µk−1 − µk−1 . Since 0 < µk−1 < 1 , (p − 2) k−1 < µ < +1 +1 ((p − 2)k−2 + 1)k−1 .2 For the DNA alphabet, p = 4 and for k = 3 the above characteristic equation becomes µ4 − 2µ2 − 2 = 0. Solving for µ, the largest real value is 1.6528 which is greater than the golden mean (1.618), but less than 2. The asymptotic value for µ is 2 when k approaches infinity.
EXAMPLE 4.7. Consider ∆ with θ = ρν and choose k = 3. Then S X= ∞ AA({G, C}T )i is θ-subword-3-code. i=1 As other authors have observed, note that it is easy to get θ-kcode if one of the symbols in the alphabet is completely ignored in the construction of the code X. PROPOSITION 4.8. Assume that θ(a) 6=Sa for all symbols a ∈ Σ. Let k−1 ((Σ \ {c})k−2 b)i for b, c ∈ Σ such that θ(b) = c and let X = ∞ i=1 a + k ≥ 3. Then X and X are θ-k-code. The entropy of X + is such that k−2 1 log((p − 1) k−1 ) < ~(X + ) < log(((p − 1)k−2 + 1) k−1 ). Proof. The fact that X + is a θ-k-code is straight forward, since every subword of x ∈ X of length k is either power of a or contains the symbol b. Let A = (V, E, λ) be the automaton that recognizes X + where V = {1, ..., 2k −2} is the set of vertices, E ⊆ V ×Σ×V and λ : E → Σ (with (i, s, j) 7→ s) is the labeling function defined in the following way:
λ(i, s, j) =
a, for 1 ≤ i ≤ k − 1, j = i + 1
b, for i = 2k − 2, j = k, and i = 2k − 2, j = 1 s, for k ≤ i ≤ 2k − 3, j = i + 1, s ∈ Σ \ {c}
This automaton is with delay 1.
nature.tex; 7/01/2005; 23:43; p.15
16 Let A be the adjacency matrix of this automaton. Its characteristic equation is µ2k−2 − µk−1 (p − 1)k−2 − (p − 1)k−2 = 0. This implies µk−1 (p − 1)k−2 = µk−1 − µk−1 . We are interested in the largest real value +1 for µ. Since µ > 0, we have 0
2. Hence the entropy of X + in this case is greater than log 2.
PROPOSITION 4.9. Assume that θ(a) 6= a Sfor all symbols a ∈ Σ. k−1 b)i for Let b, c ∈ Σ such that θ(b) = c and let X = ∞ i=1 ((Σ \ {c}) k ≥ 3. Then X and X + are a θ-k-code. The entropy of X + is such that k−1 ~(X + ) = log((p − 1) k ). The fact that X + is a θ-k-code is straight forward, since every subword of x ∈ X of length k contains the symbol b. Let A = (V, E, λ) be an automaton that recognizes X + where V = {1, ..., k} is the set of vertices, E ⊆ V × Σ × V and λ : E → Σ (with (i, s, j) 7→ s) is the labeling function defined in the following way:
λ(i, s, j) =
b, s,
for i = k, j = 1 for 1 ≤ i ≤ k − 1, j = i + 1, s ∈ Σ \ {c}
Let A be the adjacency matrix of this automaton. Its characteristic k−1 equation is µk − (p − 1)k−1 = 0. This implies that µ = (p − 1) k . 2 For the DNA alphabet, p = 4 and for k = 5 the above estimate says 4 that µ = 3 5 . Hence the entropy of X + in this case is greater than log 2. The above propositions showed that in most cases we can generate “good” set of codewords that carry enough entropy to encode bit → strings by simple symbol→ symbol map. The following proposition provides condition under which given a set of strings can be used as a base to obtain good DNA code words by injectively mapping bits → DNA words. PROPOSITION 4.10. Let Σ1 and Σ2 be finite alphabet sets and let f be an injective morphism or antimorphism from Σ1 to Σ∗2 . Let X be a code over Σ∗1 . Then f (X) is a code over Σ∗2 . Let θ1 : Σ∗1 → Σ∗1 and θ2 : Σ∗2 → Σ∗2 be both morphisms or antimorphisms respectively such that f (θ1 (x)) = θ2 (f (x)). Let P = Pref(θ2 (f (X))) and S = Suff(θ2 (f (X))).
nature.tex; 7/01/2005; 23:43; p.16
Involution Codes – DNA Coded Languages
17
+ + 1. Let (A+ P ∪ SA+ ) ∩ f (Σ+ 1 ) = ∅ and A P A ∩ f (Σ1 ) = ∅ where ∗ ∗ A = Σ2 \ f (Σ1 ). If X is strictly θ1 -infix (comma-free) then f (X) is strictly θ2 -infix (comma-free). + 2. Let r be such that |f (a)| = r for all a ∈ Σ1 and Σ+ 2 Sub(k−1)r (f (X))Σ2 ∩ Subkr (f (X)) = ∅.
a) If X is θ1 -subword-k-code then f (X) is θ2 -subword-(k + 1)rcode. b) If X is θ1 -k-code then f (X) is θ2 -(k + 1)r-code. Proof: 1. Let X be θ1 -infix. Suppose f (X) is not θ2 -infix. Then there exists x, y ∈ f (X) such that x = aθ2 (y)b for some a, b ∈ Σ∗2 and x1 , y1 ∈ X such that x = f (x1 ) and y = f (y1 ). Suppose a, b ∈ f (Σ∗1 ) then there exists a1 , b1 ∈ Σ∗1 such that f (x1 ) = f (a1 )f (θ2 (y1 ))f (b1 ) which implies x1 = a1 θ1 (y1 )b1 . This is a contradiction with the second hypothesis in 1. Suppose a ∈ / f (Σ∗1 ). Then either there exists y 0 ∈ P + 0 such that ay ∈ f (Σ1 ) or there exists a b0 such that b = b0 b00 with aθ2 (y)b0 ∈ f (Σ1 ). This is a contradiction with the first hypothesis in 1. Hence f (X) is θ2 -infix. Similar proof works when X is θ1 comma-free. 2. Suppose f (X) is not a θ2 -(k + 1)r-code. Then there are x, y ∈ Sub(k+1)r (f (X)) such that x = θ2 (y). First assume x, y ∈ f (Σ+ 1) and there are x1 , y1 ∈ Sub(k+1) (X) such that f (x1 ) = x and f (y1 ) = y. Hence f (x1 ) = θ2 (f (y1 )) which implies x1 = θ1 (y1 ). This contradicts with X being a θ1 -k-code. Otherwise, if x = cx0 d 0 0 such that x0 ∈ f (Σ+ 1 ) and c, d ∈ A, then |x | ≤ kr and x = 0 f (x1 ) for some x1 ∈ Subk (X). Similarly y = ey g such that e, g ∈ A and y 0 = f (y1 ) for some y1 ∈ Subk (X). If x0 = θ2 (y 0 ) then f (x1 ) = θ2 (f (y1 )) which implies x1 = θ1 (y1 ), a contradiction. Suppose x0 6= θ2 (y 0 ) then x0 is a subword of ey 0 g which contradicts + Σ+ 2 Sub(k−1)r (f (X))Σ2 ∩Subkr (f (X)) = ∅. Hence f (X) is θ-(k+1)r code. Similar proof works for θ1 -subword-k-code. 2.
nature.tex; 7/01/2005; 23:43; p.17
18 5. Experimental Results
Using the methods provided in the previous section, a set of 20 sequences of length 20 base pairs were designed and tested experimentally. Among these, 10 were θ-5-codes (designed by methods from proposition 4.9), five were θ-subword-3-codes (proposition 4.6) and the remaining 5 were θ-comma free codes (proposition 4.4). Purified oligonucleotides were purchased from Integrated DNA Technologies. In addition to the 20 sequences, Watson-Crick complement of one of the sequences (K1) was ordered to act as a control. The sequences K1-K10 are the θ-5-codes, S1-S5 and F 1-F 5 are respectively θ-subword-3-codes and θ-comma free codes. Non-crosshybridization and secondary structure were measured by running the sequences on TAE polacrylamide non-denaturing gels (15%) at 4◦ C. Annealing reaction was done by heating the molecules at 95◦ C for 5 minutes and cool it down to room temperature. Table 1: Sequences
Sequence K1 K2 K3 K4 K5 K6 K7 K8 K9 K10
aatacatcacatttctaccc actactacacacctcttacc atcaccacccatcacactac ttaccatctctatacatctc ctatctattcctctcacatc tacacataactccactcatc tcccacatcccatactaatc ctaacatacctacacactac acctctacacacttcacaac tatacatacctcaaccactc
Sequence S1 S2 S3 S4 S5 F1 F2 F3 F4 F5
aagtgtctgtctctgtctgt aactgtctctgtctgtctct aagtctgtgtctctctgtct aagtctctgtgtctctgtgt aactctgtctgtgtctctct aagcggaatctcggaaacgg aatcggaagcgcgcaaacgg aaccggaatcacgcaatcgg aaacggaaacacggaagcgg aatcggaatctcggaagcgg
The results of the experiment are shown in Figure 4. The top and the bottom bands on the gel are the dye. Table 2: Contents of the Gel
nature.tex; 7/01/2005; 23:43; p.18
19
Involution Codes – DNA Coded Languages
Figure 4. A 15% acrylamide non-denaturing gel of the tested DNA codes.
Lane
Content
1 2-11 12 13 14 15 16 17
Ladder K1-K10 All θ-comma-free codes All θ-subword-3-codes All θ-5-codes All θ-5-codes with all θ-subword-3-codes All θ-subword-3-codes with all θ-comma-free-codes All θ-comma-free codes with all θ-5-codes
18
Double stranded molecule (K1 with K1)
←
Observed result: No duplexes were detected in lane 14 which contains annealed solution of all θ-5-codes. Their speed on the gel coincides with one of the single stranded molecules in lanes 2-11. This shows that no cross-hybridization is observed for these strands. Also no secondary structures were detected in any lanes 2-11. All molecules are at the same level as 20bp mark of the ladder in lane 1. The ladder is not visible clearly due to the staining of the gel for a long time in order to make the DNA’s visible. The double stranded molecule on lane 18 runs at a different level than the single stranded DNA molecules (K1 − K10) on lanes 2-11. However some cross hybridization was observed in θ-comma-free and θ-subword-3-codes (lanes 12,13,15,16,17).
nature.tex; 7/01/2005; 23:43; p.19
20 6. Concluding remarks
In this paper we investigated theoretical properties of languages that consist of DNA based code words. In particular we concentrated on intermolecular and intramolecular cross hybridizations that can occur as a result that a Watson-Crick complement of a (sub)word of a code word is also a (sub)word of a code word. These conditions are necessary for a design of good codes, but certainly may not be sufficient. For example, the algorithms used in the programs developed by Seeman (Seeman, 1990), Feldkamp (Feldkamp et.al., 2002) and Ruben (Ruben et.al., 2002), all check for uniqueness of k-length subsequences in the code words. Unfortunately, none of the properties from Definition 2.1 ensures uniqueness of k-length words. Such code word properties remain to be investigated. The observations in Section 3 provide a general way how from a small set of code words with desired property we can obtain, by concatenating the existing words, arbitrarily large sets of code words with similar properties. We hope that the general methods of designing such codewords will simplify the search for “good” codes. Better characterizations of good code words that are closed under Kleene ∗ operation may provide even faster ways for designing such codewords. The most challenging questions of characterizing and designing good θ-k-codes remains to be developed. The information capacity of the codes designed by the methods from section 4 show that in most cases we obtain sets of code words that can encode binary strings symbols → symbols. In (Arita and Kobayashi, 2002) authors use binary strings with certain properties to generate DNA codes (assigning bases to bits). It remains to be investigated how powerful (in the sense of information capacity) can be the transition from binary strings to DNA code words. Our experimental results show that when θ-k-codes are used, they provide a good base for developing new code words, however the other properties taken isolated are not shown to provide sufficient distinction between coded DNA strands in laboratory experiments. Our approach to the question of designing “good” DNA codes has been from the formal language theory aspect. Many issues that are involved in designing such codes have not been considered. These include (and are not limited to) the free energy conditions, melting temperature as well as Hamming distance conditions. All these remain to be challenging problems and a procedure that includes all or majority of these aspects will be desirable in practice. Acknowledgment
nature.tex; 7/01/2005; 23:43; p.20
Involution Codes – DNA Coded Languages
21
The first two authors are been partially supported by grants EIA0086015 and EIA-0074808 from the National Science Foundation, USA. The third author is supported by EIA-0130385 and MCB-9980092 from the National Science Foundation, USA.
References R.L. Adler, D. Coppersmith, M. Hassner, Algorithms for sliding block codes -an application of symbolic dynamics to information theory, IEEE Trans. Inform. Theory 29 (1983), 5-22. M. Arita and S. Kobayashi, “DNA Sequence Design Using Templates,” New Generation Comput. 20(3), 263–277 (2002). (Available as a sample paper at http://www.ohmsha.co.jp/ngc/index.htm.) E.B. Baum, DNA Sequences useful for computation unpublished article, available at: http://www.neci.nj.nec.com/homepages/eric/seq.ps (1996). J. Berstel, D. Perrin, Theory of codes, Academis Press, Inc. Orlando Florida, 1985. R.S. Braich, N. Chelyapov, C. Johnson, P.W.K. Rothemund, L. Adleman, Solution of a 20-variable 3-SAT problem on a DNA computer, Science 296 (2002) 499-502. R. Deaton, J. Chen, H. Bi, M. Garzon, H. Rubin, D.F. Wood, A PCR-based protocol for in vitro selection of non-crosshybridizing oligonucleotides, DNA Computing: Proceedings of the 8th International Meeting on DNA Based Computers (M. Hagiya, A. Ohuchi editors), Springer LNCS 2568 (2003) 196-204. R. Deaton et. al, A DNA based implementation of an evolutionary search for good encodings for DNA computation, Proc. IEEE Conference on Evolutionary Computation ICEC-97 (1997) 267-271. D. Faulhammer, A. R. Cukras, R. J. Lipton, L. F.Landweber, Molecular Computation: RNA solutions to chess problems, Proceedings of the National Academy of Sciences, USA 97 4 (2000) 1385-1389. U. Feldkamp, S. Saghafi, H. Rauhe, DNASequenceGenerator - A program for the construction of DNA sequences, DNA Computing: Proceedings of the 7th International Meeting on DNA Based Computers (N. Jonoska, N.C. Seeman editors), Springer LNCS 2340 (2002) 23-32. M. Garzon, R. Deaton, D. Reanult, Virtual test tubes: a new methodology for computing, Proc. 7th. Int. Symposium on String Processing and Information retrieval, A Coru˘ na, Spain. IEEE Computing Society Press (2000) 116-121. T. Head, Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors, Bull. Math. Biology 49 (1987) 737-759. T. Head, Gh. Paun, D. Pixton, Language theory and molecular genetics, Handbook of formal languages, Vol.II (G. Rozenberg, A. Salomaa editors) Springer Verlag (1997) 295-358. S. Hussini, L. Kari, S. Konstantinidis, Coding properties of DNA languages, Theoretical Computer Science 290 (2003) 1557-1579. N. Jonoska, D. Kephart, K. Mahalingam, Generating DNA code words Congressus Numernatium 156 (2002) 99-110. N. Jonoska, K. Mahalingam, Languages od DNA based code words , Proceedings of the 9th International Meeting on DNA Based Computers, J.Chen, J.Reif editors, Springer LNCS 2943 (2004) 61-73 .
nature.tex; 7/01/2005; 23:43; p.21
22 N. Jonoska, K. Mahalingam, Methods for constructing coded DNA languages, Aspects of Molecular Computing, N.Jonoska, G.Paun, G.Rozenberg editors, Springer LNCS 2950 (2004) 241-253. L. Kari, S. Konstantinidis, E. Losseva, G. Wozniak, Sticky-free and overhang-free DNA languages, Acta Informatica 40 (2003) 119-157. L. Kari, S. Konstantinidis, P.Sosik, On properties of bond-free languages, submitted. M.S. Keane, Ergodic theory an subshifts of finite type, in Ergodic theory, symbolic dynamics and hyperbolic spaces (ed. T. edford, et.al.) Oxford Univ. Press, Oxford 1991, pp.35-70. D. Kephart, J. Lefevre, Codegen: The generation and testing of DNA code words, submitted. Z. Li, Construct DNA code words using backtrack algorithm, preprint (2002). D. Lind, B. Marcus, An introduction to Symbolic Dynamics and Coding, Cambridge University Press, Inc. Cambridge United Kingdom, 1999. Q. Liu et al., DNA computing on surfaces, Nature 403 (2000) 175-179. A. Marathe, A.E. Condon, R.M. Corn, On combinatorial word design, Preliminary Preproceedings of the 5th International Meeting on DNA Based Computers, Boston (1999) 75-88. Gh. Paun, G. Rozenberg, A. Salomaa, DNA Computing, new computing paradigms, Springer Verlag 1998. A.J. Ruben, S.J. Freeland, L.F. Landweber, PUNCH: An evolutionary algorithm for optimizing bit set selection, DNA Computing: Proceedings of the 7th International Meeting on DNA Based Computers (N. Jonoska, N.C. Seeman editors), Springer LNCS 2340 (2002) 150-160. N.C. Seeman, De Novo design of sequences for nucleic acid structural engineering J. of Biomolecular Structure & Dynamics 8 (3) (1990) 573-581.
nature.tex; 7/01/2005; 23:43; p.22