Involution codes: with application to DNA coded

0 downloads 0 Views 4MB Size Report
codes: h-infix, h-comma-free, h-k -codes and h-subword-k-codes. These codes arise from questions on DNA strand design. We investigate conditions under ...
Natural Computing (2005) 4: 141–162 DOI: 10.1007/s11047-004-4009-9

! Springer 2005

Involution codes: with application to DNA coded languages NATASˇA JONOSKA1,*, KALPANA MAHALINGAM1 and JUNGHUEI CHEN2 1

Department of Mathematics, University of South Florida, Tampa FL 33620, USA (*Author for correspondence, e-mail: [email protected]); 2 Department of Chemistry and Biochemistry, University of Delaware, Newark DE 19716, USA

Abstract. For an involution h : R! ! R! over a finite alphabet R we consider involution codes: h-infix, h-comma-free, h-k -codes and h-subword-k-codes. These codes arise from questions on DNA strand design. We investigate conditions under which both X and X þ are same type of involution codes. General methods for generating such involution codes are given. The information capacity of these codes show to be optimized in most cases. A specific set of these codes was chosen for experimental testing and the results of these experiments are presented. Key words: codes, DNA codes, Watson–Crick involution

1. Introduction In bio-molecular computing and in particular DNA based computations and DNA nanotechnology, one of the main problems is associated with the design of the oligonucleotides such that mismatched pairing due to the Watson–Crick complementarity is minimized. In laboratory experiments non-specific hybridizations pose potential problems for the results of the experiment. Many authors have addressed this problem and proposed various solutions. Common approach has been to use the Hamming distance as a measure for uniqueness (Baum, unpublished; Deaton et al., 1997; Faulhammer et al., 2000; Garzon et al., 2000; Liu et al., 2003). Deaton et al. (Deaton et al., 1997; Garzon et al., 2000) used genetic algorithms to generate a set of DNA sequences that satisfy pre-determined Hamming distance. Marathe et al. (1999) also used Hamming distance to compute combinatorial bounds of DNA sequences, and they used dynamic programing for design of the strands used in Liu et al. (2003). Seeman’s program (Seeman, 1990) generates sequences by testing overlapping subsequences to enforce uniqueness.

142

! JONOSKA ET AL. NATASA

This program is designed for producing sequences that are suitable for complex three-dimensional DNA structures, and the generation of suitable sequences is not as automatic as the other programs have proposed. Feldkamp et al. (2002) also uses the test for uniqueness of subsequences and relies on tree structures in generating new sequences. Ruben et al. (2002) use a random generator for initial sequence design, and afterwards check for unique subsequences with a predetermined properties based on Hamming distance. One of the first theoretical observations about number of DNA code words satisfying minimal Hamming distance properties was done by Baum (Baum, unpublished). Experimental separation of ‘‘good’’ codes that avoid intermolecular cross hybridization on big pool of random strands was reported in Deaton et al. (2003). In Hussini et al. (2003), the authors introduce a theoretical ap proach to the problem of designing code words. Based on these ideas and code-theoretic properties, a computer program for generating code words is being developed (Jonoska et al., 2002). Another algorithm based on backtracking, for generating such code words is also developed by Li (Li, Preprint). Every bio-molecular protocol involving DNA or RNA generates molecules, whose sequences of nucleotides form a language over the four letter alphabet D ¼ fA; G; C; Tg. The Watson–Crick complementarity of the nucleotides defines a natural involution mapping h; A7!T and G7!C, which is an antimorphism of D" . Undesirable Watson–Crick bonds (undesirable hybridizations) can be avoided if the language satisfies certain coding properties. In particular for DNA code words, no involution of a word is a subword of another word, or no involution of a word is a subword of a composition of two words. These properties are called h-infix and h-comma-free, respectively. The case, when a DNA strand may form a hairpin, (i.e., when a word contains a reverse complement of a subword) was introduced in Jonoska et al. (2002) and was called h-subword-k-code. For words representing DNA sequences we use the following convention. A word u over D denotes a DNA strand in its 50 ! 30 orientation. The Watson–Crick complement of the word u, also in orientation 50 ! 30 is denoted with u . For example if u ¼ AGGC then u ¼ GCCT. There are two types of unwanted hybridizations: intramolecular and intermolecular. The intramolecular hybridization happens, when two sequences, one being a reverse complement of the other appear within the same DNA strand (see Figure 1). In this case the DNA strand forms a hairpin.

143

INVOLUTION CODES

Two particular intermolecular hybridizations are of interest (see Figure 2). In Figure 2(a) the strand labeled u is a reverse complement of a subsequence of the strand labeled v, and in Figure 2(b) represents the case, when u is the reverse complement of a portion of a concatenation of v and w. We start the paper with definitions of languages with coding properties that avoid intermolecular and intramolecular cross-hybridizations. The definitions of h-infix and h-comma-free languages are same as the ones introduced in Hussini et al. (2003). Here we also consider intra-molecular hybridizations and subword hybridizations. Hence, we have two additional coding properties: h-subword-k-code and h-k-code. In Jonoska and Mahalingam (2004) necessary and sufficient conditions for preserving these properties under splicing were obtained. Here we make several observations about the closure properties of the code word languages. In particular, we concentrate on properties of languages that are preserved with concatenation. If a set of DNA strands has ‘‘ good’’ coding properties that are preserved under concatenation, then the same properties will be preserved under arbitrary ligation of the strands. Section 3 provides necessary and sufficient conditions for a finite set of words to generate (by concatenations) an infinite set of code words. In practice, besides use in ligation, these conditions provide a way to generate new ‘‘good’’ code words starting from a small set of initial ‘‘good’’ code words and as such might facilitate the otherwise difficult task of strand design. Section 4 describes several ways how to generate ‘‘good’’ code words. These sets of code words also provide sufficient informational entropy such that they can be used to encode binary strings bit ! symbols. Sets of molecules obtained by the described methods were tested for cross-hybridization experimentally. The results are shown in Section 5. Experiments showed that the designed h-k-codes had no visible cross hybridizations. The other two sets (h-comma-free, h-subword-k-code) of code words avoided one specific way of annealing, but in general, stronger properties like h-k-code are needed for reliable experiments. We end with few concluding remarks. (a) v

u

x

(b)

u

w=uvux, u = k, v = m

Figure 1. Intramolecular hybridization (h-subword-k-code): (a) the reverse complement is at the beginning of the 5¢ end, (b) the reverse complement is at the end of the 3¢. The 3¢ end of the DNA strand is indicated with an arrow.

144

! JONOSKA ET AL. NATASA

(a)

u

u

(b) v

v

w

Figure 2. Two types of intermolecular hybridization: (a) (h-infix) one code word is a reverse complement of a subword of another code word, (b) (h-comma-free) a code word is a reverse complement of a subword of a concatenation of two other code words. The 3¢ end is indicated with an arrow.

2. Definitions An alphabet R is a finite non-empty set of symbols. We will denote by D the special case when the alphabet is fA; G; C; Tg representing the DNA nucleotides. A word u over R is a finite sequence of symbols in R. We denote by R! the set of all words over R, including the empty word 1 and, by Rþ , the set of all non-empty words over R. We note that with the word concatenation, R! is the free monoid and Rþ is the free semigroup generated by R. The length of a word u ¼ a1 ; . . . ; an is n and is denoted with juj. Throughout the rest of the paper, we concentrate on sets X $ Sþ that are codes meaning, every word in Xþ can be written uniquely as a product of words in X (i.e., Xþ is a free semigroup generated by X). For the background on codes we refer the reader to Berstel and Perrin (1985). We will need the following definitions: PrefðwÞ ¼ fuj9v 2 R! ; uv ¼ wg;

SuffðwÞ ¼ fuj9v 2 R! ; vu ¼ wg;

SubðwÞ ¼ fuj9v1 ; v2 2 R! ; v1 uv2 ¼ wg; PPrefðwÞ ¼ PrefðwÞnfwg;

PSuffðwÞ ¼ SuffðwÞnfwg;

RX ðwÞ

LX ðwÞ

¼ fx 2 R! jwx 2 Xg;

¼ fx 2 R! jxw 2 Xg;

where RX ðwÞ and LX ðwÞ are respectively the right and the left context of w in X. We extend these definitions to the set of prefixes, suffixes and subwords of a set of words. Similarly, we have Suffk ðwÞ ¼ SuffðwÞ \ Rk ; Prefk ðwÞ ¼ PrefðwÞ \ Rk and Subk ðwÞ ¼ SubðwÞ \ Rk . We follow the definitions initiated in Hussini et al. (2003) and used in Jonoska and Mahalingam (2004), Jonoska et al. (2002), Kari et al. (2003).

INVOLUTION CODES

145

An involution h: R ! R of a set R is a mapping such that h2 equals the identity mapping, hðhðxÞÞ ¼ x for all x 2 R. The mapping m: D ! D defined by mðAÞ ¼ T; mðTÞ ¼ A; mðCÞ ¼ G; mðGÞ ¼ C is an involution on D and can be extended to a morphic involution of D$ . Since the Watson–Crick complementarity appears in a reverse orientation, we consider another involution q: D$ ! D$ defined inductively, qðsÞ ¼ s for s 2 D and qðusÞ ¼ qðsÞqðuÞ ¼ sqðuÞ for all s 2 D and u 2 D$ . This involution is antimorphism such that qðuvÞ ¼ qðvÞqðuÞ. The Watson–Crick complementarity then is the antimorphic involution obtained with the composition mq ¼ qm. Hence for a DNA strand u, we have that qmðuÞ ¼ mqðuÞ ¼ u . The involution q reverses the order of the letters in a word and as such is used in the rest of the paper. For the general case, we concentrate on morphic and anti-morphic involutions of R$ that we denote with h. The notions of h-infix and h-comma-free in 3, 4 of Definition 2.1 below were called as h-compliant and h-free respectively in Hussini et al. (2003), Jonoska et al. (2002), Jonoska and Mahalingam (2004), Kari et al. (2003). Here we use the notion of h-infix and h-comma-free since when h is the identity mapping these notions correspond to infix code and comma-free code. Recall that X % R$ is an infix code if X \ ðR$ XRþ [ Rþ XR$ Þ ¼ ; and X % R$ is a comma-free code if X2 \ Rþ XRþ ¼ ;. Various other intermolecular possibilities for cross hybridizations were considered in Kari et al. (2003) (see Figure 3). All of these properties are included with h-k-code (5 of Definition 2.1). Definition 2.1. Let h: R$ ! R$ be a morphic or antimorphic involution.

1. The set X is called h-subword-k -m -code if for all u 2 R$ such that juj ¼ k we have R$ uRi hðuÞR$ \ X ¼ ; for all 1 ' i ' m. 2. The set X is called h-subword-k-code if for all u 2 R$ such that juj ¼ k we have R$ uRi hðuÞR$ \ X ¼ ; for all i ( 1. 3. We say that X is called h-infix if R$ hðXÞRþ \ X ¼ ; and Rþ hðXÞR$ \ X ¼ ;. 4. The set X is called h-comma free if X2 \ Rþ hðXÞSþ ¼ ;. 5. The set X is called h-k-code for some k > 0 if Subk ðXÞ \ Subk ðhðXÞÞ ¼ ;. 6. The set X is called strictly h if X0 \ hðX0 Þ ¼ ;, where X0 ¼ Xnf1g.

The notions of h-prefix, h-suffix (subword) code can be defined nat urally from the notions described above, but since this paper does not investigate these properties separately, we do not list the formal definitions here.

146

! JONOSKA ET AL. NATASA

u

u

u

u

u

u

u =k

Figure 3. Various cross hybridizations of molecules one of which contains subword of length k and the other its complement.

We have the following observations: Observation 2.2. In the following we assume that k ! minfjxj : x 2 Xg

1. When h is identity, X is an infix (comma-free) code iff X is h-infix (comma-free). 2. When X is such that X ¼ hðXÞ then X is h-infix (comma-free) iff X is infix (comma-free). 3. X is strictly h-infix iff R% hðXÞR% \ X ¼ ;. 4. If X is strictly h-comma-free then X and hðXÞ are strictly h-infix and hðXÞ is h-comma-free. 5. If X is strictly h-infix then X% is both h-prefix and h-suffix code. 6. If X is h-k-code, then X is h-k0 -code for all k0 > k. 7. X is a h-k-code iff hðXÞ is a h-k-code. 8. If X is strictly h such that X2 is h-subword-k-code, then X is strictly h-k -code. 9. If X is a h-k-code then both X and hðXÞ are h-infix, h-subword-kcode, h -prefix-k and suffix-k-code for any m & 1. If k ! x=2 for all x 2 X then X is h -comma-free and hence avoids the cross hybridizations as shown in Figures 1 and 2. 10. If X is a h-k- code then X and hðXÞ avoids all cross hybridizations of length k shown in Figure 3 and so all cross hybridizations presented in Figure 2 of Kari et al. (2003).

It is clear that h -subword-k-m-code implies h -prefix-k -m -code and h- suffix-k-m-code. We note that when h ¼ qm, the h -subword-km-code of the code words X ' D% does not allow intramolecular hybridization as in Figure 1 for a pre-determined k and m. The maximal length of a word that together with its reverse complement can appear as subwords of code words is limited with k. The length

INVOLUTION CODES

147

of the hairpin, i.e., ‘‘distance’’ between the word and its reversed complement is bounded between 1 and m. The values of k and m would depend on the laboratory conditions (e.g., the melting temperature and the length of the code words). In order to avoid intermolecular hybridizations as presented in Figure 2, X has to satisfy h-infix and h-comma-free. Most applications would require X to be strictly h. The most restricted and valuable properties are obtained with a h-k-code, and the analysis of this type of codes is also most difficult. When X is h-k-code, all intermolecular hybridizations presented in Figure 3 are avoided. We include several observations in Section 3.

3. Generating infinite sets of code words It is easy to note that other than the h-comma-free codes none of the h-infix, h-k-code and h-subword-k-code are closed under arbitrary concatenation. In this section we investigate what are the properties of a finite set of ‘‘good’’ code words X that can generate an infinite set of code words Xþ with the same ‘‘good’’ properties. In practice, it is much easier to generate a relatively small set of code words that has certain properties (i.e., in case of DNA or RNA, mismatched hybridization is avoided), and if we know that any concatenation of such words would also satisfy the requirements, the process of generating code words could be rather simplified. Hence, we give necessary and sufficient conditions for X such that Xþ is h-(subwork)k-code or h-infix. The Lemma below shows that if X is h-comma-free, then X2 is ‘‘almost’’ h -infix. The difference is in + vs * in 3 of Definition 2.1. All properties below refer to a finite set X " Rþ . Lemma 3.1. If X is h-comma-free , then X2 \ Rþ hðX2 ÞRþ ¼ ;. Proof. Suppose lemma does not hold then there are x1 ; x2 ; y1 ; y2 2 X such that x1 x2 ¼ ahðy1 y2 Þb with a; b 2 Rþ . When h is morphic, x1 x2 ¼ ahðy1 Þhðy2 Þb, and when h is anti-morphic, we have x1 x2 ¼ ah ðy2 Þhðy1 Þb. In both cases X would not be h-comma-free. Hence ( X2 \ Rþ hðX2 ÞRþ ¼ ; Note that the converse of the above need not be true. For example consider X ¼ fa3 b; a2 b2 g with h being morphism a7!b; b7!a. Then X2 \ Rþ hðX2 ÞRþ ¼ ; since all words in X2 are of length 8, but X is not h-comma-free since a2 b2 a3 b ¼ a2 hða2 b2 Þab.

148

! JONOSKA ET AL. NATASA

Lemma 3.2. If X is strictly h-infix then Xn is strictly h-infix for all n > 1. Proof. If the lemma does not hold then Xn is not strictly h-infix for some n. This means that there are x; y 2 Xn such that x ¼ shðyÞt for some s; t 2 R$ (not both equal to 1). Let x ¼ x1 . . . xn ¼ shðy1 . . . yn Þt with xi ; yi 2 X , then one of hðyi Þ is a subword of some xj which is a contradiction to X being h-infix. ( The Kleene $ closure of X contains the union of all Xn and in order for it to be h-infix we need stronger properties. The next proposition is a stronger version of Lemma 1(ii) and Proposition 1 in Hussini et al. (2003). Proposition 3.3. The following are equivalent: (i) X is strictly h-comma-free, (ii) Xþ is strictly h-infix, (iii) Xþ is strictly h-comma-free. Proof. (i))(ii). Suppose X is strictly h-comma-free. Hence, by observation 4 X is strictly h-infix. By Lemma 3.2 Xn is strictly h-infix for all n & 1. Suppose Xþ is not strictly h-infix then there exist x; y 2 Xþ such that x ¼ ahðyÞb for some a; b 2 R$ (not both equal to 1). Let x ¼ x1 . . . xn and y ¼ y1 . . . ym , for xi ; yj 2 X, hence for some yi , either (a) hðyi Þ is a subword of xj for some j, (b) hðyi Þ is a subword of xj xj þ 1, (c) hðyi Þ is a subword of xj'i xj xjþ1

The cases (a) and (c) contradict the fact that X is h-infix and (b) contradicts to X being h-comma-free. Hence Xþ is strictly h-infix. (ii))(iii).Given that Xþ is strictly h-infix, suppose Xþ is not h- comma-free. Then there exist x; y; z 2 Xþ such that xy ¼ ahðzÞb for some a; b 2 Rþ . Let x ¼ x1 . . . xn ; y ¼ y1 . . . ym and z ¼ z1 . . . zr with xi ; yi ; zi 2 X. Then hðzi Þ is a subword of one of xj xjþ1 ; xj ; ys ; ys ysþ1 or xn y1 . All cases contradict that Xþ is h-infix. Hence Xþ is strictly h-comma-free. u (iii))(i). Obvious, since X is a subset of Xþ . The following definition is the same as the one defined in Hussini et al. (2003). Definition 3.4. ForX ( Rþ define 1. Xis ¼ PSuffðhðXÞÞ \ PPrefðXÞ: 2. Xip ¼ PSuffðXÞ \ PPrefðhðXÞÞ:

149 INVOLUTION CODES S 3. Xs ¼ Sx2PSuffðXÞ RhðXÞ ðxÞ, 4. Xp ¼ x2PPref LhðXÞ ðxÞðXÞ. The following two properties are similar observations as Propositions 6 and 9 in Hussini et al. (2003). Proposition 3.5. Let X $ Rþ then X is h-infix and Xip Xis \ hðXÞ ¼ ; iff X is h-comma-free. Proof. Let X be h-infix and Xip Xis \ hðXÞ ¼ ; . Suppose X is not hcomma-free and there are x; y; z 2 X such that xy ¼ ahðzÞb for some a; b 2 Rþ . Since X is h-infix, hðzÞ is not a subword of x or y. Let hðzÞ ¼ z1 z2 such that az1 ¼ x and z2 b ¼ y. But z1 z2 2 hðXÞ, so z2 b 2 X implies z2 2 Xis and az1 2 X implies z1 2 Xip . Hence z1 z2 2 Xip Xis \ hðXÞ, which contradicts the hypothesis. Conversely, let X be h-comma-free. Suppose xy 2 Xip Xis \ hðXÞ. Then, x 2 Xip implies that there are u; v 2 Rþ such that ux 2 X and xv 2 hðXÞ. For y 2 Xis there exists w; r 2 Rþ such that wy 2 hðXÞ and yr 2 X. Hence uxyr 2 X2 with xy 2 hðXÞ which is a contradiction with X being h-comma-free. ( Proposition 3.6. If X is h-infix and Xp Xs 2 hðXÞ ¼ h then X is h comma-free (. Proof. Suppose X is not h-comma-free. Then there exists x; y; z 2 X such that xy ¼ ahðzÞb for some a; b 2 Rþ . Since X is h-infix, hðzÞ is not a subword of x or y . Let hðzÞ ¼ z1 z2 for some z1 ; z2 such that az1 ¼ x and z2 b ¼ y which implies z2 2 Xs and z1 2 Xp . Hence z1 ; z2 2 Xp Xs 2 hðXÞ which is a contradiction with the initial assumption. ( The converse of the above proposition is not true. For example, X ¼ fbba; bbabg is h-comma-free for an anti-morphic h mapping a ! b and b ! a. But Xs ¼ fbaa; a; aag and Xp ¼ ;, which implies baa 2 Xp Xs \ hðXÞ.

Proposition 3.7. investigates the case, when the property of subword-km -codes are preserved with Kleene& . It turned out that conditions under which a subword-k -m -code is closed under concatenation with itself are somewhat more demanding than the ones for h-comma-free and h-infix. Considering 8,9 and 10 in Observation 2.2 these properties might turn out to be quite important. Proposition 3.7. Let k, m be positive integers and X $ R& be such that for every word x 2 X; jxj ' 2kþ m. Let

150

! JONOSKA ET AL. NATASA



m [

[

l¼1 iþj¼2kþl

Suffi ðXÞPrefj ðXÞ

ð1Þ

Then X% is h-subword-k-m-code if and only if X is h-subword-k-m-code and for all y 2 L; Prefk ðyÞ \ Suffk ðhðyÞÞ ¼ ;. Proof. Assume that X is h-subword-k-m-code and for all y 2 L; Prefk ðyÞ \ Suffk ðhðyÞÞ ¼ ;. Suppose X% is not h-subword-k-m-code. Then there exists x 2 X% such that x ¼ x1 ushðuÞx2 , where jsj ¼ l & m and juj ¼ k.We claim that this is impossible. If ushðuÞ is a subword of some y 2 X, then this contradicts the property that X is h-subword-k-m-code. If ushðuÞ is a subword of some x1 x2 for 6 ; which is x1 ; x2 2 X then there is a p such that Prefk ðpÞ \ Suffk ðhðpÞÞ ¼ again a contradiction with the hypothesis. If ushðuÞ is a subword of some x1 x2 x3 then we have u2 su3 ¼ x2 for x1 ; x2 ; x3 2 X for some u1 u2 ¼ u and u3 u4 ¼ hðuÞ and since ju2 j < k; ju3 j < k and jx2 j ' 2k þ m which implies jsj > m which is a contradiction. Hence X% is h-subword-k-m-code. Conversely, note that if X% is h-subword-k-m -code then X is h-subword-k-m-code. Suppose there exists x 2 L such that Prefk ðxÞ \ Suffk ðhðxÞÞ 6¼ ;. Then x is such that x is either a subword of some y 2 X or a subword of some y1 y2 with y1 ; y2 2 X. Both cases are ( contradictory to the fact that X% is h-subword-k-m-code. The conditions under which codes with one of the coding properties in Definition 2.1 are closed under Kleene% are discussed above. But when is a language h-subword-k-code, h-infix and h-comma-free all at once? Propositions 3.8 and 3.9 try to give an answer to this question. The condition in the first proposition is quite strong due to the strong requirements. However, the condition is only sufficient, and may not be necessary. Proposition 3.8. Let X be a h-2-code such that for all x 2 X; jxj ' 3. Then both X and Xþ are strictly 1. h-subword-k-code for k ' 3, 2. h-infix and h-comma-free. Proof. 1. By induction on the powers of X. Since Sub2 ðXÞ\ Sub2 ðhðXÞÞ ¼ ; and k ' 3; R% uRm hðuÞR% \ X ¼ ; for all u 2 Rk and for m ' 1. Hence X1 ¼ X is h-subword-k-code. Consider X2 and suppose that there exists an x 2 X2 such that

INVOLUTION CODES

151

x ¼ y1 uy2 hðuÞy3 for some y1 ; y3 2 R$ , and u 2 Rk ; y2 2 Rm for m % 1. Let x ¼ x1 x2 for x1 ; x2 2 X. It is not the case that x1 ¼ y1 and x2 ¼ uy2 hðuÞy3 since X is h subword-k-code. So suppose that x1 ¼ y1 u1 ; x2 ¼ u2 y2 hðuÞy3 for some u1 ; u2 such that u1 u2 ¼ u and u1 6¼ 1. Since k > 2 one of u1 or u2 has length at least 2. This contradicts our assumption that Sub2 ðXÞ \ Sub2 ðhðXÞÞ ¼ ;. Now the inductive step is done similarly. Assume Xn is h-subword-k-code. Suppose there exists an x ¼ x1 & & & xnþ1 2 Xnþl such that x ¼ y1 uy2 hðuÞy3 for some y1 ; y3 2 R$ ; u2 Rk ; y2 2 Rm ; m % 1: Suppose u2 Subðxi xiþ1 . . . xiþt Þ with juj > 2. Then there is a subword of u say uj 2 Subðxj Þ, such that juj j % 2, and hðujÞ 2 SubðXÞ. This is a contradiction to the condition that Sub2 ðXÞ \ SSub2iðhðXÞ ¼ ;. So for every i, Xi is h-subword-k-code. Hence Xþ ¼ 1 i¼1 X is h-subword-k -code. 2. Since Sub2 ðXÞ \ Sub2 ðhðXÞÞ ¼ ; X is both h-infix and h-comma( free and by Proposition 3.3 Xþ is h-infix and h-comma-free. The following observation is straight forward. Proposition 3.9. Let X be a h-k-code for k ( minfjxj : x 2 Xg. Then Xþ is strictly h-k-code iff X2 is a strictly h-k-code. 4. Methods to generate good code words With Section 3, we describe the necessary and sufficient conditions under which concatenations of ‘‘good’’ code words produce new ‘‘good’’ code words. With the constructions in this section we show several ways to generate such codes. Many authors have realized that in the design of DNA strands it is helpful to consider three out of the four bases. This was the case with several successful experiments (Faulhammer et al., 2000; Braich et al., 2002; Liu et al., 2003). It turns out that this, or a variation of this technique can be generalized such that codes with some of the desired properties can be easily constructed. In this section, we concentrate on providing methods to generate ‘‘good’’ code words X such that Xþ has the same property. For each code X the entropy of Xþ is computed. The entropy measures the information capacity of the codes, i.e., the efficiency of these codes when used to represent information. The standard definition of entropy of a code X ) Rþ uses probability distribution over the symbols of the alphabet of X (see Berstel and Perrin, 1985). However, for a p-symbol alphabet, the maximal entropy is obtained, when each symbol appears with the

152

" JONOSKA ET AL. NATASA

same probability 1=p. In this case the entropy essentially counts the average number, of words of a given length as subwords of the code words (Keane, 1991). From the coding theorem, it follows that {0, 1}þ can be encoded by Xþ with R7!f0; 1g if the entropy of Xþ is at least log 2 (Adler et al., 1983), see also Theorem 5.2.5 in Lind and Marcus (1999). The codes for h-comma-free, strictly h-comma-free, and h-k-codes designed in this section have entropy larger than log 2 when the alphabet has p ¼ 4 symbols. Hence, such DNA codes can be used for encoding bit-strings. We start with the entropy definition as defined in Lind and Marcus (1999). Definition 4.1. Let X be a code. The entropy of Xþ is defined by 1 !hðXÞ ¼ limn!1 logjSubn ðXþ Þj: n If G is a deterministic automaton or an automaton with a delay that recognizes Xþ and AG is the adjacency matrix of G, then by Perron!, and the Frobenius theory AG has a maximal positive eigen value l ! (see Chapter 4 of Lind and Marcus, (1999)). We entropy of Xþ is log l use this fact in the following computations of the entropies of the designed codes. In Hussini et al. (2003), Proposition 16, authors designed a set of DNA code words that is strictly h-comma-free. The following propositions show that in a similar way we can construct codes with additional ‘‘good’’ propoerties. In what follows we assume that R is a finite alphabet with jRj % 3 and h: R ! R is an involution which is not identity. We denote with p the number of symbols in R. We also use the fact that X is (strictly) h -comma-free iff Xþ is (strictly) h-comma-free (Proposition 3.3). Proposition 4.2. Let a; b 2 R be such that for all c 2 Rnfa; bg; hðcÞ i m m 62 fa; bg. Let X ¼ [1 i¼1 a ðR=fa; bgÞ b for a fixed integer m % 1. Then X þ and X are h-comma-free. The entropy of Xþ is such that logðp & 2Þ < !hðXþ Þ < logðp & 1Þ. Proof. Let x1 ; x2 ; y 2 X such that x1 x2 ¼ shðyÞt for some s; t 2 Rþ such that x1 ¼ am pbm ; x2 ¼ am qbm and y ¼ am rbm , for p; q; r 2 ðRnfa; bgÞþ . Since h is an involution, if hðaÞ ¼ 6 a; b, then there is a; c 2 Rnfa; bg such that hðcÞ ¼ a, which is excluded by assumption. Hence, either hðaÞ ¼ a or hðaÞ ¼ b. When h is morphic hðyÞ ¼ hðam ÞhðrÞhðbm Þ and when h is antimorphic hðyÞ ¼ hðbm ÞhðrÞhðam ). So, hðyÞ ¼ am hðrÞbm or hðyÞ ¼ bm hðrÞam .

INVOLUTION CODES

153

Since x1 x2 ¼ am pbm am qbm ¼ sbm hðrÞam t or x1 x2 ¼ am pbm am qbm ¼ sam hðrÞbm t the only possibilities for r are hðrÞ ¼ p or hðrÞ ¼ q. In the first case s ¼ 1 and in the second case t ¼ 1, which is a contradiction with the definition of h-comma-free. Hence X is h-comma-free. Let A ¼ ðV; E; kÞ be the automaton that recognizes Xþ , where V ¼ f1; . . . ; 2m þ 1g is the set of vertices, E % V & R & V and k : E ! R (with ði; s; jÞ7!sÞ is the labeling function defined in the following way: 8 < a for 1 ' i ' m; kði; s; jÞ ¼ b for m þ 2 ' i ' 2m; : s for i ¼ m þ 1;m þ 2;

j ¼ i þ 1; j ¼ i þ 1; and i ¼ 2m þ 1; j ¼ 1, j ¼ m þ 2; s 2 Rnfa;bg.

Then the adjacency matrix for A is a ð2m þ 1Þ & ð2m þ 1Þ matrix with ij th entry equal to the number of edges from vertex i to vertex j . Then the characteristic polynomial can be computed to be det ðA ( lIÞ ¼ ð(lÞ2m ðp ( 2 ( lÞ þ ð(1Þ2m ðp ( 2Þ. The eigen values are solutions of the equation l2m ðp ( 2Þ ( l2mþ1 þp ( 2 ¼ 0, which gives p ( 2 ¼ l ( ðl=ðl2mþ1 ÞÞ. (. Hence 0 < l=ðl2mþ1 Þ< 1; i:e:; p ( 2 < l < p ( 1 In the case of the DNA alphabet, p ¼ 4 and for m ¼ 1 the above characteristic equation becomes l3 ( 2l2 ( 2 ¼ 0. The largest real value of l is approximately 2.3593 which means that the entropy of Xþ is greater than log 2. Example 4.3. Consider the DNA alphabet D with h ¼ qm. Let m ¼ 2 and choose A and T such that X % [ni¼1 A2 fG; Cgi T2 . Then X and so Xþ is h -comma-free. Proposition 4.4. Choose distinct a; b; c 2 R such that hðaÞ 6¼ a. Let m(1 i m m X ¼ [1 cÞ b for some m ) 2. Then X, and so Xþ is strictly i¼1 a ðR logðpðm(1Þ=m Þ h-comma-free. The entropy of Xþ is such that 1=m þ m(1 þ 1Þ Þ. < !hðX Þ < logððp Proof. The proof that X is strictly h-comma-free is not difficult and we just give a sketch. Suppose there are x; x1 ; x2 2 X such that shðxÞt ¼ x1 x2 for some s; t 2 Rþ . Let x ¼ am s1 cs2 c . . . sk cbm , then hðxÞ is either hðam Þhðs1 c . . . sk cÞhðbm Þ or hðbm Þhðs1 c . . . sk cÞhðam Þ, which cannot be a proper subword of x1 x2 for any x1 ; x2 2 X. Hence X is h-comma- free. Let A ¼ ðV; E; kÞ be the automaton that recognizes Xþ , where V ¼ f1; . . . ; 3mg is the set of vertices, E % V & R & V is the set of edges and k : E ! R (with ði; s; jÞ7!sÞ is the labeling function defined in the following way:

154

! JONOSKA ET AL. NATASA

8 a > < b kði; s; jÞ ¼ > :c s

for 1 $ i $ m; j ¼ i þ 1; for 2m $ i $ 3m & 1; j ¼ i þ 1; and i ¼ 3m; j ¼ 1; for i ¼ 2m; j ¼ m þ 1; for m þ 1 $ i $ 2m & 1; j ¼ i þ 1; s 2 R:

Note that this automaton is not deterministic, but it has a delay 1, hence the entropy of Xþ can be obtained from its adjacency matrix. Let A be the adjacency matrix of this automaton. The characteristic equation for A is &ðlÞ3m þ ðlÞ2m pm&1 þ pm&1 ¼ 0. This implies pm&1 ¼

l3m lm ¼ lm & 2mþ1 : 2mþ1 l l

Since p is an integer and lm m&1 0 < 2mþ1 < 1; p m < l < ð pm&l þ lÞ1=m : l

(

For the DNA alphabet, p ¼ 4 and for m ¼ 2 the above characteristic equation becomes l6 & 4l4 & 4 ¼ 0. Solving for l, the largest real value of l is 2.055278539. Hence the entropy of Xþ is greater than log 2. Example 4.5. Consider D and h ¼ qm and let m ¼ 2; a ¼ A; c ¼ C; b ¼ G. S i þ Then X ¼ 1 i¼1 AAðDCÞ GG and X are strictly h-comma-free. With the following propositions we consider ways to generate hsubword-k -code and h-k -codes. k&1 Proposition 4.6. Let a; b 2 R such that hðaÞ ¼ b and let X ¼ [1 i¼1 a k&2 i ððR=fa; bgÞ bÞ . Then X is h-subword-k-code for k ' 3. Moreover, when h is morphic then X is a h-k-code. The entropy of Xþ is such that ðk&2Þ logððp & 2Þðk&1Þ Þ< hðXþ Þ < logðððp & 2Þk&2 þ 1Þk&1 Þ.

Proof. Suppose there exists x 2 X such that x ¼ rushðuÞt for some r; t 2 R( ; u; hðuÞ 2 Rk and s 2 Rm for m ' 1 . Let x ¼ ak&1 s1 bs2 b . . . sn b, where si 2 ðRnfa; bgÞk&2 . Then the following are all possible cases for u: 1. 2. 3. 4.

u u u u

is is is is

a a a a

subword subword subword subword

of of of of

ak&1 s1 ; as1 b; s1 bs2 , bsi b for some i $ n.

INVOLUTION CODES

155

In all the these cases, since hðaÞ ¼ b; hðuÞ is not a subword of x. Hence X is h-subword-k-code. Let A ¼ ðV; E; kÞ be the automaton that recognizes Xþ , where V ¼ f1; . . . ; 2k % 2g is the set of vertices, E & V ' R ' V and k : E ! R (with (i, s, j)7!s) is the labeling function defined in the following way: 8 j ¼ i þ 1; < a for 1 ( i ( k % 1; j ¼ k; and i ¼ 2k % 2; j ¼ 1; kði; s; jÞ ¼ b for i ¼ 2k % 2; : s for k ( i ( 2k % 3; j ¼ i þ 1; s 2 Rnfa; bg. This automaton is with delay 1. Let A be the adjacency matrix of this automaton. The characteristic equation for A is ½ð%lÞ2k%2 % ðp % 2Þk%2 lk%1 % ðp % 2Þk%2 ¼ 0. So ¼ lk%1 % ðlk%1 Þ=ðlk%1 þ 1Þ. Since 0 < ðlk%1 Þ=ðlk%1 þ 1Þ < 1; ðp % 2Þk%2 ðk%2Þ ( ðp % 2Þðk%1Þ < l < ððp % 2Þk%2 þ 1Þk%1 : For the DNA alphabet, p= 4 and for k=3 the above characteristic equation becomes l4 % 2l2 % 2 ¼ 0. Solving for l, the largest real value is 1.6528 which is greater than the golden mean (1.618), but less than 2. The asymptotic value for l is 2 when k approaches infinity. Example 4.7. Consider D with h ¼ qm and choose k ¼ 3. Then i X ¼ [1 i¼1 AAðfG; CgTÞ is h-subword-3-code.

As other authors have observed, note that it is easy to get h-k-code if one of the symbols in the alphabet is completely ignored in the construction of the code X.

Proposition 4.8. Assume that hðaÞ ¼ 6 a for all symbols a 2 R. Let b, c 2 R k%1 ððR n fcgÞk%2 bÞi for k * 3. Then such that hðbÞ ¼ c and let X ¼ [1 i¼1 a h % k%code. The entropy of Xþ is such that X and Xþ are ðk%2Þ=ðk%1Þ k%2 1=ðk%1Þ þ Þ
156

" JONOSKA ET AL. NATASA

This automaton is with delay 1. Let A be the adjacency matrix of this automaton. Its characteristic equation is l2k!2 ! lk!1 ðp ! 1Þk!2 ! ðp ! 1Þk!2 ¼ 0. This implies ðp ! 1Þk!2 ¼ lk!1 ! ðlk!1 Þ=ðlk!1 þ 1Þ. We are interested in the largest real value for l. Since l > 0, we have 0 < ðlk!1 Þ=ðlk!1 þ 1Þ < 1, which implies ðp ! 1Þðk!2Þ=ðk!1Þ < l < ððp ! 1Þk!2 þ 1Þ1=k!1 : ( For the DNA alphabet p ¼ 4 and for k ¼ 4 the above estimate says that l > 32=3 > 2. Hence the entropy of Xþ in this case is greater than log 2. Proposition 4.9. Assume that hðaÞ 6¼ a for all symbols a 2 R. Let b; c 2 R k!1 i bÞ for k & 3. Then X and such that hðbÞ ¼ c and let X ¼ U1 i¼1 ððRnfcgÞ þ þ X are a h-k-code. The entropy of X is such that !hðXþ Þ ¼ logððp ! 1Þðk!1Þ=ðkÞ Þ

The fact that Xþ is a h-k-code is straight forward, since every subword of x 2 X of length k contains the symbol b. Let A ¼ ðV; E; kÞ be an automaton that recognizes Xþ where V ¼ f1; . . . ; kg is the set of vertices, E ' V ( R ( V and k : E ! R ðwithði; s; jÞ7!sÞ is the labeling function defined in the following way: ! b for i ¼ k; j ¼ 1; kði; s; jÞ ¼ s for 1 ) i ) k ! 1; j ¼ i þ 1; s 2 Rnfcg: Let A be the adjacency matrix of this automaton. Its characteristic equation is lk ! ðp ! 1Þk!1 ¼ 0. This implies that l ¼ ðp ! 1Þðk!1Þ=k . (

For the DNA alphabet, p ¼ 4 and for k ¼ 5 the above estimate says that l ¼ 34=5 . Hence the entropy of Xþ in this case is greater than log 2. The above propositions showed that in most cases we can generate ‘‘good’’ set of codewords that carry enough entropy to encode bit ! strings by simple symbol ! symbol map. Proposition 4.10 provides condition under which given a set of strings can be used as a base to obtain good DNA code words by injectively mapping bits ! DNA words.

Proposition 4.10. Let R1 and R2 be finite alphabet sets and let f be an injective morphism or antimorphism from R1 to R*2 . Let X be a code over R*1 . Then f(X) is a code over R*2 . Let h1 : R*1 ! R*1 and h2 : R*2 ! R*2 be both morphisms or antimorphisms respectively such that fðh1 ðxÞÞ =h2 ðfðxÞÞ.

INVOLUTION CODES

157

Let P ¼ Prefðh2 ðfðXÞÞÞ and S ¼ Suffðh2 ðfðXÞÞÞ. Aþ PAþ \ fðR1 Þ ¼ ;, where 1. Let ðAþ P [ SAþ Þ \ fðRþ 1 Þ ¼ ; and % % A¼ R2 fðR1 Þ. if X is strictly h1 -infix (comma-free) then f(X) is strictly h2 -infix (comma-free). 2. Let r be such that jfðaÞj ¼ r for all a 2 R1 and Rþ 2 Subðk&1Þr ðfðXÞÞRþ \ Sub ðfðXÞÞ ¼ ;. kr 2 (a) If X is h1 -subword-k-code then f(X) is h2 -subword-(k +1)r-code . (b) If X is h1 -k-code then f(X) is h2 -ðk þ 1Þr-code. Proof. 1. Let X be h1 -infix. Suppose f(X) is not h2 -infix. Then there exists x; y 2 fðXÞ such that x ¼ ah2 ðyÞb for some a; b 2 R%2 and x1 ; y1 2 X such that x ¼ fðx1 Þ and y ¼ fðy1 Þ. Suppose a; b 2 fðR%1 Þ then there exists a1 ; b1 2 R%1 such that fðx1 Þ ¼ fða1 Þfðh2 ðy1 ÞÞfðb1 Þ which implies x1 ¼ a1 h1 ðy1 Þb1 . This is a contradiction with the second hypothesis in 1. Suppose a 62 fðR%1 Þ. Then either there exists y0 2 P such that 0 such that b ¼ b0 b00 with ay0 2 fðRþ 1 Þ or there exists a b 0 ah2 ðyÞb 2 fðR1 Þ. This is a contradiction with the first hypothesis in 1. Hence f(X) is h2 -infix. Similar proof works when X is h1 -commafree. 2. Suppose fðXÞ is not a h2 -ðk þ 1Þr-code. Then there are x; y 2 SubðkþlÞr ðfðXÞÞ such that x ¼ h2 ðyÞ. First assume x; y 2 fðRþ 1Þ and there are x1 ; y1 2 Subðkþ1Þ ðXÞ such that fðx1 Þ ¼ x and fðy1 Þ ¼ y Hence fðx1 Þ ¼ h2 ðfðy1 ÞÞ which implies x1 ¼ h1 ðy1 Þ. This contradicts with X being a h1 -k -code. Otherwise, if x ¼ cx0 d ¼ fðx1 Þ such that Table 1. Sequences Sequence K1 K2 K3 K4 K5 K6 K7 K8 K9 K10

aatacatcacatttctaccc actactacacacctcttacc atcaccacccatcacactac ttaccatctctatacatctc ctatctattcctctcacatc tacacataactccactcatc tcccacatcccatactaatc ctaacatacctacacactac acctctacacacttcacaac tatacatacctcaaccactc

Sequence S1 S2 S3 S4 S5 F1 F2 F3 F4 F5

aagtgtctgtctctgtctgt aactgtctctgtctgtctct aagtctgtgtctctctgtct aagtctctgtgtctctgtgt aactctgtctgtgtctctct aagcggaatctcggaaacgg aatcggaagcgcgcaaacgg aaccggaatcacgcaatcgg aaacggaaacacggaagcgg aatcggaatctcggaagcgg

158

! JONOSKA ET AL. NATASA

Figure 4. A 15% acrylamide non-denaturing gel of the tested DNA codes. 0 0 x0 2 fðRþ 1 Þ and c; d 2 A; then jx j $ kr and x ¼ fðx1 Þ for some 0 x1 2 Subk ðXÞ. Similarly y ¼ ey g such that e,g 2 A and y0 ¼ fðy1 Þ for some y1 2 Subk ðXÞ. If x0 ¼ h2 ðy0 Þ then fðx1 Þ ¼ h2 ðfðy1 ÞÞ which implies 6 h2 ðy0 Þ then x0 is a subword x1 ¼ h1 ðy1 Þ, a contradiction. Suppose x0 ¼ þ 0 of ey g which contradicts R2 Subðk&1Þr ðfðxÞÞRþ 2 \ Subkr ðfðXÞÞ ¼ ;. Hence fðXÞ is h-ðk þ 1Þ-code. Similar proof works for h1 -subword-kcode. (

5. Experimental results Using the methods provided in Section 4, a set of 20 sequences of length 20 base pairs were designed and tested experimentally. Among these, 10 were h-5-codes (designed by methods from Proposition 4.9), five were Table 2. Contents of the gel Lane

Content

1 2–11 12 13 14 15 16 17 18

Ladder K1-K10 All h-comma-free codes All h-subword-3-codes All h-5-codes All h-5-codes with all h-subword-3-codes All h-subword-3-codes with all h-comma-free-codes All h-comma-free codes with all h-5-codes ! Double stranded molecule (K1 with K1 )

INVOLUTION CODES

159

h-subword-3-codes (Proposition 4.6) and the remaining five were h-comma free codes (Proposition 4.4). Purified oligonucleotides were purchased from Integrated DNA Technologies. In addition to the 20 sequences, Watson–Crick complement of one of the sequences (K1) was ordered to act as a control. The sequences K1–K10 are the h-5-codes, S1–S5 and F1–F5, respectively, h-subword-3-codes and h-comma free codes. Non-crosshybridization and secondary structure were measured by running the sequences on TAE polacrylamide non-denaturing gels (15%) at 4 !C. Annealing reaction was done by heating the molecules at 95 !C for 5 min and cool it down to room temperature. The results of the experiment are shown in Figure 4. The top and the bottom bands on the gel are the dye. Observed result: No duplexes were detected in lane 14, which contains annealed solution of all h-5-codes. Their speed on the gel coincides with one of the single stranded molecules in lanes 2–11. This shows that no cross-hybridization is observed for these strands. Also no secondary structures were detected in any lanes 2–11. All molecules are at the same level as 20 bp mark of the ladder in lane 1. The ladder is not visible clearly due to the staining of the gel for a long time in order to make the DNA’s visible. The double stranded molecule on lane 18 runs at a different level than the single stranded DNA molecules (K1–K10) on lanes 2–11. However, some cross hybridization was observed in h-comma-free and h-subword-3-codes (lanes 12,13 and 15–17).

6. Concluding remarks In this paper we investigated theoretical properties of languages that consist of DNA based code words. In particular we concentrated on intermolecular and intramolecular cross hybridizations that can occur as a result that a Watson–Crick complement of a (sub) word of a code word is also a (sub)word of a code word. These conditions are necessary for a design of good codes, but certainly may not be sufficient. For example, the algorithms used in the programs developed by Seeman (Seeman, 1990), Feldkamp (Feldkamp et al., 2002) and Ruben (Ruben et al., 2002), all check for uniqueness of k-length subsequences in the code words. Unfortunately, none of the properties from Definition 2.1 ensures uniqueness of k-length words. Such code word properties remain to be investigated. The observations in Section 3 provide a general way

160

! JONOSKA ET AL. NATASA

how from a small set of code words with desired property we can obtain, by concatenating the existing words, arbitrarily large sets of code words with similar properties. We hope that the general methods of designing such codewords will simplify the search for ‘‘good’’ codes. Better characterizations of good code words that are closed under Kleene* operation may provide even faster ways for designing such codewords. The most challenging questions of characterizing and designing good h-k-codes remains to be developed. The information capacity of the codes designed by the methods from Section 4 show that in most cases we obtain sets of code words that can encode binary strings symbols ! symbols. In Arita and Kobayashi (2002), authors use binary strings with certain properties to generate DNA codes (assigning bases to bits). It remains to be investigated how powerful (in the sense of information capacity) can be the transition from binary strings to DNA code words. Our experimental results show that when h-k-codes are used, they provide a good base for developing new code words, however the other properties taken isolated are not shown to provide sufficient distinction between coded DNA strands in laboratory experiments. Our approach to the question of designing ‘‘good’’ DNA codes has been from the formal language theory aspect. Many issues that are in volved in designing such codes have not been considered. These include (and are not limited to) the free energy conditions, melting temperature as well as Hamming distance conditions. All these remain to be challenging problems and a procedure that includes all or majority of these aspects will be desirable in practice.

Acknowledgements The first two authors are been partially supported by grants EIA0086015 and EIA-0074808 from the National Science Foundation, USA. The third author is supported by EIA-0130385 and MCB9980092 from the National Science Foundation, USA.

References Adler RL, Coppersmith D and Hassner M (1983) Algorithms for sliding block codes – An application of symbolic dynamics to information theory. IEEE Trans Inform Theory 29: 5–22

INVOLUTION CODES

161

Arita M and Kobayashi S (2002) DNA sequence design using templates. New Generation Comput 20(3): 263–277 (Available as a sample paper at http://www.ohmsha. co.jp/ngc/index.htm.) Berstel J and Perrin D (1985) Theory of Codes. Academis Press Inc., Orlando Florida. Braich RS, Chelyapov N, Johnson C, Rothemund PWK and Adleman L (2002) Solution of a 20-variable 3-SAT problem on a DNA computer, Science 296: 499– 502 Deaton R, Chen J, Bi H, Garzon M, Rubin H and Wood DF (2003) A PCR-based protocol for in vitro selection of non-crosshybridizing oligonucleotides. DNA Computing: Proceedings of the 8th International Meeting on DNA Based Computers. In: Hagiya M, and Ohuchi A. (eds), Springer LNCS 2568: 196–204 Deaton R. et al. (1997) A DNA based implementation of an evolutionary search for good encodings for DNA computation. Proceedings of IEEE Conference on Evolutionary Computation ICEC-97, pp. 267–271 Faulhammer D, Cukras AR, Lipton RJ and Landweber LF (2000) Molecular Computation: RNA solutions to chess problems. Proceedings of the National Academy of Sciences, USA 97(4): 1385–1389 Feldkamp U, Saghafi S and Rauhe H (2002) DNA sequence generator – A program for the construction of DNA sequences, In: Jonoska N and Seeman NC (eds) DNA Computing: Proceedings of the 7th International Meeting on DNA Based Computers, Springer LNCS 2340: 23–32 Garzon M, Deaton R and Reanult D (2000) Virtual test tubes: a new methodology for computing, Proceedings of 7th International Symposium on String Processing and Information Retrieval, A Coruna, IEEE Computing Society Press, Spain 116–121. Head T (1987) Formal language theory and DNA: An analysis of the generative capacity of specific recombinant behaviors. Bull. Math. Biology 49: 737–759 Head T, Paun Gh and Pixton D (1997) Language theory and molecular genetics, Handbook of Formal Languages, Vol. II, In: Rozenberg G, and Salomaa A. (eds) Springer Verlag pp. 295–358 Hussini S, Kari L and Konstantinidis S (2003) Coding properties of DNA languages. Theoretical Computer Science 290: 1557–1579 Jonoska N, Kephart D and Mahalingam K (2002) Generating DNA code words. Congressus Numernatium 156: 99–110 Jonoska N and Mahalingam K (2004) Languages od DNA based code words, Proceedings of the 9th International Meeting on DNA Based Computers, In: Chen J, and Reif J. (eds), Springer LNCS 2943: 61–73 Jonoska N and Mahalingam K (2004) Methods for constructing coded DNA languages, In: Jonoska N, Paun G and Rozenberg G (eds), Aspects of Molecular Computing, Springer LNCS 2950: 241–253 Kari L, Konstantinidis S, Losseva E and Wozniak G (2003) Sticky-free and overhangfree DNA languages. Acta Informatica 40: 119–157 Keane MS (1991) Ergodic theory an subshifts of finite type, In: edford T et al. (ed.) Ergodic Theory, Symbolic Dynamics and Hyperbolic Spaces pp. 35–70, Oxford University Press, Oxford Li Z (2002) Construct DNA code words using backtrack algorithm, preprint Lind D and Marcus B (1999) An Introduction to Symbolic Dynamics and Coding. Cambridge University Press Inc., Cambridge United Kingdom Liu Q et al. (2000) DNA computing on surfaces. Nature 403: 175–179

162

! JONOSKA ET AL. NATASA

Marathe A, Condon AE and Corn RM (1999) On combinatorial word design. Preliminary Preproceedings of the 5th International Meeting on DNA Based Computers, pp. 75–88 Boston. Paun Gh, Rosenberg G and Salomaa A (1998) DNA Computing, New Computing Paradigms. Springer Verlag Ruben AJ, Freeland SJ and Landweber LF (2002) PUNCH: An evolutionary algorithm for optimizing bit set selection. In: Jonoska N, Seeman NC (eds) DNA Computing: Proceedings of the 7th International Meeting on DNA Based Computers, Springer LNCS 2340: 150–160 Seeman NC (1990) De Novo design of sequences for nucleic acid structural engineering. Journal of Biomolecular Structure & Dynamics 8(3): 573–581

Suggest Documents