Constructing Orthogonal de Bruijn Sequences

1 downloads 0 Views 299KB Size Report
Small σ – We prove that a pair of orthogonal de Bruijn sequences exist for ... Large σ – Through an algorithmic construction, we prove that there are at ..... Czar, M.J., Anderson, J.C., Bader, J.S., Peccoud, J.: Gene synthesis demystified. ... Math- ematical Proceedings of the Cambridge Philosophical Society 44, 463–482 ...
Constructing Orthogonal de Bruijn Sequences Yaw-Ling Lin1, , Charles Ward2 , Bharat Jain2 , and Steven Skiena2, 1

Dept. of Comp. Sci. and Info. Eng., Providence University [email protected] 2 Dept. of Comp. Sci., Stony Brook University [email protected], {bkjain,skiena}@cs.sunysb.edu

Abstract. A (σ, k)-de Bruijn sequence is a minimum length string on an alphabet set of size σ which contains all σ k k-mers exactly once. Motivated by an application in synthetic biology, we say a given collection of de Bruijn sequences are orthogonal if no two of them contain the same (k + 1)-mer; that is, the length of their longest common substring is k. In this paper, we show how to construct large collections of orthogonal de Bruijn sequences. In particular, we prove that there are at least σ/2 mutually-orthogonal order-k de Bruijn sequences on alphabets of size σ for all k. Based on this approach, we present a heuristic which proves capable of efficiently constructing optimal collections of mutuallyorthogonal sequences for small values of σ and k, which supports our conjecture that σ − 1 mutually-orthogonal de Bruijn sequences exist for all σ and k. Keywords: de Bruijn graphs, de Bruijn sequences, orthogonal sequences, Eulerian cycles, DNA synthesis.

1

Introduction

New technologies create new questions about classical objects. Here we introduce a natural class of combinatorial sequence design problems in response to rapid advances in DNA synthesis technology. An exciting new field of synthetic biology is emerging with the goal of designing novel organisms at the genetic level. DNA sequencing technology can be thought of as reading DNA molecules, so as to describe them as strings on {A, C, G, T } for computational analysis. DNA synthesis is the inverse operation, where one can take any desired string and construct DNA molecules to specification with exactly the given sequence. Indeed, commercial vendors such as GeneArt and Blue Heron today charge under 40 cents per base, or only a few thousand dollars to synthesize virus-length sequences, and prices are rapidly dropping [4,6]. In May 2010, Venter’s group announced the first synthetic life form, a feat requiring synthesizing a 1.08 megabase bacterial chromosome [8]. The advent of cheap  

This work is supported in part by the National Science Council (NSC-99-2632-E126-001-MY3), Taiwan, ROC. Supported in part by NIH Grant 5R01AI07521903, NSF Grants IIS-1017181 and DBI-1060572, and IC Postdoctoral Fellowship HM1582-07-BAA-0005.

F. Dehne, J. Iacono, and J.-R. Sack (Eds.): WADS 2011, LNCS 6844, pp. 595–606, 2011. c Springer-Verlag Berlin Heidelberg 2011 

596

Y.-L. Lin et al.

synthesis will have many exciting new applications throughout the life sciences (including our own work on synthetic vaccines [5,16,15]): the freedom to design new sequences to specification leads to a variety of new algorithmic problems on sequences. We are particularly concerned with the problem of designing novel sequences encoding large collections of target patterns. A collaboration with microbiologists Bruce Futcher (Stony Brook University) and Lucas Carey (Weizmann Institute) revealed the need for several distinct sequence designs each containing representative binding sites for a large set of transcription factors. Three distinct objectives of these designs needed to be satisfied: 1. Minimum Length – Since synthesis cost increases as a function of length, these sequence designs must be as short as possible. 2. Variability – Several equivalent designs were needed, in order to control for the possibility of innoportunely inserting larger signal patterns, and to provide additional confirmatory evidence of the findings. 3. Orthogonality – The final goal is that each pair of our designed sequences avoid common (or reverse-complement) sequences to minimize the chances of cross-hybridization. Such molecular interactions could lead to a variety of problems, including blocking the very binding sites the molecules were specifically designed to include. A particularly relevant class of DNA sequences contain occurrences of all 4k patterns of length k. Simple concatenation would yield a string of length k4k , but considerable length reduction is possible by overlapping patterns. Indeed, de Bruijn sequences [7,11] yield optimal circular strings of length 4k containing each pattern exactly once. These can be linearized, as in the case of the string AACAGAT CCGCT GGT T A containing all 16 two-base nucleotide sequences (as shown in Figure 1(left)). Such compression can have a big impact on synthesis costs: realizing all 6-mers using the concatenation design would require 24,576 bases, as opposed to the linearized de Bruijn sequence of only 4,101 bases. Further, the number of distinct de Bruijn sequence of length (order) k and alphabet σ grow exponentially as a function of k and σ, addressing the second criteria. In this paper we deal with the third criteria: orthogonality. In particular, we define two order-k de Bruijn sequences as orthogonal if they contain no (k + 1)mer in common – even though they both must contain all k-mers in common. A set S of de Bruijn sequences are mutually orthogonal if each pair is orthogonal, meaning that no (k + 1)-mer occurs more than once in S. For example, the following three sequences: AAACAAGCACCAGACGAGTCATACTCCCGCCTAGGCGTGATCTGTAATTTGCTTCGGTTATGGGAA AAAGAATGAGGATAGTATCGACAGCGGGTGGCATTGTCTACGCTCAACCCTGCCGTTCCACTTTAA AAATAACTATTACATCACGTAGATGTTTCTTGACCTCGCAGTGCGAAGGGCTGGTCCGGAGCCCAA

each contain all 43 possible DNA triplets exactly once, but have no 4-base sequence in common. In this paper we show how to construct large sets of orthogonal de Bruijn sequences. In particular:

Constructing Orthogonal de Bruijn Sequences

597

– Small σ – We prove that a pair of orthogonal de Bruijn sequences exist for all orders k ≥ 1 for σ ≥ 3, and further that this is the smallest alphabet which permits orthogonal de Bruijn sequences. Similarly, a pair of orthogonal sequences always exist on the related Kautz graphs for σ ≥ 4. – Large σ – Through an algorithmic construction, we prove that there are at least σ/2 mutually orthogonal order-k de Bruijn sequences for all k ≥ 1. Since there can be at most σ−1 mutually orthogonal sequences, this family is at least half as large as optimal. Further our result generalizes to constructing edge-pair disjoint Eulerian cycles of σ-regular directed graphs, and hence can be used to construct large sets of orthogonal sequences of difference types (such as Kautz sequences [13]). We conjecture that σ − 1 mutually-orthogonal de Bruijn sequences exist for all k ≥ 1, σ ≥ 3. However, significantly improving our result seems intimately connected with a well-known conjecture on Hamiltonian decompositions of line graphs [1], and hence appears difficult. – Optimal constructions for fixed σ and k – The technique employed in our combinatorial result suggests a heuristic procedure for constructing even larger sets of orthogonal sequences. We couple this heuristic with exhaustive search to construct maximal sets of sequences for all reachable k on σ ≤ 26. These constructions support our conjecture that σ − 1 mutually-orthogonal sequences exist for all k. We have made these constructions available at http://www.algorithm.cs. sunysb.edu/orthogonal, along with the programs that identified them, where we anticipate they will suffice for all near-term DNA synthesis applications. Indeed, a proposal is now pending to synthesize our three mutually-orthogonal σ = 4, k = 6 sequence designs as plasmids within a host such as yeast or E. coli, where they can be easily grown as needed for new experiments without additional costly synthesis. These plasmids will be made available to the research community, where we anticipate they will find a wide variety of applications. – Kautz Sequences – Many of our approaches in de Bruijn graphs can be applied on the Kautz graphs as well. In particular, we find that it is possible to find σ − 1 orthogonal Kautz Sequences on alphabet size with σ > 3, for a modest size of k. These results suggest that it is possible to find Hamiltonian decompositions of Kautz graphs with σ > 3, a very nice property for a network topology. Further, we also discover a simple deduction on counting the number of different Kautz sequences, independent of previous approaches using spectral graph theory [18]. – Specialized Constructions – K´ asa [12] conjectures that there are orthogonal de Bruijn sequences of size σ − 1 for σ ≥ 3, k ≥ 2, that can be constructed from one de Bruijn sequence s and series of morphism operations. We implement tools for constructing orthogonal de Bruijn sequences of this kind for relatively small σ and k. In this vein, we propose a conjectured form of orthogonal de Bruijn sequence sets of size σ − 1. These sets have the property of being uniquely described by a single input sequence. These sequences seem relatively abundant

598

Y.-L. Lin et al.

for small values of σ and k, but grow increasing more scarce as the problem size increases. We show that such sequences exist for σ up to 9 when k = 2, and up to k = 6 when σ = 4. This paper is organized as follows. Section 2 introduces previous work on constructing de Bruijn sequences. Analytical results are presented in Section 3. Our heuristic construction and experimental results are presented in Section 4. Open problems, particularly related to d-orthogonal sequences, are described in Section 5.

2

Preliminaries

The de Bruijn sequences are well studied combinatorial objects. There are a number of useful characterizations for the construction of de Bruijn sequences, including as an Eulerian Tour of the de Bruijn Graph [7,11] and as the output of a feedback shift register [10,17]. We will focus on the former characterization here, and review a number of useful known properties of the de Bruijn sequences. Definition 1. A de Bruijn graph of order k on an alphabet of size σ, denoted by Gσ,k , is a directed graph with vertices representing all σ k−1 sequences of length (k−1), and σ k edges joining pairs of vertices whose sequences have the maximum possible overlap (length k − 2). See, for example, Figure 1(middle). The standard method for constructing de Bruijn sequences is to construct an Eulerian cycle of this graph. Every Eulerian cycle on the de Bruijn graph Gσ,k defines a distinct, minimum length de Bruijn sequence containing all σ k k-mers exactly once. Note that this Eulerian cycle approach can be applied to any pattern set whose overlap graph contains vertices of equal in- and out-degree. ˆ∗ The de Bruijn graph Gσ,2 is isomorphic to the directed complete graph K σ (with σ self loops). Every Eulerian cycle of Gσ,k defines a distinct, minimum length de Bruijn (circular) sequence containing all σ k k-mers exactly once. We

Fig. 1. (left) An order-2, σ = 4 de Bruijn sequence representing the shortest DNA sequence containing all 16 distinct bi-nucleotide patterns. (middle) The order-4, σ = 2 de Bruijn graph, Eulerian tours of which correspond to de Bruijn sequences. (right) The order-4, σ = 3 Kautz graph, discussed in section 2.1.

Constructing Orthogonal de Bruijn Sequences

599

say a given collection of de Bruijn sequences are orthogonal if any two sequences maximally differ in sequence composition; that is, their longest common substring is length k. As an example, the collection {001122021, 002211012} is an orthogonal collection of de Bruijn sequences on G3,2 ; while {0011223313032021, 0023212203113301, 0033221101312302} is an orthogonal collection of de Bruijn sequences on G4,2 . The directed line graph L(G) of a directed graph G = (V, E) contains a vertex x for each edge (u, v) ∈ G, and directed edges (x, y) ∈ L(G) iff x = (u, v) and y = (v, w), where u, v, w ∈ V and x, y ∈ E. Observation 2. Gσ,k+1 is the directed line graph of Gσ,k . Then any Eulerian cycle in Gσ,k corresponds to a Hamiltonian cycle in Gσ,k+1 . Note that two orthogonal de Bruijn sequences correspond to two Eulerian cycles in Gσ,k such that these two cycles do not share any pair of consecutively incident edges. It follows directly that: Observation 3. An orthogonal collection of de Bruijn sequences on Gσ,k corresponds to a set of Eulerian cycles in Gσ,k such that no two cycles share any common adjacent pair of edges. By the line graph property of Observation 2, these correspond to a set of edge-disjoint Hamiltonian cycles in Gσ,k+1 . Let us define Ω(σ, k) to be the maximum size of all orthogonal collections of de Bruijn sequences on Gσ,k . Observation 2 tells us that: Corollary 4. The size of an orthogonal collection of de Bruijn sequences on Gσ,k can not exceed σ − 1. That is, Ω(σ, k) ≤ σ − 1. Proof. An orthogonal collection of de Bruijn sequences on Gσ,k corresponds to a set of edge-disjoint Hamiltonian cycles in Gσ,k+1 . However, the minimum in/out degree of any vertex in Gσ,k+1 is σ − 1 plus a self-loop. As the self-loop cannot contribute towards an additional edge-disjoint cycle, there can be at most σ − 1 mutually edge-disjoint Hamiltonian cycles on Gσ,k+1 , and therefore at most σ −1 orthogonal de Bruijn sequences on Gσ,k .  It is conjectured [3,12] that Ω(σ, k) = σ − 1, for k ≥ 2. In the case of k = 1, note that an orthogonal collection of de Bruijn sequences over Gσ,1 , actually corresponds to a Hamiltonian decomposition of the complete directed graph Kσ∗ (without self-loops). It is known that all Kn∗ admit a (directed) Hamiltonian decomposition of size n − 1, except for the case of K4∗ and K6∗ [1]. These results directly lead to Ω(σ, 1) = σ − 1 when σ = 4 or 6 (Ω(4, 1) = 2 and Ω(6, 1) = 4). The conjecture remains plausible for the case of k ≥ 2, and we will later give compelling experimental evidence for this. Subsequent to submitting this work for publication, we discovered Fleischner and Jackson’s [9] result on compatible Eulerian circuits which implies our Theorem 8. Rowley and Bose [20,19] use feedback shift registers [10,17] to obtain σ − 1 orthogonal sequences when σ is a power of 2, and σ orthogonal sequences on a modified de Bruijn graph when σ is the power of a prime.

600

Y.-L. Lin et al.

2.1

Kautz Graphs

A Kautz graph [13] is a labeled graph, very similar to the de Bruijn graph. Like the de Bruijn graph, vertices of Kzσ,k are labeled by strings of length k over an alphabet Σ with σ letters, but with the additional restriction that every two consecutive letters in the string must be different. As in the de Bruijn graph, there is a directed edge from a vertex u to another vertex v if it is possible to transform the string of u into the string of v by removing the first letter and appending a letter to it. See figure 1(right) for an example of a Kautz graph. For a fixed degree and number of vertices, the Kautz graph has the smallest diameter of any possible directed graph; furthermore, a degree-δ Kautz graph has δ disjoint paths from any node u to any other node v. These properties suggest that the Kautz graph can be a nice candidate of the network topology for connecting processors in interconnection networks [2]. Note that Kautz graphs have in-degree equal to out-degree, both being σ − 1, for each node. It thus follows that all Kautz graphs are Eulerian. As with the de Bruijn sequences, a Kautz sequence naturally corresponds to a Euler path in the underlying Kautz graph. Thus, a Kautz sequence is a sequence of minimal length that produces all possible length-(k + 1) sequences, but with the restriction that every two consecutive letters in the sequences must be different. A set of three mutually orthogonal Kautz sequences for σ = 4 and k = 3 are: 01202303021320312313010323212102013101 01213020212323013210102310313120303201 01230232020321313231212010130310210301 Observation 5. Kzσ,k+1 , is the directed line graph of Kzσ,k . Thus, an Eulerian cycle in Kzσ,k , corresponds to a Hamiltonian cycle in Kzσ,k+1 . Therefore, both Kautz and de Bruijn graphs can be considered as families of iterated line graphs. Kzσ,k are obtained from k −1 iterated line graph operations ˆ σ∗ (with self-loops.) on Kσ∗ , while Gσ,k are the (k−2)-th iterated line graph from K

3

Analytical Results

To be Eulerian, each vertex of a directed graph must have the same in-degree as out-degree. Furthermore, an Euler tour in G corresponds to a pairing of each inedge to its out-edge for each vertex v ∈ G. Such an edge-pairing defines a perfect matching between input edges to output edges of v. We call such an edge-pairing an (edge) wiring of v. Two wirings of a vertex are disjoint if the corresponding matchings are edge-disjoint. Note that an Euler tour defines a specific wiring for each vertex in G; however, a set of arbitrary wirings for vertices of G usually ends up with several disconnected (edge) cycles. As a restatement of Observation 3, we have Observation 6. Two Euler tours in G are orthogonal to each other if and only if the two induced wirings for each vertex of G are disjoint.

Constructing Orthogonal de Bruijn Sequences

601

K˝ onig [23] proved that an r-regular bipartite graph can be decomposed into r edge-disjoint perfect matchings, implying that it is hopeful to find an orthogonal collection of Euler tour with size δ in an Eulerian graph with minimum in/outdegree δ. However, it is easily verified that we cannot find two orthogonal Euler tours in the directed complete graph K3∗ (without self-loops), whose minimum ˆ 3∗ with 3 selfin/out-degree is δ = 2. In contrast, the directed complete graph K ˆ ∗. loops, has minimum degree 3, but there exist two orthogonal Euler tours in K 3 We now show that there exist at least two orthogonal Euler tours in digraph G whenever δ ≥ 3: Theorem 7. There exists an orthogonal collection of Euler tours of size 2 in any Eulerian digraph G with minimum degree at least 3. Hence, there exists an orthogonal collection of de Bruijn sequences on Gσ,k with size 2 if σ ≥ 3. That is, Ω(σ, k) ≥ 2 if σ ≥ 3 Proof. Let C be an arbitrary Euler tour of G, defining a specific wiring for each vertex in G. We can rewire each vertex v in G such that the new wiring is disjoint and still forms an Euler tour. Note that the initial wiring of v partition edges of G into δ disjoint nonempty paths, namely {P1 , P2 , . . . , Pδ }, with C = (P1 P2 · · · Pδ ) in circular order. Let ai (bi ) denote the first (last) edge of Pi . Note that the vertex v wires bi to a1+i mod δ . It is easily verified that the newly constructed wiring of v by connecting bi to a2+i mod δ produces a disjoint Euler tour (Pδ Pδ−1 · · · P2 P1 ). Note that the argument fails for δ = 2 where (P1 P2 ) = (P2 P1 ). Figure 2 shows this diagrammatically for an example for σ = 3.  The idea of rewiring the edge-disjoint perfect matchings in Theorem 7 can be extended so that we can find more orthogonal Euler tours, if possible, by repeatedly rewiring the edge matching while maintaining the one connected Euler tour:

Fig. 2. An example of rewiring a vertex with σ = 3 for the proof of Theorem 7

Theorem 8. Given an Eulerian digraph G with minimum degree δ, there exists an orthogonal collection of Euler tours in G of size δ/2. Therefore, there exists an orthogonal collection of de Bruijn sequences of Gσ,k with size at least σ/2; that is, σ/2 ≤ Ω(σ, k) ≤ σ − 1. Proof. Given an orthogonal collection of m Euler tours on G, {C1 , C2 , . . . , Cm }, 1 ≤ m < δ/2, we will show that it is always possible

602

Y.-L. Lin et al.

to rewire each vertex v ∈ G such that the new wiring is disjoint and maintains an Euler tour. Observe that these m disjoint Euler tours induce m forbidden bipartite matchings between the input and output edges for each vertex for the rewiring in the (m+1)-st sequence. Similar to the analysis of Theorem 7, take an arbitrary Euler tour with respect to a vertex v, partition edges of Gσ,k into δ disjoint nonempty paths, namely {P1 , P2 , . . . , Pδ }, with the Eulerian tour C = (P1 P2 · · · Pδ ). Let ai (bi ) denote the first (last) edge of Pi , and thus the original wiring of v wires bi to a1+i mod δ , as, again, in Figure 2. We define the permissible wiring graph of v as the digraph Wv with vertex set {P1 , P2 , . . . , Pδ } & edge set {(Pi , Pj )| (bi , aj ) is not an edge of Cx , 1 ≤ x ≤ m}. That is, two path vertices are connected by an edge if they have not been wired together in a previous tour. We now wish to find a Hamiltonian path on this graph, as it will correspond to a wiring which is both permissible and preserves an Eulerian tour of the graph. Observe that both the in-degree and out-degree of each vertex of the wiring graph are at least δ − 1 − m ≥ δ/2. By the Ghouila-Houri theorem [23], a directed graph D with n vertices has a Hamiltonian cycle if both the in-degree and out-degree of every vertex of D are at least n/2. Thus, it follows that our wiring graph has a Hamilton cycle, and thus every vertex v can be successfully rewired, maintaining an Euler tour in G, while having the property that the edge-matching in v is orthogonal to those of the previous m tours.  3.1

Special Orthogonal Families

K´ asa [12] observed small samples of orthogonal de Bruijn sequences following a special pattern. In particular, a de Bruijn sequence can be transformed into another by means of a morphism function, μ, where μ(0) = 0, μ(σ − 1) = 1, and μ(i) = 1 + i for 1 ≤ i ≤ σ − 2. K´ asa conjectured that for σ ≤ 3, k ≤ 2, sets of orthogonal de Bruijn sequences of size σ − 1 can be constructed from one de Bruijn sequence s and morphism operations ν i (s), 1 ≤ i ≤ σ − 2. For example, the following are a set of K´ asa’s orthogonal de Bruijn sequences on σ = 4, k = 2, 00102113230331220 00203221310112330 00301332120223110 We implement tools for constructing orthogonal de Bruijn sequences of this kind under small σ and k. K´ asa’s orthogonal de Bruijn sequences collection is interesting since it provides a simple construction of σ − 1 orthogonal de Bruijn sequences. In terms of routing in interconnection networks, this corresponds to a set of disjoint Hamiltonian cycles in the line graph, de Bruijn graph Gσ,k+1 . Unfortunately, verifying and/or disproving K´ asa’s conjecture is not easy mostly because of the abundance of de Bruijn sequences to be tested, and we were unable to find additional empirical evidence to support or disprove the conjecture. We also define a new class of orthogonal de Bruijn sequences. Specifically, we note that there seem to exist sets of orthogonal de Bruijn sequences s1 , . . . , sσ−1

Constructing Orthogonal de Bruijn Sequences

603

with the property that, for each k-mer X which is followed in si by the symbol y (i.e., yielding the k+1-mer Xy), in sequence si+1 the k-mer X will be followed by the symbol y + 1(mod σ). For example, the sequences below have this property: 00102231121320330 00232103011331220 00313020123322110 These sets have the property of being uniquely determined by a single starting sequence. We have no guarantee that any de Bruijn sequence will yield such a set of sequences, although if they exist, the construction itself guarantees their orthogonality. If a randomly-generated de Bruijn sequence has this property, construction of such a set of orthogonal sequences is immediate. Unfortunately, the probability that a random de Bruijn sequence will have this property seems to decrease exponentially with increasing σ and k. Through extensive computational experiments, we have verified the existence of such sets of sequences up to σ = 9 for k = 2 and up to k = 6 for σ = 4. 3.2

Counting Eulerians

The matrix-tree theorem for graphs (see [21,22,23]), states that the number of Eulerian circuits in a labeled Eulerian digraph G is equal to ε(G) = T · Πv∈G (deg(v) − 1)! where T is the number of (directed) spanning trees rooted at any particular vertex of G. Note that the number of directed spanning trees in an Eulerian digraph does not depend on the vertex where it is rooted. Plugging in the terms k−1 T = σ σ −k , deg(v) = σ, and |V (G)| = σ k−1 , it follows that number of Eulerian k−1 k−1 k−1 · σ σ −k or (σ!)σ /σ k . circuits in Gσ,k is [(σ − 1)!]σ To count the number of Eulerian circuits in Kautz graphs, and thus the number of Kautz sequences, van Aardenne-Ehrenfest’s formula [22] again can be applied here. The interesting part is to calculate the number of rooted directed spanning tree on Kautz graphs. Here we use an induction on k; the inductive base is easily derived from Cayley’s formula T (Kzσ,1 ) = σ σ−1 . The inductive hypothesis can be derived from Knuth’s line graph spanning tree recursion [14]: T (L(G)) = T (G)Πv∈G deg+ (v)deg



(v)−1

where T (G) denote the number of spanning trees on G; L(G) is the line graph k−1 of G. It follows that T (Kzσ,k+1 ) = T (Kzσ,k ) · (σ − 1)(σ−2)σ(σ−1) ; note that k is decreased by 1 at deducing the final form. It follows that number of Eulerian k−1 circuits in Kzσ,k is σ σ−2 [(σ−1)!]σ(σ−1) /(σ−1)σ+k−1 . The formula can also be derived by the graph spectral theory in characteristic polynomial and permanent of the arc-graph [18]. The abundance of Kautz sequences suggests an orthogonal collection of Kautz sequences with large size.

604

4

Y.-L. Lin et al.

Heuristic Construction of Orthogonal Sequences

The algorithm described in Theorem 8 gives no fewer than σ/2 orthogonal de Bruijn sequences. In order to find the σ − 1 − σ/2 orthogonal sequences conjectured to remain, we extend the idea used in the proof of Theorem 8. Note that the algorithm, in visiting every vertex v of G, expects the wiring graph of v to be Hamiltonian connected. Failing to satisfy the condition, the algorithm stops and reports the certified orthogonal sequences. However, it is possible (and generally the case, beyond the first σ/2 sequences) that an orthogonal sequence can only be found by rewiring two or more vertices simultaneously, while fixing any one of these vertices alone does not render an orthogonal sequence. Thus, the refined approach rewires one vertex at a time, according to the temporarily fixed matches of other vertices of the graph, only this time, instead of trying to connect all paths incident to the vertex in one step, the algorithm picks a good wiring according to a heuristic, and continues to proceed to other vertices. Hopefully, in later stages of the algorithm, some vertex can be rewired and lead to an disjoint Eulerian cycle, rendering another orthogonal sequence. To justify a good wiring, recall that the wiring graph of v, a digraph Wv , in essence, defines the permissible connectivity between the paths Q = {P1 , P2 , . . . , Pσ } incident to vertex v. Furthermore, a permissible (perfect) matching between σ pairs of in-edges and out-edges of v connects the paths of Q, forming disjoint cycles made of Q. Originally, the idea is to find a matching that forms a single loop. Here we define a good match is the one that maximizes the length of the smallest cycle. A good wiring makes sure that more disconnected vertices on the smallest cycle have a higher chance being fixed in the later stage. In order to find an orthogonal de Bruijn sequence, no disjoint cycle left on the graph is allowed. We have experimented with other heuristics, such as maximizing the longest cycle; however, these approaches do not successfully lead to finding all σ − 1 orthogonal de Bruijn sequences as efficiently. This algorithm allows us to efficiently find de Bruijn sequences up to useful values of σ and k; the running times for our algorithm finding σ − 1 orthogonal de Bruijn sequences for various values of σ and k are illustrated in Figure 3. It is straightforward to apply our algorithm to find orthogonal Eulerians in Kautz graphs as well. When applied to the Kautz graph, thus, we obtain orthogonal Kautz sequences. The first σ−2 Kautz sequences are found by the algorithm in time somewhat less than that of the equivalent de Bruijn sequences; however, the algorithm frequently fails to find the (σ − 1)st Kautz sequence, because the final sequence is entirely determined by the preceding sequences. Figure 3 presents runtimes for finding σ − 1 orthogonal Kautz sequences for realizable values of σ and k. Based on the empirical results of this program, we conjecture that Ω(Kzσ,k ) = σ − 1 for all σ ≥ 4, and, moreover, that this is true for σ = 3 when k ≥ 5. A Python implementation of this algorithm, as well as sets of orthogonal de Bruijn and Kautz sequences, are available at our website http://www.algorithm.cs.sunysb.edu/orthogonal.

Constructing Orthogonal de Bruijn Sequences

605

de Bruijn sequences max total seq σ length k 2 26 16,900 12 19,008 3 7 14,406 4 5 12,500 5 4 12,288 6 3 4,374 7 3 13,122 8 3 39,366 9

Fig. 3. (left)Run-time of our algorithm for various values of σ and k, finding both orthogonal de Bruijn and Kautz sequences. (right) Largest achievable value of σ for each fixed k using our heuristic, given a small running time limit (1000 seconds).

5

Open Problems

We have proven that large families of orthogonal de Bruijn sequences of any order exist for all σ ≥ 3. Further, we give optimal constructions for a large number of finite cases, particular those of interest in synthetic biology. Several open problems remain. We say that two order-k de Bruijn sequences are d-orthogonal if they contain no (k + d)-mer in common, generalizing the notion beyond d = 1. The most compelling problem concerns tightening the bounds on the number of d-orthogonal (σ,k) de Bruijn sequences as a function of d, σ, and k. In particular, we know very little for d > 1. We also envision a second class of diverse de Bruijn sequence families, designed to identify motifs from fragment length assays (e.g. electrophoresis) instead of sequencing. Suppose we seek to identify the cutter sequence of a specific restriction enzyme. If we design two sequences such that the resulting fragment lengths are distinct for each possible cutter, the cutter identity follows directly from observation. The applications for these sequences are more speculative than those comprising the body of this paper, but the algorithmics appear interesting.

References 1. Bermond, J.-C.: Hamiltonian decompositions of graphs, directed graphs and hypergraphs. Ann. Discrete Math. 3, 21–28 (1978); Pr´esent´eau Cambridge Combinatorial Conf., Advances in Graph Theory , Trinity College, Cambridge, England (1977) 2. Bermond, J.-C., Darrot, E., Delmas, O., Perennes, S.: Hamilton circuits in the directed wrapped butterfly network. Discrete Applied Mathematics 84(1), 21–42 (1998)

606

Y.-L. Lin et al.

3. Bond, J., Iv´ anyi, A.: Modelling of interconnection networks using de bruijn graphs. In: Iv´ anyi, A. (ed.) Third Conference of Program Designer, Budapest (1987) 4. Bugl, H., Danner, J.P., Molinari, R.J., Mulligan, J.T., Park, H.-O., Reichert, B., Roth, D.A., Wagner, R., Budowle, B., Scripp, R.M., Smith, J.A.L., Steele, S.J., Church, G., Endy, D.: DNA synthesis and biological security. Nature Biotechnology 25, 627–629 (2007) 5. Coleman, J.R., Papamichial, D., Futcher, B., Skiena, S., Mueller, S., Wimmer, E.: Virus attenuation by genome-scale changes in codon-pair bias. Science 320, 1784–1787 (2008) 6. Czar, M.J., Anderson, J.C., Bader, J.S., Peccoud, J.: Gene synthesis demystified. Trends in Biotechnology 27(2), 63–72 (2009) 7. de Bruijn, N.G.: A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49, 758–764 (1946) 8. Gibson, D., et al.: Creation of a bacterial cell controlled by a chemically synthesized genome. Science (2010), doi:10.1125./science.1190719 9. Fleischner, H., Jackson, B.: Compatible euler tours in eulerian digraphs. In: Cycles and Rays, Proceeding Colloquium Montreal, 1987. ATO ASI Ser. C, pp. 95–100. Kluwer Academic Publishers, Dordrecht (1990) 10. Golomb, S.W.: Shift Register Sequences. Holden-Day (1967) 11. Good, I.J.: Normal recurring decimals. J. London Math. Soc. 21, 167–172 (1946) 12. K´ asa, Z.: On arc-disjoint hamiltonian cycles in de Bruijn graphs. CoRR abs/1003.1520 (2010) 13. Kautz, W.H.: Bounds on directed (d,k) graphs. In: Theory of Cellular Logic Networks and Machines, AFCKL-68-0668 Final Rep., vol. 24, pp. 20–28 (1968) 14. Knuth, D.E.: Oriented subtrees of an arc digraph. Journal of Combinatorial Theory 3, 309–314 (1967) 15. Montes, P., Memelli, H., Ward, C., Kim, J., Mitchell, J., Skiena, S.: Optimizing restriction site placement for synthetic genomes. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 323–337. Springer, Heidelberg (2010) 16. Mueller, S., Coleman, R., Papamichail, D., Ward, C., Nimnual, A., Futcher, B., Skiena, S., Wimmer, E.: Live attenuated influenza vaccines by computer-aided rational design. Nature Biotechnology 28 (2010) 17. Ronse, C.: Feedback Shift Registers. Springer, Berlin (1984) 18. Rosenfeld, V.R.: Enumerating Kautz sequences. Kragujevac Journal of Mathematics 24, 19–41 (2002) 19. Rowley, R., Bose, B.: Edge-disjoint Hamiltonian cycles in de Bruijn networks. In: Distributed Memory Computing Conference, pp. 707–709 (1991) 20. Rowley, R., Bose, B.: On the number of arc-disjoint Hamiltonian circuits in the de Bruijn graph. Parallel Processing Letters 3(4), 375–380 (1993) 21. Tutte, W.T.: The dissection of equilateral triangles into equilateral triangles. Mathematical Proceedings of the Cambridge Philosophical Society 44, 463–482 (1948) 22. van Aardenne-Ehrenfest, T., de Bruijn, N.G.: Circuits and trees in oriented linear graphs. Simon Stevin: Wisen Natuurkundig Tijdschrift 28, 203–217 (1951) 23. West, D.: Introduction to Graph Theory, 2nd edn. Prentice-Hall, Englewood Cliffs (2000)