Bloomberg School of Public Health, Baltimore, MD 21205, USA. 6. Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218. 7.
Automated Design of Assemblable, Modular, Synthetic Chromosomes Sarah M. Richardson1,2 , Brian S. Olson3 , Jessica S. Dymond1,4 , Randal Burns6 , Srinivasan Chandrasegaran5, Jef D. Boeke1,4 , Amarda Shehu3 , and Joel S. Bader1,7 1
High Throughput Biology Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA 2 McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA 3 Department of Computer Science, George Mason University, Fairfax, VA 22030 4 Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA 5 Department of Environmental Health Sciences, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA 6 Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 7 Department of Biomedical Engineering, Johns Hopkins University, Baltimore MD 21218
Abstract. The goal of the Saccharomyces cerevisiae v2.0 project is the complete synthesis of a re-designed genome for baker’s yeast. The resulting organism will permit systematic studies of eukaryotic chromosome structure that have been impossible to explore with traditional gene-ata-time experiments. The efficiency of chemical synthesis of DNA does not yet permit direct synthesis of an entire chromosome, although it is now feasible to synthesize multi-kilobase pieces of DNA that can be combined into larger molecules. Designing a chromosome-sized sequence that can be assembled from smaller pieces has to date been accomplished by biological experts in a laborious and error-prone fashion. Here we pose DNA design as an optimization problem and obtain optimal solutions with a parallelizable dynamic programming algorithm.
1
Introduction
Synthetic biology requires careful design of nucleotide sequences. Practical applications of synthetic biology include modifying proteins through amino acid changes and redesigning multi-gene pathways. Synthetic biology enables the study of genes that are difficult to manipulate by traditional means, such as ancestral genes inferred from phylogenetic studies of extant biological sequences. Design tools for synthetic DNA sequences have been limited primarily to genelength sequences. For example, GeneDesign assists gene editing and synthesis at the physical level of nucleotides and oligos that can be ordered and used inhouse for inexpensive gene assembly [1]. Other software packages perform logicallevel checks of the syntax and grammar of well-formed transcriptional units, R. Wyrzykowski et al. (Eds.): PPAM 2009, Part II, LNCS 6068, pp. 280–289, 2010. c Springer-Verlag Berlin Heidelberg 2010
Automated Design of Assemblable, Modular, Synthetic Chromosomes
281
such as whether genes have promoters, start codons, non-inhibitory secondary structures, and properly located termination sequences [2,3]. While full genomes are an attractive synthesis target, scaling up synthetic routes from genes (thousands of nucleotides) to genomes (millions of nucleotides) is proving difficult [4]. Some groups have avoided this problem by focusing on genomes so small that they can be put together exactly as though they were merely very large genes, as the Endy group did when refactoring part of the 39 kilobase (kb) genome of bacteriophage T7 [5]. This approach is limited to viral genomes [6] or bacterial plasmids and is impractical for groups interested in larger prokaryotes or eukaryotes. Other groups are using a top-down approach, which involves taking an existing genome and editing it in place as the Blattner group did for Escherichia coli [7], with edits usually limited to deletions rather than insertions or substitutions. Alternatively, a bottom-up approach was used to synthesize the entire bacterial Mycoplasma genitalium genome [8]. Overlapping oligos were assembled into larger cassettes, combined in yeast into a complete genome. Ultimately it must be transplanted as a complete genome into the host cell, which has been achieved for natural but not synthetic DNA [9]. The obligate parasite M. genitalium has a single 582 kb circular genome, much smaller than the 4.7 megabase (Mb) E. coli. The bottom-up approach is unlikely to scale to larger synthetic targets and delays integration and testing to the very end. Our group has launched an effort to synthesize the genome of the yeast Saccharomyces cerevisiae. Yeast has a 12 Mb genome comprising sixteen linear chromosomes, ten of which are each larger than the entire genome of M. genitalium, and the smallest of which is five times larger than the genome of T7. As a eukaryote, yeast has more chromosomal features than viral and bacterial genomes. Our synthetic target is an edited version of the native yeast genome that includes the biological equivalent of debug statements that will allow us to identify and remove the DNA-equivalent of dead code and probe the cryptic function of nonprotein-coding regulatory sequences, many of which await discovery by methods such as ours. The synthetic strategy combines bottom-up synthesis with in-place editing, which permits it to scale to whole chromosomes and genomes. In a hierarchical procedure, DNA sequences of 60 nucleotides are ordered and combined using PCR and genetic engineering into larger synthetic pieces that can then be introduced into yeast to replace cognate wild-type sequences through homologous recombination. The experimental workflow is sufficiently streamlined for adaptation to undergraduate teaching laboratories [10]. This strategy should therefore be a valuable addition to the field of synthetic biology. Our synthetic strategy, and genetic engineering in general, requires the use of restriction enzymes to cut and recombine DNA molecules at enzyme-specific recognition sites, also termed restriction or cut sites. Requirements for the occurrence of suitable restriction sites introduce constraints on any target sequence we wish to synthesize. If a synthetic target does not satisfy the constraints, it is possible to modify the target by editing its DNA sequence, at the cost of
282
S.M. Richardson et al.
possibly introducing unwanted and unpredictable changes to biological function. Restriction enzymes have varying prices, and less expensive enzymes often perform better. Furthermore, solutions with roughly equal spacing between cut sites are preferred to solutions with widely varying spacings. Optimal design has been a practical problem for our in-house project because the combinatorial complexity prevents human experts from reliably generating acceptable error-free solutions, and there are too many sub-optimal solutions for a brute-force computational enumeration. Our design requires that a chromosome be split into 10 kb chunks delimited by restriction enzyme cut sites. We permit the chunk boundaries to vary within a 1000 nt window to include protein-coding regions where we can introduce restriction enzyme cut sites by synonymous recoding of protein-coding sequences. The yeast genome is 70% protein-coding, and each 1 kb interval contains approximately 700 protein-coding nucleotides, or codons for 233 amino acids. Due to genetic code redundancy, a restriction enzyme with a 6-bp recognition site will usually recognize at least 6 different amino acid pairs (three reading frames, two directions). The probability that 6 pairs out of the 202 total possible pairs are absent in a 233 amino acid sequence is approximately e−233×6/400 , or 3%. This theoretical expectation that each possible 6-cutter is individually feasible for each cut site is borne out in practice. With over 100 restriction enzymes with 6-nt recognition readily available, there are over 10026 possible solutions for a small yeast chromosome. A larger chromosome requiring 130 cut sites could have a search space of 100130 solutions. We found that top-down, greedy approaches are inefficient because feasible solutions are sparse, and even algorithms with long look-aheads lead to dead ends. Here we formulate optimal design as a constrained optimization problem for a combined cost reflecting the edit distance and the efficiency of the synthetic strategy. We present an algorithm that computes the optimal solution. The algorithm uses a parallelizable indexing step to catalog all edits that are unlikely to affect gene function (i.e., those that involve purely synonymous base substitutions). Next, the algorithm uses dynamic programming to divide the search into optimal sub-problems that can be computed in parallel. It uses dead-end elimination to remove sub-optimal paths from the search tree. A recent human design for 90 kb assisted with available computational tools required over 40 man-hours of work. Scaling to the yeast genome would require nearly 3 years of this dedicated expert. In contrast, our implementation takes 2.5 minutes for the same 90 kb region, roughly a 1000× speed-up, and only 5 to 6 hours for the entire genome. Furthermore, the algorithm produces output that is superior to all of our expert-generated results, allowing us to quickly create several plans of action for inspection and evaluation — and perhaps concurrent synthesis. The algorithm is implemented in Perl and Python for compatibility with our existing synthetic biology software and for access to core I/O and visualization functionality from the Generic Model Organism Database project [11].
Automated Design of Assemblable, Modular, Synthetic Chromosomes
2
283
Biological Constraints
Restriction enzymes are proteins that cleave DNA at exquisitely predictable recognition sites. We use a set of 121 commercially-available enzymes from REBASE [12], a restriction enzyme database. Genetic engineering techniques often require that an enzyme has one, and only one, recognition site in a large piece of DNA. Sites must therefore be rare (to avoid multiple cuts) but not too rare (to avoid the absence of a site altogether). Synthetic biology offers the ability to manipulate the distribution of recognition sites in target DNA molecules by recoding the native sequence to add or remove recognition sites. That is, we can place n − 1 restriction enzyme recognition sites very precisely in our synthetic sequence, and then, rather than synthesizing a single long molecule, we can synthesize and assemble n shorter molecules, cut them with the appropriate enzymes, and ligate them together into the long molecule we wish to obtain. For instance, the enzyme BamHI will cut any double stranded DNA molecule that contains its recognition site 5 GGATCC 3 , striking asunder both strands C 3 between G and GATCC: 35 GC |GATC CTAG| G 5 . The two double-stranded cleavage products can be ligated back to their original partners in a process called re-ligation, or ligated to other pieces of DNA that have been cut with enzymes that leave the same “overhang”. In the case of BamHI, cleavage leaves two DNA molecules, both with overhangs that read 5 GATC 3 . Because this overhang sequence is palindromic (note that palindromes in DNA are distinct from natural language due to antiparallel DNA strands), the two species of cleavage products may ligate to themselves or to each other, yielding a mix of three ligation products. Mixed products are undesirable because they reduce the yield of the desired product and produce potentially inhibitory or deleterious side-products. Enzymes leaving non-palindromic overhangs are much more desirable because they ensure a single canonical product from an assembly reaction. Our algorithm filters enzymes by relevant criteria, including price, but most importantly, by the kind of overhang left after cleavage. Of the 121 available enzymes, 55 are capable of generating non-palindromic cleavage sites. Not all enzymes are alike in efficiency, availability, or price, and the algorithm uses a real-valued cost to rank the overall performance of each enzyme. The algorithm also requires a chromosome whose protein-coding regions are annotated. The annotations are used to enforce a hard constraint imposed by our limited knowledge of biology; we have implemented a design rule that all DNA changes to introduce or remove restriction sites must be accomplished within protein-coding regions. Edits to non-protein-coding DNA are not permitted because intergenic and intronic sequences house yet-uncharacterized regulatory sequences that could be disrupted by a single base change. Algorithms to evaluate the cost of different edits therefore require knowledge of the boundaries between introns (sequences transcribed from DNA to RNA but then spliced out), exons (sequences remaining in the processed RNA transcript including all protein-coding sequences), and intergenic sequences (DNA sequence that is not transcribed but which may contain important regulatory or structural features).
284
S.M. Richardson et al.
Feasible edits to protein-coding sequences are achieved by substituting synonymous codons, three-nucleotide sequences that encode the same amino acid. Synonymous edits exploit the redundancy of the genetic code. Isolated edits that leave the amino acid sequence unchanged usually do not affect protein expression or activity. Widespread edits, however, are more likely to change RNA secondary structure or affect protein activity by changing the speed of translation, which can in turn alter protein levels or generate mis-folded protein. When genes overlap, edits must retain the amino acid sequences of both proteins. We use the standard Generic Feature Format (.gff), which consists of tabdelimited entries annotating features followed by actual nucleotide sequence in FASTA format. GFF files for many different organisms are easily obtained from public genome repositories on the Internet; the yeast chromosomes were taken from the Saccharomyces Genome Database [13].
3
Data Collection and Indexing
Once the algorithm is provided with a list of enzyme recognition sites and an annotated chromosome sequence, it generates a database of every single restriction enzyme site in the chromosome. Every intergenic sequence is parsed for existing recognition sites. Those that are found are treated as immutable; they may be used as boundaries if they are fortuitously placed, but nothing can be done to improve their placement or the overhangs they leave. A suffix tree is created from all possible 6-frame translations of restriction enzyme recognition sites, such that each node in the tree is an amino acid string that may be reverse translated to be a recognition site. Every exonic sequence in the chromosome is then searched with the suffix tree both for existing recognition sites and for sites where a recognition site could be introduced without changing the protein sequence of the gene using. As long as they occur within protein coding genes, existing and potential recognition sites may be manipulated to yield any of several different overhangs, all of which are computed by the algorithm. It is necessary that every extant site be indexed so that later in the algorithm, when potential sites are considered, the number of sites that must be modified is a known contribution to the cost. The construction of the enzyme database may be executed in parallel because parsing each gene and intergenic sequence for restriction sites is completely independent; no processing element working on a region need communicate with any other processing element. We implemented this collection step in multi-threaded Perl for compatibility with existing code.
4
Landmark Selection
The current implementation of our global assembly scheme requires restriction enzyme sites as landmarks that divide the chromosome into segments of about 10 kb, which may then be built up from oligos [10]. To enforce hierarchical modularity, some of these landmarks will have additional uniqueness requirements. Landmark 1 sites will be 10 kb landmarks that can also divide the chromosome
Automated Design of Assemblable, Modular, Synthetic Chromosomes
285
into 100 kb segments; landmark 2 sites will be 10 kb landmarks that can also divide the chromosome into 30 kb segments. Every other 10 kb landmark is a landmark 3 site (Fig. 1). The innermost landmark 3 sites are not permitted to occur anywhere else within their flanking landmark 1 or 2 sites; likewise, the landmark 2 sites may not appear anywhere within their flanking landmark 1 sites, and any consecutive landmark 1 sites may not be the same. The overhangs left by any two consecutive landmark sites should be different and non-palindromic. These constraints prevent enzymes from cutting the DNA at unwanted locations or yielding a mix of ligation products from cross-reactions. The goal is to select the optimal permutation of landmark enzymes for a full complement of evenly spaced landmark sites. From now on we will refer to a permutation of restriction enzyme recognition sites and locations as a plan. A valid plan must meet the adjacency and uniqueness constraints described above and is then given a real-valued cost reflecting editing costs and penalties. The optimization goal is to identify the plan with the lowest cost. Each landmark separates two regions of size L (L = 100 kb for landmark 1 sites, 30 kb for landmark 2 sites, and 10 kb for landmark 3 sites). The algorithm checks the database for all eligible sites E within the upstream and downstream regions, a window of 2L. The cost for enzyme E at location X is Cost(E) = ln Price(E) + A × max[0,
|X − X0 | − Δ] + B0 n0 + B1 n1 + B2 n2 . (1) L
The optimal position of the landmark, X0 , is at the midpoint of the 2L window. In order to penalize changes to critical genes more heavily, counts of edited genes are categorized as non-essential (n0 ), slow growth (n1 ), and essential (n2 ). The cost of the enzyme in US dollars per unit is Price(E). We use the parameters A = 15, Δ = 0.1, B0 = 0.1, B1 = 0.6, and B2 = 1.1. This objective function was selected to match generally with the intuition of biological experts. To keep costs on a uniform scale, enzyme prices are sensitive to fold-ratios and optimal positions are calculated relative to an allowed 10% variation. The fixed parameter values, and even the entire form of the objective function, are open to re-analysis subsequent to experimental tests of designed sequences. 4.1
Dynamic Programming
Brute-force enumeration of all valid plans fails because the time to enumerate grows exponentially as O(mn ), where m is the number of possible enzyme choices per landmark (55 commercially available enzymes capable of leaving non-palindromic overhangs) and n is the number of landmarks required (scaling linearly with chromosome size with one landmark per 10,000 nucleotides). For yeast chromosomes, n ranges from approximately 25 to 180. We have developed an efficient approach that employs dynamic programming and dead-end elimination to run in O(nm6 ) time. Since the chromosome is divided into 100 kb regions by landmark 1 sites (Fig. 1), the optimal plan for the chromosome additively combines the optimal plans for each of these regions. Finding the optimal plans for each region are
286
S.M. Richardson et al.
Flanking Landmark 1 Pairs
Landmark 2 Pairs
Landmark 3 Pairs
~100 kilobases
Nested Overlapping Sub-Problems
Fig. 1. Each sub-problem is bounded by a pair of flanking landmark 1 sites
overlapping subproblems due to the landmark 1 sites shared by consecutive regions. Optimal sub-structure and overlapping sub-problems are the hallmarks of dynamic programming [15]. A few iterations of the algorithm’s progress through the dynamic programming matrix are displayed (Fig. 2). Columns 0 to n − 1 correspond to regions. Rows correspond to pairs of flanking landmark 1 enzymes. Each cell in the matrix maintains the optimal cost, opt-cost(i, j), for a particular row i and a particular column j. Each cell is initialized with an ‘x’ to represent an invalid or infinite cost. The costs are computed as opt cost(i, j) = cost(i, j) + mincompatiblei {opt cost(i , j − 1)}
(2)
where cost(i, j) refers to the cost associated with a particular region j and a particular pair of enzymes i, and compatible i refers only to those rows i that are compatible with row i. Compatibility here refers to the flanking constraint: the enzyme pair (e1 , e2 ) at column j − 1 is compatible with the enzyme pair (e3 , e4 ) at position j if and only if e2 = e3 . The cost associated with the optimal plan for the chromosome will be found in the cell with the minimum value in the last column of the matrix. As in classic dynamic programming, the plan associated with the minimum cost can be recovered by tracing back from the optimum. Each sub-problem requires computing the optimal cost (i, j) for a given region and a given selection of enzymes for the flanking landmark 1 sites. A brute force approach to the sub-problems yields a runtime of O(nm10 ), where the exponent 10 refers to the total number of landmark sites: two for landmark 1, two for landmark 2, and six for landmark 3 (Fig. 1). The complexity of the problem becomes O(nm2 ) plus the cost of pre-computing the sub-problems. As we explain below, the complexity is actually bounded by the computation of the sub-problems. For this reason pre-computing the sub-problems is ideal as it facilitates parallelization. Sites for landmarks 2 and 3 can be grouped into pairs, and a dynamic programming algorithm similar to the one described for landmark 1 sites can then be applied. The order of enzymes in a sub-plan matters for the cost, but not for the uniqueness constraints. Therefore the algorithm need only store the cost associated with combinations rather than permutations of the sub-plans. This
Automated Design of Assemblable, Modular, Synthetic Chromosomes
BstEII PasI
BstEII DraIII PasI
0 1 Flanking Landmarks
BstEII
BstEII DraIII
(BstEII,DraIII)
0 -1
DraIII PasI
2 Region
1 x
2 x
PasI
BstEII DraIII PasI
0 1 Flanking Landmarks
BstEII DraIII
(BstEII,DraIII)
0 -1
287
DraIII PasI
2 Region
1 -7
2 -11 -6
(BstEII,PasI)
-2
x
x
(BstEII,PasI)
-2
x
(DraIII,BstEII)
x
x
x
(DraIII,BstEII)
x
-5
x
(DraIII,PasI)
x
x
x
(DraIII,PasI)
x
-4
-10
(PasI,BstEII)
-4
x
x
(PasI,BstEII)
-4
x
x
(PasI,DraIII)
-2
x
x
(PasI,DraIII)
-2
-3
x
Fig. 2. A small example of the dynamic programming algorithm. All cells were initialized as null. The optimal path is displayed in red. Cost is as computed by Eq. 2.
yields an overall running time of O(m6 ) for each sub-problem, since each subproblem contains m2 choices for the flanking landmark 1 pair, m2 choices for the landmark 2 pair, and 3m2 choices for the landmark 3 pairs. Furthermore, the number of sub-problems is linear with respect to the size of the chromosome, making the over-all running time of our dynamic programming algorithm O(sub − problems + global − problem) = O(nm6 + nm2 ) = O(nm6 ) 4.2
(3)
Dead-End Elimination
Although O(nm6 ) time improves on exponential scaling, it is still inconvenient for m = 55 possible restriction enzymes. We have developed a dead-end elimination algorithm to reduce the effective size of m. We observed that landmarks often have enzyme choices that are not constrained by any other sites (Fig. 3). In this example, landmark site 9 may use enzymes AccI and BanI regardless of which enzymes are chosen for sites 8, 10, 11, 14, and 17. Since AccI and BanI are independent, either choice may be used regardless of all other landmarks. Therefore, there is no need to keep any enzyme choice that has a cost greater than the minimum of cost(AccI) and cost(BanI) (in this case AccI). For any landmark plan that uses BanI, DrdI, or FokI in landmark site 9, we can always improve the cost by substituting AccI; using BanI, DrdI, or FokI is guaranteed to produce a sub-optimal result. In practice, the dead-end elimination algorithm reduces the size of of m by 25% or more on the yeast chromosomes considered in this work, speeding the run time by a factor of (4/3)6 or about 5.6. 4.3
Parallel Implementation
Since the performance of the enzyme selection algorithm is bounded by the pre-processing of independent sub-problems, we used the PP (Parallel Python)
288
S.M. Richardson et al.
DrdI FokI SfiI
AccI BanI DrdI FokI SfiI
-5 -4 -3 -4 -7
DrdI FokI SfiI
PfoI DraIII BstEII
BseYI BsiYI AvaII
DrdI FokI BseYI AvaII
~10 kb
Fig. 3. The enzyme choice for site 9 is constrained by the choices for 8, 10, 11, 14, and 17. Using Dead-end Elimination, we can prune BanI, DrdI, and FokI from site 9.
module to facilitate parallelization across multiple independent machines. This module allows seamless addition of processor cores across multiple physical systems. This approach scales up to n cores, where n is the number of landmark 1 delimited regions in the chromosome. Each landmark 1 region is sent to a different core. For the biggest yeast chromosome, the run time decreased from 155 sec to 30 sec, a 5× speedup, using 16 cores in a cluster of quad-core 2.6 GHz Opteron processors with 4GB of RAM each. In practice, the speed-up from parallelization is limited by the region with the highest effective m value — the single region that takes the most time to compute.
5
Results and Discussion
Manual design of even the smallest yeast chromosome is a laboriously slow and error-prone process that does not guarantee an optimal solution. We have transformed this synthetic biology design problem into a formal optimization problem which we solve with an efficient implementation using suffix-tree indexing, dynamic programming, and dead-end elimination. Our approach produces optimal designs with a 1000× speed advantage over human experts. The input is an annotated target chromosome and list of restriction enzyme specifications and costs. The output is a minimally edited synthetic target, and existing downstream software can convert this into oligonucleotides for ordering from a vendor. Designs generated by this algorithm are currently being synthesized and experimentally tested as part of a project to create a yeast cell with synthetic DNA [10]. We hope that our code will be useful to others planning similar projects at any scale. All source code is available from the authors’ website, www.baderzone.org, under an open source BSD license. Acknowledgments. We thank Pamela Meluh, Yan Qi, and John Kloss for careful reading of the manuscript and discussion. S.M.R. was supported by Department of Energy Computational Science Graduate Fellowship DE-FG0297ER25308. This project was supported in part by a Microsoft Research Award to J.S.B and J.D.B., and grant MCB 0718846 from the National Science Foundation to J.D.B, J.S.B and S.C.
Automated Design of Assemblable, Modular, Synthetic Chromosomes
289
References 1. Richardson, S.M., Wheelan, S.J., Yarrington, R.M., Boeke, J.D.: GeneDesign: rapid, automated design of multikilobase synthetic genes. Genome Res. 16, 550–556 (2006) 2. Cai, Y., Hartnett, B., Gustafsson, C., Peccoud, J.: A syntactic model to design and verify synthetic genetic constructs derived from standard biological parts. Bioinformatics 23, 2760–2767 (2007) 3. Villalobos, A., Ness, J.E., Gustafsson, C., Minshull, J., Govindarajan, S.: Gene Designer: a synthetic biology tool for constructing artificial DNA segments. BMC Bioinformatics 7, 285–293 (2006) 4. Czar, M.J., Anderson, J.C., Bader, J.S., Peccoud, J.: Gene synthesis demystified. Trends Biotechnol. 27, 63–72 (2009) 5. Chan, L.Y., Kosuri, S., Endy, D.: Refactoring bacteriophage T7. Mol. Sys. Bio. 1 (2005), doi: 10.1038/ msb4100025 6. Cello, J., Paul, A.V., Wimmer, E.: Chemical synthesis of poliovirus cDNA: generation of infectious virus in the absence of natural template. Science 297, 1016–1018 (2002) 7. P´ osfai, G., Plunkett, G., Feh´er, T., Frisch, D., Keil, G.M., Umenhoffer, K., Kolisnychenko, V., Stahl, B., Sharma, S.S., Arruda, M., Burland, V., Harcum, S.W., Blattner, F.R.: Emergent properties of reduced-genome. Escherichia coli. Science 312, 1044–1046 (2006) 8. Gibson, D.G., Benders, G.A., Andrews-Pfannkoch, C., Denisova, E.A., BadenTillson, H., Zaveri, J., Stockwell, T.B., Brownley, A., Thomas, D.W., Algire, M.A., Merryman, C., Young, L., Noskov, V.N., Glass, J.I., Venter, J.C., Hutchison, C.A., Smith, H.O.: Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220 (2008) 9. Lartigue, C., Glass, J.I., Alperovich, N., Pieper, R., Parmar, P.P., Hutchison, C.A., Smith, H.O., Venter, J.C.: Genome transplantation in bacteria: changing one species to another. Science 317, 632–638 (2007) 10. Dymond, J., Scheifele, L., Richardson, S.M., Lee, P., Chandrasegaran, S., Bader, J.S., Boeke, J.D.: Teaching Synthetic Biology, Bioinformatics, and Engineering to Undergraduates: The Interdisciplinary Build-a-Genome Course. Genetics 181, 13– 21 (2009) 11. Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., Lewis, S.: The Generic Genome Browser: A Building Block for a Model Organism Database. Genome Res. 12, 1599–1610 (2002) 12. Roberts, R.J., Vincze, T., Posfai, J., Macelis, D.: REBASE–enzymes and genes for DNA restriction and modification. Nucl. Acids Res. 35, D269–D270 (2007) 13. Fisk, D.G., Ball, C.A., Dolinski, K., Engel, S.R., Hong, E.L., Issel-Tarver, L., Schwartz, K., Sethuraman, A., Botstein, D., Michael, C.J.: Saccharomyces cerevisiae S288C genome annotation: a working hypothesis. Yeast 23, 857–865 (2006) 14. Mattson, T.G., Sanders, B.A., Massingill, B.L.: Patterns for Parallel Programming, 1st edn. Addison-Wesley, Reading (2004) 15. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. McGraw-Hill, New York (2001)