review
Massive parallelism, randomness and genomic advances J. Craig Venter, Samuel Levy, Tim Stockwell, Karin Remington & Aaron Halpern
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
doi:10.1038/ng1114 In reviewing the past decade, it is clear that genomics was, and still is, driven by innovative technologies, perhaps more so than any other scientific area in recent memory. From the outset, computing, mathematics and new automated laboratory techniques have been key components in allowing the field to move forward rapidly. We highlight some key innovations that have come together to nurture the explosive growth that makes a new era of genomics a reality. We also document how these new approaches have fueled further innovations and discoveries.
In 1987 Victor McKusick and Frank Ruddle launched a journal titled Genomics. This was the first widespread use of the term, which was actually coined by T.H. Roderick of the Jackson Laboratory1. ‘Genomics’ was used to describe a field of science differing from genetics in its focus on the study of DNA from a broader standpoint, that of the entire complement of genetic material. Genomics would primarily consider the full haploid set of chromosomes in an organism rather than studying a single gene or a family of functionally or structurally related genes. As is now clear, genomics has built on many of the important discoveries in genetics, beginning with the elucidation of the double helix structure of DNA by Watson and Crick in the 1950s, the discovery of reverse transcriptase in the late 1960s, the discovery of recombinant DNA and restriction enzymes in the 1970s and the discovery of polymerase chain reaction (PCR) in the early 1980s. The method developed in the 1970s by Sanger et al.2 to sequence longer stretches of DNA, dideoxynucleotide sequencing, now known simply as Sanger sequencing, was a guiding force in the development of genomics. This technique was revolutionary at the time, and in fact, was the only method used to determine the base sequence of DNA for many years. It was the combination of these breakthroughs that allowed researchers to even contemplate sequencing a large eukaryotic genome, such as that of humans. Though the field of genetics was dominated by laborious hunts for single disease genes during the 1980s and early 1990s, this was also the era when active discussions and planning were underway for development of the large, multinational, publicly funded Human Genome Project (HGP). It soon became clear that the research community would need an additional outlet to publish what was sure to be a myriad of new discoveries from the rapid advances in genetics and the burgeoning field of genomics. In response to this need, a new journal was launched in 1992—Nature Genetics. The inaugural volume of the new journal contained, among other contributions, two articles describing the results of the first human genome test sequencing projects using the Applied Biosystems
373 automated DNA sequencer3,4. These papers detailed the work of a team at the National Institute of Neurological Disorders and Stroke (NINDS) that involved sequencing three cosmids each from human chromosome 19 and chromosome 4. The cosmids, from regions thought to be associated with myotonic dystrophy and Huntington disease, were deliberately selected with the help of collaborators who had long been looking for the genes associated with these diseases. These two projects were representative of the state of the art in the community at the time, which was focused on a strategy in which genomic sequencing commenced only after cosmid mapping had been done. When these projects began in 1989, the fledgling discipline of genomics was clearly more an abstraction than a practical endeavor. The HGP settled on a cosmid mapping strategy because the pervading assumption of the time was that clones larger than cosmids (35 kb) were not readily sequenceable because of the limitations of shotgun sequencing. In the late 1980s there was considerable debate and skepticism concerning the use of automated DNA sequencers. Much of this discussion was due to the belief that single machines were expected to sequence the genome. For example, early attention centered on claims from Japan about a million-base-pair DNA sequencing machine built by Wada and his team5,6. This team was confident that this machine would give Japan the ability to sequence the human genome, but ultimately the program was unsuccessful. Sequencers at the time were very limited; in fact, the initial model of automated DNA sequencers in the United States could handle only 16 templates per day and produce roughly 300 bp per template. The papers published in the inaugural issue of Nature Genetics were not revolutionary, but they did represent an important first step in the application of automated DNA sequence analysis to unknown areas of the human genome. This was a critical turning point for subsequent genome sequencing and required de novo analysis of raw human genomic DNA sequence where the gene content was unknown.
The Center for the Advancement of Genomics, 1901 Research Blvd., Rockville, Maryland 20850, USA. Correspondence should be addressed to J.C.V. (e-mail:
[email protected]).
nature genetics supplement • volume 33 • march 2003
219
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review For the next few years, genomics progressed linearly. The plan for the public genome project by the United States National Human Genome Research Institute (NHGRI) and Department of Energy (DOE) was to sequence six genomes by 2005—those of Escherichia coli, Saccharomyces cerevisiae (yeast), Drosophila melanogaster, Caenorhabditis elegans, mouse and human7. By 2003, however, at least 99 genomes have already been sequenced and published (Fig. 1), including the original six in the plan, and hundreds of others are in progress. Not surprisingly, this exponential growth in sequenced genomes contributed to the exponential growth in GenBank data (Fig. 2). The first four genomes sequenced and published were not part of the original NHGRI/DOE plan. Instead, they were completed by independent, not-for-profit institutes in the United States and Japan. The first genome completed by the publicly funded sequencing consortium was that of yeast8. The E. coli genome, the first to receive public funding, was not completed until 1997 (ref. 9). In this review, we assess the changes that have caused the dramatic acceleration in sequencing genomes from the original plan of just six completed genomes by 2005. We highlight some key advances that have allowed researchers in both the private and public sectors of genomics to make remarkable progress in sequencing and analyzing the genomes of more organisms. Two dominant themes have driven such innovations in genomics: first, the advent and adoption of massively parallel systems, both in sequencing and computing; and second, the subtle, yet equally important, philosophical change in the way genomic projects have been conceived and organized. Early genomic
Haemophilus influenzae Mycoplasma genitalium
Mycoplasma pneumoniae Synechocystis
1995 1996 1997 Saccharomyces cerevisiae
Aquifex aeolicus Chlamydia trachomatis serovar D Mycobacterium tuberculosis H37Rv Rickettsia prowazekii Treponema pallidum
Archaeoglobus fulgidus
1998 Caenorhabditis elegans
Methanococcus jannaschii
Chlamydophila pneumoniae CWL029 Deinococcus radiodurans Helicobacter pylori J99 Thermotagoa maritima
1999 Pyrococcus horikoshii
Expressed-sequence tags Throughout the 1980s, the process of discovering genes and producing DNA sequences was extremely labor-intensive and time-consuming. But the discovery in the early 1990s of a new way to detect genes radically changed the pace and scope of gene discovery. A paper published in 1991 (ref. 10) established the usefulness of expressed-sequence tags (ESTs) with the discovery of several hundred new genes. Though the use of ESTs has now become commonplace, before 1991 as the idea was being developed, the usefulness of an approach based on random, partial cDNAs was far from clear to many in the scientific community. Early in the planning phase of the HGP, Sydney Brenner and Paul Berg, as well as several other internationally recognized researchers, made strong arguments to include a large cDNA effort in the initial stage of the HGP. But many involved in this area of science maintained that mRNA expression would provide only a small number of highly abundant transcripts and
Bacillus halodurans Campylobacter jejuni Chlamydia muridarum Chlamydophila pneumoniae AR39 Chlamydophila pneumoniae J138 Mesorhizobium loti Neisseria meningitidis MC58
completed genomes Bacillus subtilis Borrelia burgdorferi Escherichia coli K12 Helicobacter pylori 26695 Methanobacterium thermoautotrophicum
projects were large and distributed and proceeded in a linear, methodical fashion, whereas current projects are based on smaller, multi-dimensional teams and are organized in a qualitycontrolled environment to take advantage of the flexibility of random sampling. In this review, we follow these themes from the development of expressed-sequence tags (ESTs) and bacterial artificial chromosomes (BACs), which were key breakthroughs in gene discovery efforts and for mapping and sequencing in clone-by-clone methods, through to whole-genome shotgun (WGS) sequencing.
Aeropyrum pernix
Neisseria meningitidis Z2491 Pseudomonas aeruginosa Ureaplasma urealyticum Xyella fastidiosa 9a5c Vibrio cholerae
2000 Arabidopsis thaliana Drosophila melanogaster Halobacterium Thermoplasma acidophilum Thermoplasma volcanium
eubacteria eukaryote Archaea
Anabaena Agrobacterium tumefaciens str. C58 (Cereon) Agrobacterium tumefaciens str. C58 (U. Washington) Buchnera aphidicola sp. APS Caulobacter crescentus Clostridium acetobutylicum Corynebacterium glutamicum ATCC 13032 Escherichia coli O157:H7 EDL933 Escherichia coli O157:H7 Listeria innocua Lactococcus lactis Listeria monocytogenes Mycobacterium leprae Mycoplasma pulmonis Pasteurella multocida Rickettsia conorii Salmonella enterica subsp. enterica serovar Typhi Salmonella typhimurium LT2 Sinorhizobium meliloti Staphylococcus aureus subsp. aureus Mu50 Staphylococcus aureus subsp. aureus N315 Streptococcus pneumoniae R6 Streptococcus pneumoniae TIGR4 Streptococcus pyogenes M1 GAS Yersinia pestis CO92
Bacillus anthracis Bifidobacterium longum Brucella melitensis Brucella suis Buchnera aphidicola sp. SGS Cholorobium tepidum Clostridium perfringens Escherichia coli CFT073 Fusobacterium nucleatum Leptospira interrogans Mycobacterium tuberculosis CDC1551 Mycoplasma penetrans Oceanobacillus iheyensis Pseudomonas putida KT2440 Ralstonia solanacearum Shewanella oneidensis Shigella flexneria Staphylococcus aureus subsp. aureus MW2 Streptomyces coelicolor
2001
2002
Encephalitozoon cuniculi Guillardia theta Homo sapiens (Celera) Homo sapiens (IHGSC) Pyrococcus abyssi Sulfolobus solfataricus Sulfolobus tokodaii
Anopheles gambiae Ciona intestinalis Fugu rubripes Magnaporthe grisea Mus musculus (Celera) Mus musculus (MGSC) Oryza sativa L. ssp. indica Oryza sativa L. ssp. japonica Plasmodium falciparum Plasmodium yoelii yoelii Schizosaccharomyces pombe Streptococcus agalactiae Streptococcus mutans Streptococcus pyogenes MGAS315 Streptococcus pyogenes MGAS8232 Thermoanaerobacter tengcongensis Thermosynechococcus elongatus Wigglesworthia glossinidia Xanthomonas axonopodis Xanthomonas campestris Yersinia pestis KIM
Methanopyrus kandleri Methanosarcina acetivorans Methanosarcina mazei Pyrobaculum aerophilum Pyrococcus furiosus
Fig. 1 Genomes of living organisms sequenced between 1995 and 2002.
220
nature genetics supplement • volume 33 • march 2003
DNA (billion bp)
sequences (millions)
would leave most human genes undetected, and 25 50 some speculated that as little as 8–9% of genes could be uncovered using the EST method11. It was an innovative idea using random sam20 40 pling—motivated by attempts to annotate de novo human genome sequence—that ultimately proved this conventional wisdom wrong. 15 30 After generating the sequence data from chromosomes 19 and 4 in 1989, the NINDS team discovered that interpreting and annotating the sequences required more time than 10 20 obtaining the sequence. Using de novo (meaning sequence-based rather than evidencebased) gene-prediction software of the era that 5 10 applied a neural-network technique12, only 4 of 10 exons in the known gene ERCC1 were accurately predicted, and false predictions were higher than 0 0 correct exon hits10. Six frame translations from all 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 predicted exons were screened against GenBank year data, but owing to the dearth of available data at the time, few matches were found, with the excepFig. 2 The growth of GenBank since 1992. Cumulative sequences (in millions) are shown as filled diation of exact matches to known cDNA clones. monds. Cumulative base pairs of DNA (in billions) are shown as filled area. (See URLs.) Therefore, to verify exons, PCR primers were synthesized and used for PCR amplification from human brain and placental cDNA libraries3. Although the results of the EST experiment initially met with Amplified PCR products were then sequenced to verify the correct annotation and to provide evidence for transcription of the genes. mixed reviews, successful early applications, such as the discovery It became clear that without sequenced cDNA clones from human of new DNA mismatch-repair enzymes linked to colon cancer15,16 libraries, the annotation of the human genome would be a slow, and substantial gene discovery in the plant field with Arabidopsis laborious, error-prone and very expensive task. thaliana17, helped to earn its ultimate acceptance. Figure 3 shows The key idea behind the EST publication was this: rather than the growth of EST sequencing, with more than 70 organisms now sequencing 1,000 fragments from a cosmid clone, which would represented by more than 10,000 publicly available EST entries. yield at most one human gene but could be entirely fruitless, why From the initial deposit of 9,388 EST sequences to GenBank in not do the same 1,000 sequencing reactions but target clones ran- 1992, today’s collection contains more than 15 million sequences, domly picked from a cDNA library instead? With the unique auto- comprising over 7.6 billion base pairs across 484 organisms, makmated DNA sequencing and bioinformatics capabilities of the ing the EST method the principal method of gene discovery. CitaNINDS lab in 1990, this idea could readily be tested. Mark Adams tion of Adams et al.10 has increased throughout the last decade, was recruited to do the experiment, which focused on neurotrans- and Figure 4 shows the increase in the number of publications mission by using a human brain cDNA library. When the results that report use of the method. It is clear that many of today’s most came off the sequencers and were checked against the database, it exciting genomic technologies owe at least a nod to the EST was clear that a winning idea had been born. Before publishing method (Fig. 5). One of the conclusions of Adams et al.10 was that their results, the researchers concentrated on optimization of the methods and computational analysis of the data. They 80 tested subtractive cDNA libraries to reduce redundancy and mapped a consid70 erable number of the newly discovered genes back to the human genome to 60 demonstrate the bi-directional nature of the method. The new method was named 50 expressed-sequence tags, derived from the 13 term sequence-tagged site , as the group 40 had successfully mapped sequences, derived from expressed genes, in the 30 genome. From the first 1,000 cDNA clones sequenced, 373 new human genes 20 expressed in the brain were discovered. Though there had been previous small10 scale attempts by other researchers to apply random cDNA sequencing, one 0 early study on partial cDNA sequencing 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 from skeletal muscle seemed to confirm early notions that only a few over- Fig. 3 The growth of EST sequencing since 1992. Stacked columns indicate the cumulative number of organexpressed transcripts would be found by isms with over 10,000 ESTs in GenBank. Animals shown in white, plants in green, fungi in red, alveolate in blue, euglenozoa in purple, mycetozoa (slime mold) in yellow and red algae in gray. cDNA sequencing14. number of organisms
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review
nature genetics supplement • volume 33 • march 2003
221
review Fig. 4 The number of publications listed in PubMed that contain the expressions ‘EST’ (shown as red squares; ref. 10), ‘BAC’ (green diamonds; ref. 93) and ‘YAC’ (blue triangles; ref. 94). Data shown are from the year in which the seminal paper describing each technique was published (for ‘EST’ and ‘BAC’) or from 1992 (for YAC). Each data point was ‘normalized’ for spurious matches by subtracting the number of occurrences of the search term for the year before the original publication. Data for 2002 are partial-year data.
800
number of publications
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
700
600
500
400
300
200
100
0 1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
publication year
ESTs had potential to form the primary basis of genome annotation. This has clearly been the case for annotating the human genome18,19, and ESTs contributed substantially to ordering and assembling the HGP human genome as well19. In this review, we use citation trees to elucidate ‘idea trails’ and thus identify significant research milestones that were reached after the publication of certain seminal articles. This approach is one indicator of how a particular article may have contributed to the subsequent development of new ideas in the scientific community. Figure 5 shows one such citation tree for the original paper by Adams et al. on ESTs10. This tree was generated by ranking all publications that cited the original article according to the number of times they themselves have been cited in subsequent biological literature. The targeted paper was cited 930 times between 1991 and 2002, and 49 citing publications have each been cited more than 150 times.
human EST Adams et al. 1991 (ref. 10) (930)
The tree is organized according to the major concepts that are addressed in these publications. It highlights the development of three distinct methods to detect levels of gene expression: the serial analysis of gene expression (SAGE) sequence tag method20, high-density oligonucleotide microarrays21,22 and cDNA microarrays23. These methods have been used extensively since their development because they can, in a highly parallel fashion, detect the expression of many genes. Coupled with an ongoing effort to sequence ESTs from different species (Fig. 5; refs. 24–30), these methods have enhanced our understanding of tissue-specific gene expression. The EST approach, in concert with gene maps31, has more accurately estimated gene numbers in humans, C. elegans and A. thaliana. The growth of sequence databases has improved the efficiency and sophistication of computational search methods32. Ultimately, the combination of greater amounts of sequencing information and computational analysis has contributed significantly to experiments aimed at determining the functions of individual genes and gene sets (Fig. 5; refs. 15,16,33–35).
Whole-genome shotgun sequencing With the formation in 1992 of The Institute for Genomic Research (TIGR), a not-for-profit research institute, EST sequencing grew substantially from the original effort at NIH. TIGR’s initial project was called the Human Gene Anatomy Project and was aimed at sequencing cDNA clones from libraries made from every major human tissue and organ. By establishing a high-throughput DNA factory to effectively generate and analyze the hundreds of thousands of EST sequences, this team had unknowingly created a new paradigm cDNA microarray for genomic facilities that would evenhuman - Duggan et al. 1999 (ref. 23) (405) tually be the model for many large-scale computational methods DNA sequencing projects, including database searching - Altschul et al. 1994 (ref. 32) (393) the human genome effort at Celera Genomics. A key component of this ESTs model was a new mathematical algoC. elegans - Waterston et al. 1992 (ref. 24) (315) rithm created by Granger Sutton. Suthuman - Okubo et al. 1992 (ref. 25) (307), Adams et al. 1993 (ref. 26) (240) ton’s algorithm (packaged in the TIGR A. thaliana - Newman et al. 1994 (ref. 28) (442) human - Adams et al. 1995 (ref. 29, (463), Hillier et al. Assembler36) was designed to assemble 1996 (ref. 30) (247) ESTs into clusters to reduce the redungene discovery/functional assignment dancy of sequences, enable assembly of DNA mismatch repair - Papadopoulos et al. 1994 (ref. full-length cDNA clones and provide 15) (1137), Nicolaides et al. 1994 (ref. 16) (876) apoptosis - Fernandesalnemri et al. 1994 (ref. 33) better estimates for the total number of (663) CNS development - Messersmith et al. 1995 (ref. 35) human genes. These innovations allowed TIGR to publish the first comgenome sequencing prehensive catalog of human genes and H. influenzae - Fleishmann et al. 1995 (ref. 37) (2294) human - Lander et al. 2001 (ref. 19) (1014), Venter et chromosome maps29. Realizing that al. 2001 (ref. 20) (1258) this algorithm was a powerful tool, the human STS/gene maps TIGR team began to seek ways to use it Boguski et al. 1995 (ref. 92) (210) Schuler et al. 1996 (ref. 31) (667)
oligonucleotide microarray human subset - Lockhart et al. 1996 (ref. 21) (805), Schena et al. 1996 (ref. 93) (531) yeast whole genome - Wodicka et al. 1997 (ref. 22) (437)
SAGE Velculescu et al. 1995 (ref. 20) (947)
222
Fig. 5 Citation tree for the original human EST paper by Adams et al.10. The 930 articles that have cited this publication were ranked according to the number of citations each received. The tree was then constructed from the concepts elaborated by the publications that have been cited more than 150 times. The citation frequency of each publication is shown in red.
nature genetics supplement • volume 33 • march 2003
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review more broadly in their research. Thus, the idea for whole- throughput and high sequencing error rates, seemed daunting. genome shotgun (WGS) sequencing was born, and it debuted Though the H. influenzae paper had concluded with the prediction with the publication of the genome of Haemophilus influenzae37. that WGS sequencing could be useful in the human genome As was the case with ESTs, the WGS innovation relied on a ran- sequencing effort, the feasibility of this idea remained hotly dom sampling approach. Its widespread use for large projects is due debated55,56, even under the assumption that sequencing technolto its suitability for massively parallel data collection and process- ogy would ultimately improve. ing. Although shotgun sequencing of small segments of sequence had been used since the 1970s, this was the first time that shotgun Technology changes that advanced genomics sequencing had been used on the entire genome of a free-living In addition to some incremental changes in technology, includorganism. The group selected the genome of H. influenzae as the ing increased automation and miniaturization57,58, improved target both because its GC content is similar to that of the human computational resources and various browsers and search tools genome and because they wanted to examine an organism about for scientists to better access the genomes59, there are a few develwhich virtually nothing was known at the genomic level. By the opments that warrant special acclaim for having enabled the spring of 1995, the genome was complete, all gaps closed and anno- sequencing of large eukaryotic genomes, whether by the clonetated, with publication soon after37. by-clone or the WGS approach. This paper was catalytic and enabled innovations in several Capillary DNA sequencers. With greater capacity of lanes per other areas. As shown in Figure 6, the paper has been cited run and runs per day and smaller sample volumes, capillary 2,294 times since its publication in 1995, and 53 of these citing DNA sequencers (most notably the ABI 3700 sequencing publications themselves have been cited more than 150 times machines) vastly increased the throughput that could be each. Among the major developments based on the whole- achieved for a fixed cost. These machines eliminated the need to genome sequencing of H. influenzae was the sequencing of pour gels by hand, thus reducing the amount of labor; permitted other prokaryotic species. This increase in sequenced genomes the generation of 400 kb of sequence in a day on a single machine stimulated comparative sequence analysis, which in turn signif- with as little as 15 minutes for maintenance and sample loading; icantly aided in the functional annotation of genes. This was permitted unprecedented read lengths and improved sequencing done primarily in the prokaryotes owing to the wealth of accuracy; and, through a vast improvement in lane tracking, genome data available38–41 but later included D. melanogaster, enabled approaches to sequence assembly that otherwise would C. elegans and yeast42. More recently these comparative meth- have been unthinkable. ods have been applied to the human and mouse genomes43,44 and provide a means of characterizing protein function by virtue of identifying orthologs genome sequencing Methanococcus jannaschii - Bult et al. 1996 (ref. 94) (1468W) across several genomes45. The complete Streptomyces coelicolor Chr A3 - Redenbach et al. 1996 (ref. 95) (229C) S. cerevisiae - Goffeau et al. 1996 (ref. 8) (662C) genome sequences of model organisms E. coli K12 - Blattner et al. 1997 (ref. 9) (1747C) Helicobacter pylori - Tomb et al. 1997 (ref. 96) (1172W) such as yeast allows the selective and Baillus subtilis - Kunst et al. 1997 (ref. 97) (1111) Archaeoglobus fulgidus - Klenk et al. 1997 (ref. 98) (713W) Borrelia burgdorferi - Fraser et al. 1997 (ref. 99) (550W) sequential ‘knock-out’ of individual Rickettsia prowazekii - Andersson et al. 1998 (ref. 100) (385) Aquifex aeolicus - Deckert et al. 1998 (ref. 101) (453) genes as a means of establishing a molecTreponema pallidum - Fraser et al. 1998 (ref. 102) (304W) 46 Plasmodium falciparum (Chr2) - Gardner et al. 1998 (ref. 103) (243) ular basis for phenotype . Such studies Drosophila melanogaster - Adams et al. 2000 (ref. 67) (1073) Neisseria meningitidis - Tettelin et al. 2000 (ref. 104) (194) are clearly aided by high-throughput Homo sapiens - Venter et al. 2001 (ref. 18) (1258W), Lander et al. 2001 (ref. 19) (1014C) technologies, such as cDNA microarrays comparative genomics 47 Mycoplasma species - Himmelreich et al. 1996 (ref. 38) (472) to monitor yeast gene expression , as Mycobacteria species - Philipp et al. 1996 (ref. 40) (177) Archae and Eubacterial species - Nelson et al. 1999 (ref. 39) (347); Brown well as the promise of highly parallel proet al. 1997 (ref. 41) (166) D. melanogaster, C. elegans and yeast - Rubin et al. 2000 (ref. 42) (383) tein analysis48,49. Furthermore, the use of H. influenza and E. coli - Tatusov et al. 1997 (ref. 45) (346) sequenced genomes as a means to better gene discovery/functional assignment proteosome - Coux et al. 1996 (ref. 105) (860R) understand the molecular biology of DNA topoisomerases - Wang 1996 (ref. 106) (786R) multidrug efflux systems - Paulsen et al. 1996 (ref. 107) (300R); Nikaido et organisms is evident in the H. influenzae al. 1996 (ref. 108) (226R) DNA glycosylases - Krokan et al. 1997 (ref. 109) (230) citation tree; for example, numerous establishing gene function - Miklos et al. 1996 (ref. 110) (219R) WGS metal resistance in Bacteria - Silver et al. 1996 (ref. 111) (166R) H. influenza subsequent studies concentrated on Fleischmann et al. 1995 (ref. 37) (2294) host pathogen interactions host–pathogen interactions with various cystic fibrosis - Govan et al. 1996 (ref. 50) (269) Rhizobium/Legumes - Freiberg et al. 1997 (ref. 51) (259); Zumft 1997 (ref. 52) (237R) bacterial species serving as the nitogen regulation - Merrick et al. 1995 (159R) pathogen50–53. minimal genome Fraser et al. 1995 (ref. 85) (1088) Despite the advances leading to the Mushegian et al. 1996 (ref. 86) (161) H. influenzae publication and the everproteomics Yates 1998 (ref. 49) (212) increasing numbers of completed Blackstock 1999 (ref. 48) (183) microbial genomes, by 1998 there was tRNA detection computational - Lowe et al. 1997 (ref. 112) (182) some concern in the scientific commuyeast reverse genetics nity that the one large-scale, publicly Dujon 1996 (ref. 46) (223) funded sequencing project—the HGP— bacterial identification DNA typing - Gurtler et al. 1996 (ref. 113) (193) had reached a saturation point. Total finprotein classification ished bases amounted to just 3% of the COGs - Tatusov et al. 1997 (ref. 45) (344) genome, and scientists at major sequencDNA microarray ing centers admitted concern and unceryeast ORFs - Lashkari et al. 1997 (ref. 47) (193) tainty about how they would achieve their goals54. Major weaknesses in the sequencing technology of the time, including lane Fig. 6 Citation tree for the WGS sequencing of H. influenzae37, produced in the same manner as the tree in tracking problems, inadequate sequencer Fig. 5. W, WGS sequencing methodology; C, clone-by-clone sequencing methodology; R, review article. nature genetics supplement • volume 33 • march 2003
223
New clones for genomics. Although heroic efforts resulted in maps dense enough for cosmid tiling in some regions of the human genome60, construction of maps suitable for a directed sequencing effort proved elusive. Indeed, Saccharomyces cerevisiae may be the only organism whose genome was sequenced by first producing a map and then sequencing. The first genome project, for E. coli, was initially based on sequencing lambda clones that had been mapped in a three-year effort. Slow progress essentially forced an abandonment of the effort to sequence sequential lambda clones9. Yeast artificial chromosomes (YACs), which for some time were considered the successor to the cosmid, today have been almost completely abandoned. Because significant efforts were diverted toward the use of YACs, they may have delayed progress in genomic sequencing for several years. It was the use of bacterial artificial chromosomes (BACs) that enabled progress both in mapping and sequencing in a clone-byclone fashion. The decline of the YAC and rise of the BAC is shown in Figure 4. The acceptance of BACs into the genome community led to a new round of map construction, assigning BACs to locations on existing maps. Additionally, the introduction of BAC-end sequencing61 made possible a clone-by-clone approach that did not require sequencing to wait for map construction yet maintained the efficiency of a directed approach. The draft version of the HGP human genome was not obtained by carefully choosing which clone to work on next, although sequencing was conducted mostly on a clone-by-clone basis. This is arguably because the capacity of the sequencers was so great relative to the rate of mapping that it was not practical to leave the sequencers idle while the next set of clones was determined and samples prepared. Base-calling with quality values. In the late 1990s significant progress had been made in high-throughput, automated DNA
sequencing, as noted above, with the introduction of capillarybased sequencing machines. Though these machines certainly improved the ability to process and assess base pairs through software that was packaged with the machine, it was clear that additional software programs would be useful in determining accuracy of the bases of DNA. Most notable of these efforts were the Phred base-calling program62,63 and TraceTuner. These two chromatogram base-calling programs included an estimated quality value for each individual base in the raw sequence. These quality values allowed users to ‘quality trim’ individual DNA sequence reads64 and to determine how well fragments overlap and select the most likely consensus sequence during the process of assembly with software tools such as phrap and CAP4. Advances in computing. The SPARCcenter 2000 that was used to assemble the genome of H. influenzae in 1995 was the last model of a generation of computers that was limited by an architecture capable of addressing only 2 gigabytes of random access memory (RAM). Revolutionary biology would demand revolutionary computing: 64-bit hardware and operating systems that could use 32 GB of RAM and beyond. In 1992 DIGITAL Equipment Corporation introduced the first 64-bit processor, the Alpha, together with a 64-bit operating system. By 1998, the Alpha microprocessor was already nearly 100 times more powerful than the processor used in the 1995 SPARCcenter. Processors offered today by Intel, Hewlett Packard and IBM are almost 200 times more powerful. But processors and operating systems are only part of the story; the cost of storage has dropped dramatically. In 1992 one terabyte of disk-space cost $1,000,000; in 2002 the cost has dwindled to near $10,000. It is possible that computational biology would have stalled in the early 1990s had disk storage not dropped dramatically in
40
35
12
30
10
25
8
billion bp
number of genomes
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review
20
6
15
4
10
2
0
5
WGS
other
0 1995
1996
1997
1998
1999
2000
2001
2002
year Fig. 7 The number of genomes sequenced each year since 1995. a, The number of genomes sequenced using WGS is shown as blue triangles, and the number of genomes sequenced with other strategies is shown as yellow squares. b, The stacked columns indicate the cumulative sizes (in billions of base pairs) of genomes sequenced with WGS and other methods. H. sapiens (Celera Genomics) shown in white, Mus musculus (Celera Genomics) in green, M. musculus (Mouse Genome Sequencing Consortium) in red, Oryza sativa L. ssp. indica in blue, O. sativa L. ssp. japonica in orange, Fugu rubripes in black, Anopheles gambiae in yellow, D. melanogaster in brown, all other genomes sequenced with WGS in purple, H. sapiens (International Human Genome Sequencing Consortium) in gray and all other genomes sequenced with methods other than WGS (including those of A. thaliana, C. elegans, Plasmodium falciparum and Schizosaccharomyces pombe) in pink.
224
nature genetics supplement • volume 33 • march 2003
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review cost while significantly increasing in performance. In 1992 10 information from a variety of data, as shown with the MB per second was the state-of-art in networks. Today, even GigAssembler72. Other assembly programs have introduced some laptops use gigabit network interfaces that are a hundred important innovations such as ‘error correction’, which we times faster. The very nature of large-scale computing has believe was first described in connection with the CAP assemchanged from systems relying on one or a few powerful cus- bler73, and the introduction of the Eulerian formulation of the tom-designed processors to scalable parallel systems or farms of assembly problem74, though these assemblers remain to be computers. Complex problems are now broken down into a set tested in the context of a full-scale eukaryotic sequencing proof smaller jobs that are run concurrently, an approach that ject. Advances in assembly methods will undoubtedly continue clearly enabled genome assembly. as WGS sequencing continues to be the method of choice for Genome assembly. In 1989, the GEL sequencing program was genome sequencing (Fig. 7) or until sequence read length elimilimited to 1,000 reads and required extensive human curation of nates the need for assembly algorithms. the results3,4. By 1995, Sutton’s TIGR Assembler had proved its ability to assemble microbial genomes. The program ran for some Conclusions 30 hours on a SPARCcenter2000, using slightly under 64 MB of Clearly, the scientific community is benefiting greatly from RAM (G.G. Sutton, pers. comm.). Total coding time, including numerous published genomes across a broad spectrum of code originally written for assembling ESTs, was on the order of organisms (Fig. 1) that could only have been completed with the one programmer-year. The bulk of the compute time was spent aforementioned advances. Milestone genomes include the first doing fragment overlaps, a process that is naively quadratic. The fully sequenced living organism, H. influenza, the first archaea, roughly 27 million fragments making up the human genome Methanococcus jannaschii37; the first eukaryote, S. cerevisiae8; assembly were nearly 1,000 times as many as the 28,000 frag- the first multi-cellular organism, C. elegans75; the first large ments that made up the H. influenzae assembly and required eukaryote, D. melanogaster67; Homo sapiens18,19; and the first 1,000,000 times as many fragment pairs to be compared. More- plant, A. thaliana76. This wealth of information was hard to over, repetitive elements longer than a single read often required imagine just a few years ago. For perspective, in 1994 the manual assembly and finishing. Clearly, improvements in com- largest sequenced genome was a virus (human puter hardware alone would not be sufficient to allow WGS cytomegalovirus, at roughly 230 kb; GenBank NC_001347; assembly of large genomes. In the same time frame, Eugene Myers proposed a novel formulation of the fragment assembly problem65 that involved genome sequencing Mycobacterium tuberculosis - Cole et al. 1998 (ref. 114) (1223) breaking the problem down into the Bacillus subtilis - Kunst et al. 1997 (ref. 97) (1114) H. sapiens (Chr22) - Dunham et al. 1999 (ref. 115) (491) determination of unique regions (unitAquifex aeolicus - Deckert et al. 1998 (ref. 101) (455) Rickettsia prowazekii - Andersson et al. 1998 (ref. 100) (387) igs) followed by resolution of repetitive Pseudomonas aeruginosa PAO1 - Stover et al. 2000 (ref. 116) (336) elements by use of mate pairs. This idea E. coli O157 - Perna et al. 2001 (ref. 117) (183) Staphylococcus aureus - Kuroda et al. 2001 (ref. 118) (114) would later become the foundation for Xylella fastidiosa - Simpson et al. 2000 (ref. 119) (113) E. coli E2348/69 (LEE) - Elliott et al. 1998 (ref. 120) (147) the WGS assembler developed at Celera comparative genomics Genomics66. Weber and Myers55 sugArchae and Eubacterial species - Nelson et al. 1999 (ref. 39) (347) gested that sequencing all fragments as H. influenzae and E. coli - Tatusov et al. 1997 (ref. 45) (346) gene transfer - Jain et al. 1999 (ref. 121) (122) paired ends would significantly enhance membrane proteins - Wallin et al. 1998 (ref. 122) (116) conservation gene order - Dandekar et al. 1998 (ref. 123) (104) the capabilities of such an assembler. In 1999, Myers and Sutton, now leading the gene discovery/functional assignment assembly effort at Celera, provided a recombinases - Nunes-Duby et al. 1998 (ref. 124) (118) PCR based gene disruption - Datsenko et al. 2000 (ref. 125) (116) proof-of-concept for the new assembly quorum sensing gene regulation - Surette et al. 1999 (ref. 126) (112) proteosome - Bochtler et al. 1999 (ref. 127) (111) algorithm with their team’s assembly of secretory pathway - Sargent et al. 1998 (ref. 128) (110); Weiner et al. 1998 67 (ref. 129) (131) the D. melanogaster genome . Ultigenome sequencing proteomics E. coli K12 mately, WGS assemblies of the human, mass spectrometry - Yates 1998 (ref. 49) (212) Blattner et al. 1997 (ref. 9) (1747) X-ray cystallography - Hung et al. 1998 (ref. 130) (201); Locher et al. 1998 mouse43 and mosquito68 genomes were (ref. 131) (145) protein-protein 2-hybrid sceeen (C. elegans) -Walhout et al. 2000 (ref. 132) obtained with an enhanced version of (119) this prototype. The human genome was yeast 2-hybrid screen - Ito et al. 2001 (ref. 133) (153) computational tools/methods assembled in approximately 20,000 CPU protein interactions - Marcotte el al 1999 (ref. 134) (186) sequence alignment - Schwartz et al. 2000 (ref. 135) (129) hours on Compaq Alpha chips running gene finding - Lukashin et al. 1998 (ref. 136) (127) whole genome assembly - Myers et al. 2000 (ref. 66) (101) at 500–667 MHz and using a peak of physiology slightly under 32 GB of RAM and DNA replication - Seigneur et al. 1998 (ref. 79) (163) approximately 0.5 terabytes of storage18. biosynthetic pathways - Takahashi et al. 1998 (ref. 80) (143) pathogenicity - Razin et al. 1998 (ref. 81) (142); Perna et al. 1998 In the last two years, several alternate (ref. 82) (126) eukaryotic metabolism - Martin et al. 1998 (ref. 83) (202) assemblers for large-scale sequencing minimal genome projects have been described. In the Hutchison et al. 1999 (ref. 84) (118) WGS assembly context, these include genome evolution gene transfer - Lawrence et al. 1998 (ref. 88) (189) ARACHNE69, JAZZ70 and Phusion71, mutation rates - Drake et al. 1998 (ref. 87) (170) transcription and translation each following the basic model of the translation initiation- Kozak 1999 (ref. 137) (235) 66 Celera assembler . Clone-by-clone DNA microarray - Richmond et al. 1999 (ref. 138) (116); Tao et al. 1999 (ref. 139)(106) projects have assembled reads from sinmedical consequences Collins 1979 (ref. 89) (153R) gle clones into contigs, using programs such as phrap, followed by wholegenome scale ordering and orienting of Fig. 8 Citation tree for the WGS sequencing of E. coli9. All publications that cited the original paper and have contigs into scaffolds based on bridging themselves been cited more than 100 times since their publication are shown. nature genetics supplement • volume 33 • march 2003
225
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review ref. 77), and that high-water mark just barely exceeded what had been reached nearly a decade earlier (Epstein–Barr virus, at roughly 170 kb; GenBank NC_001345; ref. 78). Each completed genome has had a catalytic effect on its field (for example, Fig. 6). Citation evidence for E. coli (Fig. 8) suggests that since completion of the genome sequencing of this organism, studies undertaken provided a better understanding of mechanisms regulating cellular physiology that include DNA replication, biosynthetic pathways and metabolism79–83. The genome sequencing of this microorganism has most positively impacted the study of the ‘minimal genome’, the smallest set of genes that can support the necessary biological mechanisms of self-replication and fructification84–86. The E. coli citation tree also indicates that as the sequences of more genomes accumulate, molecular evolutionary methods will afford more accurate measures of mutation rates between populations87,88, and that ultimately the medical consequences of genomics will be realized in the fields of basic and applied biomedical research89. In the next decade of genomics we expect new technologies to accelerate the field by orders of magnitude and to have an unimaginable, but clearly catalytic, impact on biology and medicine. URLs. Information on TraceTuner is available at http://www.paracel.com/publications/tracetuner1_092100.pdf and on CAP4 at http://www.paracel.com/publications/cap4_ 092200.pdf. Documentation for phrap and cross_match is available at http://www.phrap.org/phrap.docs/phrap.html. Information on the growth of GenBank is available at http://www. ncbi.nlm.nih.gov/Genbank/genbankstats.html. Acknowledgments
The authors thank D. Borton, H. Kowalski and M. Peterson for their help with this manuscript. 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12.
13. 14.
15. 16. 17. 18. 19. 20. 21. 22.
226
Kuska, B. Beer, Bethesda, and biology: how “genomics” came into being. J. Natl. Cancer Inst. 90, 93 (1998). Sanger, F., Nicklen, S. & Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463–5467 (1977). Martin-Gallardo, A. et al. Automated DNA sequencing and analysis of 106 kilobases from human chromosome 19q13.3. Nat. Genet. 1, 34–39 (1992). McCombie, W.R. et al. Expressed genes, Alu repeats and polymorphisms in cosmids sequenced from chromosome 4p16.3. Nat. Genet. 1, 348–353 (1992). Wada, A. The practicability of and necessity for developing a large-scale DNAbase sequencing system: toward the establishment of international super DNAsequencing centers. Basic Life Sci. 46, 119–130 (1988). Wada, A. Fundamental significance of DNA mass-sequencing factory for biological sciences in future. Adv. Biophys. 30, 85–103 (1994). Understanding Our Genetic Inheritance: The Human Genome Project, The First Five Years, FY 1991–1995. NIH Report 90–1590 (1990). Goffeau, A. et al. Life with 6000 genes. Science 274, 563–567 (1996). Blattner, F.R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1474 (1997). Adams, M.D. et al. Complementary DNA sequencing: expressed-sequence tags and human genome project. Science 252, 1651–1656 (1991). Roberts, L. Gambling on a shortcut to genome sequencing. Science 252, 1618–1619 (1991). Uberbacher, E.C. & Mural, R.J. Locating protein-coding regions in human DNA sequences by a multiple sensor–neural network approach. Proc. Natl. Acad. Sci. USA 88, 11261–11265 (1991). Olson, M., Hood, L., Cantor, C. & Botstein, D. A common language for physical mapping of the human genome. Science 245, 1434–1435 (1989). Putney, S.D., Herlihy, W.C. & Schimmel, P. A new troponin T and cDNA clones for 13 different muscle proteins, found by shotgun sequencing. Nature 302, 718–721 (1983). Papadopoulos, N. et al. Mutation of a Mutl homolog in hereditary colon cancer. Science 263, 1625–1629 (1994). Nicolaides, N.C. et al. Mutations of 2 Pms homologs in hereditary nonpolyposis colon cancer. Nature 371, 75–80 (1994). Somerville, C. & Somerville, S. Plant functional genomics. Science 285, 380–383 (1999). Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). Velculescu, V.E., Zhang, L., Vogelstein, B. & Kinzler, K.W. Serial analysis of gene expression. Science 270, 484–487 (1995). Lockhart, D.J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675–1680 (1996). Wodicka, L., Dong, H.L., Mittmann, M., Ho, M.H. & Lockhart, D.J. Genome-wide
expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol. 15, 1359–1367 (1997). 23. Duggan, D.J., Bittner, M., Chen, Y.D., Meltzer, P. & Trent, J.M. Expression profiling using cDNA microarrays. Nat. Genet. 21, 10–14 (1999). 24. Waterston, R. et al. A survey of expressed genes in Caenorhabditis elegans. Nat. Genet. 1, 114–123 (1992). 25. Okubo, K. et al. Large-scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nat. Genet. 2, 173–179 (1992). 26. Adams, M.D., Kerlavage, A.R., Fields, C. & Venter, J.C. 3,400 new expressedsequence tags identify diversity of transcripts in human brain. Nat. Genet. 4, 256–267 (1993). 27. Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. & Venter, J.C. Rapid cDNA sequencing (expressed-sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 4, 373–386 (1993). 28. Newman, T. et al. Genes galore—a summary of methods for accessing results from large-scale partial sequencing of anonymous Arabidopsis cDNA clones. Plant Physiol. 106, 1241–1255 (1994). 29. Adams, M.D. et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 377, 3–174 (1995). 30. Hillier, L. et al. Generation and analysis of 280,000 human expressed-sequence tags. Genome Res. 6, 807–828 (1996). 31. Schuler, G.D. et al. A gene map of the human genome. Science 274, 540–546 (1996). 32. Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. Issues in searching molecular sequence databases. Nat. Genet. 6, 119–129 (1994). 33. Fernandesalnemri, T., Litwack, G. & Alnemri, E.S. Cpp32, a novel human apoptotic protein with homology to Caenorhabditis elegans cell-death protein Ced-3 and mammalian interleukin-1β-converting enzyme. J. Biol. Chem. 269, 30761–30764 (1994). 34. Simonet, W.S. et al. Osteoprotegerin: a novel secreted protein involved in the regulation of bone density. Cell 89, 309–319 (1997). 35. Messersmith, E.K. et al. Semaphorin III can function as a selective chemorepellent to pattern sensory projections in the spinal cord. Neuron 14, 949–959 (1995). 36. Sutton, G.G., White, O., Adams, M.D. & Kerlavage, A.R. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. 1, 9–19 (1995). 37. Fleischmann, R.D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995). 38. Himmelreich, R. et al. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24, 4420–4449 (1996). 39. Nelson, K.E. et al. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399, 323–329 (1999). 40. Philipp, W.J. et al. An integrated map of the genome of the tubercle bacillus, Mycobacterium tuberculosis H37Rv, and comparison with Mycobacterium leprae. Proc. Natl. Acad. Sci. USA 93, 3132–3137 (1996). 41. Brown, J.R. & Doolittle, W.F. Archaea and the prokaryote-to-eukaryote transition. Microbiol. Mol. Biol. Rev. 61, 456–502 (1997). 42. Rubin, G.M. et al. Comparative genomics of the eukaryotes. Science 287, 2204–2215 (2000). 43. Mural, R.J. et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296, 1661–1671 (2002). 44. Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). 45. Tatusov, R.L., Koonin, E.V. & Lipman, D.J. A genomic perspective on protein families. Science 278, 631–637 (1997). 46. Dujon, B. The yeast genome project: what did we learn? Trends Genet. 12, 263–270 (1996). 47. Lashkari, D.A. et al. Yeast microarrays for genome-wide parallel genetic and gene-expression analysis. Proc. Natl. Acad. Sci. USA 94, 13057–13062 (1997). 48. Blackstock, W.P. & Weir, M.P. Proteomics: quantitative and physical mapping of cellular proteins. Trends Biotechnol. 17, 121–127 (1999). 49. Yates, J.R. Mass spectrometry and the age of the proteome. J. Mass Spectrom. 33, 1–19 (1998). 50. Govan, J.R.W. & Deretic, V. Microbial pathogenesis in cystic fibrosis: mucoid Pseudomonas aeruginosa and Burkholderia cepacia. Microbiol. Rev. 60, 539–574 (1996). 51. Freiberg, C. et al. Molecular basis of symbiosis between Rhizobium and legumes. Nature 387, 394–401 (1997). 52. Zumft, W.G. Cell biology and molecular basis of denitrification. Microbiol. Mol. Biol. Rev. 61, 533–616 (1997). 53. Merrick, M.J. & Edwards, R.A. Nitrogen control in bacteria. Microbiol. Rev. 59, 604–622 (1995). 54. Pennisi, E. DNA sequencers’ trial by fire. Science 280, 814–817 (1998). 55. Weber, J.L. & Myers, E.W. Human whole-genome shotgun sequencing. Genome Res. 7, 401–409 (1997). 56. Green, P. Against a whole-genome shotgun. Genome Res. 7, 410–417 (1997). 57. Meldrum, D. Automation for genomics, part one: preparation for sequencing. Genome Res. 10, 1081–1092 (2000). 58. Meldrum, D. Automation for genomics, part two: sequencers, microarrays, and future trends. Genome Res. 10, 1288–1303 (2000). 59. Stein, L.D. et al. The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610 (2002). 60. Doggett, N.A. et al. An integrated physical map of human chromosome 16. Nature 377, 335–365 (1995). 61. Venter, J.C., Smith, H.O. & Hood, L. A new strategy for genome sequencing. Nature 381, 364–366 (1996). 62. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998). 63. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998). 64. Chou, H.H. & Holmes, M.H. DNA sequence quality trimming and vector removal. Bioinformatics 17, 1093–1104 (2001).
nature genetics supplement • volume 33 • march 2003
© 2003 Nature Publishing Group http://www.nature.com/naturegenetics
review 65. Myers, E.W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995). 66. Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000). 67. Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000). 68. Holt, R.A. et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298, 129–149 (2002). 69. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002). 70. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002). 71. Mullikin, J.C. & Ning, Z. The phusion assembler. Genome Res. 13, 81–90 (2003). 72. Kent, W.J. & Haussler, D. Assembly of the working draft of the human genome with GigAssembler. Genome Res. 11, 1541–1548 (2001). 73. Huang, X. & Madan, A. CAP3: a DNA sequence assembly program. Genome Res. 9, 868–877 (1999). 74. Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001). 75. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998). 76. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000). 77. Bankier, A.T. et al. The DNA sequence of the human cytomegalovirus genome. DNA Seq. 2, 1–12 (1991). 78. Baer, R. et al. DNA sequence and expression of the B95-8 Epstein–Barr virus genome. Nature 310, 207–211(1984). 79. Seigneur, M., Bidnenko, V., Ehrlich, S.D. & Michel, B. RuvAB acts at arrested replication forks. Cell 95, 419–430 (1998). 80. Takahashi, S., Kuzuyama, T., Watanabe, H. & Seto, H. A 1-deoxy-D-xylulose 5phosphate reductoisomerase catalyzing the formation of 2-C-methyl-D-elythritol 4-phosphate in an alternative non-mevalonate pathway for terpenoid biosynthesis. Proc. Natl. Acad. Sci. USA 95, 9879–9884 (1998). 81. Razin, S., Yogev, D. & Naot, Y. Molecular biology and pathogenicity of mycoplasmas. Microbiol. Mol. Biol. Rev. 62, 1094–1156 (1998). 82. Perna, N.T. et al. Molecular evolution of a pathogenicity island from enterohemorrhagic Escherichia coli O157:H7. Infect. Immun. 66, 3810–3817 (1998). 83. Martin, W. & Muller, M. The hydrogen hypothesis for the first eukaryote. Nature 392, 37–41 (1998). 84. Hutchison, C.A. et al. Global transposon mutagenesis and a minimal mycoplasma genome. Science 286, 2165–2169 (1999). 85. Fraser, C.M. et al. The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403 (1995). 86. Mushegian, A.R. & Koonin, E.V. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl. Acad. Sci. USA 93, 10268–10273 (1996). 87. Drake, J.W., Charlesworth, B., Charlesworth, D. & Crow, J.F. Rates of spontaneous mutation. Genetics 148, 1667–1686 (1998). 88. Lawrence, J.G. & Ochman, H. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95, 9413–9417 (1998). 89. Collins, F.S. Shattuck lecture—medical and societal consequences of the human genome project. N. Engl. J. Med. 341, 28–37 (1999). 90. Shizuya, H. et al. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci. USA 89, 8794–8797 (1992). 91. Burke, D.T., Carle, G.F. & Olson, M.V. Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 236, 806–812 (1987). 92. Boguski, M.S. & Schuler, G.D. Establishing a human transcript map. Nat. Genet. 10, 369–371 (1995). 93. Schena, M. et al. Parallel human genome analysis: microarray-based expression monitoring of 1,000 genes. Proc. Natl. Acad. Sci. USA 93, 10614–10619 (1996). 94. Bult, C.J. et al. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273, 1058–1073 (1996). 95. Redenbach, M. et al. A set of ordered cosmids and a detailed genetic and physical map for the 8 Mb Streptomyces coelicolor A3(2) chromosome. Mol. Microbiol. 21, 77–96 (1996). 96. Tomb, J.F. et al. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388, 539–547 (1997). 97. Kunst, F. et al. The complete genome sequence of the Gram-positive bacterium Bacillus subtilis. Nature 390, 249–256 (1997). 98. Klenk, H.P. et al. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390, 364–370 (1997). 99. Fraser, C.M. et al. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390, 580–586 (1997). 100. Andersson, S.G.E. et al. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396, 133–140 (1998). 101. Deckert, G. et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358 (1998). 102. Fraser, C.M. et al. Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281, 375–388 (1998).
nature genetics supplement • volume 33 • march 2003
103. Gardner, M.J. et al. Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. Science 282, 1126–1132 (1998). 104. Tettelin, H. et al. Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science 287, 1809–1815 (2000). 105. Coux, O., Tanaka, K. & Goldberg, A.L. Structure and functions of the 20S and 26S proteasomes. Annu. Rev. Biochem. 65, 801–847 (1996). 106. Wang, J.C. DNA topoisomerases. Annu. Rev. Biochem. 65, 635–692 (1996). 107. Paulsen, I.T., Brown, M.H. & Skurray, R.A. Proton-dependent multidrug efflux systems. Microbiol. Rev. 60, 575–608 (1996). 108. Nikaido, H. Multidrug efflux pumps of gram-negative bacteria. J. Bact. 178, 5853–5859 (1996). 109. Krokan, H.E., Standal, R. & Slupphaug, G. DNA glycosylases in the base excision repair of DNA. Biochem. J. 325, 1–16 (1997). 110. Miklos, G.L.G. & Rubin, G.M. The role of the genome project in determining gene function: insights from model organisms. Cell 86, 521–529 (1996). 111. Silver, S. & Phung, L.T. Bacterial heavy metal resistance: new surprises. Annu. Rev. Microbiol. 50, 753–789 (1996). 112. Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997). 113. Gurtler, V. & Stanisich, V.A. New approaches to typing and identification of bacteria using the 16S-23S rDNA spacer region. Microbiology 142, 3–16 (1996). 114. Cole, S.T. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998). 115. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999). 116. Stover, C.K. et al. Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogen. Nature 406, 959–964 (2000). 117. Perna, N.T. et al. Genome sequence of enterohaemorrhagic Escherichia coli O157: H7. Nature 409, 529–533 (2001). 118. Kuroda, M. et al. Whole-genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet 357, 1225–1240 (2001). 119. Simpson, A.J.G. et al. The genome sequence of the plant pathogen Xylella fastidiosa. Nature 406, 151–157 (2000). 120. Elliott, S.J. et al. The complete sequence of the locus of enterocyte effacement (LEE) from enteropathogenic Escherichia coli E2348/69. Mol. Microbiol. 28, 1–4 (1998). 121. Jain, R., Rivera, M.C. & Lake, J.A. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96, 3801–3806 (1999). 122. Wallin, E. & von Heijne, G. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 7, 1029–1038 (1998). 123. Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998). 124. Nunes-Duby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T. & Landy, A. Similarities and differences among 105 members of the Int family of site-specific recombinases. Nucleic Acids Res.26, 391–406 (1998). 125. Datsenko, K.A. & Wanner, B.L. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. USA 97, 6640–6645 (2000). 126. Surette, M.G., Miller, M.B. & Bassler, B.L. Quorum sensing in Escherichia coli, Salmonella typhimurium, and Vibrio harveyi: a new family of genes responsible for autoinducer production. Proc. Natl. Acad. Sci. USA 96, 1639–1644 (1999). 127. Bochtler, M., Ditzel, L., Groll, M., Hartmann, C. & Huber, R. The proteasome. Annu. Rev. Biophys. Biomol. Struct. 28, 295–317 (1999). 128. Sargent, F. et al. Overlapping functions of components of a bacterial Secindependent protein export pathway. EMBO J. 17, 3640–3650 (1998). 129. Weiner, J.H. et al. A novel and ubiquitous system for membrane targeting and secretion of cofactor-containing proteins. Cell 93, 93–101 (1998). 130. Hung, L.W. et al. Crystal structure of the ATP-binding subunit of an ABC transporter. Nature 396, 703–707 (1998). 131. Locher, K.P. et al. Transmembrane signaling across the ligand-gated FhuA receptor: crystal structures of free and ferrichrome-bound states reveal allosteric changes. Cell 95, 771–778 (1998). 132. Walhout, A.J.M. et al. Protein interaction mapping in C. elegans using proteins involved in vulval developement. Science 287, 116–122 (2000). 133. Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98, 4569–4574 (2001). 134. Marcotte, E.M. et al. Detecting protein function and protein–protein interactions from genome sequences. Science 285, 751–753 (1999). 135. Schwartz, S. et al. PipMaker—a Web server for aligning two genomic DNA sequences. Genome Res. 10, 577–586 (2000). 136. Lukashin, A.V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998). 137. Kozak, M. Initiation of translation in prokaryotes and eukaryotes. Gene 234, 187–208 (1999). 138. Richmond, C.S., Glasner, J.D., Mau, R., Jin, H.F. & Blattner, F.R. Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 27, 3821–3835 (1999). 139. Tao, H., Bausch, C., Richmond, C., Blattner, F.R. & Conway, T. Functional genomics: expression analysis of Escherichia coli growing on minimal and rich media. J. Bact. 181, 6425–6440 (1999).
227