Plant Physiology, April 1999, Vol. 119, pp. 1151–1155, www.plantphysiol.org © 1999 American Society of Plant Physiologists
Update on Genomics in the USA
Genes, Genomes, Genomics. What Can Plant Biologists Expect from the 1998 National Science Foundation Plant Genome Research Program?1 Virginia Walbot* Department of Biological Sciences, Stanford University, Stanford, California 94305–5020
cDNAs provide the quickest route to gene definition. However, with the exceptions of Arabidopsis and rice, there are few plant cDNA sequences available in the public domain. To rectify this, many projects will sequence cDNAs to generate EST databases (Table I). The time frames for these projects range from 1 year (major organs of maize) to 3 years for the majority of projects. In most cases, non-normalized libraries will be sequenced, resulting in a
biased representation of highly and moderately expressed genes. With a goal of 30,000 unique ESTs, the largest database proposed is for soybean (soybean growers contributed the funds to initiate a public EST project for this crop). A critical component of all EST projects is quick and accurate data delivery to GenBank and project-maintained websites. Bioinformatics services within each project will report details of cDNA library preparation and sampling strategies and then derive consensus sequences and assemble overlaps (for ESTs sequenced multiple times). For many projects, preliminary homology searching in public databases will be part of the EST annotation; this will automatically link new information to the Arabidopsis and rice ESTs and genomic sequencing data for those ESTs with recognizable matches. For some projects, ESTs will also be mapped to chromosomes or to BACs, and this information also will be available as part of the annotation. These projects will also produce and distribute new tools such as hybridization arrays for global gene-expression studies. Widespread adoption of these new EST hybridization arrays should allow ready comparison between experiments and between laboratories using the same or closely related organisms. The technology for arrays is changing rapidly: The current low-tech methods involve ESTs spotted onto nylon membranes, whereas the newer high-tech methods use microarrays on glass slides. The NSF has funded one project to explore a new technology using gold-bead-labeled hybridization probes to simplify signal detection (PI Nina Fedoroff). In all hybridization approaches, accurately arraying ESTs of known sequence is the most critical step. Rumors abound of completed geneexpression studies compromised by the realization that errors occurred during array construction. The NSFfunded projects share a special responsibility to produce high-quality EST arrays and guidelines for data interpretation, because an entire community of plant biologists will become dependent on these new tools. At least for the near future, it is unlikely that there will be a commercial source of hybridization arrays for plants.
1 Research support in maize genomics is from National Science Foundation grant no. 98-72657. * E-mail
[email protected]; fax 1– 650 –725– 8221.
Abbreviations: BAC, bacterial artificial chromosome; EST, expressed sequence tag; T-DNA, transferred DNA from Agrobacterium.
The recent 85-million-dollar investment by the National Science Foundation (NSF) in plant genomics projects promises an enormous reservoir of new information and many new tools for the rapid analysis of gene-expression patterns, for gene tagging, and for map-based projects. The individual goals of the 23 funded projects are summarized on-line at www.nsf.gov/bio/pubs/awards/genome98.htm. Soon, many of the projects will have their own websites, where project scopes will be more fully described (e.g. http://www.zmdb.iastate.edu/). A useful introduction to the goals, new methods, and vocabulary in genomics research is provided in a recent Update in Plant Physiology (Bouchez and Ho¨fte, 1998). An excellent summary of the promise of genomics for plant biology and agriculture can be found in the 16 papers published in the Proceedings of the National Academy of Sciences (1998, volume 95) as the report of a colloquium entitled “Protecting Our Food Supply: The Value of Plant Genome Initiatives.” In this Update, I provide an introduction to the NSFfunded projects, based on information distributed by project coordinators during an October 1998 meeting at the NSF. The projects can be divided into four major categories: (a) cDNA sequencing and hybridization arrays for gene-expression studies; (b) knockout mutant collections and categorizing plant phenotypes; (c) genomic mapping with a goal of reconciling physical and genetic maps; and (d) comparative and evolutionary analysis of gene families or genomes.
GENE DISCOVERY VIA cDNA SEQUENCING AND EXPRESSION ANALYSIS
1151
1152
Walbot
Plant Physiol. Vol. 119, 1999
Table I. Scope of cDNA sequencing and analysis Species
Dicots Tomato Arabidopsis
Arabidopsis, rice, Dunaliella, and Mesembryanthemum Cotton Medicago truncatula
Soybean Monocots Maize Maize
Sorghum
Size/Special Focus
PI/E-mail
90,000/Fruit development Genes up- or down-regulated by ozone and pathogens and new hybridization detection technology 10,000 per species/Response to osmotic treatments Fiber development Major organs plus roots challenged with rhizobia, mycorrhizae, and pathogens 30,000 unique/Major organs 50,000/Major organs Embryo development in lines with novel oil or starch production 50,000/Major organs
An Arabidopsis virtual center (PI Pamela Green) has the largest component of microarray development and service. The new funding in functional genomics complements the genome-sequencing effort in Arabidopsis by providing tools to analyze gene-expression patterns. The center will not only construct EST microarrays but will also perform hybridizations with user-supplied samples. In addition, the co-PIs will provide baseline data by assessing hybridization patterns to RNA from “core” tissues. The project will also identify plant-specific genes based on the available Arabidopsis ESTs and genomic sequence, and it will construct a microarray grid to monitor expression of these genes under a wide variety of conditions. Project data will be available on a website and should set the standard for interpretation of hybridization arrays for this plant. A policy issue for Plant Physiology and other journals to consider is whether gene-expression analysis of individual genes will suffice, and, if so, for how long? When will analysis of global expression patterns be required for publication? Even if the focus of a report is a single gene, should authors be expected to define gene-expression differences in isogenic nonmutant and null allele stocks or in transgenic overexpression and antisense or suppressed examples? The desire for more robust information must be tempered by the knowledge that the new tools will be limited in supply, at least initially, and could be more costly than northern-blot hybridization analysis, for example. Second, in what format will array hybridization data be published? A few examples will illustrate points in a paper, but the greater impact comes from clever analysis of patterns and comparisons between data sets. Papers that synthesize vast amounts of information into comprehensible patterns will be essential to guide the field. A public repository of hybridization data files may become as important as public release of DNA sequence information. Should authors be obligated to maintain accessible files of
J. Craig Venter/
[email protected] and Steven D. Tanksley/
[email protected] Nina V. Fedoroff/
[email protected]
Hans Bohnert/
[email protected] Thea A. Wilkins/
[email protected] Douglas Cook/
[email protected]
Lila Vodkin/
[email protected] Virginia Walbot/
[email protected] Bertrand Lemieux/
[email protected]
Andrew H. Paterson/
[email protected]
their hybridization data, or should journals provide such information to their subscribers? It is also important to consider that the planned hybridization tools will be built from cDNA collections that represent moderately and highly expressed genes very well but contain only a subset of genes expressed at lower levels or under unique conditions. Thus, the data are “global” but incomplete for each species. The problem is compounded if comparisons are made between arrays constructed from different EST pools. If two species are compared using arrays that are 75% complete (100% complete for highexpression genes and 50% complete for low-expression genes), there might be only a 25% overlap of the lowexpression classes. Ultimately, all of the transcripts from a species could be defined and a comprehensive hybridization array devised. There is certainly room for new projects to produce arrays drawing on deeper collections of ESTs from specific developmental stages or physiological treatments to ensure better coverage of the rare transcript class. MUTATIONAL APPROACH TO UNDERSTANDING GENE FUNCTION The cDNA approach to gene discovery suffers from the limitation that it may be nearly impossible to recover a cDNA representing each gene. Random mutagenesis automatically provides a “normalized” approach to gene discovery and is therefore a second route to functional genomics. In this approach, the consequences of eliminating or altering the contribution of a single gene are explored. Several projects propose generating and sharing collections of mutagenized plants, as well as reporting phenotypic analysis of individual mutants at project websites. For dicots, the main tool will be T-DNA insertion mutants, with the most effort in Arabidopsis. For example, the analysis of osmotic responses (PI Hans Bohnert) will generate
Genes, Genomes, Genomics up to 200,000 T-DNA-tagged lines. An independent resource of the same magnitude will be generated by PI Pamela Green and her collaborators. The latter project will sequence the T-DNA/Arabidopsis junction fragments for many of the insertion events, facilitating the identification of mutations in specific genes. Transposons are the choice for insertional mutagenesis in maize. In the first strategy, PI Hugo Dooner exploits the tendency of Ac to transpose to closely linked sites. With local hopping, the project can perform saturation mutagenesis from several locations and perfect PCR techniques for rapid recovery and analysis of large collections of insertions. A second goal is to produce transgenic maize lines with genetically engineered Ac elements at diverse locations. This collection of lines will facilitate future genediscovery and mutagenesis efforts using the methods developed in the first phase of the project. Two projects will use Mu elements as transposon tags; these elements are thought to transpose preferentially into genes but randomly with regard to chromosomes. Using standard Mutator lines with many copies of mobile Mu elements, the Cold Spring Harbor Laboratory collaborative project (PI Rob Martienssen) will grow and self more than 40,000 Mutator plants. This will generate a public mutant collection that should contain more than 1 million insertions and therefore many different mutations of most genes. Mutants of interest will be identified by PCR screening using DNA samples prepared from the original plants. One primer reads out of the highly conserved Mu termini and a second primer is gene specific. PCR screening will be provided as a service, and collaborators will receive seed packets (selfed progeny) of plants identified as carrying the mutation. This project is a public version of the TUSC (Trait Utility System for Corn) gene-discovery method developed at Pioneer Hi-Bred International (Bensen et al., 1995). A second Mu-tagging project will generate fewer mutations but will create an immortalized collection of the mutations in Escherichia coli libraries. Genetically engineered RescueMu elements will be used in a gene-tagging and -sequencing project. RescueMu elements contain pBluescript; therefore, when total maize DNA is transformed into E. coli, only the tagged genes are maintained. PI Virginia Walbot and workers at six other maize genetics laboratories will grow fields of 2304 plants (a 48-row 3 48column grid) and collect leaf punches along each row and each column. Each pool of leaf samples will produce a single RescueMu clone library; the 48-row and 48-column libraries from one grid of plants fit into a 96-well plate. Because the diversity of plasmids in a single library is low (approximately 50–500), only a dozen PCR cycles using a Mu readout and a gene-target primer are required to detect a product band on an agarose gel. Germinal insertions are distinguished from somatic events by amplification of the same fragment in both a row and a column library; the two positive results also define the plant that contained the mutation of interest. The fee for receiving selfed seed from that plant is submission of 1 kb of DNA sequence adjacent to the RescueMu insertion in a target gene. In addition to user-generated genomic sequences, the project will sequence approximately 150,000 RescueMu-maize junctions
1153
to aid in the discovery of genes that are not defined by ESTs. Phenotypic analysis of public mutant collections will require an unprecedented effort by geneticists, molecular biologists, and physiologists to record plant traits in a common format. A first pass at plant description will be made by the project teams, and these data, including photographs, will be publicly available. At least for maize and Arabidopsis, uniform scoring criteria and a data sheet for recording information will be required if searchable databases are to be developed. In addition, the Arabidopsis project headed by Pamela Green and the Cold Spring Harbor Laboratory Mutator project intend to incorporate additional, more detailed phenotypic information into their databases; this information will be supplied by community members who receive seed. For this system to work, users must agree to contribute to project goals by providing complete and timely information in a usable format. The incentive to do so is that materials will be supplied only to cooperative collaborators. PLACING GENES AND OTHER MARKERS ON CHROMOSOMAL MAPS Gene discovery and expression analysis as described in the previous two sections will have an immediate impact on the way many plant biologists design experiments. Map construction is a more abstract goal for investigators with a single gene focus; however, for deeper insight into genome structure, evolution, and definition of candidate loci for quantitative trait loci, maps are invaluable guides (McCouch, 1998). The first round of NSF plant genome projects includes many mapping projects with diverse long-term goals but similar experimental strategies. Larger projects are using a combination of marker-assisted mapping strategies to place cDNAs, simple sequence repeats, and other DNA segments onto chromosome locations and BACs. Within these projects and in all of the smaller projects, novel technologies will be tested to repackage large genomes into more manageable units. BACs have thus far been the most stable method for propagating large segments of plant chromosomes. The Clemson Center managed by PI Rod A. Wing will continue to be instrumental in generating and characterizing BAC libraries. Their materials and new BAC libraries prepared by individual projects will be essential for integrating physical and genetic maps. BAC fingerprinting underlies much of the physical mapping. With this technique, the restriction-fragment patterns of thousands of BACs are compared, and those with shared patterns are assembled into contigs. By hybridization, DNA markers can be placed on a BAC and even more finely mapped within a BAC. If such markers are also mapped to the chromosomes, then BACs can be assembled along the chromosome map. Sequencing the ends of BACs and mapping these to chromosomes is another approach for placing the BACs onto existing genetic maps. As summarized in Table II, physical maps based on BACs are proposed for rice, sorghum, and soybean; two or more projects will generate information for these species.
1154
Walbot
Plant Physiol. Vol. 119, 1999
Table II. Physical and genetic mapping of plant chromosomes Species
Approaches 3 Goal
Maize, sorghum, rice, sugarcane, barley, cotton, tomato, and soybean
Fingerprint and end-sequence BACs; array EST clones 3 Affordable access to new technologies Sequence BACs containing gene markers from several legumes 3 Composite legume map In vivo tagging of chromatin DNA 3 Threedimensional map of nuclear genome and transcriptional and replication activities Tetrad analysis to map centromere functions 3 Plant artificial chromosomes BAC-end sequences for 1000 positions 3 Anchors for future physical maps Fingerprint BACs and place economically important genes 3 Physical map suitable for marker-assisted breeding and gene cloning Assemble BAC contigs, cross-link maps of rice and sorghum, and place individual BACs from other grasses 3 Physical map for grasses Create rodent cell lines with plant chromosome translocations; use cre-lox recombination to select deletions 3 Facilitate physical mapping Fluorescence in situ hybridization to pachytene chromosomes 3 Physical map Fingerprint and map BACs to chromosomes; radiation hybrid rat lines 3 Integrated genetic and physical map Maize chromosomes stably maintained in oat 3 Mapping to individual chromosomes possible without polymorphisms
Legumes
Arabidopsis and tobacco
Arabidopsis and Chlamydomonas reinhardtii Soybean Soybean
Sorghum, rice, and other grasses
Maize and Arabidopsis
Maize Maize and sorghum
Maize
For legumes and grasses, the extent of synteny within each group will be evaluated as maps are assembled for multiple species. If the order of genes and other markers is preserved between species, many types of chromosome walking are greatly facilitated (Gale and Devos, 1998). In addition, the genes within the syntenic block are likely to be homologs, allowing accurate comparisons of gene changes. Because the entire genome sequence of Arabidopsis will be known within a few years, synteny with other plants, even distantly related plants, may be recognizable. To test this idea, maps for tomato (PI Steven Tanksley) and Medicago truncatula (PI Douglas Cook) will be crossreferenced to Arabidopsis. Mapping genes and DNA sequences is a fundamental activity in genetics. Scorable phenotypes (alternative alleles at the visible or DNA-sequence level) are required for traditional mapping, and resolution depends on the number of recombinant chromosomes examined. Several physical approaches to circumvent the requirement for polymorphisms and recombination have been funded. Whole chromosomes can be purified, as exemplified by introgression of individual maize chromosomes into oat lines. An existing panel of seven viable oat lines carrying individual maize chromosomes will be expanded to include all 10 maize chromosomes (PI Ronald L. Phillips). Maize probes
PI/E-mail
Rod A. Wing/
[email protected]
Douglas Cook/
[email protected]
Eric Lam/
[email protected]
Daphne K. Preuss/
[email protected] Lila Vodkin/
[email protected] David A. Lightfoot/
[email protected]
Andrew Paterson/
[email protected]
Z. Renee Sung/
[email protected]
W. Zacheus Cande/
[email protected] Edward H. Coe/
[email protected]
Ronald L. Phillips/
[email protected]
that hybridize in situ to only one oat addition line are placed on that chromosome. A second approach to placing maize genes involves analyzing quantitative hybridization differences. DNA samples from monosomics (2N21), trisomics (2N11), deletion stocks, and translocation stocks (enriched representation of parts of chromosome arms) will be arrayed onto nylon filters (PI Virginia Walbot). Sophisticated repackaging of plant chromosomes will be attempted using tricks from mammalian somatic cell genetics. The goal of this research is to prepare stable mammalian cell lines carrying approximately 1% of a plant genome. These cell lines would be the starting material for mapping or sequencing projects. Plant chromosome segments, generated by radiation or cre-lox-mediated recombination, will be fused onto rodent chromosomes. DNA prepared from these lines and in situ hybridization can be used to map unknown genes to segments within chromosomes (PIs Z. Renee Sung and Edward H. Coe). Centromeres are required for orderly chromosome behavior during mitosis and meiosis, and yet we know little about the DNA required to organize a centromere. PI Daphne Pruess and her collaborators will use tetrad analysis in Arabidopsis and C. reinhardtii to map centromere regions. The elements will be sequenced and then tested for function in artificial chromosomes. For researchers inter-
Genes, Genomes, Genomics
1155
Table III. Genetic diversity and the evolution of functions Species
Approaches 3 Goals
PI/E-mail
Maize, cotton, and Arabidopsis
Identify CelA and related genes involved in cellulose synthase 3 Characterize mutants Characterize osmotic stress-induced genes and expression patterns; analyze mutants in signal transduction 3 Discover control logic of a key physiological process Measure genetic diversity and DNA sequence polymorphism on chromosomes 1 and 3; identify candidate agronomic genes 3 Statistical analysis of maize-teosinte divergence and current diversity Map tomato cDNAs in both species to determine to what extent synteny exists 3 Ability to clone tomato genes based on Arabidopsis gene position
Deborah P. Delmer/
[email protected]
Arabidopsis, rice, Dunaliella, and Mesembryanthemum
Maize and teosinte
Tomato and Arabidopsis
ested in introducing multiple new traits (stacking) into transgenic plants, building a custom chromosome may ultimately provide the best route. The physical placement of chromosomes in vivo, monitored using fluorescent hybridization or protein tags, will be explored in two projects. PI W. Zacheus Cande will develop a physical map of marker positions along individual maize chromosome arms. This reality check of where markers are located in vivo will be useful in verifying maps developed from BAC contigs. PI Eric Lam will develop a set of fluorescent beacons to triangulate chromosome positions in Arabidopsis and tobacco, and will then map how these positions change as a function of the cell cycle and the state of cellular differentiation. Both of these projects will provide new kinds of information about chromosome behavior, packing, and interaction with nuclear landmarks that transcend our current conception of genomics efforts. COMPARATIVE AND EVOLUTIONARY APPROACHES TO GENE FUNCTION It is a paradox of current molecular analysis that we can simultaneously appreciate that each species is distinguished by subtle and profound differences from all other taxa while using sequence matches to argue for identity of function. At least for some genes, the sequence differences must hold the explanation for the obvious species-specific differences used in classification. But which changes are important in speciation? Have structural or regulatory genes been more likely to acquire novel functions? Have coding regions or regulatory regions been primarily responsible for morphological and functional differences? Is there one pattern, e.g. novel promoters in regulatory genes, or will each trait examined reflect a unique suite of alterations? New analytical methods, variously termed horizontal genomics or phylogenomics, seek answers to these questions (Table III). These methods require large DNAsequence data sets in which the same gene is sequenced from many accessions. Furthermore, the specimens used must be phylogenetically instructive examples, and investigators must know the impact of allelic variation of the chosen genes on specific traits.
Hans J. Bohnert/
[email protected]
John F. Doebley/
[email protected]
Steven D. Tanksley/
[email protected]
In the first round of NSF plant genome projects, the nature of divergence between maize and teosinte will be examined by assessing diversity within each group and allelic variation along chromosomes 1 and 3 using 50 to 100 markers (PI John F. Doebley). The pace of change in regulatory regions and exons and in noncoding regions can be assessed, as well as the pattern of change along the chromosomes. This project is likely to contribute new statistical approaches to data evaluation and simulation of models. With the new analytical framework, data generated from the EST and mapping projects in other plants could be used in future projects directed at understanding the fixation of particular traits within or between genera. Complementing the narrow focus of the maize-teosinte approach, comparative studies of genes expressed under osmotic stress (PI Hans Bohnert) and those involved in cellulose synthesis (PI Deborah P. Delmer) will be examined. Diverse species were picked for these analyses, because previous work highlighted special features of their biology. Cogent evolutionary arguments will likely await the intercalation of data from key genes from a more representative set of flowering plants. On the other hand, the genes and regulatory regions found in common in the diverse plants examined could highlight the backbone of conserved features. In these projects, mutants will be used to test the role of individual genes, and this analysis will strengthen the interpretation of conserved roles across broad taxonomic categories. Received January 7, 1999; accepted January 15, 1999.
LITERATURE CITED Bensen RJ, Johal GS, Crane VC, Tossberg JT, Schnable PS, Meeley RB, Briggs SP (1995) Cloning and characterization of the maize an1 gene. Plant Cell 7: 75–84 Bouchez D, Ho¨fte H (1998) Functional genomics in plants. Plant Physiol 118: 725–732 Gale MD, Devos KM (1998) Comparative genetics in the grasses. Proc Natl Acad Sci USA 95: 1971–1974 McCouch S (1998) Toward a plant genomics initiative: thoughts on the value of cross-species and cross-genera comparisons in the grasses. Proc Natl Acad Sci USA 95: 1983–1985