News & Views Layout - Duke Computer Science

183 downloads 4303 Views 87KB Size Report
Biology and Biochemisty, Rutgers University,. Piscataway New Jersey 08854-5638, USA. email: [email protected] or [email protected].
© 1999 Nature America Inc. • http://structbio.nature.com

news and views

Structural genomics: keystone for a Human Proteome Project Gaetano T. Montelione and Stephen Anderson

© 1999 Nature America Inc. • http://structbio.nature.com

A natural extension of the genome sequencing projects involves analysis of the corresponding three dimensional protein structures. This emerging field of ‘structural genomics’ has the potential to unify our understanding of the structural basis of protein function.

The genome sequencing projects are providing vast quantities of new information that is changing the way biological research is performed1. These genomic data represent molecular blueprints for a wide range of organisms from microbes to humans. Products of these genes are widely recognized as the next generation of protein therapeutics and targets for the development of pharmaceuticals. Genomic sequence information is invaluable for characterizing evolutionary and functional relationships between related genes, as well as for identification of gene products that are involved in human disease or are unique to specific pathogenic organisms. However, primary sequence data is only part of the complete structural characterization required for understanding mechanisms of protein function and the development of pharmaceuticals. The next challenge in genomic research involves characterization of the biological functions of these gene products and analysis of the corresponding threedimensional structures. These genomescale enterprises have been referred to as functional and structural genomics, respectively. The primary enabling technology for sequence-based genomic and bioinformatic analysis is high-throughput sequencing of expressed and genomic DNA sequences. By comparison, threedimensional structure determination of proteins by X-ray crystallography or NMR spectroscopy has been a relatively slow and expensive process. An initiative in structural genomics will require technological advances analogous to rapid DNA-sequencing methods that can provide high throughput three-dimensional structure determinations. For protein crystallography, several key advances that have matured only recently have the potential to greatly accelerate the structural analysis process. These include high-level recombinant systems for producing selenomethionyl proteins,

systematic methods for crystallization screening, cryo-crystallography, undulator sources of synchrotron radiation, charge-coupled device (CCD) detectors, multiwavelength anomalous diffraction (MAD) phasing, and software for automatic generation of protein structures from MAD data2. The impact of these technologies is just beginning to be realized, but in combination they have the potential for allowing high-throughput analysis of protein structures by X-ray crystallography. Indeed, it has been proposed by some researchers that these combined technologies could be used to design a facility based on one synchrotron beam line with an average throughout approaching one protein structure determination per day. Protein NMR spectroscopy will also play a key role in structural genomics. Although high-resolution NMR structure determinations are currently limited to proteins or protein domains with molecular weights less than ~30 kD, many important structural-genomic targets are in this size range. Information derived from NMR studies, including chemicalshift perturbation data for deriving structure-activity relationships and internal dynamic data derived from nuclear relaxation studies, are complementary to crystallographic data. Solid-state NMR may also have particular value for structural analysis of integral membrane proteins. Recent advances in NMR technologies have the potential to greatly accelerate the process of solution structure determination. These developments include high-level protein production systems for biosynthetic isotope enrichment, triple-resonance methods for determining resonance assignments, combined measurements of NOE, scalar coupling, and residual dipolar coupling data that provide many local and global conformational constraints, software for automated analysis of 3D structures from NMR data, and sensitivity enhancements pro-

nature structural biology • volume 6 number 1 • january 1999

vided by random deuteration, 800–900 MHz magnets, superconducting probes, and TROSY detection methods3. Using such technologies in combination, some researchers have already begun to design ‘NMR Parks’, in which a set of 12–20 NMR spectrometers could be configured to have a combined average throughput approaching one structure per day. Realizing the tremendous opportunities provided by the genomic sequence data, several groups around the world have initiated structural genomics and bioinformatics efforts aimed at characterizing structures of genomically-defined targets. Most efforts involve opportunistic strategies, in which many candidate targets are first screened for suitability for structural analysis, and only those that provide good quality data are pursued. It is estimated that there are 1000–1500 common fold families in nature, of which ~700 (50–70%) are represented in the current protein structure database. Accordingly, one reasonable goal is to determine a ‘basis set’ of folds that includes a representative structure from each domain sequence family4–6. With such a database, most of the remaining structures would be within ‘homologymodeling distance’ of a characterized fold. It has been estimated that one structure from each of some 10,000 domain sequence families will have to be determined in order to sample all of the fold families. However, such estimates depend on many unknown factors including the robustness of comparative modeling methods, the definitions of distinct fold classes, the actual number of such classes in nature, and the methods used to cluster gene families from which representative members will be selected. A complementary goal of the emerging efforts in structural genomics involves using three-dimensional structural similarities between proteins to identify candidate biochemical functions for novel gene products. Structural similarity is 11

© 1999 Nature America Inc. • http://structbio.nature.com

© 1999 Nature America Inc. • http://structbio.nature.com

news and views often preserved over longer evolutionary distances than overall sequence similarity. Bona fide homologous relationships identified by sequence and/or structural similarities can provide crucial clues regarding the biochemical function(s) of a novel gene product. For these reasons it is sometimes possible to apply structurebased functional genomics methods in the analysis of novel gene products, using structural homology to derive clues to potential biochemical functions which can then be tested by appropriate biochemical assays. As one of the primary goals of genomics involves the identification and understanding of biochemical and cellular functions of all gene products, this would seem to be a particularly valuable application of structural genomics. Efforts to determine structures for representative members of sequence families, particularly for proteins or domains with known biochemical functions, complement efforts to determine structures of proteins with unknown functions aimed at identifying homologous relationships and potential biochemical functions. From an analysis of structures in the current Protein Data Bank, one can expect that one-half to two-thirds of the protein domain structures generated in a structural genomics initiative will map to known folds (and often provide useful clues to biochemical function) while one-third to one-half of structure determinations will identify new fold families that are not yet represented in the current data base. Such novel structures would contribute to our overall understanding of the structures of genomic products. In cases where biochemical function is not known, these novel structures could provide testable functional hypotheses based on the clustering of conserved residues in the core and on the surface of the protein, and allow the generation of structurebased profiles that can be used to identify other homologous genes. Discussions of structural genomics consistently evoke many concerns and caveats. What is the realistic scale of a structural genomics initiative? Will it really be possible, as some researchers

12

have suggested, to integrate the required technologies sufficiently to provide high throughput analysis of 10,000 protein domain structures in a five-year period? What about the fold families that are missed in such an opportunistic approach, and integral membrane proteins which constitute 15–25% of genomic sequences and some 50% of the pharmaceutically-important receptors? Can we reasonably enumerate folds, or will we eventually learn that the landscape of fold space is continuous and that our fold ‘bins’ are in some ways arbitrary? Is it naive to suggest that a database of three dimensional protein-domain structures will be sufficient to model all the structural information that we require for understanding biological functions? For example, in multidomain proteins and macromolecular complexes the domain structural units are organized in three-dimensional space in ways that are not easy to predict, and crucial new insights are provided by experimental determinations of the structures of macromolecular assemblies even when the structures of the individual components are already known. How valuable is a database of homology-modeled structures, particularly if many modeled structures are based on a single or few representatives from the corresponding sequence family? While some functional information can indeed be gleaned just by characterizing a protein’s fold family, other kinds of functional and mechanistic understanding require high-resolution structural information that may not be available from comparative modeling. For such cases, it will still be necessary to have high-resolution experimental structural data. Perhaps 10,000 structures is just a starting point. Genomic thinking represents a paradigm shift in biology that has just begun to impact on structural biology. Although significant technology development and implementation is still required, structural genomics has certainly come of age4–9. Despite it shortcomings, the potential impact of structural genomics in biology is enormous. A database of experimental and

homology-modeled structures would have tremendous value, and greatly accelerate the analysis of the remaining structures. The effort will generate not only atomic coordinates, but expression vectors and protocols for preparing large quantities of the corresponding proteins. The technological spin-offs will impact all areas of biology, particularly structural biology and biotechnology. This research enterprise, a key component of a Human Proteome Project, will expand our understanding of the atomic basis of life6, provide significant shortcuts to understanding gene function, and generate the most fundamental genomic data besides the sequence information itself. More importantly, as in most adventures of discovery, the most valuable benefits of structural genomics are the unanticipated surprises — outcomes that cannot yet be predicted. Acknowledgments We thank E. Arnold, R. Bruccoleri, K. Gunsalus and A. Stock for stimulating discussions related to this report. Support from the New Jersey Commission on Science and Technology and the Merck Genome Research Institute is gratefully acknowledged.

Gaetano Montelione and Stephen Anderson are in the Center for Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemisty, Rutgers University, Piscataway New Jersey 08854-5638, USA. email: [email protected] or [email protected] 1. Tilghman, S. M. Lessons learned, promises kept: A biologists view of the Genome Project Genome Res. 6, 773–780 (1996). 2. Hendrickson, W. and Ogata, C. Phase determination from multiwavelength anomalous diffraction measurements. Meth. Enz. 276, 494–523 (1997). 3. Wüthrich, K. The second decade—into the third millennium. Nature Struct Biol. 5, 492–495 (1998). 4. Kim, S.-H. Shining a light on structural genomics. Nature Struct Biol. 5, 643–645 (1998). 5. Penessi, E. Taking a structured approach to understanding proteins. Science 279, 978–979 (1998). 6. Terwilliger, T.C. et al. Class-directed structure determination: Foundation for a Protein Structure Initiative. Prot. Sci. 7, 1851–1856 (1998). 7. Gaastraland, T. Structural genomics: bioinformatics in the driver’s seat. Nature Biotech. 16, 625–627 (1998). 8. Shapiro, L. & Lima, C. D. The Argonne structural genomics workshop: Lamaze class for the birth of a new science. Structure 6, 265–267 (1998). 9. Sali, A. 100,000 protein structures for the biologist. Nature Struct. Biol. 5, 1029–1032 (1998).

nature structural biology • volume 6 number 1 • january 1999