BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING
BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory, Oak Ridge,Tennessee, USA
Volumes Published in This Series: The Physics ofCerebrovascular Diseases, Hademenos, G.J., and Massoud, T.E, 1997
Lipid Bilayers: Structureand Interactions, Katsaras, J., 1999 Physics with IllustrativeExamplesfrom Medicine and Biology: Mechanics, Second Edition, Benedek,G.B., and Villars, EM.H., 2000 Physics with IllustrativeExamplesfrom Medicine and Biology: Statistical Physics, Second Edition, Benedek,G.B., and Villars, EM.H., 2000 Physics with IllustrativeExamplesfrom Medicine and Biology: Electricity and Magnetism, Second Edition, Benedek,G.B., and Villars, EM.H., 2000 Physics ofPulsatile Flow, Zamir, M., 2000 Molecular Engineering ofNanosystems, Rietman, E.A., 2001 Biological Systems UnderExtreme Conditions: Structure and Function, Taniguchi, Y. et al., 2001 IntermediatePhysicsfor Medicine and Biology, Third Edition, Hobbie, R.K., 2001 Epilepsy as a Dynamic Disease, Milton, J., and lung, P. (Eds), 2002 Photonics ofBiopolymers, Vekshin, N.L., 2002 Photocatalysis:Science and Technology, Kaneko, M., and Okura, I., 2002 E. coli in Motion, Berg, H.C., 2004
Biochips: Technology and Applications, Xing,W.-L., and Cheng, J. (Eds.), 2003 Laser-TissueInteractions: Fundamentals and Applications, Niemz, M., 2003 Medical Applications ofNuclear Physics, Bethge,K., 2004 Biological Imaging and Sensing, Furukawa, T. (Ed.), 2004 Biomaterials and TissueEngineering, Shi, D., 2004 Biomedical Devices and TheirApplications, Shi, D., 2004 MicroarrayTechnology and Its Applications, Muller, U.R.,and Nicolau, D.V. (Eds), 2004 Emergent Computation: Emphasizing Bioinformatics, Simon, M., 2005 Molecular and Cellular Signaling, Beckerman,M., March 22, 2005 The Physics ofCoronaryBlood Flow, Zamir, M., May, 2005 The Physics ofBirdsong Mindlin, G.B., Laje, R., August, 2005 Radiation Physicsfor Medical Physicists Podgorsak, E.B., September 2005 Neutron Scattering in Biology-Techniques and Applications Fitter, J., Gutberlet, T., Katsaras, J. (Eds.), January 2006 Forthcoming Titles Topology in Molecular Biology: DNA and Proteins Monastyrsky, M.1. (Ed.), 2006 Optical Polarization in Biomedical Applications Tuchin, V.Y., Wang, L. (et al.), 2006 ContinuedAfter Index
Ying Xu, Dong Xu, and He Liang (Eds.)
Computational Methods for Protein Structure Prediction and Modeling Volume 2: Structure Prediction
~ Springer
YingXu Department of Biochemistry and Molecular Biology University of Georgia 120 Green Street Athens, GA 30602 USA email:
[email protected]
Dong Xu Department of Computer Science Digital Biology Laboratory University of Missouri-Columbia 201 Engineering Building West Columbia, MO 65211 USA email:
[email protected]
Iie Liang Department of Bioengineering Center for Bioinformatics University of Illinois at Chicago 851 S. Morgan Street Chicago, IL 60607-063 USA email:
[email protected]
Library of Congress Control Number: 2006929615 ISBN 10: 0-387-33321-5 ISBN 13: 978-0387-33321-2 Printed on acid-free paper. © 2007 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC,233 Spring Street, New York,NY 10013, USA), except for briefexcerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. 9876543 2 1
springer. com
Preface
An ultimate goal of modern biology is to understand how the genetic blueprint of cells (genotype) determines the structure, function, and behavior ofa living organism (phenotype). At the center of this scientific endeavor is characterizing the biochemical and cellular roles ofproteins, the working molecules of the machinery of life. A key to understanding of functional proteins is the knowledge of their folded structures in a cell, as the structures provide the basis for studying proteins' functions and functional mechanisms at the molecular level. Researchers working on structure determination have traditionally selected individual proteins due to their functional importance in a biological process or pathway ofparticular interest. Major research organizations often have their own protein X-ray crystallographic orland nuclear magnetic resonance facilities for structure determination, which have been conducted at a rate of a few to dozens of structures a year. Realizing the widening gap between the rates ofprotein identification (through DNA sequencing and identification of potential genes through bioinformatics analysis) and the determination of protein structures, a number of large scientific initiatives have been launched in the past few years by government funding agencies in the United States, Europe, and Japan, with the intention to solve protein structures en masse, an effort called structural genomics. A number of structural genomics centers (factory-like facilities) have been established that promise to produce solved protein structures in a similar fashion to DNA sequencing. These efforts as well as the growth in the size of the community and the substantive increases in the ease of structure determination, powered with a new generation of technologies such as synchrotron radiation sources and high-resolution NMR, have accelerated the rate of protein structure determination over the past decade. As of January 2006, the protein structure database PDB contained rv34,500 protein structures. The role of structure for biological sciences and research has grown considerably since the advent of systems biology and the increased emphasis on understanding molecular mechanisms from basic biology to clinical medicine. Just as every geneticist or cell biologist needed in the 1990s to obtain the sequence of the gene whose product or function they were studying, increasingly, those biologists will need to know the structure of the gene product for their research programs in this century. One can anticipate that the rate of structure determination will continue to grow. However, the large expenses and technical details of structure determination mean that it will remain difficult to obtain experimental structures for more than a small fraction of the proteins of interest to biologists. In contrast, DNA sequence determination has doubled routinely in output for a couple of decades. The genome projects have led to the production of 100 gigabytes of DNA data in Genbank, and
v
vi
Preface
as the cost of sequencing continues to drop and the rate continuesto accelerate, the scientific community anticipates a day when every individual has the genes of their interest and the genomes of all related major organisms sequenced. Structure determination of proteins began before nucleic acids could be sequenced, which now appears almost ironic. As microchemistry technologies continue to mature, ever more powerful DNA sequencing instruments and new methods for preparation of suitable quantities of DNA and cheaper, higher sequencing throughput, while enabling a revolution in the biological and biomedical sciences, also left structure determination way behind. As sequencing capacity matured in the last few decades of the twentieth century, DNA sequences exceeded protein structures by IO-fold, then IOO-fold, and now there is a IOOO-fold difference between the number of genes in Genbank and the number of structures in the PDB. The order of magnitude difference is about to jump again, in the era ofmetagenomics, as the analyses of communities of largely unculturable organisms in their natural states come to dominate sequence production. The 1. Craig Venter Institute's Sargasso Sea experiment and other early metagenomics experiments at least doubled the number of known open reading frames (ORFs) and potential genes, but the more recent ocean voyage data (or GOS) multipled the number on the order of another IO-fold,probably more. The rate of discovery of novel genes and correspondingly novel proteins has not leveled off, since nearly half of new microbial genomes tum out to be novel. Furthermore, in the metagenomics data, new families ofproteins are discovered directly proportional to the rate of gene (ORF) discovery. The bottom line is quite simple. Despite the several fold reduction in cost in structure determination due to the structural genomics projects-the NIH Protein Structure Initiative and comparable initiatives around the world-and the steady increase in the rate of protein structure determination, the number of proteins with unknown structures will continue to grow vastly faster. At an early structural genomics meeting in Avalon, New Jersey, the experimental community voted in favor ofexperimentally solving 100,000 structures ofproteins with less than 30% sequence identity to proteins with known structures. This seemed to some theoreticians at the time as solving "the protein structure problem" and removing the need for theory, simulation, and prediction. Now, while it appears that this goal is aiming too high for just the initiative alone, certainly, the structural community will have 100,000 structures in the PDB not long after the end of this decade-and probably sooner than expected as costs continue to go down and technologies continue to advance. Yet, those 100,000 structures will be significantly less than 1% of the known ORFs genes! The problem, therefore, is not about having structures to predict, but having robust enough methods to make predictions that are useful at deep levels in biology, from helping us infer function and directing experimental efforts to providing insight into ligand binding, molecular recognition, drug discovery, and so on. The kind of success in terms of"reasonable" accuracy for "most" targets has been the grand success of the CASP competition (see Chapter 1) but is completely inadequate for the biology ofthe twenty-first century and the expectations ofboth basic and applied life sciences. Prediction is not at the requisite level ofcomprehensive robustness yet, and therein is one of the features of critical importance of the discussions in this book.
Preface
vii
Computational methods for predicting protein structure have been actively pursued for some time. Their acceptance and importance grew rapidly after the establishment of a blind competition for predicting protein structure, namely, CAS~ CASP involves theoreticians predicting then-unknown protein structures and their verification and analysis following subsequent experimental determination. The validation of the general approach both enhanced funding and brought participants to the field and pointed to the limitations of current methods and the value of extensive research into advanced computational tools. Overall, the rapidly growing importance of structural data for biology fueled the emergence ofa new branch ofcomputational biology and of structural biology, an interface between the methods ofbioinformatics and molecular biophysics, namely, structuralbioinformatics. Similar to genomic sequence analysis, bioinformatic studies of protein structures could lead to both deep and general or broad insights about aspects such as the folding, evolution, and function of proteins, the nature of protein-ligand and protein-protein interactions, and the mechanisms by which proteins act. The success of such studies could have immense impacts not just on science but on the whole society through providing insight into the molecular etiology of diseases, developing novel, effective therapeutic agents and treatment regimens, and engineering biological molecules for novel or enhanced biochemical functions. As one of the most active research fields in bioinformatics, structural bioinformatics addresses a wide spectrum of scientific issues, including the computational prediction of protein secondary and tertiary structures, protein docking with small molecules and with macromolecules (i.e., DNA, RNA, and proteins), simulation of dynamic behaviors of proteins, protein structure characterization and classification, and study of structure-function relationships. While proteins were viewed as essentially static three-dimensional structures up until the 1980s, the establishment of computational methods, and subsequent advances in experimental probes that could provide data at suitable time scales, led to a revolution in how biologists think about proteins. Indeed, over the past few decades, computational studies using molecular dynamics simulations of protein structure have played essential roles in understanding the detailed functional mechanisms of proteins important in a wide variety of biological processes. Within the applied life sciences, protein docking has been extensively applied in the drug discovery pipeline in the pharmaceutical and biotech industry. Protein structure prediction and modeling tools are becoming an integral part of the standard toolkit in biological and biomedical research. Similar to sequence analysis tools, such as BLAST for sequence comparison, the new methods for structure prediction are now among the first approaches used when starting a biological investigation, conducted prior to actual experimental design. That computational analysis would become the first step for experimentalists represents a major paradigm shift that is still occurring but is clearly essential to deal with the maturation of the field, the large quantities of data, and the complexity of biology itself as reflected in the requirement for today's powerful experimental probes used to address sophisticated questions in biology. This paradigm shift was noted first by Wally Gilbert, in a prescient article fifteen years ago ("Toward a new paradigm for molecular biology,"
viii
Preface
Nature 1991, 349:99), who asserted that biologists would have to change their mode of approach to studying nature and to begin each experimental project with a bioinformatics analysis of extant literature and other computational approaches. This paradigm shift is deeply interconnected with the increased emphasis on computational tools and the expectation for robust methods for structure prediction. Similar to other fields ofbioinformatics, structural bioinformatics is a rapidly growing science. New computational techniques and new research foci emerge every few months, which makes the writing of textbooks a challenging problem. While a number of books have been published covering various aspects of protein structure prediction and modeling, it is widely recognized that the field lacks a comprehensive and coherent overview ofthe science of "protein structure prediction and modeling," which span a range from very basic problems (around physical and chemical properties and principles), such as the potential function and free energies that determine the folded shape ofa protein, to the algorithmic techniques for solving various structure prediction problems, to the engineering aspects of implementation of computer prediction software, and to applications of prediction capabilities for investigations focused on functional properties. As educators at universities, we feel that there is an urgent need for a well-written, comprehensive textbook, one that proverbially goes from soup to nuts, and that this requirement is most critical for beginners entering this field as young students or as experienced researchers coming from other disciplines. This book is an attempt to fill this gap by providing systematic expositions ofthe computational methods for all major aspects ofprotein structure analysis, prediction, and modeling. We have designed the chapters to address comprehensively the main topics of the field. In addition, chapters have been connected seamlessly through a systematic design of the overall structure of the book. We have selected individual topics carefully so that the book would be useful to a broad readership, including students, postdoctoral fellows, research scientists moving into the field, as well as professional practitionerslbioinformatics experts who want to brush up on topics related to their own research areas. We expect that the book can be used as a textbook for upper undergraduate-level or graduate-level bioinformatics courses. Extensive prior knowledge is not required to read and comprehend the information presented. In other words, a dedicated reader with a college degree in computational, biological, or physical science should be able to follow the book without much difficulty. To facilitate learning and to articulate clearly to the reader what background is needed to obtain the maximum benefit from the book, we have included four appendices describing the prerequisites in (1) biology, (2) computer science, (3) physics and chemistry, and (4) mathematics and statistics. If a reader lacks knowledge in a particular area, he or she could benefit by starting from the references provided in the corresponding appendix. While the chapters are organized in a logical order, each chapter in the book is a self-contained review of a specific subject. Hence, a reader does not need to read through the chapters sequentially. Each chapter is designed to cover the following material: (1) the problem definition and a historical perspective, (2) a mathematical or computational formulation of the problem, (3) the computational methods and
Preface
ix
algorithms, (4) the performance results, (5) the existing software packages, (6) the strengths, pitfalls, and challenges in current research, and (7) the most promising future directions. Since this is a rapidly developing field that encompasses an exceptionally wide range of research topics, it is difficult for any individual to write a comprehensive textbook on the entire field. We have been fortunate in assembling a team of experts to write this book. The authors are actively doing research at the forefront of the major areas of the field and bring extensive experience and insight into the central intellectual methods and ideas in the subdomain and its difficulties, accomplishments, and potential for the future. Chapter 1 (A Historical Perspective and Overview of Protein Structure Prediction) gives a perspective on the methods for the prediction ofprotein structure and the progress that has been achieved. It also discusses recent advances and the role of protein structure modeling and prediction today, as well as touching briefly on important goals and directions for the future. Chapter 2 (Empirical Force Fields) addresses the physical force fields used in the atomic modeling ofproteins, including bond, bond-angle, dihedral, electrostatic, van der Waals, and solvation energy. Several widely used physical force fields are introduced, including CHARMM, AMBER, and GROMOS. Chapter 3 (Knowledge-Based Energy Functions for Computational Studies of Proteins) discusses the theoretical framework and methods for developing knowledge-based potential functions essential for protein structure prediction, protein-protein interaction, and protein sequence design. Empirical scoring functions including single-body energy function, statistical method for pairwise interaction between amino acids, and scoring function based on optimization are addressed. Chapter 4 (Computational Methods for Domain Partitioning of Protein Structures) covers the basic concept of protein structural domains and practical applications. A number of computational techniques for domain partition are described, along with their applications to protein structure prediction. Also described are a few, widely used, protein domain databases and associated analysis tools. Chapter 5 (Protein Structure Comparison and Classification) discusses the basic problem ofprotein structure comparison and applications, and computational approaches for aligning two protein structures. Applications of the structure-structure alignment algorithms to protein structure search against the PDB and to protein structural motif search in the PDB are also discussed. Chapter 6 (Computation of Protein Geometry and Its Applications: Packing and Function Prediction) treats protein structures as 3D geometrical objects, and discusses structural issues from a geometric point of view, such as (1) the union of ball models, molecular surface, and solvent-accessible surface, (2) geometric constructs such as Voronoi diagram, Delaunay triangulation, alpha shape, surface geometry (including cavities and pockets) and their computation, (3) local surface similarity measure in terms of shape and sequence, and (4) function prediction based on protein surface patterns. Also described are the application issues of these computational techniques. Chapter 7 (Local Structure Prediction of Proteins) covers protein secondary structure prediction, supersecondary structure prediction, prediction of disordered
x
Preface
regions, and applications to tertiary structure prediction. A number of popular prediction software packages are described. Chapter 8 (Protein Contact Maps Prediction) describes the basic principles for residue contact predictions, and computational approaches for prediction ofresidueresidue contacts. Also discussed is the relevance to tertiary structure prediction. A number of popular prediction programs are introduced. Chapter 9 (Modeling Protein Aggregate Assembly and Structure) describes the basic problem of structure misfolding and implications, experimental approach for data collection in support of computational modeling, computational approaches to prediction of misfolded structures, and related applications. Chapter 10 (Homology-Based Modeling of Protein Structure) presents the foundation for homology modeling, computational methods for sequence-sequence alignment and constructing atomic models, structural model assessment, and manual tuning ofhomology models. A number ofpopular modeling packages are introduced. Chapter 11 (Modeling Protein Structures Based on Density Maps at Intermediate Resolutions) discusses methods for constructing atomic models from density maps ofproteins at intermediate resolution, such as those obtained from electron cryomicrosopy. Details of application of computational tools for identifying a-helices, B-sheets, as well as geometric analysis are described. Chapter 12 (Protein Structure Prediction by Protein Threading) describes the threading approach for predicting protein structure. It discusses the basic concepts of protein folds, an empirical energy function, and optimal methods for fitting a protein sequence to a structural template, including the divide-and-conquer, the integer programming, and tree-decomposition approaches. This chapter also gives practical guidance, along with a list of resources, on using threading for structure prediction. Chapter 13 (De Novo Protein Structure Prediction) describes protein folding and free energy minimization, lattice model and search algorithms, off-lattice model and search algorithms, and mini-threading. Benchmark performance ofvarious tools in CASP is described. Chapter 14 (Structure Prediction of Membrane Proteins) covers the methods for prediction of secondary structure and topology of membrane proteins, as well as prediction oftheir tertiary structure. A list ofuseful resources for membrane protein structure prediction is also provided. Chapter 15 (Structure Prediction of Protein Complexes) describes computational issues for docking, including protein-protein docking (both rigid body and flexible docking), protein-DNA docking, and protein-ligand docking. It covers computational representation for biomolecular surface, various docking algorithms, clustering docking results, scoring function for ranking docking results, and start-of-theart benchmarks. Chapter 16 (Structure-Based Drug Design) describes computational issues for rational drug design based on protein structures, including protein therapeutics based on cytokines, antibodies, and engineered enzymes, docking in structurebased drug design as a virtual screening tool in lead discovery and optimization, and ligand-based drug design using pharmacophore modeling and quantitative
Preface
xi
structure-activity relationship. A number of software packages for structure-based design are compared. Chapter 17 (Protein Structure Prediction as a Systems Problem) provides a novel systematic view on solving the complex problem of protein structure prediction. It introduces consensus-based approach, pipeline approach, and expert system for predicting protein structure and for inferring protein functions. This chapter also discusses issues such as benchmark data and evaluation metrics. An example of protein structure prediction at genome-wide scale is also given. Chapter 18 (Resources and Infrastructure for Structural Bioinformatics) describes tools, databases, and other resources of protein structure analysis and prediction available on the Internet. These include the PDB and related databases and servers, structural visualization tools, protein sequence and function databases, as well as resources for RNA structure modeling and prediction. It also gives information on major journals, professional societies, and conferences of the field. Appendix 1 (Biological and Chemical Basics Related to Protein Structures) introduces central dogma of molecular biology, macromolecules in the cell (DNA, RNA, protein), amino acid residues, peptide chain, primary, secondary, tertiary, and quaternary structure of proteins, and protein evolution. Appendix 2 (Computer Science for Structural Informatics) discusses computer science concepts that are essential for effective computation for protein structure prediction. These include efficient data structure, computational complexity and NP-hardness, various algorithmic techniques, parallel computing, and programming. Appendix 3 (Physical and Chemical Basis for Structural Bioinformatics) covers basic concepts of our physical world, including unit system, coordinate systems, and energy surfaces. It also describes biochemical and biophysical concepts such as chemical reaction, peptide bonds, covalent bonds, hydrogen bonds, electrostatic interactions, van der Waals interactions, as well as hydrophobic interactions. In addition, this chapter discusses basic concepts from thermodynamics and statistical mechanics. Computational sampling techniques such as molecular dynamics and Monte Carlo method are also discussed. Appendix 4 (Mathematics and Statistics for Studying Protein Structures) covers various basic concepts in mathematics and statistics, often used in structural bioinformatics studies such as probability distributions (uniform, Gaussian, binomial and multinomial, Dirichlet and gamma, extreme value distribution), basics of information theory including entropy, relative entropy, and mutual information, Markovian process and hidden Markov model, hypothesis testing, statistical inference (maximum likelihood, expectation maximization, and Bayesian approach), and statistical sampling (rejection sampling, Gibbs sampling, and Metropolis-Hastings algorithm). YingXu Dong Xu Jie Liang John Wooley April 2006
Acknowledgments
During the editing of this book, we, the editors, have received tremendous help from many friends, colleagues, and families, to whom we would like to take this opportunity to express our deep gratitude and appreciation. First we would like to thank Dr. Eli Greenbaum of Oak Ridge National Laboratory, who encouraged us to start this book project and contacted the publisher at Springer on our behalf. We are very grateful to the following colleagues who have critically reviewed the drafts of the chapters of the book at various stages: Nick Alexandrov, Nir Ben-Tal, Natasja Brooijmans, Chris Bystroff, Pablo Chacon, Luonan Chen, Zhong Chen, Yong Duan, Roland Dunbrack, Daniel Fischer, Juntao Guo, Jaap Heringa, Xiche Hu, Ana Kitazono, loan Kosztin, Sandeep Kumar, Xiang Li, Guohui Lin, Zhijie Liu, Hui Lu, Alex Mackerell, Kunbin Qu, Robert C. Rizzo, Ilya Shindyalov, Ambuj Singh, Alex Tropsha, Iosif Vaisman, Ilya Vakser, Stella Veretnik, Bjorn Wallner, Jin Wang, Zhexin Xiang, Yang Dai, Xin Yuan, and Yaoqi Zhou. Their invaluable input on the scientific content, on the pedagogical style, and on the writing style helped to improve these book chapters significantly. We also want to thank Ms. Joan Yantko of the University of Georgia for her tireless help on numerous fronts in this book project, including taking care of a large number of email communications between the editors and the authors and chasing busy authors to get their revisions and other material. Last but not least, we want to thank our families for their constant support and encouragement during the process of us working on this book project.
xiii
Contents
Contributors. ............................................................................. xvii 12 Protein Structure Prediction by Protein Threading...................... Ying Xu, Zhijie Liu, Liming Cai, and Dong Xu
1
13 De Novo Protein Structure Prediction..... .... ... .... ............... ....... .. Ling-Hong Hung, Shing-Chung Ngan, and Ram Samudrala
43
14
65
Structure Prediction of Membrane Proteins................... XicheHu
15 Structure Prediction of Protein Complexes................................. Brian Pierce and Zhiping Weng
109
16 Structure-Based Drug Design.. Kunbin Qu and Natasja Brooijmans
135
17 Protein Structure Prediction as a Systems Problem. ..................... 177 Dong Xu and Ying Xu 18 Resources and Infrastructure for Structural Bioinformatics.. Dong Xu, Jie Liang, and Ying Xu Appendix 1 Biological and Chemical Basics Related to Protein Structures Hong Guo and Haobo Guo Appendix 2 Computer Science for Structural Informatics....... Guohui Lin, Dong Xu, and Ying Xu
207
229
241
Appendix 3 Physical and Chemical Basis for Structural Bioinformatics ...................................................................... 267 Hui Lu, Ognjen Perisic, and Dong Xu
xv
xvi
Contents
Appendix 4 Mathematics and Statistics for Studying Protein Structures................................................................. Yang Dai and Jie Liang
299
Index........................................................................................ 317
Contributors
Natasja Brooijmans Chemical and Screening Sciences Wyeth Research Pearl River, New York 10965
Christopher Bystroff
University of Tennessee Knoxville, Tennessee 37996
Hong Guo Department of Biochemistry and Cellular and Molecular Biology
Department of Biology Rensselaer Polytechnic Institute
University of Tennessee
Troy, New York 12180
Knoxville, Tennessee 37996
Liming Cai Department of Computer Science University of Georgia Athens, Georgia 30602-7404
Jun-tao Guo Department of Biochemistry and Molecular Biology University of Georgia
Orhan Camoglu
Athens, Georgia 30602-7229
Department of Computer Science
Carol K. Hall
University of California Santa Barbara Santa Barbara, California 93106
Department of Chemical and Biomolecular Engineering
YangDai Department of Bioengineering
North Carolina State University Raleigh, North Carolina 27695
University of Illinois at Chicago Chicago, Illinois 60607-7052
HaoboGuo Department of Biochemistry and
Jaap Heringa Centre for Integrative Bioinformatics Vrije Universiteit 1081 HV Amsterdam, The
Cellular and Molecular Biology
Netherlands xvii
Contributors
xviii
XicheHu
HuiLu
Departmentof Chemistry
Departmentof Bioengineering
University of Toledo
University of Illinois at Chicago
Toledo, Ohio 43606
Chicago, Illinois 60607-7052
Ling-Hong Hung
JianpengMa
Departmentof Microbiology
Departmentof Biochemistry and
University of Washington Seattle,Washington 98195-7242
MolecularBiology Baylor Collegeof Medicine Houston, Texas 77030
Xiang Li
Departmentof Bioengineering University of Illinois at Chicago Chicago, Illinois 60607-7052
and Department of Bioengineering Rice University Houston, Texas 77005 Alexander D. MacKerell, Jr.
Jie Liang Department of Bioengineering
University of Illinois at Chicago Chicago,Illinois 60607-7052
Departmentof Pharmaceutical Chemistry School of Pharmacy University of Maryland Baltimore, Maryland21201
GuohuiLin
Departmentof Computing Science
Shing-Chung Ngan
University of Alberta
Department of Microbiology
Edmonton, Alberta T6G 2E8, Canada
University of Washington Seattle,Washington 98195-7242
Zhijie Liu
Departmentof Biochemistryand MolecularBiology
Ognjen Perisic
University of Georgia
Departmentof Bioengineering University of Illinois at Chicago
Athens, Georgia 30602-7229
Chicago, Illinois 60607-7052
xix
Contributors
Brian Pierce
Stella Veretnik
Department of Biomedical
San Diego Supercomputer Center
Engineering
University of California San Diego
Boston University
San Diego, California 92093-0505
Boston, Massachusetts 02215
Zhiping Weng Kunbin Qu Department of Chemistry
Department of Biomedical Engineering
Rigel Pharmaceuticals, Inc.
Boston University
San Francisco, California 94080
Boston, Massachusetts 02215
Ram Samudrala
Ronald B. Wetzel
Department of Microbiology
Department of Structural Biology
University of Washington
Pittsburgh Institute for
Seattle, Washington 98195-7242
Neurodegenerative Diseases University of Pittsburgh School of
Ilya Shindyalov San Diego Supercomputer Center
Medicine Pittsburgh, Pennsylvania 15260
University of California SanDiego San Diego, California 92093-0505
Victor A. Simossis
John C. Wooley Associate Vice Chancellor for Research
Centre for Integrative Bioinformatics
University of California San Diego
Vrije Universiteit
San Diego, California 92093-0043
1081 HV Amsterdam, The Netherlands
Zhexin Xiang Ambuj K. Singh
Center for Molecular Modeling
Department of Computer Science
Center for Information Technology
University of California Santa Barbara
National Institutes of Health
Santa Barbara, California 93106
Bethesda, Maryland 20892-5624
xx
Contributors
Dong Xu
Yuzhen Ye
Computer Science Department
Bioinformatics and Systems Biology
University of Missouri-Columbia Columbia, Missouri 65211-2060
Department The Burnham Institute for Medical Research
YingXu Institute ofBioinformatics and Department of Biochemistry and Molecular Biology
La Jolla, California 92037
Xin Yuan Department of Computer Science
University of Georgia
Florida State University
Athens, Georgia 30602-7229
Tallahassee, Florida 32306
12 Protein Structure Prediction by Protein Threading Ying Xu, Zhijie Liu, Liming Cai, and Dong Xu
12.1 Introduction The seminal work of Bowie, Liithy, and Eisenberg (Bowie et aI., 1991) on "the inverse protein folding problem" laid the foundation ofprotein structure prediction by protein threading. By using simple measures for fitness ofdifferent amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now. What has made protein threading particularly attractive as a protein structure prediction tool is the observation that the number ofunique structural folds in nature is a few orders ofmagnitude smaller than the number ofproteins in nature (Finkelstein andPtitsyn, 1987; BairochetaI., 2005). Although this is still not a fully resolved issue, both theoretical and statistical studies (Murzin et aI., 1995; Brenner et aI., 1996; Li et aI., 1996, 1998,2002; Wang, 1996; Orengo et aI., 1997; Holm and Sander, 1996a; Zhang and DeLisi, 1998) suggest that the number ofunique structural folds in nature ranges from a few hundred to a few thousand. Clearly this is a significantly smaller number than the number ofproteins in nature - as we understand now, the number of different living organisms on earth could range from millions to hundreds ofmillions (May, 1988). Since each organism often has at least thousands of different proteins encoded in its genome, the total number of different proteins in nature is at least in the tens ofbillions or possibly significantly higher, even without considering protein variants such as alternatively spliced proteins. This disparity suggests an effective paradigm for possibly solving all protein structures through combining experimental and computational approaches, that is, to solve structures of proteins with unique structural folds using the expensive and time-consuming experimental techniques
2
Ying Xu et ale
and computationally model the rest ofthe proteins using the experimental structures as templates. This is the key strategy currently being employed by the worldwide Structural Genomics efforts (Gaasterland, 1998; Skolnick et aI., 2000; Baker and Sali, 2001). The basic idea of protein threading is to place (align or thread) the amino acids of a query protein sequence, following their sequential order and allowing gaps, into structural positions of a template structure in an optimal way measured by fitness scores outlined above. This procedure will be repeated against a collection of previously solved protein structures for a given query protein. These sequencestructure alignments, i.e., the query sequence against different template structures, will be assessed using statistical or energetic measures for the overall likelihood of the query protein adopting each ofthe structural folds. The "best" sequence-structure alignment provides a prediction of the backbone atoms of the query protein, based on their placements in the template structure. Currently, protein threading is being widely used in molecular biology and biochemistry labs, often for initial studies of target proteins, as it may quickly provide structural and functional information, which could be used to guide further experimental design and investigation. As a prediction technique, protein threading has a number of highly challenging computational and modeling problems. These include (a) how to effectively and accurately measure the fitness of a sequence placed in a template structure, (b) how to accurately and efficiently find the best alignment between a query sequence and a template structure based on a given set of fitness measures, (c) how to assess which sequence-structure alignment among the ones against different template structures represents a correct fold recognition and an accurate (backbone) structure prediction, and (d) how to identify which parts of a predicted structure are accurate and which parts are not. As researchers find more effective solutions to these and other challenging problems, we expect that protein threading will play an increasingly significant role in structural and functional studies of proteins.
12.2 Protein Domains, Structural Folds, and Structure Space As of now, over a million protein sequences have been determined (Bairoch et aI., 2005), among which ~30,000 have had their tertiary structures experimentally solved (Dutta and Berman, 2005). Given that there could be at least tens of billions of different proteins in nature as discussed above, one interesting question, particularly relevant to the idea of protein threading, is how many unique protein structures or structural folds these proteins might have adopted. To answer this question, we need to first look at the basic structural units of proteins, called protein domains (Wetlaufer, 1973; Richardson, 1981). Protein domain is extensively discussed in Chapter 4 of this book. Here we describe briefly the concept of a domain from the perspective of threading. A structural domain is a distinct and compact structural unit that could fold independently of other domains.
12. Protein Structure Prediction by Protein Threading
3
While many proteins are single-domain proteins, there are proteins with two, three, or even more structural domains (Ekman et aI., 2005). Our study shows that in the FSSP (Fold classification based on Structure-Structure alignment of Proteins) nonredundant set (Holm and Sander, 1996b) of the PDB, 67% of the proteins have single domains, 21% have two domains, 7% have three domains, and the remaining 5% have four or more domains (Xu et aI., 2000a). This distribution may not necessarily reflect the actual domain distribution for all the proteins in nature as the set of known protein structures in PDB might have overrepresented small proteins due to the relative ease in solving these structures. For eukaryotic organisms, it was estimated that at least two-thirds oftheir proteins are multidomain proteins (Gerstein and Hegyi, 1998; Gerstein, 1997, 1998; Doolittle, 1995; Apic et aI., 2001a,b). This number is somewhat smaller for bacterial and archaeal organisms but still represents a significant percentage of all their proteins. Previously domain partition of multidomain proteins was typically done manually. To keep up with the rate at which protein structures are being solved, there is a clear need for more automated domain-partitioning methods to process the newly solved structures. Currently computer programs are being used for partitioning a protein structure into individual domains. Popular programs for this purpose include DALI (Dietmann and Holm, 2001), DomainParser (Xu et aI., 2000a), and PDP (Alexandrov and Shindyalov, 2003). A protein domain could be part of different protein structures, through combining with other domains. Figure 12.1 shows an example of a domain in different proteins. Both proteins, RNA 3'-terminal phosphate cyclase and glutathione S-transferase, have the thioredoxin fold domain, which has two layers, one with two a-helices and one with four antiparallel r3-strands, although some details differ in the two proteins. The parts other than the thioredoxin fold domain in the two proteins have no structural relationship. Since domains are the basic structural units of proteins, current studies on the number of unique structural folds in nature have been carried out on protein domains rather than whole proteins (hereafter, the term "proteins" will refer to single-domain proteins for simplicity of discussion). To estimate the number of unique folds of proteins, one popular approach is through examining all protein families and the relationships between protein families and unique structural folds. Using the definition of SCOP classification scheme (Murzin et aI., 1995; Brenner et al., 1996), a proteinfamily represents a group of orthologous proteins (Makarova et aI. 1999; Gerlt and Babbitt, 2000; Tatusov et aI., 2000; Gelfand et al., 2000). The number ofprotein families in nature could possibly be estimated through finding orthologous gene groups covering all the genes in the genomes that have been sequenced. One such estimate suggests that there are 23,100 such protein families (Orengo et al., 1994). This number has been used in later studies on estimating the number of unique structural folds in nature. Other studies estimate this number to be in the tens of thousands (Koonin et al., 2002). Whether it is 23,100 or some other number ranging from 10,000 to 50,000, the idea is that all proteins in nature fall into one of these families. The estimation on the number of unique structural folds is obtained through estimating the number of
4
Ying Xu et al.
Fig. 12.1 An example ofthe same domain (thioredoxin fold domain shown in red, thick ribbons) appearing in different protein structures. (a) RNA 3'-terminal phosphate cyclase (PDB code Iqmi, chain A), with the thioredoxin fold domain in residues 185 to 279. (b) Glutathione S-transferase (PDB code IkOd, chain A), with the thioredoxin fold domain in residues 109 to 200.
families that each structural fold covers and studying the coverage distribution by all the known structural folds. One of the estimates by Zhang and DeLisi (1998) suggests that the number N of unique structural folds is probably around 700. This estimation is based on the observation that the number of structural folds covering X number of protein families follows a power-law distribution, withXbeing a variable. In essence it says that a few structural folds each cover many families (e.g., TIM barrel fold covers 31 protein families) while many structural folds each cover only a small number of families ; more generally, the number of structural folds decreases as their coverage of protein families increases . Specifically, Zhang and DeLisi proposed a formula
12. Protein Structure Prediction by Protein Threading
5
which matches well with the known protein families and structural folds. Let M and N represent the number of families and the number of unique folds in nature, respectively. The probability that a fold covers exactly x families is given by
P,
= (1 -
MjN)X-1 MIN.
Let M, and N, be the numbers of protein families and unique structural folds currently having solved structures, respectively. Through a simple algebraic transformation, Zhang and DeLisi showed that
which is used to estimate the number of unique structural folds. Using the known numbers of M, = 736 and N, = 361 at the time of the estimation, they estimated that N is roughly about 700, which many researchers will argue to be too small (see following for more discussion). Similar estimations have been made by other researchers, estimating the size of N ranging from a few hundred to a few thousand (Orengo et aI., 1994; Wang, 1996), based on somewhat different assumptions. Coulson and Moult (2002) recently developed a new model for estimating N, based on the work of Zhang and DeLisi. Using the more recent data on the numbers of genes, gene families, and structural folds, they argued that there are two "special" cases that have not been treated well by previous estimation models. Based on their argument, they consider that there are three classes of structural folds, which are termed unifolds, mesofolds, and superfolds. Unifolds represent structural folds that each covers only one family of proteins, superfolds represent structural folds, each ofwhich covers many structural folds, and mesofolds represent structural folds in between. For example, TIM barrel covers 31 families, while many unifolds exist in SCO~ Based on their observation, they argued that previous models such as Zhang and DeLisi did not fit well with the data of unifolds and superfolds. So a new piecewise model was then developed which treats unifolds, mesofolds, and superfolds, separately. Using this new model, Coulson and Moult (2002) estimated that less than 20% of the protein families belong to unifolds, while 20% of the families belong to a few dozen superfolds and the rest of the protein families belong to mesofolds. Considering that the estimated number of protein families ranges from 10,000 to 50,000 (or 23,100 as one of the popular estimates suggests), we can infer that the number of unifolds ranges from 2,000 to 10,000. The number of mesofolds could be estimated using the Zhang and DeLisi model, based on the sizes of M, and Ns , after excluding the unifolds and superfolds. Hence, Coulson and Moult concluded that the most probable size for the number ofmesofolds is about 400. The number of superfolds is believed to be very small, possibly in the range of low dozens. Overall this model suggests that over 80% ofthe protein families fold into a little over 400
6
Ying Xu et al,
structuralfolds, the majorityofwhichare alreadyknown, whilethe restoftheprotein families each belongs to a unique unifold. The implication of this estimation is that about 80% of the protein families are amenable for structural modeling using protein threading techniques, assuming that at least one protein in each of the meso- and superfolds has its structure solved. If experimental facilities for structure solution will strategically select their solution targets to maximally cover all the meso- and superfolds, we could expect that at least 80% of the protein families will be structurally modelable in the near future. This is exactly the strategy that the National Institute of Health (NIH) is using in its Protein Structure Initiative (http://www.nigms.nih.gov/psi/). For the remaining 20% of protein families, it might take some time to have at least one solved structure in each of the unifolds. Hence, the threading technique will be less applicable for this class of proteins, at least in the near future. There are a number of popular schemes and associated databases for classification of proteins into structural folds, including SCOP (Murzin et aI., 1995), CATH (Orengo et aI., 1997), and FSSP (Holm and Sander, 1996b). These classification schemes classify all solved protein structures into different structural folds and subclasses of structural folds. The classification of protein structures is essentially achieved through grouping protein structures into clusters ofsimilar structures, which can be done computationally through structure-structure alignments (Holm and Sander, 1996a). SCOP (Murzin et aI., 1995; Brenner et aI., 1996; Andreeva et aI., 2004) groups all protein structures essentially into a three-level classification tree. At the top level, SCOP (SCOP1.65) currently consists of about 800 structural folds, each ofwhich is further divided into superfamilies and then into families. While a family represents a group of orthologous proteins, a superfamily represents a group of homologous proteins, possibly made ofmultiple families. Currently SCOP consists ofabout 1300 superfamilies and about 2400 families. Among the 800 structural folds, 489 have only one family each, which might represent unifolds; and 9 cover a large number offamilies each, which are considered as superfolds by Coulson and Moult. One thing worth noting is that among the 800 SCOP folds, only 36 represent membrane proteins. This is a reflection of the fact that only 2% of all the solved protein structures are membrane proteins (http://blanco.biomol.uci.edu/MembraneYroteins...xtal.html). This suggests that threading is generally not applicable to structure prediction of membrane proteins, at the present time. SCOP's hierarchical classification ofstructural folds provides a convenient tool for applications of threading methods, as query proteins falling into a SCOP protein family are generally expected to have accurate structure predictions, while proteins with structural homologues in a SCOP superfamily will have a good chance to have the correct structural folds identified and some portions oftheir backbone structures predicted accurately. In general, it still represents a challenge to correctly identify the structural fold by a threading method if a query protein only has a structural analogue (i.e., similar structure but not homologous) in SCO~
12. Protein Structure Prediction by Protein Threading
7
12.3 Fitting a Protein Sequence onto a Protein Structure The realization that protein structures are clustered into structural folds in the structure space and the number ofsuch clusters is possibly quite small has led to a new way ofpredicting protein structures in a more efficient and effective manner. The general belief is that different proteins fold into similar 3D shapes because at some level, they share similar interaction patterns among their residues and between the residues and the environments. It has been shown that these interaction patterns could possibly be captured using simple statistics-based energy models as exemplified by the earlier work of Eisenberg and colleagues (Bowie et al., 1991; Fischer et al., 1996a,b; Fischer and Eisenberg, 1996), the work of Sippl and colleagues (Sippl, 1990), and others (Jones et al., 1992; Rost et al., 1997). These simple statistics-based energy functions have been used, for many cases, to distinguish the correct structural folds from the incorrect ones and to distinguish the correct placements of the residues in a query protein into the structural positions of a correct structural template from the incorrect ones. Placing the (backbone atoms of) residues of a query protein into the correct structural positions in a correct structural fold gives a prediction of the backbone structure of the query protein. To accomplish this, one would need two capabilities: (a) an energy function whose global minimum will correspond to the correct placement of residues into the correct structural template, and (b) a computational algorithm that can find the global minimum of the given energy function. We explain the basic idea of developing such energy functions in this section and leave the algorithmic issues to the next one. In their earlier work (Bowie et al., 1991; Fischer et al., 1996a,b, Fischer and Eisenberg, 1996), Eisenberg and colleagues demonstrated that simple residue-based, instead of atom-based, energy functions could provide substantial discriminating power in separating good from poor placements of individual residue types into different structural environments, justifying the usage ofresidue-based energy functions. In their work, structural environments are simply defined in terms oftwo parameters, solvent accessibility sol and secondary structure ss. Specifically the quantity sol of solvent accessibility is discretized into a number of intervals, say 30--40% exposed to the solvent. A secondary structure could be a helix, a beta-strand, or a loop, or it could be defined in terms of more refined categories of secondary structures, say including different types oftums. Then a structural environment for each residue in a template structure could be defined using (sol, s), say (0-10% exposed, alpha-helix). Statistics could be collected from a collection of solved protein structures about how frequent a particular type of amino acid appears in a particular structural environment as we just defined. This can be done by going through all protein structures under consideration to count the number of occurrences of each amino acid type in each encountered structural environment. If we consider, say, three levels of solvent accessibility, {exposed, intermediately exposed, buried}, and three types of secondary structures, we will have nine types of structural environment. Under this assumption, the result of counting the numbers of occurrences
Ying Xu et al,
8
above will be a 20 by 9 table, with each of the 20 rows representing an amino acid type and each of the 9 columnsrepresenting a structuralenvironment. Based on the collected statistics, we can build a simple preference model to measure how preferred a particular amino acid type is to a particular structural environment. This can be done using the following measure: -In(O·i.]·IE·i.j.))» where Oi,} represents the observed frequency of amino acid type i in structural environment j and Ei,l represents the expected frequency of amino acid type i in structuralenvironment j. In the work of Eisenberg and colleagues, Ei,l is estimated using the frequency of amino acid type i in all proteinsunder consideration. Hence, if an aminoacidtype i has a higherfrequency in a particularstructuralenvironment j than its frequency overall, it willbe assigneda negative score-In(Oi,}I Ei,l); otherwise it will get a positivescore or zero (when Oi,l = Ei,j). The higher Oi,l is comparedto E i.j» the more negative the corresponding energy is. A popular name for this type of energyfunction is singletonenergy. By performingstatistical analysis on a database, one can get the scoring functionusing the above formulation for the 20 amino acid types appearing in the nine structuralenvironments. Table 12.1 shows sucha scoring function, as describedin Xu et al. (1998). Table 12.1 The scoring function for the 20 amino acid types in three secondary structure types with three solvent accessibility levels. The three solvent accessibility types are buried, intermediate, and exposed, from the left column to the right for each secondary structure type. Residue A R N D C
Q
E G H I L K
M F
p
S T W
Y V
Helix -0.741 1.010 0.840 1.113 -0.389 0.587 1.074 0.317 0.335 -0.686 -0.902 2.021 -0.858 -0.585 1.126 0.328 0.302 -0.468 -0.149 -0.491
-0.007 -0.687 0.221 0.243 0.539 -0.428 -0.442 1.234 -0.217 -0.001 -0.178 -0.212 0.378 0.089 0.698 0.454 0.247 -0.270 -0.285 0.251
Sheet
Loop
-0.181 -0.430 0.642 1.367 0.223 0.235 0.070 1.406 -0.199 -0.190 -0.344 0.995 -0.479 0.340 0.042 0.344 0.743 0.483 0.656 -0.247 -0.795 -0.376 0.907 0.250 0.853 0.850 - 0.228 -0.783 2.179 -1.208 0.492 2.651 -0.489 0.119 1.249 0.221 1.220 -0.573 0.618 0.058 0.067 -0.449 -0.842 1.168 -0.029 0.282 1.773 0.218 -0.527 1.010 -0.161 0.714 1.167 -0.116 0.040 -0.849 0.199 -0.127 -0.265 0.848 0.519 -0.387 - 0.117 1.198 1.316 -0.156 1.295 0.408 0.996 0.182 0.987 0.930 0.256 1.672 0.009 0.229 0.977 0.743 1.818 - 0.209 -0.008 2.611 0.107 0.696 1.215 0.872 -0.825 0.230 0.208 0.242 0.692 1.146 -1.040 -0.168 1.085 -0.060 0.079 0.911 0.723 0.704 0.845 1.287 -0.041 -0.621 -0.646 0.111 -0.079 0.088 0.445 0.373 -0.073 -0.612 0.341 - 0.211 -0.419 0.126 0.350 -0.014 -0.377 1.493 -0.816 -0.307 1.843 - 0.082 0.797 0.041 0.991 -0.690 -0.704 0.919 0.461 - 0.292 0.723 1.053 -1.352 -0.324 0.983 0.196 0.737 0.382
12. Protein Structure Prediction by Protein Threading
9
When building such statistics-based energy functions, one needs to be careful in selectingthe data set for statistics collection. Forexample, someproteinsin SCOP (or in PDB) havemore homologous structuresthan the others,which could possibly leadto biased statistics. Toremove this type of statisticbias in our data set, one needs to remove homologous structuresin the data set for statisticscollection. There are a numberof databases forthis purpose,suchas the nonredundant sequencerepresentativesin FSSP, PDB-select(Hobohmet al., 1992), and PISCES(Wang and Dunbrack, 2003),whereno two proteinshavehigher than a certainlevelof sequencesimilarity. Another statistics-based energy functionwidelyused in threadingprogramsis often called pairwise interaction energy. It measures the preference of having two particulartypes of amino acids that are spatiallyclose. One particular form of such an energy function was developed by Sippl (1990). The basic idea of this energy function is to compare the observed frequency of a pair of amino acids within a certain distancein solvedprotein structureswith the expectedfrequency of this pair of aminoacid types in a protein structure. The basic idea of such an energyfunction comes from statistical mechanics where the probability Pi} of having a pairwise interaction betweenresidues i and j has the Boltzmanncorrelationto its energy gij (Gibbs free energy), definedas g
Pij
= exp
( *)j -gij
kT
Z,
where k is the Boltzmannconstant, T is the temperature, andZ is a partitionfunction. When using a residue-averaged state as a reference state g(P), a knowledge-based potential can be derivedusing gij = gij -
g = -kT In
( P~; ) .
Specifically, if N o(i, j) and N E(i, j) represent, the observedand expectednumbers of amino acid types i andj within a certain distance, respectively, we can use the following to measurethe preferenceof havingthese two types of amino acids close to each other: -In(No(i, j)/NE(i, j)).
While N o(i, j) canbe collectedby goingthroughtheproteinstructuresin the sample set, an accurate estimation of NE(i, j) represents a challenge. There have been a number of proposed models for estimatingthis quantity. Among these models, the simplestone is the "independentreference state" model (Xu et al., 1998),in which N E(i, j) is estimatedas follows:
10
Ying Xu et ale Table 12.2 Pairwise scoring functions between 20 amino acid types
A -14 R 27 -2 N 10 -9 D 22 -62 C 33 7 Q 3 -6 E 12 -56 G 1 -8 H 6 -26 I -11 11 L -18 26 K 12 31 M - 7 30 F -6 6 P 17 -3 S 17 - 8 T 6 6 W 5 -15 Y 5 -13 V-II 17
-43 -42 11 -20 -14 -10 6 35 36 -20 31 20 -21 -22 -23 -2 5 30
2 28 -192 7 19 -11 14 12 1 7 -27 9 -7 -3 -29 -45 -19 27 37 - 7 -45 32 15 24 29 18 29 - 33 37 24 2 25 24 20 -16 -28 -56 25 -18 -67 10 5 19 18 21 5 3 14 1 -1 -1 -11 28 3 7 23 11 16 -10 -20 -3 11 -8 -10 -7 -6 37 22 -30 1 -16 -21 -19 -13 21 27 -20 37 -15 -21 -7 -24 11 22 10 5 -1 16 -7 -21 -2 8 27 6 -9 27 6 3 -16 -9 43 20 18 23 20 20 -23 -22
ARND
C
QEGH
I
12 30 -2 -5 - 6 -2 3 -31 27
-49 -27 3 19 16 -1 -17 -5
-21 -2 11 28 3 -1 -4
LKMF
-21 -16 -18 -10 -22 -21 -43 13 9-2 -8 10 16 -10 -1 5 27 7 10 -11 -32 P
S
TWYV
Table 12.2 shows a scoring function for the preference between 20 types of amino acids using the above formulation (Xu et aI., 1998). There are more sophisticated models for defining the reference state, one of which is the uniform distribution model (Sippl, 1990; DeWitte and Shakhnovich, 1996; Lu and Skolnick, 2001; Samudrala and Moult, 1998), as discussed later. These more complex models take more factors into consideration in building the reference state, hence making the energy models more likely to be accurate. In addition to using physics- or statistics-based energy functions, researchers have incorporated evolutionary information into energy model building. One of the earlier major improvements in energy function modeling is the incorporation of sequence profile information (Panchenko et aI., 2000; Zhou and Zhou, 2005) derived from homologous proteins into the energy models outlined above. It was noticed that when using a sequence profile ofa protein family (or superfamily) rather than a single (query) protein sequence, threading accuracy could be significantly improved (Panchenko et aI., 2000; Zhou and Zhou, 2005). The very basic idea ofthis generalized approach is that rather than asking the question "will protein sequence A fit well with structural fold B?" we ask the more general question "will the whole family of protein A fit well with structural fold B?" Clearly if done properly, this approach could iron out some of the spurious predictions, caused by the appearance ofspecific individual sequences. Now a threading problem becomes a fitting problem between a sequence profile and a structural fold. A sequence profile is defined in terms ofa multiple sequence alignment ofthe members ofthe query protein's family (or superfamily), with each element being the frequency distribution ofthe 20 amino acid types in this aligned position rather than a specific amino acid. To generalize the aforementioned energy functions to take into account the profile information, we can simply use the relative frequency of each amino acid in the position-dependent distribution as a weight factor when calculating the energy values for each amino acid
11
12. Protein Structure Prediction by Protein Threading
or amino acid pairs, and then sum over all possible amino acids or amino acid pairs. Specifically,let Pi be the relative frequency ofamino acid type i in a particular aligned position with L Pi = 1.0. Then the Eisenberg type of energy can be calculated as i
- '""' L...J ~-ln(OZ l,j-IE-Z,j -). i
Similarly, pairwise interaction energy could be generalized as follows:
- L PiP In(No (i, j)/NE(i, j)). j
i.]
Other types of energy functions have also been used in existing threading programs, including fitness scores between specific amino acids and the secondary structures in the template structure and threading alignment gap penalties. Typically these energy scores are combined using a simple weighted sum while often the scaling factors are empirically determined, based on some training data. It has been observed that distance-dependent pairwise interaction energy could provide more accurate threading results than distance-independent models as outlined above. A distance-dependent energy could be estimated as follows:
u«. __ lnNo(i,j,r) (z,j,r) -
NE (.1,},. r )'
where r is the distance between residues i and j (possibly measured between their C-beta atoms), No(i, j, r) is the observed number of pairs of residues (i, j) within a distance bin from r - ~r 12 to r + ~r /2 in a database of folded structures for some bin width Sr, and NE(i, j, r) is the expected number of pairs (i, j) within the same distance bin. The challenging issue in accurately estimating the interacting energy u(i,j,r) is how to estimate NE(i, j, r). Under the assumption that we are dealing with an ideal infinite liquid-state system within a volume V and residues are distributed uniformly (Sippl, 1990; DeWitte and Shakhnovich, 1996; Lu and Skolnick, 2001; Samudrala and Moult, 1998), NE(i, j, r) can be estimated using
NE(i,j,r) = Ni · Nj
·
(41Tr2~r)
IV,
where N, and N, are the numbers of amino acid types i and j in the protein database, respectively. Researchers have realized that this model needs to be corrected when dealing with finite systems like a protein structure, to make the model more accurate when used in threading programs. Twoparticular corrections are made in the DFIRE (distance-scaled ideal gas reference state) energy model (Zhou and Zhou, 2002), a popular energy function for threadirtg. In the first correction, DFIRE used r" instead of r 2 , considering that the number of interaction pairs in a finite system could not actually reach the level of r 2 as in an infinite system, where a < 2 is
12
YingXu et ale
determined throughminimizing the distribution fluctuation of interaction distanceon a set of trainingdata. In the secondcorrection, DFIREassumes that only short-range interactions need to be considered. That is, interaction energy becomes zero when the distance betweenthe interacting pairs is beyonda cutoffdistancercut. Afterthese corrections, the interactionenergy could be estimatedusing the following formula: No(i,j,r) -11 n ( r )Ct t::,.r , --No (i, j, rcut) rcut t::,.rcut
I
U(i,j,r)
=
o
r > rcut
where constant 11 is related to the systemtemperature and can be determinedempirically. These simple energy models have played key roles in making threading programs as popular and as useful as we have seen today. While they have been used to help to solve many structurepredictionproblems, the limitations of these simple models have also become quite clear as we have seen from the recent CASP prediction results-the improvement in predictionaccuracyhas been only incremental in the past few CASPs (Kinch et aI., 2003). One of the key reasons for the incremental improvement comes from the crudeness of the threading energy functions. Currently, the algorithmic techniques for protein threadinghaveadvanced to a stage that should be able to handle more sophisticated energy models in the threading framework, which could lead to more accuratepredicted structures. We can expect that more detailedand more physics-based energyfunctions will emergein the near future as the field is clearly in need of more accurate energy models for protein threading.
12.4 Calculating Optimal Sequence-Structure Alignments The general form of threadingenergy function could be written as follows: fiE s
+ BE p + E gap ,
where E s measuresthe overall fitness of puttingindividual residuetypesinto specific structuralenvironments, E p measuresthetotalinteraction energyamongpairswithin the cutoffdistance, and E gap represents the total penalty for the gaps in a sequencestructurealignment. The scalingfactors fi andj3 are typicallydeterminedempirically through optimizingthe performance of a threadingprogram on a representative set of proteinpairs. Withthe optimizedfi andj3values, the goal of threadingis to findan alignment(or placement) betweena queryprotein sequence and a templatestructure that optimizesthe energy function. In a sense,proteinthreadingis like sequence-sequence alignmentas it finds an alignmentbetweena sequence of aminoacids and a sequence of structuralpositions
12. Protein Structure Prediction by Protein Threading
13
in a 3D structure. What makes threading more difficult than a sequence alignment problem is the pairwise interaction energy term Ep. For a sequence alignment problem, a simple dynamic programming can guarantee to find the global optimal alignment between two sequences as the problem formulation follows the principle of optimality (Brassard and Bratley, 1996). This type of simple dynamic programming algorithm does not work for a threading problem as the global optimal threading alignment could not be easily reduced to a small number of optimal threading alignments for the partial problems as in a sequence alignment problem. Intuitively, a simple dynamic programming approach, like the one used for sequence alignment that goes from the starts of the sequences to their ends and extends partial optimal alignments to include more elements at each step, will not work for protein threading as at each current point, we do not know what residues will be available to be assigned to future structural positions, which might have interactions with the previously aligned positions. Such interactions will have a global impact on a sequence-structure alignment. It is such a global nature of the problem that makes the threading problem more challenging from the algorithmic point of VIew. There have been a number of studies attempting to understand the "intrinsic" difficulty, or computational complexity, ofthe threading problem. Under the assumption that all pairwise interactions need to be considered, the threading problem was proved to be an NP-hard problem by a number of authors (Lathrop, 1994; Calland, 2003). While these mathematical proofs provide some evidence that the problem is computationally difficult, they might not be particularly relevant to the true difficulty of a threading problem as previous studies have shown that pairwise interactions beyond certain cutoffdistances (e.g., 10-12 A between C-beta atoms) do not contribute to fold recognition and threading alignment (Lund et al., 1997; Melo and Feytmans, 1997; Xu et al., 1998; Zhang and Skolnick, 2004), and hence need not be considered. As of now, it remains an open problem regarding whether the threading problem, using a distance cutoff for pairwise interactions, is polynomial-time solvable. The challenging issue in studying the "intrinsic" computational complexity of a threading problem with a cutoff for pairwise interactions is that we do not have a good and realistic characterization for the overall interaction patterns for such a threading problem. Because of the algorithmic challenges, earlier threading programs have employed various heuristic strategies for solving the optimal sequence-structure alignment problem. One particular strategy is called "frozen approximation" (Westhead et al., 1995). The basic idea is that it uses a dynamic programming approach to find a sequence-structure alignment and uses an approximation scheme to calculate the interaction energy. When the algorithm assigns an amino acid to a specific structural position from the beginning to the end of the query sequence during the dynamic programming procedure, it calculates the relevant interaction energy using the amino acids in the template structure rather than assigned amino acids from the query protein, for all the unassigned structural positions up to the current point of the dynamic programming procedure, within a certain cutoffdistance. Intuitively the
14
YingXu et al,
algorithm should work to some degree in capturing some ofthe interaction "patterns" encoded in the query protein sequence as some of the position-equivalent residues between the native structure and the native-like template structure should have similar physicochemical properties, suggesting the validity ofthe frozen approximation scheme. Practical applications have also confirmed that the frozen approximation, while not guaranteeing global optimal threading alignment, does have an advantage over threading programs that do not consider pairwise interactions (Westhead et al., 1995; Skolnick and Kihara, 2001; Zhang et al., 1997). A number ofrigorous threading algorithms have been developed that guarantee to find the global optimal threading alignments, measured in terms of energy functions that consider pairwise interactions. These include a divide-and-conquer algorithm employed in the PROSPECT threading program (Xu and Xu, 2000) and an integer programming algorithm used in the RAPTOR program (Xu et aI., 2003a,b). It was convincingly demonstrated, through applications ofthese programs at the CASP contests (Xu et aI., 2001; Xu and Li, 2003), that threading programs with guaranteed global optimality do have an advantage over programs without this property. One particular advantage of such programs is that they can be used to rigorously benchmark a proposed energy function. When using programs without such a guarantee to assess an energy function, it will be difficult to decide whether it is the energy function or the lack of rigor in the threading algorithm that has resulted in a subpar performance by a particular energy function. The following provides some detailed information about three types of threading algorithms.
12.4.1
PROSPECT
The basic idea of the divide-and-conquer algorithm in PROSPECT (Xu and Xu, 2000) can be outlined as follows. The algorithm first divides the query protein sequence into two subsequences and also divides the template structure into two substructures by cutting at one of its loop regions. Then it tries to find the globally optimal threading alignments between the first subsequence/substructure pair and between the second subsequence/substructure pair, respectively. When calculating pairwise interaction energy for each "half" of the sequence-structure alignment problem, we might need information about which amino acids are assigned to which structural positions in the other "half" of the problem. To facilitate this calculation, the algorithm uses a simple data structure for each structural position L in the current substructure, that keeps a list of structural positions in the other substructure that are close enough to L to have interactions with the amino acid to be assigned to it. Figure 12.2 depicts schematically the situation where each of the two substructures has a number ofstructural positions, which are close enough to structural positions in the other substructure so that their alignments with amino acids need to be considered when calculating the interaction energy in the other substructure. These structural positions can be considered as extended parts of the other substructure (shown as extended arms for each substructure in Fig. 12.2). The difference between these extended parts and the original positions ofa substructure is that when doing alignment
12. Protein Structure Prediction by Protein Threading
15
Fig.I2.2 Schematic illustration of the divide-and-conquer algorithm of PROSPECT. The blocks represent coresecondary structure elements, andcurvedlinesindicate interactionsbetween linked secondary structures.
between the substructure and the corresponding subsequence, we do not have any knowledge about which amino acids are assigned to these positions in the other substructure. Hence, the optimal threading alignment for each substructure/subsequence pair depends on the optimal threading alignment for other substructure/subsequence pair. This codependence relationship makes the problem challenging. In PROSPECT, this problem is overcome using the following strategy : consider all possible combinations of amino acids possibly assigned to these extended positions (in the other "half" of the problem); and then solve an optimal threading alignment problem for each substructure/subsequence pair under each possible combination of amino acid assignments to these extended structural positions. Assume that we can solve the optimal threading problem for each pair of (extended) substructure/subsequence and for each combination of such an assignment. Then it can be checked that the global optimal threading alignment for the whole structure and sequence must be the union of two optimal threading alignments for the two extended subproblems, under one specific combination of the amino acid assignment to the extended part of each subproblem. This realization lays the foundation for the divide-and-conquer algorithm of PROSPECT as it allows reducing a whole threading problem to two smaller threading problems. If we can solve the smaller threading problems, the whole threading problem can be solved by simply going through all combinations of the amino acid assignments to the extended parts for each subproblem to find the one that gives the overall best combined score. During the "conquer" step, it needs to make sure that the two optimal solutions, to be combined, to the subproblems is consistent. To solve a smaller threading problem, we can apply the same divide-and-conquer strategy to reduce it to even smaller problems. This procedure can continue until the size of the problem is small enough that it can be solved using a brute force exhaustive search strategy. The trick is how to make this algorithm run efficiently.Note that ifnot done carefully, the number of possible combinations of assignments needed to be considered could be very large. In PROSPECT, (sub)structures are divided in such a way which minimizes such a number ofcombinations, through cutting at the "weakest" link with
16
YingXu et aI.
the least interactions between two substructures (Xu et aI., 1999). The overall computational complexity of the divide-and-conquer algorithm is dominantly determined by the thickest link among all weakest links throughout the bipartitioning of a protein structure into a series of small structures during the whole divide-and-conquer scheme. Intriguingly, we found that this thickest link is generally a small number for the vast majority ofthe solved protein structures (Xu et aI., 1999), making the actual computing time of PROSPECT threading practically acceptable. This observation has also raised an interesting question: "Is there something special about the topology of protein structures, which could be further and more rigorously exploited for efficient threading algorithm development?"
12.4.2 RAPTOR RAPTOR uses a more general framework to rigorously solve the threading problem (Xu et aI., 2003a,b) than PROSPECT. It formulates a threading problem as a linear integer programming (LIP) problem. It uses an integer variable ("0" or "1") to represent if a particular residue in the query protein sequence is assigned to a particular structural position. Then the set ofall feasible solutions to a threading problem could be defined in terms of a set of equalities or inequalities (called constraints), each of which is defined in terms of the above and other integer variables. The global optimal threading problem is then defined to find a feasible threading alignment that optimizes a given energy function. Generally a linear integer programming problem requires an exponential computing time to find an optimal solution, and hence intractable for large-size problems. Branch-and-bound represents a popular technique for solving linear integer programming problems. Typically, an integer programming problem is first relaxed to a linear programming problem, i.e., variables could take real values as possible solutions. There are efficient algorithms for solving linear programming problems as they are polynomial-time solvable (Papadimitriou and Christos, 1998). If by chance the solution to the relaxed linear programming problem is all integral, a solution to the original linear integer programming is found. Otherwise the linear solution will be used to constrain the search space, through fixing one variable to "0" or"1" and then the algorithm iterates this process until all solutions have integral values. An interesting observation made is that for the vast majority of threading problems, this relaxation procedure stops after a few iterations, solving the threading problem efficiently and also indicating that threading problems seem to have a special structure in terms of the integer programming formulation. Such special characteristics could possibly be utilized for developing more efficient threading algorithms, using more specialized algorithmic techniques.
12.4.3 Tree-Decomposition-Based Threading Algorithm One particularly interesting technique which is being actively investigated by a number of researchers is based on the idea of tree decomposition of an interaction
12. Protein Structure Prediction by Protein Threading
17
graph representing possible alignments between a query sequence and a template structure (Song et al., 2005; Xu et al., 2005). In a sense this type of technique is a generalization of the divide-and-conquer outlined above. Tree decomposition technique has been widely used for various graph-related optimization problems, for example finding the maximum independent set and dominating set (Arnborg and Proskurowski, 1989). We now provide a detailed description of one such algorithm for solving the protein threading problem. Using a tree decomposition algorithm, both the template structure and the query sequence are represented as graphs; vertices denote core secondary structures (or simply cores) and edges represent interactions between cores (two cores are considered to be in interaction if their shortest distance is within a predefined cutoff distance). A sequence-structure alignment problem essentially corresponds to finding an isomorphic mapping from the structure graph to a subgraph of the sequence graph. The efficiency of the alignment hinges on the tree width of the structure graph. Intuitively, the tree width of a graph measures how much the graph is "treelike." A graph can be represented as a "tree having thick trunks," where the "trunk thickness" is quantified by the tree width of the graph. This technique of "treelike" representation for graphs is called tree decomposition. In a tree decomposition of a graph, vertices of the graph are grouped into possibly overlapping subsets, each of which is associated with a node in the tree. The maximum size of such a subset corresponds to the tree width of the tree decomposition. Given a tree decomposition of a structure graph with tree width t, a dynamic programming algorithm can be employed to find the optimal sequence-structure alignment in time Oik: N 2 ) , for some small integer parameter k and N being the number of amino acids in the template structure. The alignment algorithm is very efficient since the tree width for such structure graphs is small in general (by the nature of protein structures). For example, among 3890 protein tertiary structure templates compiled using PISCES (Wang and Dunbrack, 2003) only 0.8% of them have tree width t > 10 and 92 % have t < 6, when using a 7.5-A C~ - C~ distance cutoff for defining pairwise interactions [see Fig. 12.5(a)]. We now provide the details of a tree-decompositionbased threading algorithm.
12.4.3.1 Graph Representation A sequence-structure alignment can be formulated as a generalized subgraph isomorphism optimization problem, for which both the template structure and the query sequence are represented as mixed graphs that contain both directed and undirected edges. We use V(G), E(G), and A(G) to denote the vertex set, the undirected edge set, and the directed edge (arc) set of a mixed graph G, respectively. The graph H for the template structure is constructed as follows: each vertex in V(H) represents a core, each undirected edge in E(H) represents the interaction between two cores, and each directed edge (arc) in A(H) represents the loop between two consecutive cores (from the N-terminal to the C-terminal). For technical
18
Ying Xu et al,
(a)
(b) __
N
1
...-
----l ~ _ -
2
3
_ .
6
•
C
Fig. 12.3 (a) Folded chain B of proteinkinaseC (PDB-ID IAV) proteinwitheightcoresecondary structures; (b) its corresponding structure graph.
convenience, both N- and C-terminals are presented as vertices. Figure 12.3 gives a protein tertiary structure and its corresponding structure graph representation. A query sequence is preprocessed so that for each core in the template, all substrings (called candidates) of the query sequence that align well with the core are identified (Xu et aI., 2000) . By representing each candidate as a vertex, a query sequence can also be represented as a mixed graph. That is, each edge in E(G) connects a pair of candidates that may possibly interact but do not overlap in the sequence, and each arc in A( G) connects two candidates (from the N-terminal to the C-terminal) that do not overlap. As in a structure graph, both N- and C-terminals are represented as vertices in the sequence graph . Figure 12.4 illustrates the sequence graph with a simple example. The relationship between a core v in the template and its candidates in the query sequence can be constrained using a mapping function M such that M(v) contains all possible candidates of v. The less restricted M is, the more accurate the alignment is expected to be and the more time it may take to compute. Xu et a1. (1998) used a similar approach in their divide-and-conquer threading algorithm, which can find all suitable candidates for each core. The maximum size k = 1M(v) lover all cores v is called the map width ofM, an important parameter for the alignment algorithm. Now a sequence-structure alignment problem can be formulated as a problem offinding an isomorphism mapping f between the structure graph and a subgraph of
19
12. Protein Structure Prediction by Protein Threading
(a)
c
N
Fig. 12.4 (a) Five alignment candidates of cores found after a preprocessing of the sequence, among which overlapping candidates are , , and , respectively; (b) the corresponding sequence graph (edges are not shown, and arcs incident on N- and C-terminals are not shown).
the sequence graph G such that the following sum ofthe alignment energy functions
L
Ecore(u, f(u))
+
UE V(H)
L Epair«u, v), (f(u), f(v))) L Ezoop« u, v >, < f(u), f(v) »
(u, v)EE(V)
+
E
achieves the minimum, where Eeore is the alignment energy score (singleton type of energy) between a core u in the template and its candidate feu) in the sequence, Epair represents interaction energy (pairwise interaction energy) between residues (f(u),f(v)) assigned to cores (u, v), and El oop is the alignment score between the loop < u, v > in the template and the corresponding -.
(1)
U
=
8
10
12
~
-'-
CJl
~
~
~
= 2" a a =n· 5· =
=t
CIJ
~
;.
~
~
a
~
~ ~
~
52
Ling-Hong Hung et al. 9
8
~
lr)
c,
.....0
~
'S
5
8 ..d .§o
4
(l)
3
"'d 0
•
•
••
..
.--.4
~
••
6
0
(l)
•
•
7
•
2
0 0
2
• ••• •••, • • 4
6
••
,• 8
•• • • 10
12
14
en RMSD of best conformer in original set (A) Fig. 13.4 The enrichment of conformations after selection using a set of energy functions for CASP6 targets. Fourteen different energy functions are used to iteratively rank the minimized conformers. The number of conformers retained and the order in which the energies are applied were previously optimized for a large set of different target sequences. The enrichment index is computed as follows: Let a be the number of the selected conformations which are in the top 10% in terms oftheir en RMSD relative to the native structures. Let b be the expected number in a random distribution. The enrichment index is the ratio a/b. Although the energy functions are related to the target function minimized during generation, the use of multiple functions produces significant enrichment of the top conformers in most cases.
the correct fold and to each other. Thus, the conformers that are the most similar to the others, i.e., near the center of the conformational distribution, will tend to be the correct ones. The conformational center is found by using clustering methods. Metrics used include pairwise root-mean-square deviation (RMSD), pairwise RMSD with cutoffs, and number of neighbors (Simons et aI., 1999; Wang et aI., 2004). When there are no systematic errors, treating the entire ensemble as a single cluster gives the best signal. Even in a relatively uniform distribution, outliers can still bias and skew the determination of the conformational center. Our iterative density protocol addresses the problem by removing outliers, recalculating the conformational center of the remaining set, and repeating until the set is well-conditioned and no outliers remain. Alternatively, patchiness due to systematic errors can be addressed by dividing the conformers into different groups using K-means or hierarchical clustering. Iterative sampling schemes can also be used to allow very large sets to be clustered quickly (Zhang and Skolnick, 2004b). When multiple clusters
53
13. De Novo Protein Structure Prediction
100
•
~
IIQ
e
e .C...
=
~
90 80 70
~
IIQ
......... •....= ..... ~
~
~
C.
e ..... ~ e
......... ~
~
= • ~ ~ ~
~
•
• • • •• • • • • • •.~ • • • • ••• • • • •• • •• • • • •
• •• • • •
~
60 50 40 30 20 10
• •
•
0 0
20000
40000
60000
80000
100000 120000
140000 160000
Number of structures generated Fig.13.5 Rankingof filteredconformations by iterative density. The percentileranking(relative to this filteredset of conformers) forthe singlebest conformerchosenby iterativedensityis shown. Even though all the conformers have been minimizedfor the originaltarget function and ranked by 14 selectionfunctions at this stage,iterative densityis sufficiently orthogonalto these functions to be able to choosea top conformer. The accuracyand consistency of the selectionalso improves with the number of conformers generated.
are used, the conformers at the center of the largest cluster or the cluster with the lowest energies are then chosen. Clustering can be used in conjunction with an initial energy-based screening ofconformers. This combination is particularly effective because clustering is relatively orthogonal to energy functions. In addition, the noise reduction achieved by clustering increases with the number of conformations generated and thus automatically takes advantage of increased computational resources. This observation is illustrated in Fig. 13.5, which shows a positive correlation between the percentile ranking of conformations selected by iterative density and the number of candidate conformations generated.
13.2.8 PROTINFO, an Example de Novo Prediction Protocol When given a target sequence, PROTINFO first uses PsiPred to predict the secondary structure. The secondary structure propensities are used to derive a set of 1.25
RMSD Improvement, A
Fig. 15.5 Ligand interface Co RMSD changes for ZDOCK predictions after 6D rigid-body
minimization. Near-native predictions «6.0 A ligand interface RMSD) for 39 Benchmark 1.0 test cases were minimized . The leftmost column in each case indicates the complexes that became worse upon minimization; clearly overall the predictions were improved. Top: using the convex global underestimator (CGU) for minimization . Bottom: using the Monte Carlo minimization with simplex (MCMS)
128
Brian Pierce and Zhiping Weng
one group. In addition, the overall predictive success ofgroups has improved over the rounds, indicating progress in the algorithms and improved ability to select correct predictions from those algorithms (Janin, 2005). Several lessons can be drawn from CAPRI about what works for docking and what does not (Wodak and Mendez, 2004). The use ofknown biological information to constrain or filter a docking search clearly helps, as the predictions for targets that had such data available had a better success rate. It is also clear from performance on some targets that backbone flexibility is an important next step in protein docking developments.
15.5.2 New Developments In addition to the lessons learned, the CAPRI experiment has led to various specific improvements and developments in docking. For instance, CAPRI Target 10 required docking ofa tick-borne encephalitis viral protein to form a symmetric trimer. This has led to various algorithms for protein docking specifically geared toward predicting symmetric multimers (Pierce et aI., 2005; Schneidman-Duhovny et al., 2005). CAPRI has also been a useful testing ground for the many new protein docking servers that are available on the Web (Table 15.2). There are several more exciting directions where protein docking is heading. One is the combination of binary docking predictions to produce predictions of macromolecular complexes (Inbar et aI., 2005). In addition, homology modeling of proteins and using the models for docking is being actively explored (Tovchigrechko et al., 2002); although some models do not have atomic-level precision (which influences the success rate of most docking programs), this technique opens up many more possible complexes that can be predicted. In contrast to protein docking ofmodels, several groups have used modeling ofhomologues onto known structures of complexes to predict interactions ofproteins on a genomic scale (Aloy et aI., 2004; Aloy and Russell, 2002; Lu et aI., 2003). The field ofprotein docking is rapidly evolving, with extensions ofwell-known methods being complemented by new docking techniques imported from other fields of research. Dockers are now poised to tackle more difficult problems, such as including more backbone flexibility in the searching algorithm. This will allow for prediction of more difficult test cases, and increase the scope of protein complexes that can be successfully predicted through docking. So far, the success seen to date is encouraging; clearly there is much more to be done that will yield meaningful results for years to come.
Recommended Reading Books Protein-Protein Recognition (Kleanthous, 2000) An Introduction to Protein Architecture: The Structural Biology ofProteins (Lesk, 2001)
15. Structure Prediction of Protein Complexes
129
Review Articles "Principles of protein-protein interactions" (Jones and Thornton, 1996) "Prediction of protein-protein interactions by docking methods" (Smith and Sternberg, 2002) "Protein-protein docking: Is the glass half-full or half-empty?" (Vajda and Camacho, 2004) "Structural basis of macromolecular recognition" (Wodak and Janin, 2002) "Principles of docking: An overview of search algorithms and a guide to scoring functions" (Halperin et al., 2002) "Prediction of protein-protein interactions: The CAPRI experiment, its evaluation and implications" (Wodak and Mendez, 2004)
References Abagyan, R., Totrov, M., and Kuznetsov, D. 1994. ICM-A new method for protein modeling and design-Applications to docking and structure prediction from the distorted native conformation. J. Comput. Chem. 15:488-506. Aloy, P., Bottcher, B., Ceulemans, H., Leutwein, C., Mellwig, C., Fischer, S., et al. 2004. Structure-based assembly of protein complexes in yeast. Science 303:2026-2029. Aloy, P, and Russell, R.B. 2002. Interrogating protein interaction networks through structural biology. Proc. Natl. Acad. Sci. USA 99:5896-5901. Bartel, P.L., and Fields, S. 1995. Analyzing protein-protein interactions using twohybrid system. Methods Enzymol. 254:241-263. Betts, M.l, and Sternberg, M.l 1999. An analysis of conformational changes on protein-protein association: Implications for predictive docking. Protein Eng. 12:271-283. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.I, Swaminathan, S., and Karplus, M. 1983. CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4:187-217. Canutescu, A.A., Shelenkov, A.A., and Dunbrack, R.L., Jr. 2003. A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci. 12:2001-2014. Carugo, 0., and Franzot, G. 2004. Prediction of protein-protein interactions based on surface patch comparison. Proteomics 4:1727-1736. Chen, R., Li, L., and Weng, Z. 2003a. ZDOCK: An initial-stage protein-docking algorithm. Proteins 52:80-87. Chen, R., Mintseris, 1, Janin, 1, and Weng, Z. 2003b. A protein-protein docking benchmark. Proteins 52:88-91. Choe, H., Burtnick, L.D., Mejillano, M., Yin, H. L., Robinson, R.C., and Choe, S. 2002. The calcium activation of gelsolin: Insights from the 3A structure of the G4-G6/actin complex. J. Mol. Bioi. 324:691-702. Chothia, C., and Janin, 1 1975. Principles of protein-protein recognition. Nature 256:705-708.
130
Brian Pierce and Zhiping Weng
Clackson, T., and Wells, J. A. 1995. A hot spot of binding energy in a hormonereceptor interface. Science 267:383-386. Clore, G. M. 2000. Accurate and rapid docking of protein-protein complexes on the basis of intermolecular nuclear Overhauser enhancement data and dipolar couplings by rigid body minimization. Proc. Nat. Acad. Sci. USA 97:90219025. Clore, G.M., and Schwieters, C.D. 2003. Docking of protein-protein complexes on the basis of highly ambiguous intermolecular distance restraints derived from IH/15N chemical shift mapping and backbone 15N-IH residual dipolar couplings using conjoined rigid body/torsion angle dynamics. J. Am. Chern. Soci. 125:2902-2912. Comeau, S.R., Gatchell, D.W, Vajda, S., and Camacho, C.l 2004. ClusPro: An automated docking and discrimination method for the prediction of protein complexes. Bioinformatics 20:45-50. Connolly, M. L. 1983. Solvent-accessible surfaces of proteins and nucleic acids. Science 221:709-713. Connolly, M. L. 1986. Shape complementarity at the hemoglobin alpha 1 beta 1 subunit interface. Biopolymers 25:1229-1247. Dellis, S., Strickland, K.C., McCrary, W J., Patel, A., Stocum, E., and Wright, C.F. 2004. Protein interactions among the vaccinia virus late transcription factors. Virology 329:328-336. Dill, K.A., Phillips, A.T., and Rosen, J. B. 1997. Protein structure and energy landscape dependence on sequence using a continuous energy function. J. Comput. Bioi. 4:227-239. Dominguez, C., Boelens, R., and Bonvin, A.M. 2003. HADDOCK: A proteinprotein docking approach based on biochemical or biophysical information. J. Am. Chern. Soc. 125:1731-1737. Duan, Y., Reddy, B.~, and Kaznessis, Y.N. 2005. Physicochemical and residue conservation calculations to improve the ranking of protein-protein docking solutions. Protein Sci. 14:316-328. Dunbrack, R. L., Jr. 2002. Rotamer libraries in the 21st century. Curr. Opine Struct. Bioi. 12:431-440. Fischer, D., Lin, S.L., Wolfson, H.L., and Nussinov, R. 1995. A geometry-based suite of molecular docking processes. J. Mol. Bioi. 248:459-477. Fitzjohn, ~W, and Bates, P A. 2003. Guided docking: First step to locate potential binding sites. Proteins 52:28-32. Gabb, H.A., Jackson, R.M., and Sternberg, M.l 1997. Modelling protein docking using shape complementarity, electrostatics and biochemical information. J. Mol. Bioi. 272:106-120. Gardiner, E.l, Willett.P, and Artymiuk, P 1. 2001. Protein docking using a genetic algorithm. Proteins 44:44-56. Gray, IJ., Moughon, S., Wang, C., Schueler-Furman, 0., Kuhlman, B., Rohl, C.A., et al. 2003. Protein-protein docking with simultaneous optimization ofrigid-body displacement and side-chain conformations. J. Mol. Bio!. 331:281-299.
15. Structure Prediction of Protein Complexes
131
Greer, I, and Bush, B.L. 1978. Macromolecular shape and surface maps by solvent exclusion. Proc. Natl. Acad. Sci. USA 75:303-307. Halperin, I., Ma, B., Wolfson, H., and Nussinov, R. 2002. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins 47:409-443. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., et al. 2002. Systematic identification ofprotein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180-183. Honig, B., and Nicholls, A. 1995. Classical electrostatics in biology and chemistry. Science 268: 1144-1149. Inbar, Y.,Benyamini, H., Nussinov, R., and Wolfson, H.I 2005. Combinatorial docking approach for structure prediction of large proteins and multi-molecular assemblies. Phys. Bioi. 2:S156-S165. Janin, I 1997. The kinetics of protein-protein recognition. Proteins 28:153-161. Janin, I 2005. Assessing predictions of protein-protein interaction: The CAPRI experiment. Protein Sci. 14:278-283. Janin, I, Henrick, K., Moult, I, Eyck, L.T., Sternberg, M.I, Vajda, S., et al. 2003. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins 52:2-9. Jiang, F.,and Kim, S.H. 1991. "Soft docking": Matching ofmolecular surface cubes. 1 Mol. Bioi. 219:79-102. Jones, S., and Thornton, 1M. 1996. Principles ofprotein-protein interactions. Proc. Natl. Acad. Sci. USA 93:13-20. Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A.A., Aflalo, C., and Vakser, LA. 1992. Molecular surface recognition: Determination of geometric fit between proteins and their ligands by correlation techniques. Proc. Natl. Acad. Sci. USA 89:2195-2199. Kleanthous, C. 2000. Protein-Protein Recognition. London, Oxford University Press. Klupp, B.G., Bottcher, S., Granzow, H., Kopp, M., and Mettenleiter, T.C. 2005. Complex formation between the UL 16 and UL21 tegument proteins ofpseudorabies virus. J. Viro/. 79: 1510-1522. Knegtel, R.M., Kuntz, I.D., and Oshiro, C.M. 1997. Molecular docking to ensembles of protein structures. J. Mol. Bio/. 266:424-440. Kortemme, T., Morozov, A.V., and Baker, D. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. 1 Mo/. Bio/. 326: 1239-1259. Kuntz, I.D., Blaney, 1M., Oatley, s.i, Langridge, R., and Ferrin, T.E. 1982. A geometric approach to macromolecule-ligand interactions. 1 Mol. Bio/. 161:269288. Lamdan, Y., Schwartz, IT., and Wolfson, H. I 1990. Affine invariant model-based object recognition. IEEE Trans. Robotics Autom. 6:578-589. Lee, R.H., and Rose, G.D. 1985. Molecular recognition. I. Automatic identification of topographic surface features. Biopolymers 24:1613-1627.
132
Brian Pierce and Zhiping Weng
Lesk, A. M. 2001. Introduction to Protein Architecture: The Structural Biology of Proteins. London, Oxford University Press. Li, L., Chen, R., and Weng, Z. 2003. RDOCK: Refinement of rigid-body protein docking predictions. Proteins 53:693-707. Lu, L., Arakaki, A.K., Lu, H., and Skolnick, 1 2003. Multimeric threadingbased prediction of protein-protein interactions on a genomic scale: Application to the Saccharomyces cerevisiae proteome. Genome Res. 13:11461154. Mandell, IG., Roberts, Y.A., Pique, M.E., Kotlovyi, Y., Mitchell, IC., Nelson, E., et al. 2001. Protein docking using continuum electrostatics and geometric fit. Protein Eng. 14:105-113. Mendez, R., Leplae, R., De Maria, L., and Wodak, s.i 2003. Assessment of blind predictions ofprotein-protein interactions: Current status of docking methods. Proteins 52:51-67. Meng, E.C., Gschwend, D.A., Blaney, 1M., and Kuntz, J.D. 1993. Orientational sampling and rigid-body minimization in molecular docking. Proteins 17:266278. Mintseris, 1, Wiehe, K., Pierce, B., and erson, R., Chen, R., Janin, 1, et al. 2005. Protein-Protein Docking Benchmark 2.0: An update. Proteins 60:214-216. Mitchell, IC., Rosen, lB., Phillips, A.T., and Ten Eyck, L.F. 1999. Coupled optimization in protein docking. Paper presented at the RECOMB. Miyazawa, S., and Jernigan, R.L. 1996. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Bioi. 256:623-644. Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, WE., Belew, R.K., et al. 1998. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J. Comput. Chem. 19:1639-1662. Mullaney, B.~, and Pallavicini, M.G. 2001. Protein-protein interactions in hematology and phage display. Exp. Hem atoI. 29: 1136-1146. Norel, R., Petrey, D., Wolfson, H.I, and Nussinov, R. 1999. Examination of shape complementarity in docking of unbound proteins. Proteins 36:307-317. Palma, ~N., Krippahl, L., Wampler, IE., and Moura, 11 2000. BiGGER: A new (soft) docking algorithm for predicting protein interactions. Proteins 39:372384. Phillips, A.T., Rosen, lB., and Walke, Y.H. 1995. Molecular structure determination by convex global underestimation of local energy minima. In P M. Pardalos, D. Shalloway, and G. Xue (eds.), DIMACS Series in Discrete Mathematics and Theoretical Computer Science (Vol. 23, pp. 181-198). Providence, RI, American Mathematical Society. Pierce, B., Tong, W, and Weng, Z. 2005. M-ZDOCK: A grid-based approach for Cn symmetric multimer docking. Bioinformatics 21:1472-1478. Press, W H. 2002. Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. London, Cambridge University Press.
15. Structure Prediction of Protein Complexes
133
Puig, 0., Caspary, F., Rigaut, G., Rutz, B., Bouveret, E., Bragado-Nilsson, E., et al. 2001. The tandem affinity purification (TAP) method: A general procedure of protein complex purification. Methods (Duluth) 24:218-229. Ritchie, D.W, and Kemp, G.I 2000. Protein docking using spherical polar Fourier correlations. Proteins 39: 178-194. Sandak, B., Wolfson, H.I, and Nussinov, R. 1998. Flexible docking allowing induced fit in proteins: Insights from an open to closed conformational isomers. Proteins 32:159-174. Schneidman-Duhovny, D., Inbar, Y., Nussinov, R., and Wolfson, H.I 2005. PatchDock and SymmDock: Servers for rigid and symmetric docking. Nucleic Acids Res. 33(Web Server issue): W363-W367. Shoichet, B.K., and Kuntz, LD. 1991. Protein docking and complementarity. J. Mol. Bioi. 221:327-346. Smith, G.R., and Sternberg, M.I 2002. Prediction ofprotein-protein interactions by docking methods. Curro Opine Struct. Bioi. 12:28-35. Smith, G.R., Sternberg, M.I, and Bates, ~A. 2005. The relationship between the flexibility of proteins and their conformational states on forming proteinprotein complexes with an application to protein-protein docking. J. Mol. Bioi. 347:1077-1101. Tovchigrechko, A., Wells, C.A., and Vakser, LA. 2002. Docking of protein models. Protein. Sci. 11:1888-1896. Vajda, S., and Camacho, C.I 2004. Protein-protein docking: Is the glass half-full or half-empty? Trends Biotechnol. 22(3): 110-116. Vakser, L A. 1995. Protein docking for low-resolution structures. Protein Eng. 8:371377. Wodak, S.l, and Janin, I 1978. Computer analysis of protein-protein interaction. J. Mol. Bioi. 124:323-342. Wodak, S.l, and Janin, 1 2002. Structural basis ofmacromolecular recognition. Adv. Protein Chern. 61:9-73. Wodak, S.l, and Mendez, R. 2004. Prediction of protein-protein interactions: The CAPRI experiment, its evaluation and implications. Curro Opine Struct. Bioi. 14:242-249. Zacharias, M. 2003. Protein-protein docking with a reduced protein model accounting for side-chain flexibility. Protein Sci. 12:1271-1282. Zacharias, M., and Sklenar, H. 1999. Harmonic modes as variables to approximately account for receptor flexibility in ligand-receptor docking simulations: Application to DNA minor groove ligand complex. J. Comput. Chern. 20:287-300. Zhang, C., Liu, S., Zhou, H., and Zhou, Y. 2004. An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci. 13:400--411. Zhang, C., Vasmatzis, G., Cornette, IL., and DeLisi, C. 1997. Determination of atomic desolvation energies from the structures of crystallized proteins. J. Mol. Bioi. 267:707-726.
134
Brian Pierce and Zhiping Weng
Zhou, H., and Zhou, Y. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stabilityprediction. Protein Sci. 11 :2714-2726. Zhu, H., Bilgin, M., Bangham, R., Hall, D., Casamayor, A., Bertone.P, et al. 2001. Global analysis of protein activities using proteome chips. Science 293:21012105. Zielenkiewicz, P, and Rabczenko, A. 1984. Protein-proteinrecognition: Methodfor finding complementary surfacesof interacting proteins. J. Theor. Bioi. 111: 1730.
16 Structure-Based Drug Design Kunbin Qu and NatasjaBrooijrnans
16.1 Introduction to Modern Drug Discovery To cure or contain a disease is to selectively eliminate or restrain the disorder caused by the pathogen, the human body, or both, and at the same time to minimize damage to the healthy parts. The quest of drug discovery is to identify the causes of the phenotype of the disease, and to interfere with the abnormal gene or gene product in such a manner that the disease is cured. Selectivity for just the aberrant gene products remains the critical point throughout the modem drug discovery process. Since ancient times, people have been using medicinal products, obtained from plant, animal, and mineral sources, to treat diseases. Some of these treatments were effective, but many of them had strong side effects. Starting in the late nineteenth century, modern drug discovery has become an interdisciplinary effort between chemistry, pharmacology, and biology. The phenomena ofdifferent binding affinities of dyes for biological tissues observed by Paul Ehrlich led to the hypothesis of the "chemoreceptor," which would exist distinctively in different organisms and disease tissues. Ehrlich then hypothesized that therapeutic effects can be achieved based on their differences (Drews, 2000). Shortly after Ehrlich's stipulation, the enzyme activity theory from Emil Fischer led to the "lock-and-key" concept, which provided an opening for structure-based drug design (SBDD): the key can be specifically designed when the lock, i.e., the target structure, is known (Fischer, 1894). In the 1950s this was modified by Koshland to the "induced fit" theory, where the lock and the key have to undergo conformational changes to fit optimally (Koshland, 1958). Generally the best-fitting conformers are not the lowest-energy conformers (Foote and Milstein, 1994). We can divide drugs into two major therapeutic categories: chemotherapeutics, historically the bread-and-butter of pharmaceutical companies, and protein therapeutics, an avenue originally taken by the biotechnology industry. More recently, other methods have emerged as promising avenues to disrupt a particular disease process. Gene therapy aims to correct the disease process by restoring, modifying, or enhancing cellular functions through the introduction of a functional gene into a target cell (Anderson, 1992). Small-interfering RNAs (siRNAs) are incorporated into an RNA-induced silencing complex that targets and cleaves mRNA complementary to the siRNAs, and thus leads to the knockdown of the targeted mRNA (Fire et aI., 1998). Recent rapid progress in RNA synthesis and structure determination also made RNA itself an attractive target since RNA is involved in all stages of cell life, 135
Kunbin Qu and Natasja Brooijmans
136
due to its key sequence motifs and also its intricate three-dimensional folds (Hermann and Westhof, 1998). Nevertheless, protein therapeutics and chemotherapeutics remain the dominant methods in current pharmaceutical research and development, due to their relatively high success rate, the abundant scientific knowledge, and the well-established acceptance ofproteins and small molecules as drugs. In this chapter, we will focus on how SBDD facilitates the discovery and development of protein therapeutics and small chemical therapeutics, by optimizing the key for the lock.
16.1.1
Current Drug DiscoveryProcess
The modem drug discovery process starts with target identification and validation. This operation is the first critical step in the drug discovery pipeline. It has to be shown that a molecular target, generally a protein, plays a crucial role in causing the symptoms of a human disease, and that by disrupting or enhancing its function, normally by gene knock-outs or artificially synthesized molecules, such as a dominant negative gene, siRNA, or small chemical, a positive effect on the disease process can be achieved. The complexity of the validation process has been witnessed by the slow turnout rate of novel targets, despite the recent genomic and proteomic revolution (Drews, 2003). The complexity lies in the fact that most diseases originate at the genetic level, and all the genes and gene products are interconnected in a complicated network of different pathways. Once the link between the disease and the target has been established and clearly understood, the second critical step in drug discovery is to find a way to modify that target. The development of both chemo- and protein therapeutics rely heavily on the atomic details of protein-protein and protein-small ligand interactions. In Section 16.2 we will focus on the SBDD of protein drugs, whereas in Sections 16.3 and 16.4 we will focus on the SBDD of small chemical compounds. Protein drugs can be very specific and effective, like for example monoclonal antibodies, whereas a small molecule can intervene with a single site on a multifunctional protein, leaving the other functions untouched. Beyond their initial in vitro on-target biological effects, two other factors, pharmacokinetics and pharmacodynamics, have significant impact on their endpoint efficacy in animals, and ultimately in humans. One can contrast pharmacokinetics to what the body does to the drug, and pharmacodynamics to what the drug does to the body. The processes that are studied and quantified in pharmacokinetics are often described by the mnemonic ADMET, for: absorption, distribution, metabolism, excretion, and toxicity. A study by Van de Waterbeemd in the 1990s showed that poor ADMET properties were often responsible for pharmaceutical development failures (Van de Waterbeemd and Gifford, 2003).
16.1.2 Role of Protein Structure in Modern Pharmaceutical Sciences The ability ofa protein to perform its function in the cell depends in part on its ability to assume and retain its proper functional conformation or tertiary structure. The
16. Structure-Based Drug Design
137
advances in structural biology and high-speed computing have greatly facilitated the elucidation of the three dimensional structure of medically relevant proteins on a large scale. In the middle of 2005, the number of structures in the Protein Data Bank (PDB) (Berman et al., 2000) reached 33,000 entries. All these are beneficial to SBDD, because they show the arrangements of the atoms of the protein in three dimensions, which often allows for the identification of protein and small-molecule binding sites. Knowledge about binding sites is critical in SBDD. In addition, if the structure of the protein-ligand complex is solved, these structures provide insight into how the ligand interacts with the protein and how to possibly optimize the ligand for a better fit with the protein. Despite the rapid growth in the number of solved protein structures, the conformational changes of the protein, especially in response to ligand/protein binding ("induced fit"), remain an issue. Most of the structures are solved by X-ray crystallography, which is a single snapshot of the dynamic structure in an unnatural environment. X-ray structures thus do not reveal much of the dynamic behavior of proteins, but flexibility of the protein structure plays an important role in SBDD, as will be discussed in Section 16.3.
16.1.3 Structure-Based Drug Design SBDD attempts to discover and aid in the discovery and optimization of therapeutic molecules by taking advantage of the three-dimensional structure of the protein and/or ligand (Kuntz, 1992). In most cases molecular recognition between the active site and interacting molecules is explored. In the case of protein therapeutics, the structure of the protein itself is explored to improve its properties to address pharmacokinetic and pharmacodynamic issues, such as stability and solubility. Based on the atomic details of the protein structure, SBDD can rapidly generate robust hypotheses about ligand or protein binding, which can be tested in iterative cycles with experimentally determined co-crystal structures and physiologically relevant biological assays. Here we need to emphasize the term "iterative," which means the whole process involves the integrated application of traditional biology and medicinal chemistry (Fig. 16.1), along with an array of advanced technologies, including X-ray crystallography, combinatorial chemistry, SBDD, and protein biophysical chemistry. Focus is on the three-dimensional molecular structure and the interactions of the active site of the protein target and the lead compounds. Receptor-based SBDD uses geometric criteria (steric) to match the receptor and ligand while exploring conformational space of the ligand. The location of the active site on the protein surface is usually known from X-ray crystallography or NMR, or can be predicted based on the three-dimensional structure. The receptorbased approach will be discussed in detail in Section 16.3. Algorithms that do not require knowledge about the binding site or the receptor structure, but use ligand structures instead, have also been developed. This so-called ligand-based approach will be discussed in Section 16.4.
Kunbin Qu and Natasja Brooijmans
138
Enhancement of SAR & binding modes
Experimental testing
Fig.16.1 Illustration of structure-based drug design process. Initial leads can be found through virtual screening or high-throughput screening. Once leads have been found, a cyclic process is started where analogues of the initial leads are made using insight from the binding mode. New analogues are tested, and the additional binding data can be used to enhance the SAR and models of the binding modes. Another iteration is initiated.
16.2 Protein Therapeutics Protein therapeutics can be mainly grouped into three categories based on their broad family classification: (1) cytokines, (2) antibodies, (3) engineered enzymes. The history of using proteins as therapeutic agents dates back to the 1950s when Isaacs and Lindenmann first discovered that interferon can make cells more resistant to virus infection without adverse effects on the uninfected cells (Isaacs, 1957). Since then, research and development of protein drugs has continued to rise. There are currently more than 30 protein drugs on the market for diseases as varied as cancer, anemia, rheumatoid arthritis, HI~ and severe oral mucositis. However, natural proteins have not evolved for utilization as drugs, and optimization is necessary. The optimization involves many aspects that integrate the traditional goals of protein design to improve the physicochemical properties that relate to the pharmacokinetic and pharmacodynamic features, which have a close relationship with the tertiary structure of the protein. Using the three-dimensional structure of the target, hypotheses can be generated as to how to improve those properties. In this section we will go through the three major classes of protein therapeutics and the role of SBDD in the optimization process.
16.2.1 Cytokines Cytokines are regulatory proteins secreted by white blood cells and several other cell types in the body and are involved in both short- and long-range cell-eell
16. Structure-Based Drug Design
139
communication. Their pleiotropic actions include the regulation of the innate and adaptive immune response, control of the development and proliferation of various types of blood cells, and modulation of the inflammatory response. Cytokines have been shown to be effective therapeutics for a wide variety of diseases. However, the efficacy ofcytokine therapeutics is often limited by their rapid clearance from serum, as well as by their immunogenicity, poor solubility, and low stability, something that can be addressed by SBDD. One simple stabilization strategy is to replace free surface cysteines, thereby preventing the formation of unwanted intermolecular disulfide bonds. Cysteine-toserine mutations have been introduced to granulocyte colony-stimulating factor (GCSF) and interferon (IFN) ~ 1b. More dramatic improvements in global stability can be obtained by optimizing packing interactions and hydrophobic burial in the protein core. Optimizing secondary structure propensity, hydrogen bonds, and electrostatic interactions can also improve protein stability substantially. Luo and colleagues selected the buried core to optimize G-CSF stability and improved thermal stability significantly. A 5- to 10-fold increase in shelf life was achieved, while maintaining biological activity. Choosing buried regions for modification minimizes changes to the surface, and thus maintains the active site and limits the designed protein's potential for antigenicity (Luo et al., 2002). An additional stabilization strategy is reduction of proteolytic susceptibility. If a specific site in the protein is known to be especially prone to proteolysis, it can be modified so that it no longer matches the substrate specificity of the putative protease. The protease cleavage sites are often found in the flexible loops on the surface that interconnect helices and beta-sheets, and mutations can be introduced that decrease the flexibility ofthe loop. The detailed effects of such modifications are discussed in Chapter 7 of this book. To achieve a therapeutic response, the drug has to have the appropriate concentration at the site of action. In a recent study, a soluble receptor based on vascular endothelial growth factor receptor 1 (VEGFRI) was optimized to confer improved injection-site absorption. The initial soluble receptor, constructed from the first three extracellular immunoglobulin domains ofVEGFRl and an antibody Fe region (see Fig. 16.2 for antibody structure), associated with negatively charged proteoglycans at the injection site rather than being absorbed efficiently into the bloodstream. To improve bioavailability, Holash and colleagues constructed a set of domain-deletion and domain-replacement variants with progressively lower isoelectric point (pI) values. Analysis of these variants indicated that lowering the pI decreased interactions with the extracellular matrix and improved bioavailability (Holash et al., 2002). Another important factor that can influence a drug's effectiveness is the rate at which a protein therapeutic is cleared from the body by mechanisms such as proteolysis, kidney filtration, and receptor-mediated endocytosis. Different approaches have been used to increase the pharmacokinetic properties of protein therapeutics. Attachment of polyethylene glycol groups to proteins (PEGylation) or to the Fe region of antibodies (see Fig. 16.2) has been proven to improve the serum half-life of biotherapeutics (Bailon et al., 2001; Ravetch and Bolland, 2001). Increased glycosylation can decrease kidney filtration and affect receptor-mediated processing. Elliott
140
Kunbin Qu and Natasja Brooijmans
~CDRs
Disulfide Bond
Fig. 16.2 Structure of an antibody. All the antibodies have a "Y" shape with two identical light chains (L, 23 kDa) and two identical heavy chains (H, 53-75 kDa). Immunoglobulins can be cleaved into three fragments through limited proteolysis: two identical Fab fragments and one Fe fragment. The Fab fragment contains the immunoglobulin antigen-binding sites ("ab" stands for antigen-binding), and forms the arms of the Y-shaped immunoglobulin. It consists of an entire L chain and the N-terminal half of an H chain. The Fe fragment (fragment ready for Crystallization) derives from the stem of the Y and consists of the identical C-terminal segments of two H chains. Both the heavy chain and the light chain consist of a variable region and a constant region made of domains around 110-130 residues. The variable regions for both heavy (VH) and light chains (VL) have one such domain, so does the constant region ofthe light chain (CL ) . The heavy chain constant region has three such domains, named CHI, CH2, and CH3. Between CHI and CH2 is the flexible hinge region (Hg). The variable regions constitute the antigen binding site. Within the variable region, there are three highly variable segments that really determine the specificity and contribute to the diversity of the molecule. They are called complementarity-determining regions (CDRs).
and colleagues engineered several cytokine therapeutics with novel glycosylation sites and enhanced pharmacokinetic properties by incorporating the consensus sequence for N-linked glycosylation (Elliott et aI., 2003). In another study, variants of G-CSF were designed such that after internalization by receptor-mediated endocytosis, the cytokines would preferentially go through the recycling path rather than being degraded. The histidine mutations allowed the variants to be released from the receptor at the low pH found in endocytic vesicles more efficiently than the native protein while maintaining native binding affinity at serum pH; this results in a significantly enhanced serum half-life (Sarkar et aI., 2002).
16.2.2 Antibodies Antibodies are part of a class of proteins called immunoglobulins (Ig). As one of the human body's most important defenses against disease, these naturally occurring proteins are produced by the immune system in response to antigens. All the
16. Structure-Based Drug Design
141
antibodies have a "Y"-shaped common structural scaffold with some unique substructures called "variable regions," which enable them to bind specifically to their corresponding antigens (see Fig. 16.2). The interaction between the antibody and the antigen enables phagocytic cells to ingest and destroy the invading microorganisms and activates the complement system of the immune system. About 30 years ago, the potential of antibodies as therapeutics for the treatment of many diseases was recognized. Depending on how the antibody was produced, it can be classified as a monoclonal or a polyclonal antibody. A monoclonal antibody is an antibody produced artificially from a single cell clone and therefore consists of a single type of immunoglobulin. It recognizes only one antigen. Polyclonal antibodies are produced by various different cells and are therefore not homogeneous in structure, increasing the chances of toxicity. The first antibody therapeutics was produced by extracting human or animal serum exposed to the antigen of interest, leading to nonspecific heterogeneous polyclonal antibodies. The development ofthe hybridoma technology made the production oflarge amounts ofmonoclonal antibodies possible against virtually any antigen of interest (Kohler and Milstein, 1975). The hybridoma technique fuses a mousederived antibody-producing lymphocyte with a transformed immortal myeloma cell to enable the continuous production of murine antibodies. However, patients often rejected mouse-generated monoclonal antibodies because their immune systems recognized them as foreign as they differed from human proteins. The human anti-mouse antibody response can also cause significant toxicity with subsequent administrations ofmouse antibodies. Early rational design and engineering of antibodies focused on how to effectively humanize the mouse antibodies to limit the immune response to the administered antibody. "Chimeric antibodies" were developed which contain about 33% ofmouse protein sequences with the remainder coming from human protein sequences. Subsequently, CDR-grafted or "humanized" antibodies were developed which contain approximately 5 % to 100/0 mouse protein sequences. More recently, recombinant DNA technologies, such as transgenic mice and phage display, have been used to create antibodies that are fully human (100% human protein sequences) (Kellermann and Green, 2002). Due to the unique modular structure ofantibodies and the well-defined function of each structural domain, rational design and engineering have been able to dissect the information to design a variety of architectures that are useful for therapeutic purposes, as shown in Fig. 16.3. Genetic engineering can generate a fragment that only contains the variable region from both the heavy chain and light chain with an amino acid linker. This so-called single chain variable region fragment (scfv) has become a key player in the engineering antibody field, because with a size of r-v 25 kDa these molecules are both effective targeting vehicles and versatile building blocks for larger antibody-based reagents, such as diabodies [an architecture of (scF v )2] and cytokine fused antibody [scf'v-cytokine] as shown in Fig. 16.3 (Holliger and Hudson, 2005). Fusing the cytokine with an antibody helps the cytokine reach the intended target, diminishing the systemic toxicity caused by cytokinealone treatment. In cancer treatments a similar approach is used by linking antibodies to
Kunbin Qu and Natasja Brooijmans
142
(SCFv )2 Bispecific diabody
scFv·cytokine
Fig. 16.3 Differentarchitectures of antibodies used in therapeutics. Ovalsin differentshadesand stripesare light chain and heavychain of antibody. Squareand circlesrepresentantigens. Triangle representscytokine.
toxins (immunotoxin). The immunotoxin has to be internalized into the endocytic compartment where the toxin is cleaved from the antibody, allowing it to enter the targeted cell for killing. The immunotoxin causes less severe side effects than treatment with the toxin alone (Smallshaw et aI., 2003). Another type of engineered antibody is the bispecific antibody that is capable of binding to two different epitopes, most often on two different antigens as shown in Fig. 16.3. Currently there are several bispecific antibodies in clinical trials, for example anti-CD30 x anti-Fey R antibodies that have one arm targeting Fe gamma receptors (Fcv R) and the other CD30. These bispecific antibodies were designed to recruit immune system effector cells to kill tumor cells. The current focus ofvariable region engineering is to increase affinity, stability, and solubility. Worn and Pluckthun reviewed the progress in rational engineering, and how structural information could be successfully applied to improve the stability of the antibody (Worn and Pluckthun, 2001). Mutations in the CDRs demonstrated over 100-fold better affinity for the antigen compared to the parent antibody (Chen et aI., 1999). The affinity of the antibody for the receptor can also be improved by engineering of the Fe region. In IgG, the Asn297-linked carbohydrate comprises a core oligosaccharide, including fucose that may contain various additional monosaccharides attached to the core. Shields et aI. showed that by removing the fucose on human IgG 1 N-linked oligosaccharide increased the binding to human Fcv RIll 50-fold and enhanced the antibody-dependent cellular toxicity (Shields et aI., 2002). Another approach is to modify the amino acid sequence ofFc that contributes with the interaction with the receptor. Shields et al. mapped the binding site on human IgG 1 for Fc)' RI, Fc)' RII, Fc)' RIll, and FcRn and designed IgG 1 variants with improved binding to the Fcv R and better therapeutic efficacy (Shields et aI., 2001).
16.2.3 Engineered Enzymes Besides the widely prescribed monoclonal antibodies and cytokines, a variety of enzyme drugs are also used to promote or inhibit blood clotting, replace deficient
16. Structure-Based Drug Design
143
factors, combat cancer, and treat infectious diseases. Engineering has been used to modify the enzymatic features of these protein drugs, such that catalytic efficiency and substrate specificity are increased, and improvements in solubility and serum half-life are made. One example of enzyme engineering is the design of long-acting variants of activated protein C (APC), a serine protease that inactivates clotting factors VIlla and Va. Berg and colleagues used modeling to design and produce a number of APC substrate derivatives with decreased susceptibility to proteolytic degradation. One particular variant had a single mutation at Leu-194 and exhibits a 4- to 6-fold reduction in the rate of inactivation in human plasma and a substantially improved pharmacokinetic profile compared with wild-type APC (Berg et al., 2003). Sun et al. altered the substrate specificity of butylcholinesterase, thereby generating a cocaine hydrolase that could be used to treat cocaine addiction or overdose. In order to improve degradation of the natural cocaine stereoisomer, which was poorly recognized by native butylcholinesterase, a pair of residues in the catalytic site were identified by molecular modeling to be mutated. The modified enzyme was indeed more effective than the wild-type at eliminating cocaine from mice in vivo (Sun et al., 2001). Besides modification ofexisting enzymes, de novo protein design has generated promising results. Dahiyat and Mayo used an algorithm based on physical chemical potential functions and stereochemical constraints to screen a combinatorial library of 1.9 x 1027 possible amino acid sequences for novel zinc finger sequences with higher stability. They found computationally, and verified by NMR, compact structures that were more stable than the native zinc finger. These novel zinc fingers showed no appreciable sequence similarity to any ofthe known zinc fingers (Dahiyat and Mayo, 1997).
16.2.4
Summary of Protein Therapeutics
The use of protein therapeutics is much younger than small-molecule therapeutics. It has started to rise in the past few decades due to breakthroughs in cell biology, molecular biology, structural biology, and protein engineering. Protein products as therapeutic agents are not only diverse but also very complex. They include, but are not limited to, cytokines, antibodies, growth factors, and various enzymes. Due to the complicated intrinsic biology, structure-based design of protein therapeutics is far less well defined as in small-molecule drugs, as will be discussed in Section 16.3. The challenges faced in assessing the safety and efficacy of protein drugs have included problems associated with pharmacokinetic and pharmacodynamic parameters such as stability, solubility, host contaminants, cross-activity, rapid plasma clearance, and these products' immunoregulatory nature, which has often resulted in a complicated array of biological effects. Many of those problems are related to the structure of the protein, and SBDD can be applied to provide concrete and testable hypotheses rather than relying on random sequence perturbations. SBDD has proven to be an indispensable tool in the protein therapeutic industry and makes drug discovery more cost effective.
144
Kunbin Qu and Natasja Brooijmans
16.3 Receptor-Based Drug Design The availability of three-dimensional structures of receptors of pharmacological interest, in combination with the lock-and-key theories of Fischer, led to the field of receptor-based drug design. Several approaches can be used, namely, (molecular) docking, interactive modeling, and de novo design. The drug discovery process consists of several stages. In the first stage of drug discovery the goal is to find one or several lead molecules that interact with the target ofinterest. In the next stage, the leads are optimized to increase their affinity for the receptor and to optimize other properties critical for drugs, namely, specificity and ADMET properties (Kerns and Di,2003). The goal of docking is to predict how two structures interact with each other. In drug discovery the interest is usually in protein-ligand complex prediction, where the ligand is a small molecule (molecular weight < 600 Da). Protein-protein docking is discussed in more detail in Chapter 15. Small-molecule docking can be used both in lead discovery and in lead optimization by for example testing how well real or "virtual" (not-yet-synthesized) compounds fit into the active site ofthe protein target. Interactive modeling can also be used to explore how changes to the lead influence the affinity and!or specificity of the ligand for the target, which is important in lead optimization. In de novo design (Honma, 2003), atoms or fragments are placed in the active site and grown by adding atoms or fragments to create a molecule that is complementary to the receptor. As illustrated in Fig. 16.1, lead molecules are identified through virtual screening (VS) or high-throughput screening (HTS). Once the binding mode of each lead has been established, modifications can be proposed and explored computationally and experimentally. Using the new binding data, the binding hypothesis can be further enhanced, and more modifications can be proposed. In this section we will focus on the use of docking in SBDD. First, a background on docking algorithms will be given, followed by a description on the use of docking as a virtual screening (VS) tool in lead discovery. Lead optimization will also be discussed, and some of the comparison studies of docking algorithms will be summarized.
16.3.1 Docking There are two main components to docking algorithms. The first part is the search algorithm that will generate different binding modes of the ligand in the active site. The second part is the scoring function, which will rank the different binding modes based on their complementarity to the active site. The pose corresponding to the mode in the crystal structure should be ranked the lowest. A successful binding mode generation is defined by the root-mean-square deviation (RMSD) ofthe docked pose compared to the crystal pose at a convention 2-A cutoff. The quantity of interest for the scoring function is the binding free energy ~Gbind, because it is directly related
16. Structure-Based Drug Design
145
to the experimental-observable equilibrium constant K eq as follows:
kon !1Gbind = -RTlnKeq = -RT InkOff
(16.1)
The docking problem is a very complex problem due to the many degrees of freedom , especially as the number of rotatable bonds of the ligand increases. The original docking programs only explored the three translational and three rotational degrees of freedom of the ligand ("rigid docking"). To further limit the number of degrees offreedom, only the receptor active site is explored rather than the full protein surface, which is still the standard procedure (Fig. 16.4). The conformation of the ligand(s) was often obtained from the Cambridge Structural Database (Allen, 2002) , which contains crystal structures of small molecules. Despite this simplification, early results were encouraging (Desjarlais et al., 1990), and with the increase in computer power, ligand flexibility was introduced and became the standard. In recent
Fig. 16.4 Complexstructure of c-Abl kinase domain with Gleevec (I IEP.pdb). Secondarystructurerepresentation of kinasedomainof c-Ablwithactivesiteresidues is shownexplicitly. Hydrogen bonds between Gleevec, protein atoms, and water molecules are represented by dotted lines. The box marks the receptor active site, which will be used by many docking algorithms to mark the boundariesfor the translationaland rotational search.
146
Kunbin Qu and Natasja Brooijmans
years, different treatments of receptor flexibility have been explored, although this is not yet common practice. For a more detailed review we refer to Brooijmans and Kuntz (2003). Different docking programs also use different approaches to rank poses of the ligand in the active site. Calculating binding free energies is a complex issue due to the many factors playing a role in protein-ligand interactions, as will be presented in Section 16.3.1.2.
16.3.1.1 Search Algorithms Search algorithms can be divided into three approaches. The first approach is a systematic search, where each degree of freedom is explored based on a grid of values. The grid values for each of the degrees of freedom are searched combinatorially. Grid values for torsional space are based on observed minima for different types of torsions. Grid values for translational and rotational degrees of freedom are established by the size of the binding site. To prevent combinatorial explosions, termination criteria are implemented to avoid searching along a path that will lead to the wrong solution. DOCK (Ewing et aI., 2001), FlexX (Rarey et aI., 1996), Hammerhead (Welch et aI., 1996), and Surflex (Jain, 2003) are docking programs that use systematic searches. The second type of search is a stochastic or random search. In stochastic searches, random changes are made to (generally) one degree of freedom at a time. Examples of stochastic searches are Monte Carlo (Me) algorithms and genetic algorithms (GAs). The biggest downside of stochastic searches is that convergence is not guaranteed. To increase the chances of convergence, several independent searches can be performed in the case ofGAs. In the case ofMC searches, different cycles can be run at different temperatures (from high to low), or by preventing new structures to be accepted when they are too similar to previously generated solutions, which are kept in a list ("Tabu search") (Baxter et aI., 1998). AutoDock (Goodsell and Olson, 1990), ICM (biased MC) (Totrov and Abagyan, 1997), MCDOCK (Liu and Wang, 1999), Prodock (biased MC) (Trosset and Scheraga, 1998), and PRO_LEADS (Tabu search) (Baxter et aI., 1998) use MC searches. The latest version ofAutoDock (Morris et aI., 1998) uses a GA, as does GOLD (Jones et aI., 1997). Deterministic searches are the last type of search algorithms. Molecular dynamics (MD) and energy minimization methods are deterministic, meaning that the initial state determines where to go next, generally to a lower or equal energy state. Because deterministic searches cannot cross energy barriers larger than 1-2 kT, these searches are local searches and often get trapped in local minima. For this reason, deterministic searches, mainly energy minimizations, are generally used in combination with systematic or stochastic search methods. This way one combines a global search with a local search. Rather than searching ligand conformational flexibility during docking, one can also generate a conformationally expanded ligand database. This database has to be generated only once, and can then be used on all targets of interest. During the dock
16. Structure-Based Drug Design
147
run, only translational and rotational degrees of freedom have to be explored ("rigid docking"). In order for this approach to be successful, the bioactive conformation of the ligand has to be part ofthe conformational ensemble (Bostroem, 2001; Good and Cheney, 2003). The bioactive conformation is rarely the lowest-energy conformation as found in solution or the gas phase due to induced fit (Perola and Charifson, 2004), which is one of the reasons why the bioactive conformation is not guaranteed to be part of the generated ensemble. Both systematic and stochastic search methods are implemented in conformation generators. Some of the most widely used programs are Catalyst, Confort, MacroModel, MOE, and Omega. Recently, several algorithms have been presented that use conformationally expanded databases to dock ensembles of ligands into the active site, rather than a single ligand/conformer at a time, which leads to significant time savings. Initially, the Shoichet group presented a flexible ligand docking algorithm in which the different conformers of each ligand are superimposed on each other based on the largest rigid fragment (Lorber and Shoichet, 1998). Joseph-McCarthy et ai. presented a method that overlays different ligands and conformers based on pharmacophore points (see Section 16.4.1). Again, all the ligands and conformers that have the same 3D pharmacophore, get docked into the receptor site at the same time, and are then evaluated separately (Joseph-McCarthy et al., 2003). The pharmacophores are used to place the ligand into the active site, and are not an interpretation of what features lead to the activity of the molecule in this setting.
16.3.1.2 Scoring Functions Protein-ligand complex formation is a complicated process, which involves van der Waals interactions, Coulombic interactions, hydrogen bonding, polar and nonpolar desolvation effects, changes in translational, rotational, and vibrational entropy ofthe complex and its components, and changes in the internal energy and conformational entropy ofthe receptor and the ligand. Accurate calculation ofthe binding free energy directly through statistical mechanics is possible, but computationally too expensive for use in drug discovery (Head et al., 1997; Luo and Sharp, 2002). For a review on binding free energies see Kollman (1993, 1994), and for more background on scoring functions see Brooijmans and Kuntz (2003). Even calculating all the forces described above separately is too demanding computationally. As a result, scoring functions used in docking programs are crude estimates of the binding free energy. Depending on the approach used, some of the forces are calculated explicitly while others are completely ignored, or their effects are taken into account implicitly. There are three types of scoring functions that are used in docking, namely, force-field-based functions (see also Chapter 2), empirical scoring functions (see also Chapter 3), and knowledge-based scoring functions. Force field scoring functions can further be divided into functions that are purely force field based, and functions that have additional terms, making these functions more semiempirical, DOCK (Meng et al., 1992) and the original AutoDock
148
Kunbin Qu and Natasja Brooijmans
(Goodsell and Olson, 1990) use force field terms only, namely, the intermolecular terms of a force field to estimate the complementarity of each ligand pose to the receptor. Van der Waals interactions are estimated by the 6-12 or Lennard-Jones potential, Coulombic interactions are taken into account by the Coulomb equation. Solvent effects can be crudely taken into account by using a distance-dependent dielectric constant in the Coulombic part ofthe scoring function. Entropic terms and intramolecular terms are completely ignored by these functions. The GOLDScore (Jones et al., 1995, 1997) and the scoring function in ICM (Totrov and Abagyan, 1997) combine force field terms with empirical terms to account for some of the deficiencies in pure force-field-based scoring functions. Empirical scoring functions are derived by collecting terms that play a role in protein-ligand interactions and modeling the equation to reproduce experimental binding free energies. In order to do this, a set ofprotein-ligand complexes is needed for which both the experimental binding free energy (Kd) and the crystal structure are known. Many of the newer functions are developed based on the pioneering work by Bohm, who developed LUDI (Bohm, 1994). FlexX (Rarey et al., 1996), the latest version of AutoDock (Morris et aI., 1998), and Glide (Friesner et aI., 2004) use empirical scoring functions. Gold (Verdonk et al., 2003) recently implemented a second scoring function, namely, an empirical scoring function based on ChemScore (Eldridge et al., 1997). Due to the modeling ofthe functional form to known proteinligand complexes, only interactions that are part of the training set can be modeled by the scoring function. In addition, while the terms in empirical scoring functions make intuitive sense, many ofthe effects that playa role in protein-ligand interactions are not modeled explicitly, but are taken into account by the other terms implicitly. This makes it difficult to analyze why these functions sometimes fail. The terms in empirical scoring functions are often easier to calculate than force field terms, giving them an important advantage. Knowledge-based scoring functions are derived based on observed frequencies of interactions seen in protein-ligand complexes. These frequencies are translated into energies using Boltzmann distributions (the more frequently observed, the more important the interaction and thus the lower the energy assigned to it). Because there is much more protein-ligand crystallographic data than combined complex structural and experimental binding data, knowledge-based functions are probably more general than empirical scoring functions. Again, only interactions part of the training set can be properly accounted for. Several functions have been developed for protein-ligand docking, namely, BLEEP (Mitchell et al., 1999), PLP (Verkhivker et al., 1995), SmoG (DeWitte and Shakhnovich, 1996; Ishchenko and Shakhnovich, 2002), PMFScore (Muegge and Martin, 1999), and DrugScore (Gohlke et al., 2000). In recent years there have been two interesting developments regarding scoring. The first is "consensus scoring," proposed by the Vertex group (Charifson et al., 1999). In consensus scoring poses are scored and ranked using two or more different scoring functions, and those poses that get ranked well by multiple functions
16. Structure-Based Drug Design
149
are kept. It has been shown that this kind of scheme reduces the number of false positives (Charifson et al., 1999; Bissantz et al., 2000; Stahl and Rarey, 2001; Clark et al., 2002). For this scheme to work well all the scoring functions used should be complementary to each other and the overall success rates for each of the functions should not differ too much. If one function performs significantly worse than the others, the results will deteriorate. The second development is that of "tiered scoring," where in the stage of pose generation, a simple function is used to quickly get rid of obviously bad poses. To rank the poses that are left, a more rigorous (generally force field based) function that includes solvent effects can be used. ICM (Schapira et al., 1999), Glide (Friesner et al., 2004), and Fred (McGann et al., 2003) are examples of docking algorithms that have implemented such schemes.
16.3.1.3 Input Receptor Structures Ideally, the input receptor structure used was solved by X-ray crystallography at high resolution, although NMR structures, or even homology models (see Chapter 11) can be used. McGovern and Shoichet showed, by comparing 10 different enzymes that have holo (ligand-bound) and apo (free) X-ray structures available, as well as homology models from a public database, that the holo structures proved to be most effective for virtual screening, followed by the apo structures, and the homology models (McGovern and Shoichet, 2003). Even when a crystal structure of the target of interest is available, care should be taken when using it, even at high resolution. The main thought to keep in mind is that the crystal structure is only a single model or interpretation of the electron density. Work done by the Blundell group illustrates it is possible to arrive at many, significantly different, solutions for a crystal structure, all of which fit the electron density data equally well (DePristo et al., 2004). When an experimental structure is not available, a homology model can be built. Despite the inherently lower resolution of a homology model compared to the template crystal structure (Marti-Renom et al., 2000), several successful virtual screening examples against homology models have been published (see Table 16.1). It has been shown that a sequence identity of 50% or more was required to get consistently better enrichment than random (Oshiro et al., 2004). In cross-docking experiments one tries to predict a protein-ligand complex using a protein structure, which was crystallized in the presence of another ligand. Thus, the protein is not in exactly the correct configuration for the ligand that is being docked. The success rates in cross-docking experiments are significantly lower compared to docking experiments with the correct configuration of the receptor for the ligand (Knegtel et al., 1997; Murray et al., 1999). Further evidence that taking into account receptor flexibility is important was illustrated by a study in which VS hit rates were compared for docking against an X-ray structure, an NMR(apo) ensemble, and an NMR(holo) ensemble. The virtual hit rates were highest for both
DOCK
FlexX
Homology ICM
X-ray
Nuclear receptor
Retinoic acid receptor (Schapira et aI., 2001) Carbonic anhydrase (Grueneberg et al., 2002) Aldose reductase (Rastelli et al., 2002)
DOCK
Lideaus
Homology ICM
Aldo- keto reductase
Nuclear receptor
Thyroid hormone receptor (Schapira et aI., 2003)
X-ray
X-ray
Kinase
BCR-ABL(Peng et al., 2003)
X-ray
Docking program Score
Selection criteria for experimental screening
Score; clustering; Rule of5 '"'V150K Score; distance restraint; visual inspection '"'VI50K Score; visual inspection '"'V 100 (3 Score; visual inspection prefilters ) '"'V127K Score; clustering; visual inspection
'"'V200K
'"'V50K
Library size
Examples of successful virtual screens
Structure
Carbonic anhydrase
Kinase
Protein family
CDK2 (Wu et al., 2003)
Target
Table 16.1
Enzyme assay & X-ray Enzyme assay & structure-based lead optimization
2 cell assays & structure-based lead optimization Cell assay
Enzyme assay & X-ray Cell assay
Experimental confirmation
42f..LM
0, from a starting distribution po(e). (2)Fort=1,2, ...
(a) Sample a candidate point e* from a jumping distribution at time t, ~(e* I et-I). The jumping distribution must be symmetric; that is, ~(ea I eb) = ~(eb I ea) for all ea and eb, and t. (b) Calculate the ratio of the densities
= p(e*ly)/p(et-Ily).
(A4.44)
with probability min{r, I} otherwise.
(A4.45)
r
(c) Set t
{
e*
e = et - I
A4. Mathematics and Statistics for Studying Protein Structures
313
In the Metropolis algorithm, a parameter e* with larger posterior probability than the current t - 1 is always accepted, and the one with lower probability is accepted randomly with probability r . The Metropolis-Hastings algorithm generalizes the basic Metropolis algorithm in two ways. First, J, is not required to be symmetric. Second, to correct for the asymmetry in the jumping rule, the ratio r in (A4.44) is replaced by a ratio of importance ratios:
e
p(e* Iy)j Jt(e* I et - 1) r - -t------- p(e 1 Iy)j ~(et-l I e*)·
(A4.46)
Allowing asymmetric jumping rules can be useful in increasing the speed of the convergence of the Markov chain. In fact, the Metropolis-Hastings algorithm can be further generalized, if the ratio r is taken as
8(et- 1 , e*)j Jt(e* I et -
1 )
r=--------
p(e t -
1
Iy)
(A4.47)
where 8(et- 1 I e*) can be any function as long as it is a symmetric function in et- 1 and e*, and makes r :::: 1 for all t - 1 and e*.
e
Gibbs sampling The Gibbs sampling is used to approximate a probability model with many variables. It samples from distribution obtained by keeping all variables fixed except one. Given the starting point (eiO), eiO), ... , 6JO)), the algorithm iterates the following steps:
(i+l)
(d) Sample ed
(i+l)
from p(6 d
(i) - 1 ,y). leI' ... ,e d(i+l)
The vector e(O), e(2), ... , e(t), ... represents realization of a Markov chain, with transition probability from e' to e,
K(e', e) = p(el I e~,
p(e3 1 e~, e~, e~,
, e~, y)p(e21 e~, ... , e~, y) , e~, y) x p(ed I e~, ... , e~-I' y).
(A4.48)
Under rather general conditions, the Markov chain generated by the Gibbs sampling algorithm converges to the target density as the number of iterations become large.
314
Yang Dai and Jie Liang
Further Reading Casella and Berger's (2002) book provides a thorough introduction on probability distributions and general approaches of statistical inference. Basic concepts of entropy, relative entropy, mutual information, and their applications are presented in Cover and Thomas (1991). The book by Gelman et al. (1995) gives a comprehensive description of various aspects of the Bayesian Paradigm. The computational algorithms in Bayesian analysis are introduced in a unified manner by Tanner (1996). The importance and rejection samplings were originated in von Neumann (1951) and Marshall (1956). The original ideas on Metropolis and Metropolis-Hastings algorithms can be found in Metropolis et al. (1953) and Hastings (1970) and overall sampling methods are described in detail in Tanner and Wang (1987). Liu (2001) provides a comprehensive treatment of the Monte Carlo method and its application in scientific computing. The original Baum-Welch method is presented in Baum et al. (1970). A unified treatment of the three algorithms associated with HMMs can be found in Rabiner (1989).
References Baum, L., Petrie, T., Soules, G., and Weiss, N. 1970. A maximization technique occurring in the statistical analysis ofprobabilistic functions ofMarkov chains. Ann. Math. Stat. 41:164-171. Casella, G., and Berger, R.L. 2002. Statistical Inference. Belmont, CA, Duxbury Press. Cover, T., and Thomas, J. 1991. Elements ofInformation Theory. New York, John Wiley & Sons. Ewens, W J., and Grant, G. R. 2001. Statistical Methods in Bioinformatics: An Introduction. Berlin, Springer. Gelman, A., Carlin, lB., Stem, H.S., and Rubin, B.D. 1995. Bayesian Data Analysis. London, Chapman & Hall/CRC. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 12:609-628. Hastings, WK. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109. Liu, J.S. 2001. Monte Carlo Strategies in Scientific Computing. Berlin, SpringerVerlag. Marshall, A. 1956. The use ofmulti-stage sampling schemes in Monte Carlo computations. In Meyer, M. (Ed.), Symposium on Monte Carlo Methods, pp. 123-140. Metropolis, N., Rosenbluth, A.W, Rosenbluth, M.N., Teller, A.H., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys.21:1087-1091. Rabiner, L.R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77:257-286.
A4. Mathematics and Statistics for Studying Protein Structures
315
Tanner, M.A. 1996. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd edition. Berlin, Springer. Tanner, M.A., and Wong, WHo 1987. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82:528-549. von Neumann, 1. 1951. Various techniques used in connection with random digits. NBS Appl. Math. Sere 12:36-38.
Index
7, 66-67, 70, 85, 97, 235, 237 66,68,70,85,97,235,237-238 ~-mrn 77-78,235-236 310 Helix 237
a-helix
~-sheet
255-257 ab initio 43-44,160,177-178,181,193,200, 218,222,241,267 adenine 231 Ad~ET 136, 144, 155, 159 alignment accuracy 29, 32, 193-194, 199 amphipathicity 85-87
A * search
antibody complementarity-determining regions (CDRs)
122, 141-142 F ab fragments 140 Fe fragment 140
Monoclonal
136, 141-142
polyclonal 141 anti-parallel J3-sheet 237 Atomic Contact Energy (ACE)
118-119,
122-124 autocorrelation function 292 autodock 116,146-148 backbone structure
2,6-7,23,30,32,183,
189 back-propagation algorithm 82,87 bacteriorhodopsin 65-67, 72, 99 base pair 232-233, 276 Bayesian 45,299,310,314 binding free energy 144, 147-148,
151 binomial distribution 301-302 bioperl 219, 262-263 biosynthesis 233-234 bipartite matching 248-249,259 BLOCKS 191,216 Boltzmann constant 9,278,293 Boltzmann factor 293 branch-and-bound 16, 28, 251, 255-257 CAFASP 194 CAPRJ 121-122, 126, 128
Cartesian coordinate 270,272 CASP 12, 14, 29, 33, 44, 122,
180-181 catalytic site 143, 191, 211 CATH 6,210 CHA~~
45,119,123-124,179
chemotherapeutics 135-136 CHIME 209,212 Clusters ofOrthologous Group (COG) 216 coarse-grained model 268 comparative modeling 177, 199, 210 confidence factors 187 conformation deterministic conformational search 146-147 bioactive 147, 163-164 ensemble 52, 147 stochastic conformational search 146-147 systematic conformational search 146-147 conformational sampling 169 confusion matrix 91-92 consensus prediction 94 consensus-based approach 180-182 constrained bipartite matching 248-249 Convex global underestimator (CGU) 126 Coulomb's Law 118 covalent bond 273-276 Critical Assessment of Structure Prediction (CASP) cross-correlation function 292 C-termini 68, 83, 237 cytokine 138-142 cytosine 231 DALI 3 dead-end-elimination algorithm 257 decoys 24,99,211 degrees of freedom 46,114,120,124, 145-147,
151,270,278,281 deoxyribonucleic acid 231 desolvation 110, 116, 116, 123-124, 147 DFIRE 11-12,118-119 Dirichlet distribution 303 disulfide bond 139, 192,209,231,276 divide-and-conquer algorithm 14-16, 26-27,
260
317
Index
318 DOCK 112,147 docking flexible ligand 218 MM/PBSA 151 receptor flexibility 146, 149, 151 rigid ligand 114 validation 152-153 domain 2-3,29,56,81,100,139,141,179,
181-182,184,210-212,216-217,238, 263 DOT 76,118 downhill simplex 124, 126 dynamic programming 13, 17, 26, 28, 78, 82-83,87-88,89,251,253,254 electrostatic interaction 98, 139, 164, 275-276 electrostatics 110, 116, 118, 123-124 enthalpy 284, 288 entropy 82,147,277,280,282 e~e
67,110,135,138,142-143,149,155
enzyme structures 211 Euler angles 122, 293-294 EVA 194 expectation maximization 78,299,311 expert system 180, 186-189, 192 fast Fourier Transform (FFT) 112, 115 finite difference method 290 fold recognition 1-2, 13, 26, 30, 32-33, 47,
55-56, 180, 183-184, 187, 190-194 folding landscape 189 freeenergy 9,69-70,74,97,144,147-148,
151,269,277,285-289 free energy of transfer 69-70,74 frozen approximation 13-14 FSSP (Fold classification based on Structure-Structure alignment of Proteins)
3,6,9,210 FTDock 115 functional motifs
190
Gamma distribution 302 Gaussian distribution 24, 301 GenBank 216 Gene Ontology (GO) 183, 216 genetic code 232 GenTHREADER 28 geometric distribution 302 Gibbs free energy 9, 287-288 Gibbs sampling 299,313 globular protein 68, 77, 83, 182, 229, 235 glycophorin A 100
GRAMM 115 GRAMM-X 113 greedy algorithm 257-258 guanine 231-232 HADDOCK 119 hash table 116, 243-244, 256 heap 243, 246-247 helix-helix interaction 97-98 Helmholtz free energy 285, 288 hidden Markov model 68,81-82,88-89,94,
193,216,299,305-306 32, 97,109,128,177,218,222,257 hydrogen bonding 28, 46, 98, 110, 119, 147, 229,231,235,237,276 hydrogen-bond 156, 231, 237 hydropathy analysis 68, 72, 74-75, 77, 79-80, 87,94-95 hydrophilic 32, 74, 77, 85-86, 237 hydrophilicity 74 hydrophobic 32, 44, 65-68, 70, 72, 75-77, 82-83,85 hydrophobic moment 45, 50, 54, 70, 72, 86 hydrophobicity 45, 68-70, 72-78, 85-86, 94 hydrophobicity analysis 76-77 homology modeling iterative density function
ICM 146,148-149,151 importance sampling 311-312 independent reference state 9 induced fit 135,137,147 information entropy 280, 283-284 integer programming 14, 16, 22, 26, 28, 251,
254-255,259 Jess expert system shell
187
kinetic barrier 288-289,296 knowledge-based 9, 45-47, 50, 55-57,
147-148 lattice representation 46 lead identification enrichment factor 151, 153 virtual screening 144, 149, 152, 154 lead optimization structure activity relationship 155-156, 158,
164 leap-frog algorithm 291 Lennard-Jones potential 44, 148, 275 linear integer programming 16, 28 lipid bilayer 65, 68, 74-75, 89
Index LiveBench 194 local ratio technique 258 longest common subsequence 253 machine learning algorithm 80-81, 87 macromolecule 67, 112, 118, 210-211, 233 Markovianprocess 299, 305 maximum likelihood 88,299,310 membraneprotein 6,29-30,34,65-100 membranetopology 76, 78 mesofold 5 meta-server 181-182 Metropolismethod 293 Metropolis-Hastings algorithm 299, 312-314 minimum spanningtree 246, 248, 257 mmCIF format 208 moleculardynamics 30, 32, 45, 57, 112, 123-124,146,151,218,261,267,276, 278,289-292 moleculardynamicssimulation 32,45,210,218, 261,267,278,289-290,292 molecular structure 137, 198 MolFit 113 Molscript 212 Monte Carlo sampling 116, 126 Monte Carlo simulatedannealing 47, 49 multinomialdistribution 301-303 multiple sequencealignment 10, 78-80, 82, 95, 216 mutation 139-140, 142-143, 178, 233, 260 mutual information 299,304-305 neural network (NN) 26,68,76,80,82,87-89, 99 Newtonianlaws 289 non-redundant sequencerepresentatives 9 NP-complete 157, 250 NP-hard 13, 241, 251, 253, 255, 259-261 N-termini 237 nucleic acid 208, 210, 215, 222, 231 obligateprotein complexes 116 optimizationproblem 17, 57, 161, 250-251 pairwise interaction 9, 11, 13-14, 17, 19, 27-28,98,241,260-261 parallel ~-Sheet 237 partition function 9,280,286-287 PatchDock 113 PDB 3,9,25,56,119,137,165,177-178, 181,190-191,194,208 PDB-select 9, 25
319 peptide 29, 77, 80, 83, 179, 182, 229, 234 peptide bond 46,229,231 per-residueaccuracy 91-94 per-segmentaccuracy 92-93, 95 Pfam 216 pharmacodynamics 136 pharmacokinetics 136 pharmacophore clique detection 157-158 feature 156, 158 maximumcommonsubstructure 158 model 155, 157-158 physicochemical 58, 72, 85-87, 89 physics-based 12, 26, 45-46, 55-56 pKa 229,231 Poisson-Boltzmann Equation 118 polar clamp 97-98 polar coordinate 270 polynomialtime reduction 250 polypeptide 74, 76, 88 porin 67,88 positive-inside rule 71-72, 76-77, 82, 85 potential energy surface 269-270, 274 potential of mean-force 267, 270, 296 primary structure 235, 237 principal of optimality 13-14, 21 PRINTS 191,216 probabilitydensity function 299 probabilitydistributions 299, 304 ProDom 216 propensity 46,68,76-80 PROSITE 191-192,216 PROSPECT 13-16,27-28 PROSPECT-PSPP pipeline 195 protein backbone 66, 70, 120 protein binding 118, 137, 178 protein classification 216 protein complexstructure 34, 109, 218 Protein Data Bank (PDB) 65,99,137,177 protein docking backbone searching 120 benchmark 121 bound 121 clustering 52-53, 55, 119-120 predictionevaluation 120 refinement 32, 34, 56, 113 side chain searching 112, 114 symmetricmultimers 128 unbound 112,117 protein folding 44,47,67,97,119,122,126 protein geometryanalysis 217 protein interaction 44
320
Index
protein modeling 119, 212 protein sequence databases 224 protein structure determination 252 protein structure prediction benchmark 199 evaluation 77, 93, 120 protein therapeutics 135-139, 143 protein threading 1-34 protein-protein docking 109, 112, 119, 124 Quantitative Structure Activity Relationship (QSAR) 3D-QSAR (CoMFA) 159,163-164 descriptor selection 159 validation 136, 152-154, 159, 162-164 quantum mechanics 97,160,267-268 quaternary structure 210, 238 Ramachandran plot 235, 274 RAPTOR 14,16,28 Rasmol 207,209,212 RDOCK 122-124 rejection sampling 299,311-312, 314 relative entropy 299, 304 Research Collaboratory for Structural Bioinformatics (RCSB) 208 residue-specific all-atom probability discriminatory function (RAPDF) 45, 50, 54, 56 Rete algorithm 187 ribonucleic acid 231-232 rigid-body docking 116, 121, 218 Robetta 49,56,181 Root Mean Square Deviation (RMSD) 52,
120-121,144,212 RosettaDock 117 rotamers 46, 57, 120, 151, 257 rules based system 86, 164 SCOP 189-190,193,210 scoring function 147-149 SCWRL 120 secondary structure 7, 11, 47, 50 segment overlap (SOV) 93-94 sequence fingerprints 259 sequence similarity 9,143,177,182,184 sequence-structure alignment 2, 12-14, 17 shape complementarity 110, 112, 116-117,
simulated annealing
47, 49, 99, 116, 161, 181,
268,293 singleton energy 8, 27 solvation 46, 110, 116, 118, 123-124 statistical significance 23-25,28-29, 33 statistics based energy 7, 9-10, 26 structural domain 2-3, 141, 182 structural fold space 191 structural motifs 191 structural refinement 181, 267 structure based drug design (SBDD) ligand-based 137 receptor-based 137 structure comparison 43, 218, 238, 255 structure families 21 0 structure quality assessment 184, 210, 218 structure-based function prediction 180 subgraph isomorphism 17 suffix tree 243-245 superfold 5-6 SwissProt 82,190,215-217
2,7,11-14,17,21,23-25,
template structure
27 tertiary structure
2, 17-18, 65, 67-68, 93, 97,
136 threading algorithm
14, 16-18, 22-23, 25-26,
28-29,33 65,67,78, 95,
three-dimensional structure
137-138,144 thymine 231-232 torsional space 46, 146 transient protein complexes 124 tree decomposition 19-22, 26 T-test 309 unifold 5-6 uniform distribution uracil 231-232
10, 284, 294, 300
van der Waals interactions
98, 100, 147-148,
151,275 van der Waals potential 1 00 Verlet algorithm 291-292 VMD 212 worst-case performance ratio
252
123 26, 32-33, 46-47, 57, 69-70, 83, 86,98,110,113,120-121,123 signal peptide 83, 179, 182, 184 side chain
ZDOCK 115,122-124 Z-score 26,29,188,190,196 Z-Test 309
(continued from page ii) Physics ofthe Human Body: A Physical View ofPhysiology Herman, J.P.,
2006
Intermediate Physics for Medicine and Biology Hobbie, R.K.,Roth, B., 2006 Computational Methods for Protein Structure Prediction and Modeling (2 volume set) Xu,Y., Xu, D., Liang, J. (Eds.) 2006 Artificial Sight: Basic Research, Biomedical Engineering, and Clinical Advances, Humayun, Weiland, Greenbaum., 2006 Physics and Energy Landscapes ofProteins Fraunfelder, H., Austin, R., Chan. S., 2006 Biological Membrane Ion Channels Chung, S.H., Anderson, O.S., Krishnamurthy, V.V.,
2006
Cell Motility, Lenz,P., 2007 Applications ofPhysics in Radiation Oncology, Goitein, M., 2007 Statistical Physics ofMacromolecules, Khokhlov,A., Grosberg, A.Y., Pande, V.S. Biological Physics, Benedek, G., Villars, F.,
2007
2007
Protein Structure Protein Modeling, Kurochikina, N., 2007 Three-Dimensional Shape Perception, Zaidi, Q., 2007 Structural Approaches to Sequence Evolution, Bastolla,U. Radiobiologically Optimized Radiation Therapy, Brahme, A. Biological Optical Microscopy, Cheng, P. Microscopic Imaging, Gu, M. Deciphering Complex Signatures: Applications in Life Sciences, Morfill, G. Biomedical Opto-Acoustics, Oraevsky, A.A. Mathematical Methods in Biology: Mathematics for Ecology and Environmental Sciences, Takeuchi,Y. Mathematical Methods in Biology: Mathematics for Life Science and Medicine, Takeuchi,Y. In Vivo Optical Biopsy ofthe Human Retina, Drexler, W.) Fujimoto, J. Tissue Engineering: Scaffold Material, Design and Fabrication Principles, Hutmacher, D.W. Ion Beam Therapy, Kraft, G.H. Biomaterials Engineering: Implants, Materials and Tissues, Helsen, J.A.