USING COMPUTATIONAL PROTEIN DOCKING TO

0 downloads 0 Views 4MB Size Report
Phil Cole and Jim Stivers to be particularly useful. My first two rotations in .... of protein interactions in a wide-range of applications, from blind predictive protein ...
USING COMPUTATIONAL PROTEIN DOCKING TO MODEL THE STRUCTURE AND SPECIFICITY OF PROTEIN INTERACTIONS

by SIDHARTHA CHAUDHURY

A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of DOCTOR OF PHILOSOPHY

Baltimore, Maryland February 2010

© Copyright by Sidhartha Chaudhury 2010 All Rights Reserved

ii

To my wife, Monisha Cherayil, my sisters, Rupsa and Aurelia Chaudhury and my parents, Abhijit and Banani Chaudhury

iv

ACKNOWLEDGMENTS Jeff Gray has been an excellent mentor to me in so many ways. When I joined the Gray lab in the spring of 2006, I came in with an open mind but without a clear idea of my thesis research. Jeff gave me a concrete project to get started with while also giving me the creative freedom to look into interesting side-projects, all of which allowed me to get a good grasp of molecular modeling and computational biology. Throughout my time in the Gray lab, Jeff has always been there encouraging me to explore new ideas and pushing me to run new simulations, test new hypotheses, and to publish. It is a rare thing to find a supervisor, a mentor, and a friend, in the same person. I’d like to thank Tom Woolf for serving as my official second reader, and Phil Cole and Bob Schlief for serving as additional members of my thesis committee. Phil, Tom, and Bob have provided valuable insight and advice regarding my research for the past few years through annual thesis reviews. The research presented in this dissertation was the product collaborations with many scientists. Aroop Sircar and Arvind Sivasubramanian, two members of the Gray lab were heavily involved in the research that became Chapters 2 and 3. In Chapter 2, they were both integral parts of the Gray lab’s team for CAPRI and Chapter 2 is derived from an article in which Aroop and I contributed equally. In Chapter 3, Aroop and Arvind developed the Rosetta Antibody homology modeling algorithm that generated the ensemble of homology models that were used for docking. In Chapter 4, the early phase of the HIV-1 protease specificity research was carried out by an undergraduate, Tiara Byrd.

In Chapter 5, Sergey Lyskov, a software developer in the Gray lab, was

instrumental in developing and maintaining PyRosetta, and Paul Nunley, an

v

undergraduate, ran many of the simulations that made up the performance benchmark for Rosetta v3. Finally, my research would not have been possible without the involvement of the entire Rosetta community in developing and maintaining the Rosetta molecular modeling package; my contributions to Rosetta were built upon a solid foundation of previous work and my hope is that my work will, in turn, serve as a solid base for future research. I am grateful for the financial support from the National Institute of Health (NIH) training grant for the Program in Molecular Biophysics and the NIH R01 research grant awarded to my advisor. The training grant enabled me participate in laboratory rotations in my first year, which I feel was invaluable to my scientific education. The R01 research grant funded my graduate research beyond the second year and paid for the ample computational resources that are available to the Gray lab. I have thoroughly enjoyed my time here at Johns Hopkins. I came to Hopkins as an undergraduate in fall of 2001 and will be finishing graduate school here in spring of 2010, concluding almost a decade of being a Hopkins student. Throughout this time I have enjoyed the vibrant research community and academic atmosphere that ultimately propelled me towards a career in science.

My experience as a biophysics graduate

student at Hopkins has been excellent. The biophysics staff, Gerald Levin, Ranice Crosby, and Kathie Kolish, were so helpful in the administrative tasks of being a student, allowing us to concentrate on our research. Our coursework educated us in a wide range of topics in biophysics and I found the physical chemistry classes, taught by Bertrand Garcia-Moreno and Dilip Asthigiri, as well as the organic chemistry classes taught by

vi

Phil Cole and Jim Stivers to be particularly useful. My first two rotations in Dan Leahy’s and Phil Cole’s labs were invaluable experiences. The Gray lab has been my home for the past five years. Current and former members Aroop Sircar, Dave Masica, Mike Daily, and Arvind Sivasubramanian have frequently contributed valuable insight and experience. Monica Berrondo has been both a close friend and co-worker for more than four years through all the ups and downs of graduate school. It has been a valuable experience for me to both learn from, and teach the newer members of the lab, Krishna Kilambi and Brian Weitzner. The work in the Gray lab would not be possible without a dedicated staff of system administrators including Alainna Wonders, Jan vandenBerg, and Josh Greenberg. I have truly enjoyed my time in Baltimore, having lived here for longer than any other city in my life, which is a tribute to the friends and family here. In particular, I’d like to thank fellow PMB graduate students Josh Friedman, Bill Bocik, and Jeff Werbin, for stimulating discussions of science and life during our semi-weekly beer tasting events. Josh Friedman, in particular, has been one of my best friends throughout grad school and I will always cherish our year living together and all our shared grad school experiences. My friends from my undergraduate days at Hopkins: Steve and Jennifer Matsumoto, Nate and Bronwyn Bates, and Mark Zabawa have all grown up together with me, gone to each other’s weddings, shared in each other’s success and failures over the past decade here in Baltimore. Debjoy Mallick, my life-long best friend de facto little brother, joined Hopkins as an undergraduate in the fall of 2005, and my sisters, Rupsa Chaudhury, who joined Hopkins as a medical student in fall of 2007, Aurelia Chaudhury,

vii

who visits frequently, have shown me love and support in the way only family can and have truly made Baltimore my home. Monisha Cherayil, now my wife, and forever the love of my life, has long been the force behind all of my personal and professional accomplishments. When I was an undergraduate, it was Monisha that convinced me to follow my dreams and apply to graduate school. The many landmarks of our lives together are scattered throughout my tenure in graduate school: the first time we lived in the same metropolitan area, our first apartment, our engagement, and our marriage. I will always look back on this period of our lives fondly. Finally, I would like to thank my parents, Abhijit and Banani Chaudhury. It is hard to conceive of a couple more devoted to their children’s education, future, and happiness. Thanks to you all.

viii

ABSTRACT USING COMPUTATIONAL PROTEIN DOCKING TO MODEL THE STRUCTURE AND SPECIFICITY OF PROTEIN INTERACTIONS FEBRUARY 2010 SIDHARTHA CHAUDHURY, B.A., JOHNS HOPKINS UNIVERSITY Ph.D., JOHNS HOPKINS UNIVERSITY Directed by: Professor Jeffrey J. Gray

Computational methods that predict the structure and specificity of protein-protein interactions can yield deep insight into the structural biology of many biochemical pathways. Through high-resolution structures of protein-interactions we can identify the structural mechanisms of diseases, engineer proteins towards specific functions, and design drugs that disrupt pathogenesis.

Challenges in accurately modeling protein

interactions include efficiently sampling the conformational space available for two proteins to interact, and adequately approximating the free energy of the conformational landscape to correctly predict the structure and specificity of the protein interaction. In this thesis I detail my work on developing new methods in flexible protein and peptide docking and applying it in a number of areas to both predict the atomic-scale structure of protein interactions as well as model their specificity. First I introduce our early efforts in blind predictive protein docking and demonstrate the limitations of the prevailing rigid-body approach to docking while outlining the challenges in modeling binding induced conformational changes in docking. Second, I introduced a new approach to

ix

modeling flexibility in protein docking called ensemble docking, inspired by the conformer selection model for protein binding. I initially apply this method to address relatively modest binding-induced conformational changes from crystal structures of unbound proteins, but then expand it towards addressing structural inaccuracies from NMR structures and homology models and finally towards blind predictive docking. Third, I extend the use of flexible docking towards modeling of interaction specificity in the case of HIV-1 protease. HIV-1 protease is a well studied enzyme-substrate system that is critical to HIV-1 pathogenesis and thus a major antiretroviral drug target. I developed a flexible peptide docking algorithm that predicts the structure of the enzymesubstrate complex and calculates the energetics of the enzyme-substrate interaction. Then, using statistical analysis and decomposition of the interaction energies, I identify enzyme and substrate residues that are most important for determining substrate specificity. We use the results to expand our understanding of substrate specificity and drug resistance in HIV-1 protease and extend the use of the algorithm towards other enzyme-substrate systems. Fourth, I look at our efforts towards the future of modeling protein-protein interactions, first through the newly updated RosettaDock v3 that incorporates additional features, such as small-molecules and co-factors, that were unavailable in the earlier version, and through PyRosetta, a Python script-based implementation of Rosetta that is intended to make molecular modeling more accessible to the structure biology community. In summary, I have developed a number of flexible protein docking methods that have had success in modeling the structure and specificity of protein interactions in a wide-range of applications, from blind predictive protein

x

docking, antibody-antigen homology modeling and docking, and the identification of specificity determinants in enzyme-substrate interactions.

xi

xii

TABLE OF CONTENTS Page ACKNOWLEDGMENTS ...................................................................................................v  ABSTRACT ....................................................................................................................... ix  LIST OF TABLES ........................................................................................................... xvi  LIST OF FIGURES ........................................................................................................ xvii  CHAPTER 1.

INTRODUCTION ...................................................................................................1 1.1 Introduction to proteins ......................................................................................1 1.2 Protein-protein complexes .................................................................................3 1.3 Molecular modeling using Rosetta ....................................................................7 1.4 A primer on protein docking ............................................................................11

2.

EARLY EFFORTS IN INCORPORATING SPECIFICITY AND FLEXIBILITY IN PROTEIN DOCKING ......................................................................................15 2.1 Summary ..........................................................................................................15 2.2 Introduction ......................................................................................................16 2.3 Results ..............................................................................................................19 2.3.1 Computational mutagenesis ..............................................................22 2.3.2 Flexible docking................................................................................25 2.3.3 Standard docking ..............................................................................29 2.4 Discussion ........................................................................................................31 2.5 Conclusion .......................................................................................................33 xiii

3.

STRUCTURE PREDICTION OF PROTEIN INTERFACES USING FLEXIBLE DOCKING .............................................................................................................35  3.1 Summary ..........................................................................................................35 3.2 Introduction ......................................................................................................37 3.3 Methods............................................................................................................44 3.4 Results ..............................................................................................................47 3.4.1 Docking algorithms ...........................................................................48 3.4.2 Energy function and descrimination .................................................53 3.4.3 Case Study: Acetrylcholinesterase-Fasciculin II ..............................55 3.4.4 Flexible docking outperforms rigid-body docking ...........................64 3.4.5 Ensemble docking with NMR targets ...............................................68 3.4.6 Ensemble docking with Antibody homology models .......................77 3.4.7 Ensemble docking in CAPRI ............................................................79 3.5 Discussion ........................................................................................................82 3.6 Conclusion .......................................................................................................88

4.

MODELING ENZYME-SUBSTRATE SPECIFICITY USING FLEXIBLE DOCKING .............................................................................................................91 4.1 Summary ..........................................................................................................91 4.2 Introduction ......................................................................................................92 4.3 Methods............................................................................................................96 4.4 Results ............................................................................................................102 xiv

4.4.1 Structure prediction of protease-substrate complexes ....................102 4.4.2 Energetic descrimination of cleavable substrates ...........................104 4.4.3 Identification of specificity-determining residues ..........................109 4.4.4 Evaluating the substrate envelope hypothesis.................................115 4.4.5 Other determinants of substrate specificity.....................................119 4.4.6 Extending the method to other enzyme-substrate interactions .......121 4.5 Discussion ......................................................................................................123 4.6 Conclusions ....................................................................................................129 5.

FUTURE DIRECTIONS IN MODELING PROTEIN INTERACTIONS ..........131 5.1 Summary ........................................................................................................131 5.2 Introduction ....................................................................................................131 5.3 Rosetta v3.......................................................................................................133 5.3.1 Introduction .....................................................................................134 5.3.2 Methods...........................................................................................135 5.3.3 Results .............................................................................................136 5.4 PyRosetta - a script-based interface to molecular modeling using Rosetta ...139 5.4.1 Introduction .....................................................................................140 5.4.2 PyRosetta Features ..........................................................................142 5.5 Conclusions ....................................................................................................145

SUPPLEMENTARY INFORMATION… ......................................................................147 REFERENCES ................................................................................................................153 CURRICULUM VITAE… ..............................................................................................177 xv

LIST OF TABLES Page Table 2.1 Summary of Targets...........................................................................................20 Table 2.2 Computational prediction of hotspot mutations for T14 ...................................24 Table 3.1 Crystal structure docking targets .......................................................................42 Table 3.2 NMR docking targets .........................................................................................43 Table 3.3 Docking results for crystal structure targets ......................................................74 Table 3.4 Docking Results for NMR targets .....................................................................76 Table 4.1 Accuracy using side-chain packing and peptide docking ................................102 Table 4.2 Residue type selectivity at each subsite ...........................................................107 Table 4.3 Specificity-determining residues .....................................................................109 Table 5.1 Protein docking incorporating small molecules in Rosetta v3 ........................137

xvi

LIST OF FIGURES Page Figure 1.1 Crystal structure of a-chymotrypsin in complex with inhibitor eglin-C ............6 Figure 1.2 Spatial and internal coordinate representation in Rosetta ..................................9 Figure 1.3 The conformational search space in protein docking .......................................13 Figure 2.1 Successful RosettaDock predictions.................................................................21 Figure 2.2 Conformations of T24 generated by the flexible docking ................................28 Figure 3.1 The flexible docking algorithm ........................................................................49 Figure 3.2 Ligand ensemble generation and conformer selection .....................................52 Figure 3.3 Different energy models for docking ...............................................................54 Figure 3.4 Backbone variability of FAS2 ..........................................................................57 Figure 3.5 Binding energy vs. L_rmsd for the docking methods ......................................60 Figure 3.6 Lowest energy structure for docking 1FSS with the CS/IF method .................62 Figure 3.7 Backbone sampling and discrimination for 1FSS ............................................64 Figure 3.8 Histogram of hit quality for each docking method ...........................................65 Figure 3.9 NMR docking results for 1ACB using the CS method ....................................69 Figure 3.10 Ensemble docking with antibody homology models......................................79 Figure 3.11 Ensemble generation recovers loop motion in Target 29 ...............................81 Figure 3.12 Ensemble docking of NMR structures in Target 41 .......................................82 Figure 4.1 Peptide Docking Algorithm ..............................................................................97 Figure 4.2 Protease peptide structure prediction..............................................................104 Figure 4.3 Energy distribution of cleavable and Non-cleavable Peptides .......................106 Figure 4.4 Specificity-determining residues in HIV-1 protease ......................................110

xvii

Figure 4.5 Structural and energetic mechanisms for specificity ......................................114 Figure 4.6 Evaluation of the substrate envelope hypothesis ............................................116 Figure 4.7 Substrate and binding envelope in the protease active-site ............................118 Figure 4.8 Distribution of active-site geometry substrate peptides .................................120 Figure 4.8 Modeling substrate specificity for hSET8 ......................................................122 Figure 5.1 Basic organization of Rosetta v3.x .................................................................134 Figure 5.2 Local docking of the FNR-Fn ternary complex .............................................138 Figure 5.3 Simple Monte Carlo peptide folding algorithm in PyRosetta ........................143

xviii

CHAPTER 1 INTRODUCTION

The formation of highly specific protein complexes is a fundamental process in biology and an integral component to all major biochemical pathways. The structures of these protein complexes can provide detailed insight into the mechanisms of function, from disease pathogenesis, to drug action, to furthering our understanding of innate biochemical processes. Using the computational tools of molecular modeling, we are beginning to develop methods to predict high-resolution structures for protein complexes and postulate how their functions arise from their structures. In this introduction, I will outline the biophysics that underlie protein interactions and the formation of protein complexes, the fundamentals behind molecular modeling methods and the Rosetta software package, and finally, their combined application towards the specific field of protein docking.

1.1 Introduction to proteins Proteins are one of the five major biological macromolecules, and are responsible for a variety of biochemical processes from structure, to signaling, to catalyzing essential biochemical reactions. Proteins are polymers of amino acids that are encoded by genes. In the process of translation, a mRNA transcript from the nucleus is used to create the protein chain in the ribosome. After translation, this amino acid polymer (also known as a polypeptide chain) adopts the lowest free-energy conformation in solution, through a process called protein folding. Most naturally occurring proteins fold into a very specific

1

shape or structure, and their function is directly a result of this structure. A typical folding energy for a protein is -10 kcal/mol, which means that well over 99% of the protein molecules in solution adopt this lowest-free energy conformation. Like protein folding, almost all protein functions occur as a result of this basic thermodynamic principle: the protein will adopt the lowest energy conformation in a given environment.

In almost all cases, this lowest energy conformation is a

biologically-evolved, highly specific structure. For example, in protein-protein binding, in which two proteins are free in solution, the lowest energy conformation will be a highly specific complex between the two partners. The energy of a given conformation is a function of various molecular forces that act on that conformation, including electrostatics, Van der Waals interactions, hydrogen-bonding, and solvation energies.

Protein Structure The monomers of a polypeptide chain, called residues, form a polymer through peptide bonds. The polypeptide chain itself (hereafter referred to as the backbone) is made up entirely of an N-Cα bond, a Cα-C bond, and a C-N bond, repeated for each amino acid. The specificity of protein structure and function arises from its amino acid sequence. Although all backbone atoms for all residues are identical, each residue in the polypeptide chain is one of 20 amino acids. Amino acids differ from one another through their side-chains, defined as those non-backbone atoms bonded to the Cα atom. The energy of a given protein conformation is a result of molecular forces that act on and between all non-bonded atoms in that conformation. In protein biophysics, the most significant energy contributions that favor one conformation over another are Van 2

der Waals energies, steric repulsions, electrostatic energies, solvation energies, and hydrogen bonding energies. These component energies themselves functions of the atomic positions and atom types of the atoms in the protein structure.

1.2 Protein-protein complexes Just as proteins generally fold into highly specific structures, proteins generally interact with each other by forming highly specific complexed structures. The advances in genetics and molecular biology in the past twenty years have lead to a deluge of data identifying potential biochemical pathways important for basic biological functions as well as disease pathogenesis that involve the interactions of hundreds of proteins. Inhibiting specific protein-protein interactions, and thus disrupting these pathways, forms the basis for today’s target-based drug development methods, where small-molecule compounds or biologics are designed to interact with a key target protein in the pathogenic pathway. A structural understanding of the protein-complexes involved in these pathways can yield insight into both their function and potential mechanisms for disruption.

Experimental characterization of protein complexes There are a number of experimental methods that are used to characterize proteinprotein interactions. At the very earliest stage, affinity purification assays or yeast-2hybrid assays, can be used to identify potential binding partners. Both these methods typically use the target protein as a ‘bait’, and expose it to a solution containing a mix of cellular proteins. Those proteins that bind to the bait are purified from the rest of the

3

mix, and identified, either genetically or through analytical chemistry methods. Once binding partners have been identified, binding titration assays, where the concentration of one protein is varied relative to the other while the total complexed concentration is measured, is used to determine the energetic of binding, or binding affinity (ΔGbinding), for that particular complex. More rigorous thermodynamic methods, such as iso-thermal calorimetry, which measure the heat transfer associated with complex formation, may be used to break down the binding affinity into its enthalpic and entropic components (ΔH, TΔS). Finally, techniques such as stop-flow fluorescence, can be used to determine the kinetics of the binding reaction. There are two predominant methods of structure determination of protein-protein interactions: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. In X-ray crystallography, a solution of complexed proteins is allowed to approach extremely high concentrations, in which, in certain very specific conditions, the protein in solution forms crystals. Then through the process of X-ray diffraction, x-rays are passed through this crystal, and the resulting diffraction pattern, which is a function of the atomic structure of the crystallized protein, is de-convoluted to yield the protein complex structure. In NMR spectroscopy, various NMR measurements yield pair-wise distance information between specific residues in the complex. These distance measurements are then used in conjunction with other molecular modeling methods to generate a number of physically realistic structural models that best fit the experimental data.

Structure

determination through NMR spectroscopy is typically limited to complexes of less than 200 residues, making it appropriate for only relatively small complexes.

X-ray

crystallography does not have a strict size limitation, however, the process of 4

crystallization a protein complex is extremely unpredictable and time-consuming – most protein complexes simply won’t crystallize in solution. Determined structures are stored in central online repository called the Protein Data Bank (PDB),1 in a PDB file format, which contains the atomic coordinates of every atom in the structure. Quantitative thermodynamic and kinetic measurements give us an overall macroscopic description for how two proteins behave in solution (i.e. how quickly the associate, how strongly they bind, etc..). Structure determination and structure prediction gives us a microscopic, or atomic-scale, description of how two proteins interact. Molecular modeling can be used to bridge the gap between the microscopic and macroscopic description of an interacting pair of proteins. It allows us to identify the structural mechanisms that guide the protein interaction, that is, the microscopic properties of the protein interaction that give rise to its macroscopic behavior in solution.

Common structural features of protein-protein interfaces Protein complexes are characterized by a number of structural features. First, the protein partners typically display a very high level of geometric complementarity across the interacting surfaces. Second, the overall conformations of the individual proteins in the complex do not vary dramatically from the unbound conformations of those proteins. Third, the surface area buried by a protein-protein interface is usually fairly substantial, typically ranging from 1500-3000Å2. Figure 1.1 shows an example of a protein-protein complex, that of the enzyme alpha-chymotrypin bound to an inhibitor, eglin-C.2 A deeper structural analysis of the protein interface reveals that complexes are often stabilized by hydrogen bonds and electrostatic interactions between specific residues

5

across the interface. Subsequent mutation of such residues often disrupts or eliminates the protein interaction.2

Figure 1.1 Crystal structure of a-chymotrypsin in complex with inhibitor eglin-C The crystal structure of the enzyme a-chymotrypsin (green) in complex with the inhibitor eglin-c (orange) demonstrates the high degree of surface complementarity of the interface between the two proteins, as well as the general complexity of the interface.

The high surface complementarity of interacting proteins, and the general lack of significant conformational change in proteins upon binding has historically supported the lock-key model of protein binding (See Chapter 3). These complexes were typically pairwise protein interactions with very high binding affinity.

Recently, many of these

conventions have been challenged by more contemporary structural biology research. Protein complex structures are increasingly being determined that show significant conformational changes upon binding, especially in the case of immune system antibody proteins and signaling transduction proteins.

More recently, proteins that undergo

dramatic disorder-to-order transitions upon binding have been identified.3 Finally, many 6

proteins are now thought to interact through a large number of very weak pair-wise interactions in large multi-body complexes consisting of tens of proteins.

These

emerging trends underscore the need to develop novel molecular modeling methods that can faithfully reproduce these behaviors.

1.3 Molecular modeling and the Rosetta software package Sampling protein conformations Protein structure can be represented in primarily two ways: Cartesian, or spatial coordinates, and internal coordinates. In spatial coordinates, each atom in the protein is given a Cartesian coordinate (x,y,z), hereafter referred to as ; this is how protein structure is stored in a PDB file. In internal coordinates, the position of each atom is a function of the bond lengths, bond angles, and torsion angles of the polypeptide chain that makes up that protein. In Rosetta, the bond lengths and angles are held fixed to ideal values.4

This means that internal coordinates describing the system consist of the

backbone torsion angles of the polypeptide chain (φ,ψ,ω) and the torsion angles of the side-chains for each residue (χ1, χ2, χ3…). Finally, many protein systems in biology include more than a single protein chain. A protein complex, for example, by definition, involves the interaction of two proteins, or two protein chains. The relative spatial position and orientation of one polypeptide chain relative to another can be described exactly using three dimensions of translation (T) and three dimensions of rotation (R), based on a given fixed spatial reference frame.5

7

Calculating Protein Energies The approximate free energy of a protein conformation is typically calculated using the spatial coordinates that define that conformation. The main energy terms used in docking are shown in Eq 1: Van der Waals ( solvation (

), electrostatic (

energies (

) (

), hydrogen bonding (

),

), and side-chain and backbone conformational

). In some cases, the atomic coordinates are converted into

higher-order geometries, such as the bond angle, length, and torsion, as in the case of hydrogen bonding, and torsion angles, as in the case of the side-chain and backbone energies; other times the energies are calculated from the pairwise inter-atomic distances that are directly generated from the spatial coordinates, as in the case of Van der Waals, solvation, or electrostatics.

Some terms are calculated using mathematical models

describing the underlying physical forces, such as in potential, or

, which uses the Lennard-Jones

, which uses the Coulomb’s potential. Other terms are calculated from

potentials derived from a statistical analysis of observed structures in the PDB, including , which is based on various Hydrogen bonding geometry statistics, is based on the observed distribution of phi/psi angles, and

, which

, which is based on the

observed distribution of side-chain torsion angles in the PDB for each amino acid. Again, in all cases, every term is a function of the structure of protein.

8

Molecular modeling in Rosetta

Cartesian coordinates

(Natoms) x (x, y, z)

refolding

(Nres) x (φ, ψ, ω, χi) (Nchains - 1) x (R,T)

Internal coordinates

Figure 1.2 Spatial coordinate and internal coordinate representation in Rosetta The pose object is responsible for synchronizing the internal coordinates, which are typically used for structural perturbations, with the spatial coordinates, which are used in energy calculations.

In Rosetta, a protein system is represented by an object called a pose.5 The pose contains all the structural information necessary to completely define the system in both spatial coordinates and internal coordinates (Figure 1.2). Within the pose object, the Cartesian coordinates and internal coordinates are synchronized, both are used in Rosetta. Sampling is done primarily through manipulation of the internal coordinates (for example, perturbing a φ or ψ torsion angle), while the energy is calculated primarily from the Cartesian coordinates (for example, calculating the electrostatic interaction between two atoms, using their respective spatial coordinates).

In a process called “refolding”,

new Cartesian coordinates are generated by ‘rebuilding’ the polypeptide chain, side-chain conformation, and rigid-body orientations, from the internal coordinates. Likewise, when a pose is first instantiated, it is typically created from a PDB file, which contains only spatial coordinates, and the internal coordinates are generated from the spatial

9

coordinates automatically. At all times, the spatial coordinates and internal coordinates are synchronized within the pose. Molecular modeling in Rosetta for structure prediction and design relies on the thermodynamic principle that the configuration of a biomolecular system at equilibrium tends towards that which is the lowest in free energy. The free energy of a given configuration (structure and sequence) is approximated using a score function that uses mathematical models of the major biophysical forces (Van der Waals energies, hydrogen bonding, electrostatics, solvation energies etc.) as a function of the configuration. Since it is impossible to exhaustively sample the entire configurational space accessible to the system because of its size and complexity, the starting structures and sampling strategies vary across different modeling applications. Rosetta protocols generally sample the relevant configurational space for a given modeling application by running a large number of relatively short Monte Carlo trajectories starting from random or semi-random starting configurations, storing the lowest-energy structures from each trajectory (called decoys), and then selecting lowestenergy decoys as predictions.

To tackle a wide range of biomolecular modeling

problems, it is necessarily to precisely define the relevant configurational space for sampling, the search strategy, and the score function used for both sampling and identifying the lowest-energy structures.

For each Rosetta algorithm it is essential to understand three things: 1. What is conformational space being search? What is held fixed? What is allowed to vary? 10

2. What is conformational search strategy within this conformational space? 3. What is the energy function/score function that is being used to identify the lowest-energy conformation?

1.4 A primer on protein docking The introduction of Chapter 3 presents a much more detailed overview of protein docking methods. Here I outline the overall approach and philosophy of protein docking. The goal of protein docking is to predict the structure of the bound complex of two proteins from the unbound component structures.

A cursory glance through

complexed structures in the Protein Data Bank shows that, generally, proteins maintain their conformation upon binding, and that the binding surface is characterized with a high degree of surface complementarity with the partner, even in the unbound conformation. Early protein docking approaches relied on the surface complementarity in the unbound structures to identify the bound conformation by searching for the rigid body position and orientation of the two partners that had the greatest degree of complementarity between them.

Examples of these geometric-based methods includes Fourier Transform-based

docking,6 which calculates the Fourier correlation of two grids representing each protein surface for all rigid-body orientations, searching for the orientation that displays the greatest surface complementarity. Docking methods that exclusively rely on surface complementarity are moderately successfully at docking targets with extremely high complementarity, such as high-affinity enzyme-inhibitor complexes where very prominent, rigid, protrusions in the inhibitors binds to a deep, pronounced, clefts, in the enzymes. Most protein-protein

11

interactions do not display this degree of surface complementarity, and other structural features at the interface, such as salt bridges and hydrogen bonds, are critical for recognition. The limitations of relying on surface complementarity as well as the coarsegraining necessary for geometric-based docking has given rise to a range of molecular mechanics based methods, which use a physically realistic energy function, with explicit terms of Van der Waals interactions, hydrogen bonding, electrostatics, and solvation. Molecular mechanics-based (MM) docking methods generally search the conformational space of a system consisting of the two protein partners, while identifying low energy docked orientations using an energy function that is intended to reflect the free energy of the system. These methods are primarily ‘rigid-body’ docking methods, where the conformation of the partners is held fixed, and only different orientations are sampled.

Unlike in FFT-based methods, which use a geometric representation of a

protein conformation, MM-based methods use an explicit protein structure, either as an all-atom structure, or through some degree of coarse graining. This allows MM-based methods to manipulate the conformations of these proteins in physically realistic ways during docking (Figure 1.3).

Common MM-based docking methods include

RosettaDock,7 ICM,8 and Haddock.9 Unlike geometric-based docking methods, which can exhaustively sample all possible rigid-body orientations, MM-based methods are much more computationally expensive and typically sample only a small fraction of the conformational space. For local docking application, where some information of about the interacting residues is present, this limited sampling is sufficient; for global docking, MM-based methods are often used to refine initial hits generated by geometric-based methods. 12

Figure 1.3 The conformational search space in protein docking The conformational search space of protein docking consists of the rigid-body orientation of the two partner proteins, the side-chain conformation of each residue on each partner, and the backbone conformations of each partner in the complex.

Flexible docking Even the most rigid proteins undergo minor conformation changes upon binding. In most cases, the backbone conformation of the protein is relatively unchanged, while the side-chain conformations of the interacting residues may adopt alternate rotameric states from the unbound conformation. These conformation changes underscore an intrinsic challenge in protein docking: that unbound structures are always suboptimal conformations for their bound docked orientations. In early docking methods that relied on surface complementarity, the sub-optimality of unbound structures was addressed by ‘softening’ the surface, allow some degree of protrusions. Most contemporary MMbased methods use explicit side-chain packing and down-weight the Van der Waals repulsive component.

13

Many current docking challenges include not only the docking or relatively rigid unbound proteins, but also proteins that undergo significant conformation changes upon binding, as well as unbound structures from generated from NMR data or homology models. In all of these cases, the suboptimality of the unbound structure is a result of major differences in the backbone conformation between the unbound and bound structure. In the case of binding-induced conformational change, this is a direct result of complex formation itself. In the case of homology models, and to a lesser extent, NMR structures, this is because of inaccuracies in the structure prediction of the true unbound conformation, in addition to whatever binding-induced conformational changes may occur to that unbound structure. Chapters 2 and 3 will focus on our efforts to incorporate backbone conformational sampling into docking.

Downstream Applications of Docking The structure of protein-protein interfaces can yield deep insight into the mechanisms that guide their function. Towards that end, protein docking is a tool from which we can predict the structure of these interfaces. Ultimately the success of protein docking will be determined not only by its ability to predict complex structures, but by its ability to gain the structural and biological insight into the interactions that are being modeled. In Chapter 4, we seek to use protein docking to identify the structural and energetic mechanisms of specificity in HIV-1 protease.

14

CHAPTER 2 EARLY EFFORTS IN INCORPORATING SPECIFICITY AND FLEXIBILITY IN PROTEIN DOCKING Previously published as: Chaudhury, S.*, Sircar, A.*, Sivasubramanian, A., Berrondo, M., Gray, J.J., (2007) Incorporating biochemical information and backbone flexibility in RosettaDock for CAPRI rounds 6-12. Proteins. 69:793-800.

Reprinted with the permission of the publisher, Wiley, Inc. with minor revisions.

2.1 Summary Critical Assessment of Protein Interactions (CAPRI) is a blind protein complex structure prediction competition that has long served as a venue both for testing the capabilities of existing docking methods as well as developing novel methods. This chapter introduces our early efforts in incorporating biochemical specificity information and using different models of backbone flexibility in predictive protein docking.

Abstract In CAPRI rounds 6-12, RosettaDock successfully predicted 2 of 5 unboundunbound targets to medium accuracy. Improvement over the previous method was achieved with computational mutagenesis to select decoys that match the energetics of experimentally determined hotspots. In the case of Target 21, Orc1/Sir1, this resulted in *

These authors contributed equally to this publication

15

a successful docking prediction where RosettaDock alone or with simple site constraints failed. Experimental information also helped limit the interacting region of TolB/Pal, producing a successful prediction of Target 26. In addition, we docked multiple loop conformations for Target 20, and we developed a novel flexible docking algorithm to simultaneously optimize backbone conformation and rigid-body orientation to generate a wide diversity of conformations for Target 24. Continued challenges included docking of homology targets that differ substantially from their template (sequence identity < 50%) and accounting for large conformational changes upon binding. Despite a larger number of unbound-unbound and homology model binding targets, Rounds 6-12 reinforced that RosettaDock is a powerful algorithm for predicting bound complex structures, especially when combined with experimental data.

2.2 Introduction Protein-protein interactions are fundamental to cellular life, and the structure of protein complexes can provide tremendous insight into the mechanisms of protein function. Given the enormous number of undetermined complex structures and the difficulties in structure determination of protein complexes through x-ray crystallography or nuclear magnetic resonance, there is a clear need for accurate computational methods to predict protein complex structures. In such methods, homology models or crystal structures of the unbound protein partners are ‘docked’ with each other to predict the complexed structure. The challenges in docking include both sufficient sampling of putative complexed conformations and discrimination of the correct complexed conformation (near-native).10

The Critical Assessment of PRotein Interactions11

16

(CAPRI) is a blind, community-wide test of protein docking methods. CAPRI targets are newly solved crystal structures that are not released to the public until after the participants have submitted their predictions to ensure that all predictions are blind. We have used RosettaDock, a Monte Carlo-based docking program, in CAPRI since its inception and performed well in previous rounds.12; 13 CAPRI rounds 6-12 proved significantly more challenging than the previous rounds largely because of an increase in the number of homology targets and unboundunbound targets, which reflect predictive situations more relevant to the broader biology community. In rounds 1-5,12; 13 RosettaDock primarily relied upon structural information for docking. Protein backbones were held fixed, and rigid-body displacement and sidechain conformations were optimized simultaneously to generate docking predictions.7 We applied this approach with moderate success in past CAPRI rounds, but as the difficulty of the targets increased, its limitations have become more apparent. In rounds 6-12 we address these limitations with continued development in RosettaDock in two critical areas: the inclusion of experimental biochemical data and the modeling of backbone conformation changes. We outline our development in these two areas and summarize our results for the other CAPRI targets. The utility of incorporating experimental information, such as mutational analysis, in structure prediction is well-established in CAPRI (e.g., refs.

14; 15; 16

). In

standard RosettaDock, such information is typically incorporated as site constraints, which specify certain residues which should appear in the interface, and distance constraints, which specify residues that should be within a certain distance of each other. Both of these constraints serve to decrease the docking conformational space and enhance

17

the discrimination of near-native decoys by precluding false positives. We sought to test whether mutational information could yield additional discrimination ability through explicit energetic calculations. We recently used such a computational mutagenesis technique to predict the binding of a therapeutic antibody, using a homology model, to a large flexible receptor antigen.17 For Target 21, we exploited this technique to take advantage of mutational analysis data available in the literature, ultimately producing a moderately accurate prediction. As CAPRI continues, the need to accommodate backbone conformational variability in docking is becoming increasingly apparent. This is due primarily to two trends. First, rounds 6-12 included a smaller number of the more rigid enzyme-substrate and antibody-antigen targets and a larger number of targets that show significant conformational changes upon binding such as signaling proteins. Second, homology targets, which often show significant backbone differences from the homology template to the bound structure, comprise an increasing portion of CAPRI targets. The shift in targets is partially due to the greater number from structural genomics initiatives, such as the Structural Genomics Consortium (SGC).18 These trends are clearly illustrated in Target 24, where there was uncertainty in the structure and position of two large regions in the homology model of one of the partners.

We developed a simultaneous

conformational sampling and docking algorithm to optimize the conformation of these regions while docking. Although we did not achieve a successful prediction, it served as an interesting case study for modeling large conformational variability in docking.

18

2.3 Results In CAPRI rounds 6-12, we used RosettaDock to successfully predict two targets to medium accuracy according to the criteria used by Mendez et al.19 We obtained predictions using global docking, local docking, and docking constraints, when applicable, with standard RosettaDock as described previously.7;

13; 20

Docking was

complemented with computational mutagenesis in Target 21, and backbone flexibility was modeled in Target 20 and 24. All homology models were generated using the Robetta structure prediction server.21 Table 2.1 shows a summary of the results for all targets. The three metrics used to assess the quality of predictions include root mean square deviation (RMSD) of the smaller partner’s (ligand) backbone atoms (N, Cα, C, O) after superposition of the larger partner, RMSD of the interface backbone atoms after optimal superposition, and the fraction of native residue-residue contacts, compared to the crystal structure of the complex.19 Individual methods and results for each target are detailed below.

19

Target 20

Complex HemK/eRF1

Type/ SeqID U-H30

Nres 636

Best Model 6

F(nat ) 0.04

L_RMSD (Å) 38.59

I_RMSD (Å) 7.15

21

Orc1/Sir1

U-U

308

3

0.46

7.62

1.92

24

Arf1/ARHGap10

U-H40

272

10

0.15

34.02

11.94

-

25

Arf1/ARHGap10

U-B

288

7

0.00

26.94

10.66

Quality Notes Standard local + multiple loops  Standard local + computational mutagenesis Standard local + flexible docking Standard local

26

TolB/Pal

U-U

505

6

0.47

2.70

1.24



27-1

E2-25K/Ubc9

U-U

359

7

0.00

48.55

9.28

-

Standard global + local

27-2

E2-25K/Ubc9

U-U

359

7

0.00

31.85

13.15

-

Standard global + local

28

NEDD4L Dimer

H53-H53

756

2

0.01

65.56

14.53

-

Symmetry-constrained global + local

Standard local + site constraints

Table 2.1 Summary of Targets Type indicates the origin of the starting monomer coordinates: bound (B), unbound (U) or homologous (H) structures. Percent sequence identity is provided for homology targets (SeqID). Best model is the rank of the submitted model of highest quality, detailed in the remaining columns: F(nat) is the fraction of native residue-residue contacts predicted correctly, L_RMSD is the RMSD of the backbone atoms of the smaller protein after optimal superposition of the backbone atoms of the larger protein, I_RMSD is the RMSD of the backbone atoms in the interface after optimal superposition , and Quality ratings follow the criteria of Mendez et al.19

20

Figure 2.1 Successful RosettaDock predictions (a) Medium accuracy prediction (model 3) of T21 using BAH Domain of Orc1 (blue) and SRD domain of Sir1 (green), superposed on the native structure (gray, 1ZH122) by aligning interface atoms (all atoms within 8 Å of partner). (b) Computational mutagenesis results for model 3 of T21. Blue: Orc1 BAH domain; gray: Sir1 SRD domain. Interface residues shown as spheres with successfully predicted bindingloss mutations in red and binding-loss mutations that were not predicted in yellow. (c) Medium accuracy prediction of T26 using Pal (green) and the β-propeller region of TolB (blue) superposed on the native structure (gray, 2HQS23) by aligning to TolB in the native structure.

21

2.3.1 Computational Mutagenesis

Target 14: Orc1-Sir1 The Bromo Associated Domain (BAH) of Orc1 is sufficient for interaction with Silencer Recognition Defective (SRD) region of Sir1, and biochemical experiments had identified both binding-loss mutations and non-binding-loss mutations.24 We locally docked the unbound structures of the isolated BAH and SRD domains (Protein Data Bank1 codes 1M4Z25 and 1Z1A22) with an additional constraint that penalized the presence of either truncated region in the interface. We then subjected the top-scoring decoys to computational mutagenesis using RosettaInterface26 and selected those decoys whose mutagenesis patterns most closely matched experimental results for submission. In experimental mutagenesis, the effect of a mutation on binding of a protein can be determined qualitatively, as a loss or gain of binding, or quantitatively, as a difference in binding free energy (ΔΔG). In computational mutagenesis using RosettaInterface,27 the ΔΔG of binding is determined for a structural model by calculating the difference in interaction energy before and after a mutation is made to one of the partners: ΔΔ where Δ

Δ

Δ

Δ

Δ

Δ

is the free energy of the wild type complex, Δ

Δ

,

and Δ

are the free

energies of the solvated wild type monomers A and B, and the asterisks indicate the corresponding quantities for the mutant species.

After optimizing the side-chain

conformations using a rotamer library, the free energy is estimated using a calibrated combination of the Rosetta scoring function components dominated by van der Waals, solvation, and hydrogen bond energies.26 The calculation is used to categorize mutations 22

as binding-loss or non-binding-loss, depending on whether the ΔΔ estimate is greater or less than 1 kcal/mol, respectively. RosettaInterface will not capture energetics that arise from backbone conformational changes or dynamics.26 For each of the top-scoring decoys, we simulated the experimental mutations of Bose et al.24 and selected the decoy whose pattern of binding-loss and non-binding-loss mutations most closely matched the experiment.

Model 3, shown in Figure 2.1A,

replicated all non-binding-loss mutations and 3 of 8 binding-loss mutations.

This

structure was our best prediction and was categorized as medium quality with 7.62 Å ligand RMSD, 1.92 Å interface RMSD, and 46% of the native residue-residue contacts from the crystal structure. The structure ranked 2nd among all submitted predictions by interface RMSD. The importance of the mutagenesis technique is underscored by the fact that Wang et al., using the standard RosettaDock approach without RosettaInterface, were unable to achieve an acceptable prediction.28 To validate the accuracy of the computational mutagenesis method in identifying binding-loss mutations from the complex structure, we retrospectively applied this technique to the released crystal structure for this complex (1ZHI22). Table 2.2 shows the computational mutagenesis results on both our best submitted prediction and the crystal structure (mutation sites shown in Figure 2.1B). There is a qualitative match between the mutagenesis results of the best predicted structure and the crystal structure, indicating that decoy discrimination with this method was optimal.

For the loss-of-binding

mutations in the crystal structure, 6 of 8 showed a positive ΔΔG, but only 3 passed the threshold to be classified as a binding-loss mutation. For the non-binding-loss mutations in the crystal structure, 6 of 8 showed no change in ΔΔG, and all 8 were under the

23

threshold for a binding-loss mutation. Although RosettaInterface predicted the correct direction of ΔΔG in 12 of 16 mutations, inaccuracies remain in the calculation of ΔΔG for categorically identifying binding-loss mutations.

Mutation E487A E488A E506A E507A K513A D514A K522A D523A

Non-binding-loss mutations ΔΔGModel Match ΔΔGNative Match 0.52 √ 0.00 √ 0.00 √ 0.28 √ 0.00 √ 0.89 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √ 0.00 √

Mutation Y489S V490D R493G F494S D503N A505T A505S A505D

Binding-loss mutations ΔΔGModel Match ΔΔGNative 1.52 √ 2.10 0.00 X 0.00 1.04 √ 3.41 0.00 X 0.00 0.48 X 0.56 0.62 X 0.97 -0.33 X 0.01 2.80 √ 2.34

Match √ X √ X X X X √

Table 2.2 Computational prediction of the change in binding free energy (kcal/mol) for the best model and crystal structure of Orc1-Sir1 Calculated ΔΔG upon mutation that is greater than 1 kcal/mol in RosettaInterface is indicative of binding loss. The check/crosses indicated qualitative agreement/disagreement between experimental and computational mutagenesis. Model indicates the best prediction, Model 3, and native indicates the crystal structure,1ZH1.22

24

2.3.2 Flexible docking Target 20: HemK-eRF1 Mutational analysis of eRF1 revealed that the 9-residue GGQ loop in eRF1 interacts with HemK, and this loop undergoes a significant conformational change upon binding.29 Therefore, we modeled this flexibility by generating a diverse set of loop conformations with a loop-building protocol30 to represent dominant conformations in a pre-existing equilibrium ensemble. We created a homology model of eRF1 using the previously determined crystal structure of eRF2 (1GQE31) as a template, which has 40% sequence identity. We then generated a large set of loop conformations in the eRF1 homology model in the absence of HemK, selected the five lowest energy loops from this set, and grafted each of these separately to the eRF1 homology model. We locally docked each eRF1 model independently to the unbound structure of HemK (1NV832). The top-scoring structures from the two largest clusters for each of the five eRF1 models were selected for submission. The released structure of the complex (2B3T33) showed a large change in the GGQ loop conformation suggesting that capturing loop movement was essential, but five loop models insufficiently sampled the conformational space, with the closest loop at 5.0 Å RMSD after aligning over residues 226-242 of eRF1 which includes the GGQ loop. Although we did not achieve any acceptable predictions for this target, this method served as a prototype for loop-ensemble docking methods currently being developed in our lab and underscored the need for simultaneous conformational refinement while docking.

25

Target 24: Arf1-GTP - ARHGAP10 We used the Robetta server to generate five homology models of the Arf1-binding domain of ARHGAP (ArfBD) using the template 1BTN,34 which has 30% sequence identity, to dock with the unbound structure of Arf1-GTP (1O3Y35). Visual inspection of the homology models of ArfBD revealed two regions of large variability between the models: a 15-residue loop and a 33-residue C-terminal region. Therefore we developed and implemented a novel flexible docking method within RosettaDock to account for the uncertainties in the homology model for this target. Our goal was to produce a large diversity of backbone conformations in the context of the protein-protein complex. We added flexible docking to RosettaDock by integrating a loop modeling method within Rosetta30 to the standard RosettaDock algorithm using a ‘fold tree’ representation of the protein complex36 which allows for simultaneous treatment of conformational sampling and docking. A fold tree provides a framework for sampling local backbone conformations while maintaining a fixed overall global structure by limiting the propagation of backbone torsion angle perturbations through chain breaks at specified ‘cut points’, which are subsequently repaired. The fold tree consists of ‘edges’, defined as structurally continuous backbone segments, and ‘jumps’ which define the spatial relationship of the edges.

Flexible regions of the protein are designated as

conformationally ‘variable regions’ and must be adjacent to at least one cut point in the fold tree to allow conformational sampling (Figure 2.2, bottom). In the low-resolution stage of RosettaDock, backbone conformations of the variable regions were sampled during docking using a Monte Carlo algorithm that iteratively applied 3-residue fragment-based torsion angle moves37 and closed the resulting chain breaks using the 26

cyclic coordinate descent (CCD) algorithm.38 In the high-resolution refinement phase of RosettaDock, the backbone torsion angles of these variable regions were optimized during docking using continuous gradient-based energy minimization following each side-chain repacking step. A crystal structure of Arf1 bound to a related partner, Sec7,39 revealed that residues 73-81 in the Switch 2 region of Arf1 were at the interface; mutational analysis in a separate study40 implicated residues 1052-1096 in the C-terminal region of ArfBD. Therefore we manually oriented the C-terminal region of ArfBD towards the Switch 2 region of Arf1-GTP, and we locally docked each ArfBD homology model to Arf1 using standard RosettaDock.

We then clustered the results and selected the best-scoring

structures from the two largest clusters to serve as starting structures for flexible backbone docking. Since the homology models exhibited variation in areas thought to be important for binding, residues 973-988 and 1044-1064 in ArfBD were designated as ‘variable regions’ in the fold tree with Arf1 represented by a single edge and ArfBD represented by two edges (Figure 2.2). For our final predictions we selected top-scoring structures from clusters with a diverse set of ArfBD C-terminal conformations. We were unable to achieve an acceptable prediction or recover a near-native ArfBD backbone conformation. However we did succeed in generating a diverse set of conformations of the variable regions from the starting ArfBD model (Figure 2.2, top).

27

Figure 2.2 Conformations of T24 generated by the flexible docking Red: ArfBD core; blue: Arf1; green: loop region of ARHGap10 treated as flexible during docking; orange: conformers of the C-terminus of ARHGap10 built from fragments during docking. (Bottom) Fold tree prescribing the propagation of conformational changes. The colors in the fold tree match their respective regions in the complex structure. The loop jump is shown with a solid line to indicate that the loop stems are held fixed relative to each other. The docking jump is shown as a dotted line to indicate that the two partners are free to move relative to each other. Residue numbering corresponds to the Mus musculus Arf binding-domain of ArfGap10 (chain A) and Homo sapiens Arf1 (chain B).

Although this method generated complexes with good surface complementarity and hydrogen bonding, visual inspection of the structures revealed that the variable regions of ArfBD tended to unfold and refold around Arf1 leading to unusual extended backbone conformations of ArfBD. Since RosettaDock creates encounter complexes and then optimize their total energies, when given freedom to modify the conformation of large regions of backbone in a complex, it can significantly improve intermolecular energies at the expense of intramolecular energies. Consequently, the free energy of the 28

monomeric state of the putative bound conformation is far worse than the free energy of the unbound conformation which contradicts existing models of protein binding which suggest that the free energies of bound and unbound conformations cannot differ drastically for binding to be kinetically possible.41 Large induced changes in the overall backbone structure also conflicts with the general observation that proteins retain much of their unbound structure when bound.42;

43

Therefore, a conformational search

algorithm designed to model variability of large regions the protein must remain limited to realistic backbone changes from the unbound state by accounting for the unbound energy of the binding proteins, the intramolecular energies and/or compactness.

2.3.3 Standard Docking Target 26: TolB-Pal Biochemical experiments suggested that residues 89-130 of Pal interacted with the C-terminal region of TolB, predicted to be a β-propeller.44 We locally docked the unbound structures of Pal and the C-terminal region of TolB (1CRZ45 and 1OAP46), manually orienting residues 89-130 of Pal towards the β-propeller of TolB, with an additional constraint penalizing interfaces near the truncated N-terminal region of TolB. Model 6, shown in Figure 2.1C, was our best prediction with ligand RMSD of 2.70Å, an interface RMSD of 1.24Å, and 47% of the native residue-residue contacts from the released crystal structure (2HQS23). As a medium-quality prediction, it ranked 3rd among all submitted predictions by ligand RMSD.

29

Target 25: Arf1-GTP-ARHGAP10 We globally docked the bound structure of ArfBD (2J5947, chain M) with the unbound structure of Arf1-GTP (1O3Y35) and then clustered the results, submitting topscoring predictions from the largest clusters that agreed with experimental information used for Target 24.

Unfortunately, a later-revealed bug in the Rosetta source tree

prevented accurate predictions on this target.

Target 27: Ubc9-Hip2 We globally docked the unbound structures of Ubc-9 and Hip-2 (1A3S48 and 1YLA49), clustered the results, and selected the top-scoring structure from the two largest clusters as starting points for local docking and refinement. The largest cluster agreed with a previous study implicating a large positively charged region on Ubc9 as being important in protein interaction.48 A substrate-enzyme interaction has been reported where Hip2 is SUMO-lated by Ubc9 on Lys 14 on the N-terminal α-helix.50 The substrate-interacting residues on Ubc9 included residues 130-136 and 126 along with the catalytic Cys93.51 Since no cluster in the global docking reflected these findings, we incorporated this biochemical data as site constraints in local docking to produce additional models, one of which had 7% of native contacts.52 The results of the two largest clusters from the global run exemplify recent findings that the RosettaDock scoring function is biased towards interfaces of large surface area, even at the cost of low surface complementarity.53 The local docking failure using site-constraints may also be attributed to the different importance for electrostatics 30

in true enzyme-substrate interactions relative to enzyme-inhibitor or other protein-protein interfaces.

Target 28: HECT-E3 ligase homodimer We used Robetta to generate five homology models from two different templates (1ND754 and 1ZVD55) and executed symmetry-constrained global docking56 using the top-scoring model for each template, generating 105 decoys. The 200 highest-scoring structures were clustered for each template and the five largest clusters were used for local, symmetry-constrained docking and refinement.

The literature suggested both

significant deviation of the HECT-domain orientation from the template structure and substantial conformational flexibility along that region,54; 57 which was later confirmed by the released structure,58 but we were unable to integrate our flexible docking method from Target 24 with symmetry-constrained docking in time for the predictions.

2.4 Discussion We used RosettaInterface for Target 21 to successfully identify 38% of binding-loss mutations and 100% of non-binding-loss mutations in a blindly docked model structure. Of the six residues that were shown to lead to binding loss upon mutation, three can be classified as interface residues, i.e., residues in which at least one side-chain atom is at least 4 Å from an atom belonging to the other partner in the complex.

In our

computational mutagenesis of the Orc1-Sir1 complex, all three interface binding-loss mutations were identified, while all three non-interface binding-loss mutations were missed (Figure 2.1B). This pattern is seen in previous results26 where RosettaInterface

31

successfully identified 79% of interface binding-loss mutations and only 0.7% of noninterface binding loss mutations for a set of 19 protein complexes of known structure. Rosetta’s energy function is short-ranged and only local side-chain motions are captured. Therefore, non-interface mutations, which may alter ΔΔG through protein stability, changes in backbone conformation and packing, or entropy, are beyond the scope of the algorithm. Since experimental mutational information is often not available for most CAPRI targets, RosettaInterface could be adapted to use evolutionarily conserved surface residues as a proxy for binding-loss mutation residues, or hot spots. Kortemme et al. demonstrated that computational hot spot residues have the least tolerance for substitution.26 Since RosettaInterface’s overall false positive rate is low (16%), it may be relatively robust to the risk of erroneously identifying false-positive decoys from bioinformatics-identified hot spot residues.26 Our approaches to flexible docking for Targets 20 and 24 reflect two opposing views of protein binding: the pre-existing equilibrium hypothesis and the induced-fit hypothesis. In Target 20, our approach was inspired by the pre-existing equilibrium hypothesis of conformational states: that the bound conformation would be a low-energy structure in the ensemble of unbound conformations. We generated a large ensemble of loop conformations in the unbound eRF1 homology model and selected the five lowest energy structures for docking. In hindsight, five structures were probably insufficient to sample the large number of low energy conformations available to that loop, suggesting that larger ensembles may be necessary. In Target 24, we followed an induced-fit approach, that conformational changes occur simultaneously with binding, by first 32

docking ArfBD to Arf1, and then using our flexible docking algorithm to simultaneously optimize the backbone conformation and rigid-body orientation of ArfBD with respect to Arf1. In this case, the results revealed structures where the intermolecular energy was significantly improved at the expense of intramolecular energy, leading to unrealistic bound backbone conformations. Our two approaches for flexible docking may be combined to complement each other. In such a method, a diverse pre-generated ensemble of structures would be used with smaller-scale backbone refinement during docking. The backbone conformational search is then restricted to small continuous regions of conformational space centered on discrete low-energy structures in the ensemble that are searched stochastically during docking and refinement.

Ideally this would reduce both the necessary size of an

ensemble to represent conformational space in a purely ensemble-docking approach and the amount of sampling needed to include diverse conformations while docking. This combined approach is representative of the 3-step mechanism for protein binding consisting of diffusion, free conformer selection, and refolding, proposed by Grunberg et al.,41 and we are currently implementing and testing new flexible backbone ensemble docking approaches.

2.5 Conclusion In CAPRI rounds 6-9 we addressed a variety of challenges in computational proteinprotein docking. We used biochemical data to quantitatively discriminate near-native decoys based on interface energetics, modeled multiple loop conformations in docking, and developed a novel method for flexible docking. In these and other respects, CAPRI

33

continues to be an excellent testing ground for pushing development of novel methods in RosettaDock.

34

CHAPTER 3 STRUCTURE PREDICTION OF PROTEIN INTERFACES USING FLEXIBLE DOCKING

Previously published as: Chaudhury, S. & Gray, J.J., (2008) Conformer selection and induced fit in flexible backbone protein-protein docking using computational and NMR ensembles. Journal of Molecular Biology. 381(4):1068-1087.

Reprinted with the permission of the publisher, Elsevier, Inc. with minor revisions.

3.1 Summary Our experience in CAPRI rounds 6-9 underscored both the need for accommodating backbone flexibility in protein docking and the challenges in doing so. The unbound structures in over half of the targets were either homology models with substantial inaccuracies in backbone conformation, or underwent significant bindinginduced conformational changes.

In Chapter 2 we looked at two complementary

approaches to modeling backbone flexibility: implicitly, where flexibility was modeled as a set of pre-generated loop conformations, and explicitly, where the backbone conformation is sampled and optimized during the docking simulation itself. In this chapter we use different kinetic models of protein binding to explore the implementation of these two types of flexibility as conformational search strategies in docking.

35

Abstract Accommodating backbone flexibility continues to be the most difficult challenge in computational docking of protein-protein complexes. Towards that end, we simulate four distinct biophysical models of protein binding in RosettaDock, a multi-scale MonteCarlo based algorithm that uses a quasi-kinetic search process to emulate the diffusional encounter of two proteins and identify low energy complexes. The four binding models are: 1) key-lock model (KL) using rigid-backbone docking, 2) conformer selection model (CS) using a novel ensemble docking algorithm, 3) induced fit model (IF) using energy gradient-based backbone minimization, and 4) a combined conformer selection/induced fit model (CS/IF). Backbone flexibility was limited to the smaller partner of the complex, structural ensembles were generated using Rosetta refinement methods, and docking consisted of local perturbations around the complexed conformation using unbound component crystal structures for a set of 21 target complexes. The lowest-energy structure contained more than 30% of the native residueresidue contacts for 9, 13, 13, and 14 targets for KL, CS, IF and CS/IF docking respectively. When applied to 15 targets using NMR ensembles of the smaller protein, the lowest-energy structure recovered at least 30% native residue contacts in 3, 8, 4 and 8 targets for KL, CS, IF and CS/IF docking respectively. CS/IF docking of the NMR ensemble performed equally well or better than KL docking with the unbound crystal structure in 10 of 15 cases. The marked success of CS and CS/IF docking shows that ensemble docking can be a versatile and effective method for accommodating conformational plasticity in docking and serves as a demonstration for the conformer

36

selection theory - that binding-competent conformers exist in the unbound ensemble and can be selected based on their favorable binding energies.

3.2 Introduction The formation of highly specific protein complexes is a fundamental process in biology, and the structures of these complexes provide detailed insight into the mechanisms of protein function. Given the enormous number of undetermined complex structures and the difficulties in determination of structures through x-ray crystallography or nuclear magnetic resonance (NMR), there is a need for accurate computational methods to predict protein complex structures. Complex structure prediction, also known as ‘protein docking,’ consists of determining the structure of the bound complex using the unbound structures of its partners (for review see Gray 2006).10 Although protein docking could be viewed as primarily an engineering problem – without need to follow the physical basis of protein binding – physical models of biomolecular interactions have long served as starting points for development.

Furthermore, the success or failure of docking

algorithms based on these models may provide insight into the biophysical models themselves. A comparison of bound and unbound structures often reveals significant changes in backbone conformation upon binding42;

59

which confound current docking methods

and represent the single greatest challenge to predictive protein docking.

Accurate

modeling of backbone conformational changes in docking is difficult because of the enormous complexity of backbone conformational space both in size and degrees of freedom (for review see Bonvin 200660), making exhaustive sampling impossible.

37

Furthermore, since changes in backbone conformation affect both the intra and intermolecular energies of putative complexes, they create additional challenges in the discrimination of near-native structures.

Effective strategies for both backbone

conformational sampling and discrimination are needed before flexible docking is feasible for predictive applications. Four biophysical models of protein binding suggest distinct conformational sampling strategies in flexible protein docking. First, the key-lock (KL) model of protein interactions, proposed by Emil Fischer in 1894,61 states that proteins interact purely via surface complementarity of their rigid unbound structures. The KL model is the most influential model in the development of protein docking algorithms, underlying the original grid-based62 and fast-Fourier transform63 (FFT) techniques. Since most modern docking methods include side-chain motions in some capacity, for the purposes of this paper we define the KL model in terms of protein backbone flexibility. After allowing side-chain motions, KL docking strategies include, among others, ZDOCK/RDOCK,6; 64 CLUSPRO,65 and RosettaDock.7 These methods are moderately successful at predicting complexes for proteins that undergo minimal backbone conformational change upon binding, but perform progressively worse as the magnitude of backbone conformation changes increases.66 Furthermore, these methods are far more successful when starting from bound structures than unbound structures,7;

67; 68

even in cases with minor

conformational changes, indicating that backbone flexibility is an important component to protein binding in general and that accurate modeling of backbone flexibility may improve docking for all complexes, not just those that exhibit greater flexibility.

38

Second, the conformational selection (CS) model proposed by Monod et al. (1965)69 for protein allostery and later adapted to protein interactions by Kumar et al. (2000)70 is a statistical mechanical view of protein binding. The unbound state of a protein is represented by an ensemble of low-energy conformations, or conformers, one of which is the bound conformation.

During the binding process, the bound-like

conformers are selected over the other conformers in the ensemble as a result of their favorable binding energies.

Thus, for docking, backbone flexibility is modeled

implicitly as a pre-generated ensemble of rigid structures generated from the unbound structure. Previous CS-docking examples include both FFT-based18;

41; 71

and Monte

Carlo (MC)-based ensemble docking.72 Both Grunberg et al. (2004)41 and Smith et al. (2005)18 used molecular dynamics (MD) to create an unbound ensemble that contained conformations that in some ways resemble the bound conformation, but both groups were unable to recover the bound structure in its entirety. Subsequent FFT-based crossdocking of an ensemble of structures from the MD simulations showed substantially improved sampling near the bound structure in the former study and marginal improvements in docking accuracy in the latter. Bastard et al. carried out ensemble docking with an ensemble of loop conformations and successfully discriminated nearnative structures when the bound loop conformation was deliberately added to the ensemble.72 Third, in Koshland’s induced-fit (IF) model,73 two proteins recognize each other to form an encounter complex, and then mutually alter their structures to form the intricate surface complementarity observed in bound structures. This model dictates that the bound conformation of a protein exists in response to the presence of the partner in

39

complex, so the backbone conformational space must be sampled explicitly during docking in response to local energetics of the interface. Explicit backbone flexibility has been modeled primarily using MD,9;

74; 75

energy minimization,74;

based methods in Monte Carlo minimization (MCM),5;

77

76

or with gradient-

but not FFT-based methods.

Wang et al. have shown impressive results in docking proteins in which a loop undergoes moderate to large conformational changes upon binding using explicit backbone flexibility, but their methods are extremely computationally intensive and require prior knowledge of the flexible regions, which limit their use in blind structure prediction.5; 78 Krol et al. carried out both MD relaxation and energy minimization of docking poses and showed significant increases in the fraction of native residue contacts recovered.74 The fourth model is a hybrid conformational-selection/induced-fit (CS/IF) model proposed by Grunberg et al.41 where binding is a two-stage process that begins with conformational selection to form an encounter complex, followed by an induced-fit or ‘refolding’ step that leads to the final bound conformation. The conformations that are selected to form the encounter complex need not resemble either the bound or unbound structure. A docking algorithm based on this model would combine both ensemble docking and explicit backbone flexibility during docking.

Krol et al.71 employed a

docking strategy that combines FFT-based cross-docking of MD generated ensembles with MD-refinement of top-ranked decoys on a limited set of targets. Their approach improved docking accuracy by measure of the number of native residue contacts recovered but decreased accuracy by other measures, such as root mean square deviation (RMSD) of interface residues.

HADDOCK combines both implicit and explicit

backbone flexibility while incorporating biochemical information,9 and it is capable of 40

using ensembles from a wide variety of sources including MD, homology modeling, or NMR structures. When HADDOCK was used to dock MD-derived ensembles for two sets of CAPRI targets, significant improvements over rigid-body docking were observed.16; 76 In several specific applications, docking of ensembles of NMR models using HADDOCK successfully generated structures of unknown protein complexes that have been subsequently validated by biochemical experiments.79; 80; 81 Our goal in this work is to implement the four binding models as conformational sampling strategies in a common algorithm (RosettaDock7) and evaluate the differing abilities and limitations of each approach.

We aim to create the optimal docking

algorithm which allows backbone flexibility. As this is an ambitious goal, we limit this study to docking targets with small to moderate amount of conformational change upon binding and we restrict backbone motion to the smaller of the two proteins (ligand). At a minimum, a flexible backbone algorithm must be able to recover correct docked complex structures (near-native) for these simpler targets. Insights we gain in this study can be later extended toward the goal of capturing large-scale conformational change in docking. Furthermore, we test our new ensemble-based methods using the sets of structural models provided by NMR solution state studies (hereafter referred to as NMR ensembles). Compared to computational ensembles, NMR ensembles are typically more diverse and arguably provide a better representation of the unbound state.82

Although NMR

structures make up roughly 15% of known structures, a systematic study showing successful docking using NMR ensembles, to our knowledge, has never been done.

41

complex PDB 2PTC(E:I) 2JEL*(LH:P) 1BVK(DE:F) 1BQL*(LH:Y) 1DFJ(E:I) 2KAI(AB:I) 2BTF*(A:P) 1BRS(A:D) 1BRC(E:I) 2SIC(E:I) 1MLC(AB:E) 2SNI(E:I) 1WQ1(G:R) 1UGH(E:I) 1CHO(E:I) 1ACB(E:I) 1CSE(E:I) 1MAH(A:F) 1FSS(A:B) 1TGS(Z:I) 1CGI(E:I)

Receptor PDB 2PTN 2JEL(LH) 1BVL(LH) 1BQL(LH) 2BNH 2PKA(XY) 2BTF(A) 1A2P(B) 1BRA 1SUP 1MLB(AB) 1SUP 1WER 1AKZ 5CHA(A) 5CHA(A) 1SCD 1MAA(B) 2ACE(E) 2PTN 1CHG

Receptor B-Trypsin Jel42 Fab fragment Antibody Hulys11 Fv Hyhel-5 Fab Ribonuclease inhibitor Kallikrein A B-actin Barnase Trypsin Subtilisin BPN IgG1 D44.1 Fab fragment Subtilisin Novo RAS activating domain uracil-DNA glycosylase a-chymotrypsin a-chymotrypsin subtilisin Carlsberg acetylcholinesterase Acetylcholinesterase Trypsinogen a-chymotrypsinogen

Ligand PDBX 6PTI 1POH 3LZT 1DKJ 7RSA 6PTI 1PNE 1A19(A) 1AAP(A) 3SSI 1LZA 2CI2(I) 5P21 1UGI(A) 2OVO 1CSE(I) 1ACB(I) 1FSC 1FSC 1HPT 1HPT

Ligand

size

Ubxtal

Ensmin

Ensmax

Pancreatic trypsin inhibitor A06 Phosphotransferase Lysozyme Lysozyme Ribonuclease A Pancreatic trypsin inhibitor Profilin Barstar APPI Subtilisin inhibitor Lysozyme Chymotrypisin inhibitor 2 RAS UDG inhibitor ovomucoid third domain Eglin-C Eglin-C Fasciculin II Fasciculin II pancreatic trypsin inhibitor pancreatic trypsin inhibitor

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

0.28 0.31 0.38 0.40 0.40 0.40 0.44 0.47 0.57 0.65 0.73 0.82 0.85 0.90 0.95 1.0 1.0 1.0 1.2 1.6 1.8

0.82 0.39 0.71 1.7 1.7 0.67 0.91 0.77 0.99 0.97 0.69 0.96 1.0 0.92 1.1 0.79 0.79 0.87 0.90 1.8 1.8

1.3 1.0 1.8 2.4 2.4 1.1 1.5 1.7 1.6 2.2 1.7 1.9 1.5 1.3 1.8 1.8 1.8 1.4 1.6 2.5 2.3

Table 3.1 Crystal structure docking targets PDB codes include chain identifiers, E_size is the number of conformers in each ensemble, Ubxtal is the BB_rmsd of the unbound structure crystal structure to the bound crystal structure, UbNMR is the BB_rmsd of the first model in the NMR structure for NMR targets, and Ensmin and Ensmax are the minimum and maximum BB_rmsd of conformers in the ensemble. All RMSD measurements are listed in Ångstroms.

42

Complex PDB

Receptor PDB

1AY7 1BRS 1EAW 1AK4 2KAI 2PTC 1KTZ 2PCC 2BTF* 1BVK

1RGH(B) 1A2P(B) 1EAX(A) 2CPL 2PKA(XY) 2PTN 1TGK 1CCP 2BTF(A) 1BVL(LH)

1MLC

1MLB(AB)

1CSE 1CHO 1B6C 1ACB

1SCD 5CHA(A) 1D6O(A) 5CHA(A)

Receptor Ribonuclease SA Barnase Matripase cyclophilin Kallikrein A B-Trypsin TGFB Cyt C peroxidase Actin Antibody Hulys11 Fv IgG1 D44.1 Fab fragment subtilisin Carlsberg a-Chymotrypsin FKBP-binding protein a-chymotrypsin

Ligand PDBX

Ligand PDBN

Ligand

size

Ubxtal

UbNMR

Ensmin

Ensmax

1A19 1A19 9PTI 1E6J 6PTI 6PTI 1M9Z 1YCC 1PNE 3LZT

1BTB 1BTB 1PIT 1OCA 1PIT 1PIT 1PLO 2HV4 1PFL 1E8L

Barstar Barstar Pancreatic trypsin inhibitor HIV capsid Pancreatic trypsin inhibitor Pancreatic trypsin inhibitor TGFB receptor II Cytochrome C Profilin Lysozyme

30 30 20 20 20 20 10 35 20 50

0.62 0.52 0.62 0.50 0.40 0.28 0.39 0.35 0.44 0.38

0.86 0.94 0.97 0.97 1.1 1.2 1.2 1.3 1.8 2.1

0.57 0.72 0.76 0.78 0.74 0.84 0.96 0.88 0.92 0.86

1.4 1.3 1.7 2.0 2.0 2.2 1.6 2.6 2.6 6.4

3LZT

1E8L

Lysozyme

50

0.73

2.2

1.6

4.0

1ACB(I) 2OVO 1IAS 1CSE(I)

1EGL 1OMT 1FKR 1EGL

Eglin-C Ovomucoid third domain TGFB receptor I Eglin-C

25 50 20 25

1.0 1.0 0.66 1.0

2.5 2.5 3.3 3.9

1.3 1.2 1.3 1.9

4.4 2.7 3.8 4.8

Table 3.2 NMR docking targets PDB codes include chain identifiers, E_size is the number of conformers in each ensemble, Ubxtal is the BB_rmsd of the unbound structure crystal structure to the bound crystal structure, UbNMR is the BB_rmsd of the first model in the NMR structure for NMR targets, and Ensmin and Ensmax are the minimum and maximum BB_rmsd of conformers in the ensemble. All RMSD measurements are listed in Ångstroms.

43

3.3 Methods Structural Data We assembled two target sets to test our docking methods (Table 3.1 and Table 3.2). The crystal structure target set consists of 21 protein complexes from Docking Benchmark 1.083 with a BB_rmsd of less than 2.0 Å and no disordered residues in the interface region of the unbound crystal structures. The NMR structure target set consists of 15 complexes from both Docking Benchmark 1.0 and 2.084 that meet the same criteria and also have an unbound NMR structure and an unbound crystal structure in the PDB.1

Docking Metrics A number of metrics are used to measure docking accuracy that are sensitive to both the ligand position relative to the receptor, the specificity of the interactions across the interface, and the backbone conformation of the ligand. L_rmsd is an overall measure of the ligand position and orientation with respect the receptor and is the RMSD of Cα coordinates of the ligand protein between the decoy and the native structure after superposition of the receptor. Interface RMSD (I_rmsd) is the Cα RMSD of interface residues in the decoy compared to the native structure after superposition of the interface residues, where interface residues are defined as those with intermolecular distances of less than 4 Å in the native structure. Fraction of native residue-residue contacts (fnat) is a measure of the specificity of the interactions across an interface. For CAPRI, Mendez et al.85 outline three classifications for docking accuracy based on these metrics: high quality (fnat > 0.5 OR I_rmsd < 1.0 Å), medium quality (fnat > 0.3 AND L_rmsd < 5.0 Å 44

OR I_rmsd < 2.0 Å) and acceptable quality (fnat > 0.1 AND L_rmsd < 10.0 Å OR I_rmsd < 4.0 Å). The backbone RMSD of the flexible ligand is measured as the Cα RMSD of interface residues after superposition of the entire ligand, to the bound ligand conformation (BB_rmsd) and unbound ligand conformation (UB_rmsd).

Docking Methods Backbone flexibility is always limited to the ligand, defined as the smaller of the two proteins in the complex. All starting structures are prepared by replacing the side chains with rotamers from a standard rotamer library expanded to include rotamers from the unbound crystal structures.7; 20 Independent local docking runs generate 1000 decoys for docking crystal structures or 5000 decoys for docking NMR structures for CS or CS/IFdocking in order to accommodate the significantly larger ensemble sizes. KL-docking is local docking using standard RosettaDock as described by Gray et al.7 with rotamer torsion angle minimization as described by Wang et al.20 Antibody docking targets use an alignment file that biased low-resolution docking towards CDR residues as described previously.12;

17; 86

For IF methods, the gradient-based energy

minimization is expanded beyond rigid-body orientation to include backbone torsion angles during the minimization step in the high-resolution MCM phase of RosettaDock, as described by Wang et al.5 The unbound ligand structure used in IF-docking was relaxed prior to docking to ensure that the structure was at a local energy minimum with the Rosetta energy function, as described by Wang et al.5 For CS-docking’s conformer selection moves, all n conformers in the ensemble are superposed along the interface residues of the current conformer (residues with less

45

than 4 Å atomic distance to the receptor).

Binding energy is calculated for each

conformer to create a partition function,

∑ exp

Δ

binding

/

, and a random binding

conformer i is selected based on its Boltzmann probability,

/

. As in

IF-docking, each conformer in the NMR ensemble underwent high-resolution refinement prior to docking to ensure they were in a local energy minimum within the Rosetta energy function.

Scoring Methods Scoring both during the docking simulation and in final decoy discrimination is based on binding energy, defined as the change in total energy from the final bound complex and the total energy of the initial unbound state, ∆

complex

unbound .

The unbound energies for the receptor and each ligand conformer are calculated by taking the lowest energy from doing 10 independent side-chain packing runs. The total unbound energy is the sum of the unbound energies of the receptor and the ligand and is specific to the conformer used in the final complex,

unbound

receptor

conformer . All unbound reference energies are calculated prior to docking. The Rosetta centroid-mode energy function used in docking is identical to that used in Gray et al.7 and consists of contact, bump, residue environment and residueresidue pair potential. The Rosetta all-atom energy function used in docking consists of Van der Waals, hydrogen bonding, side-chain probability, solvation and electrostatic terms,7 and the weights used for each component are identical to those used by Wang et al.20 In high resolution, an additional component of a secondary structure dependant

46

backbone (Ramachandran) torsional potential was also used37 and assigned a weight of 0.1.

Ensemble Generation Ensembles are generated using RosettaRelax as described by Misura et al.87 except that ‘wobble’ and ‘crank’ perturbations are omitted primarily to reduce computational cost.

Algorithm Availability Ensemble generation, NMR ensemble refinement, and all docking methods presented here are freely available for academic and non-profit use as part of the Rosetta structure prediction suite at www.rosettacommons.org.

The distribution includes supporting

scripts, documentation and full source code.

3.4 Results We have implemented and tested flexible backbone docking approaches using RosettaDock, a multi-scale Monte Carlo based algorithm that samples docking orientations by emulating the diffusional encounter of two proteins in solution and identifying low-energy complexes using an approximate energy function.

Although

RosettaDock is not a rigorous physical simulation of protein binding, its multi-scale approach, quasi-kinetic sampling method and previous success in the Critical Assessment of Protein Interactions (CAPRI)12; 13; 77; 78 make it well suited to apply the four kinetic models of binding to protein docking.

The low-resolution phase of RosettaDock

simulates the formation of an encounter complex between two proteins, while the highresolution phase models the transition from the encounter complex to a fully complexed 47

structure. Physically realistic conformational sampling strategies may more effectively locate the correct complex structure88 and likewise, an efficient sampling technique may provide insight into the underlying theories that inspired its design. The methods are tested on a set of docking targets using both crystal structures and NMR ensembles of the ligand, and backbone flexibility was limited to the ligand (Table 1). Docking is restricted to local perturbations around the native complexed orientation (local docking) which resembles blind predictive docking in cases where biochemical information on the interaction is available. Global docking, where all of docking search space is sampled, while compatible with the presented methods, is outside of the scope of this study due to the computational cost of evaluating a target benchmark of this size (~100 times the computational cost of local docking7).

3.4.1 Docking Algorithms The general docking algorithm is illustrated in Figure 3.2. The ligand in the starting structure is first randomly translated and rotated to generate an initial starting position which approximates a collisional encounter between the two partners in which the appropriate interacting surfaces are proximal to each other.7 Typically less than 3.5% of initial starting positions are less than 10 Å from the native structure by Cα RMSD of ligand residues after superposition of the receptor (L_rmsd) with the native structure. The initial complex then undergoes 500 cycles of low-resolution docking, which consists of randomized rigid-body moves of ~1 Å each followed by a Metropolis acceptance criterion. During this phase, side chains are represented rigidly as centroid pseudoatoms.

48

Figure 3.1 The flexible docking algorithm The flexible docking algorithm. The low-resolution phase models the formation of an encounter complex and the high-resolution phase models its transition to a bound complex. CS and CS/IF-docking include the conformer selection step (green box). IF and CS/IF-docking include backbone minimization (orange box).

To capture conformer selection in binding, we added a step in the low-resolution stage (green box, Figure 3.1) of the algorithm immediately following the rigid-body move and preceding the Metropolis step.

Following a rigid-body move, the entire

ensemble of conformers is superposed along the interface residues of the current conformer (Figure 3.2C).

The centroid-mode binding energy is calculated for each

conformer and used to generate a partition function. A conformer is selected from the ensemble to replace the current conformer based on its Boltzmann weighted probability

49

within the partition function. Once a conformer is selected, the Metropolis criterion is applied on the combined rigid-body/conformer selection move. The use of a partition function in this manner allows for robustness of this method towards both ensemble size and heterogeneity, compared to a simple random selection.

Conformer selection is

restricted exclusively to the low-resolution phase to remain consistent with the theory that the conformers in the unbound ensemble represent distinct low-energy structures that are the result of thermal fluctuations at time scales much greater than that of a binding event. The lowest energy complex sampled during the low-resolution phase serves as a putative encounter complex and is converted to a high resolution structure through a sidechain packing step. The complex then undergoes 50 steps of high resolution refinement consisting of rigid body moves of ~0.1 Å and periodic combinatorial side-chain packing, followed by energy gradient-based minimization of the rigid-body orientation (R, T) and side-chain torsion angles (χn). To capture induced fit in binding, we use a previously developed5 extended minimization scheme (orange box, Figure 3.2) that includes the backbone torsion angles (φn,ψn), allowing the backbone conformation of the ligand to respond to local energy gradients along the interface created by the docking process. The lowest energy structure in the high-resolution stage is selected as the final complexed structure for that iteration of the docking algorithm (hereafter referred to as a ‘decoy’), and the entire algorithm is repeated to generate 1000 decoys and all decoys are re-ranked by all-atom binding energy. To test the four kinetic models (KL, CS, IF, and CS/IF) we simply include or skip the appropriate conformer selection or induced fit step in the algorithm.

50

The computational cost of a docking method not only restricts user accessibility but also the extent of conformational sampling, which is often a limiting factor in predictive docking. Using a 1.5 GHz processor, the standard RosettaDock algorithm requires 1.5 to 5 minutes to generate a single decoy, depending on the size of the protein partners. The CS method requires approximately 1.5 to 2 times longer per decoy for an ensemble of 10 conformers, representing a 5-fold increase in efficiency over exhaustive docking of each conformer in the ensemble to the receptor. The IF method requires approximately 4-6 times longer per decoy than standard RosettaDock, while the CS-IF method requires approximately 5-7 times longer. Therefore, some consideration must be placed on the computational costs of each method along with overall docking performance when evaluating the methods in this study.

Ensemble Generation To dock crystal structures, we created computational ensembles using RosettaRelax,87; 89 a multi-scale Monte Carlo-based structural refinement algorithm which samples the local conformation space for alternate low-energy structures using small backbone torsion angle perturbations, side-chain packing and energy gradient-based minimization in torsion space. Each structure required 15-30 minutes to generate on a 1.5 GHz processor, and 10 structures were generated for each target. Figure 3.2A illustrates the ensemble generation technique for crystal structures, including idealization of bond lengths and angles, low-resolution relaxation and high-resolution structural refinement. Figure 3.2B illustrates the preparation of NMR ensembles, which consists of idealization followed by high-resolution refinement.

51

Figure 3.2 Ligand ensemble generation and conformer selection (A) Crystal structures are idealized and then relaxed in low and high-resolution to generate an ensemble of 10 structures. (B) Conformers in an NMR ensemble are idealized and refined in high-resolution only. (C) During low-resolution docking, conformers are superposed along the current conformer’s interface residues, and a conformer is selected from a partition function using Boltzmann-weighted energies.

For the Rosetta-generated ensemble, diversity is achieved primarily through the low-resolution relaxation step. Conformers are typically within a root mean square

52

deviation (RMSD) of Cα atoms (Cα RMSD) of 1.0 Å of each other, while NMR ensembles are generally much more diverse, as can be seen in comparing Figure 3.2A and 2B. To show the conformational range of each type of ensemble and their potential for containing a binding-competent conformer, the Cα RMSD of the interface residues towards the bound ligand conformation after superposition of the entire ligand (BB_rmsd) is shown in Table 1 for the closest and furthest conformer in each ensemble. In the Rosetta-generated ensembles, the closest conformer is often further from the bound structure than the unbound crystal structure, suggesting that within this target set, Rosetta has a limited ability to generate conformers that are closer to the bound state. In the NMR ensemble, the closest conformer is often significantly closer to the bound conformation than the first conformer (Model 1) in the NMR structure coordinate file, which is typically the model that satisfies the most experimental constraints.

3.4.2 Energy function and discrimination Energetic discrimination of near-native decoys remains a persistent challenge in flexible docking across FFT-based,18 MD-based71; 74 and MC-based methods,5 due in part to the interplay between intramolecular and intermolecular energies.

Previous studies on

docking have used either total energy of a complex, intermolecular energies between two partners in a complex, or binding energy of a complex, which measures the difference in energy between the bound complex and the unbound state.5; 18; 71; 74; 75 Although rigidbody docking with standard RosettaDock uses total energy, the variation in the backbone conformation in flexible docking can lead to significant changes in intramolecular energies, which can alter both sampling and discrimination. We therefore compared the

53

use of total energy and binding energy for conformer selection and final decoy discrimination.

Figure 3.3 Different energy models for docking (A) The frequency of selecting a particular conformer during the conformer selection step vs. the internal energy for each conformer when using total energy (red) and binding energy (blue) for 1BRC. (B) Total energy vs. L_rmsd and binding energy vs. L_rmsd for 1BRC. CAPRI criteria high-quality decoys are shown in brown, medium-quality in orange and acceptable-quality in tan; note that high-quality decoys sometimes meet the CAPRI I_rmsd criterion rather than the L_rmsd criterion.

Figure 3.3A shows the observed frequency that a particular conformer was selected in the low-resolution phase of CS-docking when using total energy or binding energy in generating the partition function in the conformer selection step, as a function of its unbound energy. Using total energy in the conformer selection step leads to a large bias towards low-energy conformers in the ensemble. While thermodynamically correct, it depends on the accuracy of the relative free energies of the different conformers and prevents adequate sampling of the entire ensemble. By contrast, the use of binding energy leads to a more even distribution between conformers. Unexpectedly, conformer

54

selection is slightly biased towards higher-energy conformers when using binding energy compared to total energy, possibly due to the non-specific burial of hydrophobic residues. To illustrate the effects on near-native discrimination, Figure 3.3B charts 1000 decoys in docking funnel plots (energy vs. L_rmsd to the native structure) created by the CS method for the same target. The degree of near-native discrimination in a docking funnel can be quantified by the difference in average energy of near-native decoys and the average energy of non-native decoys, normalized by the standard deviation of the energy of non-native decoys (hereafter referred to as Z-score).20 Z-scores of -1.2 and -2.6 for the total energy and binding energy, respectively, show that the use of binding energy yields significantly improved discrimination.

Since binding energy avoids accurate

thermodynamic calculation of free energies of the conformers, the starting conformers must be physically realistic and of comparable energies.

3.4.3 Case Study: Acetylcholinesterase-fasciculin II To illustrate results of the four docking algorithms in detail, we analyze the representative case of the acetylcholinesterase (AChe) - fasciculin II (FAS2) complex (Protein Data Bank1 (PDB) code 1FSS) before describing the results of the entire target set. There is a change in conformation of interface residues of FAS2 from the unbound to bound state which impedes high-quality prediction using current docking methods. In global docking of 1FSS, neither Li et al.64 using ZDOCK+RDOCK nor Smith et al.18 using an FFTbased ensemble docking method were able to generate a prediction that was within 2.5 Å Cα RMSD of interface residues (I_rmsd) among their 10 top-ranked decoys. In local

55

docking, Wang et al.5 used both rigid and flexible backbone docking in RosettaDock, but did not produce a decoy within 2.0 Å I_rmsd in the top three predictions. FAS2 features a three-finger structural motif common to a number of toxins. A visual comparison of the bound crystal structure90 with the unbound crystal structure91 reveals a ~2 Å movement of Loop II (Figure 3.4A). Computer simulations have provided evidence for both a conformational gating92 and an induced-fit mechanism93 of binding to AChe. The ensemble representing the unbound state of FAS2 that was generated from the unbound crystal structure contained conformational variability in all three loops, but most noticeably in loop II (Figure 3.4B). A quantitative analysis of the heterogeneity in the ensemble comparing the mean square fluctuation (MSF) of Cα position for the ensemble with the Cα MSF between the bound and unbound conformation and the Cα MSF calculated from the crystallographic B-factors94 of the unbound structure (Figure 3.4C) reveals good qualitative agreement. The most flexible regions are loop II (residues 27-34), loop I (residues 5-12), and the turn region between loops I and II (residues 1524). Furthermore, NMR data of the solution state dynamics of the closely related toxinα95 also follow the diversity of the FAS2 ensemble generated using Rosetta. Broad agreement with these diverse measurements of protein flexibility suggests that conformational heterogeneity in the Rosetta ensembles qualitatively reflects inherent flexibility of the protein, validating their use as a representation of the unbound state. In fact, four conformers in the ensemble were closer to the bound state than the unbound state, with BB_rmsds of 0.90 Å, 0.94 Å, 1.06 Å and 1.10 Å compared to a BB_rmsd of 1.15 Å for the unbound crystal structure.

56

Figure 3.4 Backbone variability of FAS2 (A) Bound (red) and unbound (blue) FAS2 conformations. (B) Ensemble generated by Rosetta from the unbound FAS2 conformation. (C) Cα mean-squared fluctuation (MSF) for the Rosetta ensemble (black circles), the Cα MSF between the bound and unbound FAS2 conformation (red squares), and the Cα MSF calculated from the crystallographic B-factors94 from 1FSC (green triangles).

To capture the results using the four docking methods, Figure 3.5 presents the lowest energy structure and docking funnel plots for docking FAS2 to the unbound AChe structure as a function of both ligand RMSD (L_rmsd) and the fraction of native contacts

57

recovered (fnat). The upper bound of accuracy is represented by KL-docking using the bound crystal structure of FAS2 (Figure 3.5A) which produces a lowest energy structure with a CAPRI accuracy rating of high quality with L_rmsd of 1.1 Å and an fnat of 0.74. In contrast, KL-docking using the unbound crystal structure of FAS2 (Figure 3.5B) does not produce a docking funnel towards the bound complex and contains numerous false positives, or non-native structures with low energy, including the lowest-energy structure. CS-docking (Figure 3.5C) produces a lowest-energy structure of medium quality with an L_rmsd of 2.3 Å and an fnat 0.45. IF-docking (Figure 3.5D) shows a more pronounced docking funnel than the KL method, but also contains a number of false positives, including the lowest-energy structure.

Finally, like the CS-docking, CS/IF-docking

(Figure 3.5E) produces a lowest-energy structure of medium quality, with an L_rmsd of 2.5 Å and an fnat of 0.61. The high level of accuracy of the interface in the lowest energy structure from the CS/IF method is illustrated in Figure 3.6, where both the position of FAS2 relative to AChe and side-chain orientations on both partners are recovered closely. Hydrogen-bond donor atoms from one partner are within 4 Å of hydrogen-bond acceptor atoms from the other partner for 9 of the 15 hydrogen bonds, the unusual hydrophobic stacking interaction between Met33 of FAS2 and Trp279 of AChe90 is recovered, and overall 61% of the native residue-residue contacts are satisfied.

The only major interaction not

recovered in the model compared to the bound structure is the polar and hydrophobic interactions between the C-terminal Tyr61 of FAS and Lys341, Pro76, and Phe75 of AChe.

58

59

Figure 3.5 Binding energy vs. L_rmsd for the docking methods Binding energy vs. L_rmsd, binding energy vs. fnat, and the top-ranked decoy (blue) superposed along the receptor (green) with the crystal structure of the bound ligand (red) for AChe binding to FAS2 (1FSS). (A) KL-docking using the bound FAS2 structure, (B-E) KL, CS, IF, and CS/IF-docking, respectively, using the unbound FAS2 structure. The unbound AChe structure (1FSC) is used in all cases. CAPRI criteria high quality decoys are shown in brown, medium quality in orange, and acceptable quality in tan.

60

The significant difference in docking accuracy achieved by the docking algorithms is a result of their respective treatments of backbone flexibility. Therefore it is useful to analyze how the different methods affect the backbone conformation of the ligand during docking.

Figure 3.7A demonstrates the breadth and density of

backbone conformational space being sampled by showing the BB_rmsd and UB_rmsd (Cα RMSD of interface residues to the unbound structure) for all 1000 decoys for the flexible docking methods.

KL-docking samples the single unbound backbone

conformation that is 1.15 Å from the bound structure (red line for comparison), while CSdocking samples 10 distinct backbone conformations, ranging from 0.90 Å to 1.64 Å from the bound structure. Since both IF and CS/IF docking include explicit backbone flexibility, each backbone conformation generated from docking is a unique result of the stochastic docking process. backbone conformations.

As a result, these methods sample a wide variety of

Although IF-docking samples nearly as wide a range of

conformations as CS/IF-docking (0.75 Å to 2.0 Å compared to 0.63 Å to 1.93 Å), a large fraction of these conformations was closer to the unbound state, indicating it is unable to overcome conformational and energetic barriers to move towards the bound state. Overall, the FAS2 conformations from CS and CS/IF-docking were closer to the bound conformation than the unbound crystal structure in 40% and 42% of decoys respectively, compared to 22% in IF-docking.

61

Figure 3.6 Lowest energy structure for docking 1FSS with the CS/IF method Detail of the lowest energy structure ligand (cyan) and receptor (green) for the CS/IF method for 1FSS superposed with the native structure (gray) along the receptor. This decoy has an L_rmsd of 2.5 Å, an I_rmsd of 0.94 Å, and an fnat of 0.61. Met33 of FAS2 and Trp239 of AChe are shown as spheres.

Ideally, to achieve appropriate energetic discrimination of near-native complexes in flexible docking, an energy funnel should exist not only in the rigid-body conformational space towards the bound docking orientation of the two partners (measured by binding energy vs. L_rmsd), but also in backbone conformation space towards the bound backbone conformation of both partners (measured by binding energy vs. BB_rmsd). Although in this study the receptor was kept fixed in the unbound conformation, an energy funnel towards the bound conformation of the ligand may still be observed. 62

Figure 3.7B shows the binding energy and the BB_rmsd for all decoys from CS, IF, and CS/IF-docking with near natives decoys colored in tan, orange and brown for acceptable, medium and high quality predictions, respectively (compare to the energy funnel in rigidbody conformation space, Figure 3.5C-E). Docking decoys close to the native docked orientation should have lower energies for conformers closer to the bound conformation. Overall, a distinct energy funnel is not observed in the dimension of BB_rmsd for any of the three flexible docking methods, although in both CS and CS/IF-docking, the decoy with the lowest binding energy also had a relatively low BB_rmsd (1.0 Å and 1.2 Å, respectively). The lack of an energy funnel in backbone conformation space could be due either to sampling too few backbone conformations or to a deficiency in energy function in discriminating bound-like conformers during docking.

Alternately,

differences between the bound and unbound receptor conformation (which is kept fixed in the unbound form) could force the ligand to adopt a binding-competent conformation slightly away from the bound conformation, obscuring or eliminating an energy funnel in this relatively narrow range of conformation space (BB_rmsd of 0.9-1.3 Å).

63

Figure 3.7 Backbone sampling and discrimination for 1FSS (A) The BB_rmsd and UB_rmsd in each decoy output by CS, IF and CS/IF-docking respectively. Red lines show the BB_rmsd and UB_rmsd of the unbound and bound conformations respectively. (B) Binding energy vs. BB_rmsd, for CS, IF, and CS/IF-docking, respectively. CAPRI criteria high quality decoys are shown in brown, medium quality in orange, and acceptable quality in tan.

3.4.4 Flexible docking outperforms rigid-body docking We applied the four docking methods to a set of 21 target complexes using crystal structures for both the ligand and receptor. A docking method was said to produce a ‘hit’ if it passed two criteria: 1) the lowest-energy structure was of at least medium quality and 2) at least 5 of the 10 lowest energy structures were of at least medium quality. This represents a relatively strict criterion in which a docking run is deemed successful only if its lowest energy structure is of good quality and it has converged on that solution. We

64

also performed KL-docking of the bound ligand with the unbound receptor to serve as a control. Table 3.3 shows the number of medium or high quality decoys within the 10 topscoring structures (N10) and the L_rmsd, I_rmsd and fnat of the top-scoring decoy. Figure 3.8A summarizes these results as a histogram of hit quality, and funnel plots for each method for several selected targets are presented in Supplementary Figure 3.1. KLdocking with the bound ligand structure produced 15 total hits with 12 of high quality and 3 of medium quality and represents the upper bound of results achievable through modeling backbone flexibility in the ligand. KL-docking using the unbound crystal structure produced 8 total hits with 3 of high quality and 5 of medium quality, demonstrating the potential for improvement in modeling flexibility in the ligand backbone conformation.

Figure 3.8 Histogram of hit quality for each docking method Histogram of hit quality (quality of the top-ranked decoy for all runs with at least 5 of the 10 top-scoring decoys of medium or high quality) for each docking method for (A) crystal structure targets and (B) NMR targets. CAPRI criteria high-quality is shown in brown, medium-quality in orange.

65

As seen in Figure 8A, CS/IF-docking came closest to the quality of KL-docking with the bound ligand conformation producing 13 total hits with 7 high quality and 6 medium quality predictions.

CS-docking performed comparably to the CS/IF-docking

with 13 total hits, and 7 and 6 high and medium quality prediction, respectively. IFdocking produced 11 hits with 4 and 7 high and medium-quality predictions, respectively. At least one of the flexible docking methods produced a hit in 8 of the 13 cases in which KL-docking did not, and in the 8 cases where KL-docking did produce a hit, in 4 cases at least one of the flexible docking methods produced a higher quality hit. Overall, there was substantial overlap in the improvement in performance of the flexible docking methods compared to KL-docking with 8 of 21 cases in which two or three of the flexible docking methods produced a top-ranked decoy of a higher CAPRI quality than KLdocking (e.g. 2JEL in Table 3.3; funnel plots in Supplementary Figure 3.2).

Assessing ensemble docking and explicit backbone minimization In a comparison between the flexible docking methods within each target, the two ensemble docking methods (CS and CS/IF) combined to produce 14 top-ranked decoys that were highest in fnat, compared to 5 for top-ranked decoys by the non-ensemble dock methods (KL and IF).

Furthermore, the CS and CS/IF methods had, overall, a greater

number of hits and a higher accuracy of hits. The larger breadth of conformations sampled by the CS methods is critical to the success of the flexible docking methods, even in complexes with relatively small conformation changes. In a number of cases, an improvement in docking performance was observed despite the fact that no conformers in the ensemble were present that were closer to the 66

bound structure than the unbound structure. Although initially counterintuitive, this observation agrees with the findings of both Grunburg et al.41 and Smith et al.18 and suggests that there is inherent value in allowing backbone conformational variability in docking, irrespective of whether the bound conformation is being sampled. Even a small, relatively homogenous ensemble of ten structures can make a significant difference in docking compared to a single structure. The explicit backbone minimization methods (IF and CS/IF) produced greater sampling of higher fnat decoys, as seen in the example of 1FSS (Figure 3.4). Similarly, Krol et al.74 observed increased fnat when refining near-native decoys using MD, possibly as the result of the formation of a greater number of energetically favorable contacts along a putative docking interface when minimizing or relaxing.

The increase in

sampling is not clearly reflected in the overall results however, because discrimination is more difficult. In the case of 1CSE, the top-ranked decoy is near-native in the CS method, but not the CS/IF method despite the fact that both methods use an identical ensemble of ligand conformations. Explicit backbone minimization improves the binding energy across all decoys, and in this case, the binding energy of a non-native docking orientation is improved more than the binding energy of the near-native docked orientation, leading to a false positive. The loss of a near-native lowest energy structure is also observed in two cases (1BQL and 1WQ1) when comparing KL-docking with IFdocking. In summary, except in cases where explicit backbone minimization results in a false positive, it generally improves the quality of the top-ranked decoy. Indeed, where CS and CS/IF-docking both produced a hit, backbone minimization in CS/IF-docking leads to an increase in fnat in 8 of 12 cases.

67

3.4.5 Ensemble docking with NMR targets We applied these docking methods to a set of 15 targets in which the unbound ligand structure is an NMR ensemble. A comparison of the BB_rmsd of both the unbound crystal structure and the closest conformer in the NMR ensemble to the bound crystal structure in Table 3.2 reveals a significant degree of variability in the NMR ensemble. Still, the closest NMR conformer to the bound conformation is often substantially closer than an arbitrary conformer in the ensemble, providing a means for overcoming the structural uncertainties.

68

Figure 3.9 NMR docking results for 1ACB using the CS method NMR docking results for CS-docking of 1ACB. (A) Binding energy vs. L_rmsd. (B) Binding energy vs. BB_rmsd. (C) Rosetta-refined NMR ensemble. (D) Lowest energy conformer from CS-docking (purple) superposed on the bound structure (red) and the first model in the NMR ensemble (blue). (E) Lowest energy structure from the CS method (receptor, green; ligand, purple) superposed on the native complex (gray).

69

Table 3.4 shows the docking accuracy for the four methods using the NMR structure of the ligand, with rigid-body docking results with the unbound and bound crystal structure provided for comparison. Figure 3.8B summarizes these results as a histogram of hit quality, and Supplementary Figure 3.2 provides docking funnels for each target across all the methods. For KL and IF-docking, the starting structure was the first conformer in the NMR structure coordinate file from the PDB (Model 1).

For the CS

and CS/IF methods the entire NMR ensemble was used. KL-docking with the bound ligand conformation produced 9 hits, all of high quality. KL-docking with the unbound ligand conformation produced 7 hits, 4 of high quality and 3 of medium quality. In contrast, KL-docking with the NMR ensemble Model 1 produced 3 hits, all of medium quality, demonstrating the difficulties inherent in docking NMR structures compared to crystal structures. Overall, while none of the flexible docking methods were able to approach the accuracy of docking with the bound ligand conformation, CS-docking approached the results of KL-docking with the unbound crystal structure, producing 8 hits, 2 of high quality and 6 of medium quality. CS/IF-docking produced 6 hits, 3 of high quality and 3 of medium quality, while IF-docking produced 3 hits, 2 of high quality and 1 of medium quality. Two trends can be observed from the data. First, in contrast to the crystal structure docking targets, with one exception, IF-docking only produces a hit if KL-docking produces a hit. When IF-docking does produce a hit, it generally improves the accuracy of the hit, for example in 2PTC, where IF-docking is able to produce a high-quality hit where KL-docking produces a medium-quality hit. A potential explanation is that the 70

single backbone methods (KL and IF) only produce hits in cases where the NMR structure is close to the bound conformation: for all three cases in which the IF-docking produced a hit, the first conformer in the NMR ensemble Model 1 has a BB_rmsd in the lower range among NMR targets, at 0.94 Å, 1.12 Å and 1.16 Å for 1BRS, 2KAI, and 2PTC, respectively. Second, the ensemble docking methods perform significantly better than the single-backbone methods, especially for targets with a BB_rmsd of the first conformer in the NMR ensemble of greater than 1.2 Å. As a representative example, we examine the docking of α-chymotrypsin with the NMR solution structure of eglin-C (1ACB). CSdocking produces a moderate docking funnel (Figure 3.9A) with a Z-score of -0.95 and a medium-quality hit. Figure 3.9B shows the binding energy plotted against the BB_rmsd for each decoy, which, in contrast to the example of 1FSS (Figure 3.7B), shows a pronounced energy funnel towards the bound backbone conformation.

Out of the

considerable diversity in the entire 20-model NMR ensemble of eglin-C (Figure 3.9C), CS-docking successfully selects the conformer closest to the bound conformation (1.9 Å BB_rmsd, Figure 3.9D-E). In contrast, the first model in the NMR ensemble used in the (unsuccessful) KL and IF-docking is much further from the bound conformation (3.9 Å BB_rmsd, Figure 3.9D). Similar results are observed for other cases (e.g. 1EAW and 1CSE) and also for CS/IF-docking (funnel plots in Supplementary Figure 3.2). The issues with near-native discrimination observed in docking crystal structures when using induced fit were present in the results for NMR targets as well. In 1CHO, CS-docking produced a medium-quality hit while CS/IF-docking failed to produce even a single medium-quality decoy among the 10 top-scoring decoys. Likewise CS-docking

71

produced a medium-quality hit for 2BTF while the top-scoring decoy from CS/IFdocking was of acceptable quality and only 2 of the top 10 decoys were of medium or better quality. In a surprising number of cases, the top-ranked decoy from CS/IF-docking was of acceptable quality. In 3 cases (1KTZ, 1CHO, 2PCC) CS/IF-docking appears to have converged at the solution producing at least 5 of the top 10 decoys at acceptable quality. For 1KTZ and 2PCC this represents an improvement over the other docking methods; for 1CHO it is a decrease in accuracy compared to CS-docking. Interestingly, in 4 of 15 cases CS/IF-docking produced a top-ranked decoy that was of higher accuracy than KL-docking with the unbound crystal structure, and in another 5 cases it performed equally well. In 2 of those 4 cases (2KAI and 2BRS), KLdocking using just a single conformer from the NMR ensemble outperformed KLdocking with the unbound crystal structure, suggesting that the improvement in docking is a result of differences between all NMR conformers and the crystal structure, and not the use of multiple NMR conformers. Nonetheless, excluding those 2 cases, CS/IFdocking performs equal to or better than rigid-backbone docking with the crystal structure in 8 of 13 cases. By using an ensemble docking with minimization method it is possible for the first time to locally dock NMR structures with comparable accuracy to rigid-body docking with crystal structures.

72

PDB

N10

Fnat

KL(B) Lrmsd

Irmsd

CAPRI

N10

Fnat

KL Lrmsd

Irmsd

CAPRI

N10

Fnat

CS Lrmsd

Irmsd

CAPRI

2SIC 1MAH 1CHO 2PTC 2BTF 1ACB 1BRC 2SNI 1BQL 1UGH 1WQ1 2KAI 2JEL 1BVK 1CSE 1DFJ 1FSS 1TGS 1CGI 1MLC 1BRS

10 10 10 10 6 10 10 10 7 10 4 6 8 0 10 0 10 10 3 4 4

0.85 0.64 0.75 0.60 0.86 0.68 0.36 0.76 0.41 0.70 0.39 0.72 0.79 0.11 0.64 0.00 0.74 0.74 0.69 0.08 0.09

3.9 0.23 0.76 2.7 1.0 0.88 9.2 0.91 3.1 0.43 4.9 2.5 0.74 20 1.3 12 1.1 1.7 1.2 19 18

0.51 0.13 0.31 0.58 0.38 0.40 2.0 0.25 1.1 0.19 1.5 0.47 0.26 5.9 0.36 7.1 0.43 0.50 0.42 9.4 9.5

*** *** *** *** *** *** ** ** ** *** ** *** ***

10 3 10 10 8 6 9 8 9 6 0 0 7 0 0 0 1 5 0 2 1

0.72 0.78 0.53 0.48 0.44 0.38 0.43 0.38 0.39 0.37 0.17 0.04 0.10 0.25 0.00 0.00 0.05 0.09 0.06 0.71 0.03

1.8 2.3 1.8 3.5 2.6 6.1 4.2 5.4 5.1 5.2 5.8 10 9.3 9.7 13 19 12 15 17 23 19

0.38 0.66 0.78 0.86 1.2 1.3 1.5 1.5 1.6 2.5 3.7 4.5 4.6 4.7 6.0 7.3 7.9 9.4 9.5 11 11

*** *** *** *** ** ** ** ** ** * *

10 8 5 10 7 10 9 10 10 3 0 4 7 0 5 0 5 10 0 1 8

0.74 0.59 0.81 0.43 0.45 0.51 0.58 0.51 0.74 0.23 0.26 0.03 0.77 0.12 0.32 0.00 0.45 0.64 0.08 0.14 0.00

3.6 1.6 1.6 4.3 3.0 3.3 3.2 5.7 5.2 5.6 3.6 8.4 1.2 30 2.7 22 2.3 2.4 13 20 18

0.59 0.65 0.47 1.0 1.9 0.84 0.81 1.2 1.3 2.8 2.1 3.2 0.43 10 1.2 11 1.2 0.87 5.0 9.1 9.5

*** *** *** ** ** *** *** ** ** * * * ***

Totals

*** *** *** ***

17(15)

* *

9(8)

73

** ** ***

13(13)

PDB

N10

Fnat

CS Lrmsd

Irmsd

CAPRI

N10

Fnat

IF Lrmsd

Irmsd

CAPRI

N10

Fnat

CS/IF Lrmsd

Irmsd

CAPRI

2SIC 1MAH 1CHO 2PTC 2BTF 1ACB 1BRC 2SNI 1BQL 1UGH 1WQ1 2KAI 2JEL 1BVK 1CSE 1DFJ 1FSS 1TGS 1CGI 1MLC 1BRS

10 8 5 10 7 10 9 10 10 3 0 4 7 0 5 0 5 10 0 1 8

0.74 0.59 0.81 0.43 0.45 0.51 0.58 0.51 0.74 0.23 0.26 0.03 0.77 0.12 0.32 0.00 0.45 0.64 0.08 0.14 0.00

3.6 1.6 1.6 4.3 3.0 3.3 3.2 5.7 5.2 5.6 3.6 8.4 1.2 30 2.7 22 2.3 2.4 13 20 18

0.59 0.65 0.47 1.0 1.9 0.84 0.81 1.2 1.3 2.8 2.1 3.2 0.43 10 1.2 11 1.2 0.87 5.0 9.1 9.5

*** *** *** ** ** *** *** ** ** * * * ***

8 6 6 10 4 10 9 7 6 9 0 6 10 0 2 0 6 7 0 3 9

0.67 0.64 0.39 0.55 0.41 0.57 0.57 0.40 0.07 0.30 0.09 0.50 0.68 0.08 0.13 0.00 0.00 0.12 0.16 0.48 0.43

3.4 2.4 6.3 3.0 2.2 2.9 2.8 6.1 22 3.4 6.9 5.9 1.3 19 18 25 13 16 13 2.3 3.0

0.52 1.1 1.5 0.68 1.5 1.1 0.72 1.6 11 1.9 4.1 1.2 0.46 9.2 7.4 12 8.1 9.3 5.8 1.1 1.5

*** ** ** *** ** ** *** **

10 6 5 10 9 10 4 10 6 5 0 10 9 0 2 0 7 7 0 2 5

0.74 0.69 0.54 0.62 0.69 0.60 0.36 0.77 0.33 0.31 0.07 0.58 0.56 0.24 0.03 0.00 0.55 0.55 0.28 0.14 0.06

3.9 3.7 7.7 2.3 1.9 3.8 8.2 2.3 3.6 5.0 6.5 2.9 1.6 10 13 18 3.2 4.9 19 20 19

0.52 1.4 1.7 0.75 0.67 0.82 1.1 0.82 1.4 2.5 4.2 0.69 0.55 4.3 5.5 8.5 1.4 2.4 10 9.2 11

*** ** ** *** *** *** ** *** ** **

Totals

** ** ***

13(13)

** ** ***

** ** 13(11)

*** ***

** **

14(13)

Table 3.3 Docking results for crystal structure targets KL(B) is KL-docking using the bound ligand, and KL, CS, IF, CS/IF are the four docking methods using the unbound ligand. N10 is the number of structures among the 10 lowest-energy structures that are of at least medium quality. The fnat, L_rmsd, I_rmsd, and CAPRI rating are for the lowest energy structure produced. CAPRI ratings are acceptable (*), medium quality (**), and high quality (***). Totals show the number of targets for which the lowest-energy structure was of at least medium quality and, in parenthesis, the number of hits for each method (hit defined as having a lowest energy structure of at least medium quality and N10 > 4).

74

PDB

N10

Fnat

KL(B) Lrmsd

Irmsd

CAPRI

N10

Fnat

KL(Ub) Lrmsd

Irmsd

CAPRI

N10

Fnat

KL Lrmsd

Irmsd

1KTZ 1EAW 1CHO 2PTC 1CSE 2BTF* 1ACB 1B6C 1AK4 2PCC 2KAI 1BVK 1AY7 1MLC 2BRS

10 5 10 10 10 5 10 9 10 0 6 0 6 4 4

0.80 0.40 0.75 0.60 0.64 0.86 0.68 0.70 0.70 0.00 0.72 0.11 0.73 0.08 0.09

3.0 3.4 0.76 2.7 1.3 1.0 0.88 2.3 4.6 27 2.5 20 3.2 19 18

0.50 1.2 0.31 0.58 0.36 0.38 0.30 0.56 0.60 14 0.47 5.9 0.86 9.4 9.5

*** ** *** *** *** *** *** *** ***

9 6 10 10 8 8 6 7 1 1 0 0 0 2 1

0.85 0.50 0.53 0.48 0.57 0.44 0.38 0.43 0.27 0.27 0.04 0.25 0.10 0.07 0.03

1.9 2.3 1.8 3.5 3.6 2.6 6.1 8.6 19 8.7 10 9.7 16 23 19

0.47 0.69 0.78 0.86 1.0 1.2 1.3 1.4 2.5 3.7 4.5 4.7 7.7 11 11

*** *** *** *** ** ** ** ** * *

2 2 8 6 0 5 0 0 0 0 5 1 0 0 7

0.03 0.25 0.22 0.43 0.07 0.03 0.18 0.07 0.00 0.06 0.47 0.24 0.00 0.08 0.63

25 8.1 10 5.4 15 21 10 22 24 14 6.6 16 20 26 2.1

9.3 2.6 2.2 1.4 7.3 13 2.9 13 9.3 5.5 1.3 7.6 9.4 9.7 1.0

Totals

*** ***

11(11)

*

8(8)

75

CAPRI * * ** *

**

** 3(3)

PDB

N10

Fnat

CS Lrmsd

Irmsd

CAPRI

N10

Fnat

IF Lrmsd

Irmsd

CAPRI

N10

Fnat

CS/IF Lrmsd

Irmsd

CAPRI

1KTZ 1EAW 1CHO 2PTC 1CSE 2BTF* 1ACB 1B6C 1AK4 2PCC 2KAI 1BVK 1AY7 1MLC 2BRS

0 6 5 10 3 8 10 0 0 0 9 0 7 0 6

0.29 0.38 0.39 0.54 0.03 0.48 0.25 0.13 0.02 0.00 0.55 0.06 0.67 0.05 0.67

6.3 4.8 8.9 5.5 13 2.6 5.1 13 28 20 4.5 19 2.1 19 2.5

4.2 1.8 1.9 1.1 5.6 1.7 1.7 5.2 7.6 9.3 0.90 10 0.99 9.4 1.2

* ** ** **

1 7 1 10 1 0 0 0 0 0 9 4 6 0 6

0.46 0.04 0.44 0.35 0.16 0.10 0.00 0.07 0.21 0.38 0.55 0.00 0.06 0.03 0.58

4.9 10 9.0 3.3 14 15 19 16 15 7.7 4.3 21 20 17 3.9

1.8 5.0 2.1 1.0 6.8 8.8 6.8 7.7 3.6 3.1 0.9 12 8.8 9.2 1.89

**

1 8 0 9 2 2 8 0 0 0 8 0 5 0 10

0.30 0.35 0.48 0.57 0.26 0.30 0.31 0.07 0.18 0.25 0.50 0.24 0.72 0.25 0.75

4.8 3.7 9.8 4.0 5.6 6.4 4.0 14 15 8.5 5.2 9.5 2.2 8.3 2.9

4.1 1.5 2.2 0.75 1.8 3.1 1.7 4.9 3.7 3.4 1.1 4.2 0.91 3.4 0.87

** ** * *** ** * **

Totals

** **

*** *** ** 8(8)

* ***

* * ***

** 4(3)

* * ** * *** * *** 8(6)

Table 3.4 Docking Results for NMR targets KL(B) is rigid-body docking using the bound structure, KL(Ub) is rigid-body docking using the unbound crystal structure, and KL, CS, IF, CS/IF are the four docking methods using the ligand solution-state NMR structure. All column descriptions are the same as in Table 3.3. Targets are sorted by I_rmsd in the KL(Ub) case.

76

3.4.6 Ensemble docking with antibody homology models Previously published in: Sivasubramanian, A., Sircar, A., Chaudhury, S., Gray, J.J., (2008) High-resolution homology modeling of antibody Fv regions using knowledge-based techniques, de novo loop modeling and docking. Proteins. 74:497-514.

Reprinted with the permission of the publisher, Elsevier, Inc. A portion of the Results section is reproduced below with minor revisions.

We sought to expand the use of ensemble docking towards docking homology models.

In our lab, we developed an antibody homology modeling method called

RosettaAntibody96, which generates a large number of structural models for a given antibody sequence. In order to account for the structural inaccuracies present in using any single homology model for docking, we used ensemble docking on ten of the lowest energy homology models generated by RosettaAntibody for a benchmark set of antibodyantigen complexes.96 and compared its performance to two cases: rigid-body docking of the closest homology model among the ten to the crystal structure (low-rmsd), and rigidbody docking of the lowest energy homology model among the ten (low-score), along with docking with the crystal structure of the antibody as a control. Ensemble docking using the set of the ten top-scoring antibody models produces two high-quality and five medium-quality predictions, comparable to the one high-quality and eight medium quality predictions when docking with the single low-rmsd

77

RosettaAntibody homology model for a benchmark set of 15 targets.

Figure 3.10

illustrates a typical result for docking using RosettaAntibody homology models. Docking with the unbound crystal structure of the antibody produces a deep energy funnel with high-quality predictions. Docking using the low-score homology model produces some medium hits, but no energy funnel is observed. Docking with the low-rmsd homology model produces better results, with a weak funnel and a lowest-energy structure that is of medium quality. Finally, ensemble docking with the ten lowest-energy homology models shows a very distinct funnel, and a high-quality prediction among the top 5 lowest energy structures. In real world antibody-docking tasks when the use of a homology model of the antibody is necessary, there is no a priori information about which homology model, among a set of candidate structures developed by a homology modeling algorithm, is closest to the crystal structure. Ensemble docking performed comparably to the ‘lowrmsd’ rigid-body docking scenario, that is, when knowledge of the antibody crystal structure was used to select the closest homology model, and significantly better than the ‘low-score’ rigid-body docking scenario, which reflects a blind docking run. These results demonstrate that by simultaneously docking a set of homology models, we can achieve significant improvements in docking performance over using a single homology model in blind predictive docking.

78

Figure 3.10 Ensemble docking with antibody homology models Ensemble docking using the ten lowest-energy homology models performs better than rigid-body docking using a single homology model and performs comparable to rigid-body docking using the lowest-RMSD homology model among the ten models used in the ensemble. High-quality predictions are in red, medium quality in green, and acceptable quality in blue.

3.4.7 Ensemble docking in CAPRI We used the ensemble docking method on a number of CAPRI targets to test its performance in docking structures that contained flexible or disordered loops on the protein surface, and in cases where one of the input unbound structures was an NMR structure. Here we summarize our results for two representative cases, Target 29 and Target 41. 79

Target 29: Trm8-Trm82 complex Trm8 and Trm82 are two yeast proteins that form a complex and are required for 7-methylguanosine modification of tRNA. We were given the bound structure of Trm82 (PDB code: 2VDU:B) and the unbound structure of Trm8 (PDB code: 2VDV). Inspection of a sequence alignment of Trm8 revealed surface loop (residues 183-19) that was highly conserved across eukaryotes that could potentially play a role in forming a complex with Trm82.97 Unfortunately this loop is largely disordered in the unbound Trm8 structure (Fig 3.11). We used Rosetta loop building to create a candidate complete loop conformation on the unbound Trm8 across residues 183-198, and then used that structure as an input for full-atom relax, as described in the Methods section. We used an ensemble of ten relaxed structures and docked them with the bound conformation of Trm82. We selected the ten lowest-energy clusters of solutions from the global docking run to serve as our final CAPRI predictions. Unfortunately, we did not make a single successful prediction. However, retrospective analysis showed that our loop-building and all-atom relax method was successful in recovering the disorder-to-order transition of the highly conserved surface loop, with remarkable structural similarity to the bound conformation (Fig 3.11).98 Although ultimately unsuccessful, these results demonstrate the potential of combining loop modeling and relaxation with ensemble docking to overcome the challenges of flexible and disordered regions in unbound structures.

80

Figure 3.11 Ensemble generation recovers loop motion in Target 29 The ensemble generated for Trm8 (purple) is superimposed on the bound (red) and unbound (blue) conformations. Trm82 (gray spheres) is shown in complex with Trm8. Ensemble generation recovers the structure of the disordered loop in the unbound form to the ordered state in the bound form.

Target 41: E9DNase-Im2 The bacterial toxins Colicin E DNases binds to cognate immunity Im proteins to prevent host toxicity. Target 41 was the the non-cognate, but stable complex, of Colicin E9 Dnase and Im2. We were given an unbound NMR structure of Im2 (PDB code: 2NO8) containing fifty models, and the crystal structure of the unbound enzyme E9DNase (PDB code: 1FSJ).

Previous studies of complexes homologous cognate

complexes E9Dnase-Im999 and E7Dnase and Im7 revealed a shared binding motif.100 We relaxed all fifty models of Im2 as described in the Methods section to prepare them for use in ensemble docking. We superimposed E9Dnase and Im2 with the E9-Im9 complex, and carried out local ensemble docking. We also carried out global ensemble docking of

81

the two partners to evaluate the energetic of the local complex in the context of the global energy landscape. We submitted the ten lowest energy structures as our final predictions. Among our ten predictions, eight were of at least acceptable quality (Lrmsd < 10.0A) and one was of medium quality, with an Lrmsd of 4.95A, Fnat of 0.59, and an Irmsd of 1.75A.

A scatterplot of the binding energy and Lrmsd of each decoy in

combined local and global runs shows good discrimination of near-native predictions (Fig 3.12). The best prediction, superimposed on the homologous cognate complex of E9-Im9 is shown as well.

Figure 3.12 Ensemble docking of NMR structures in Target 41 The binding energy vs. Lrmsd(Å) of each decoy relative to the homologous structure of E9-Im9 is shown left. The lowest energy structure prediction is shown right, with E9 (green) and Im2 (cyan), superimposed on the homologous E9-Im9 complex structure.

3.5 Discussion Conformer selection and induced fit are two fundamentally distinct kinetic mechanisms for protein binding, and here we attempt to model them as conformational search 82

strategies within RosettaDock.

RosettaDock’s MC algorithm samples different

thermodynamic configurations of the system irrespective of time, limiting the conclusions we can draw about the physical validity of these two mechanisms. However, the success of these distinct conformational search strategies in recapitulating the binding models which inspired their design can be addressed. The success of CS-docking using both computationally generated ensembles and NMR ensembles suggests that bindingcompetent conformers do exist in the unbound ensemble, as demonstrated by Grunberg et al.,41 and that these conformers can be selected from the ensemble based on their favorable binding energies, both central tenets of the conformer selection binding model. Likewise, success of IF-docking through energy gradient-based backbone minimization demonstrates that a binding-competent conformation can be reached through a lowenergy pathway through conformational space from the unbound structure while in complex with the partner - a necessary component in the induced fit binding model. Finally the success of CS/IF-docking, especially in the number of top-ranked structures of acceptable quality in the NMR targets, suggests that for larger conformational changes, both conformer selection over a broad range of conformational space and local induced fit may be necessary for binding, as proposed by both computational41 and experimental101 studies. As observed in CAPRI, accommodating backbone flexibility is currently the biggest obstacle to accurate protein-protein docking.

Although most of the crystal

structure docking targets we tested were classified as ‘rigid-body’ by Mintseris et al.,84 modest backbone conformational changes between the bound and unbound state can still make high-resolution docking challenging.

83

In rigid-body global docking, Li et al.64

produced at least one medium quality prediction among their 10 top ranked predictions in 9 of 21 (43%) targets that overlapped with ours. In FFT-based global-docking using MD-derived ensembles, Smith et al.18 produced at least one medium-quality prediction in their 10 top ranked predictions for 4 of 12 (33%) targets that overlapped with ours. Using backbone minimization in local docking with Rosetta (very similar to our IFdocking), Wang et al.5 produced at least medium-quality models in their 3 top-ranked predictions in 9 of 17 (53%) targets that overlapped with ours. In our study, CS and CS/IF-docking produced a top-ranked medium quality prediction in 13 and 14 of 21 crystal structure targets (62% and 67%), respectively. Although direct comparisons with other studies is difficult due both to differences in docking strategies (global vs. local docking) and the different target sets used, the general improvement in performance observed in our study may be attributed to better discrimination between conformers due to a more physically realistic scoring function, and the greater breadth of conformational sampling due to the use of an ensemble of unbound structures. Although no flexible docking method clearly outperformed the other in cases where the single unbound conformation was relatively close to the bound conformation (BB_rmsd < 1.1 Å), in cases where the single conformation was relatively far from the bound conformation (BB_rmsd > 1.3 Å), particularly among the NMR targets, the ensemble docking methods (CS and CS/IF) significantly outperformed the backbone minimization method (IF). Furthermore, CS-docking, which was the most successful method in both target sets by some measures, is the least computationally expensive. However, the CS/IF method did show moderate improvements in some cases, particularly in increasing the number of native contacts. 84

Therefore, one potential strategy for

predictive docking would be to use CS-docking first for global searches where nearnative discrimination and computational efficiency are paramount. Potential near-native orientations and their respective binding-competent conformers can then be use as starting structures for further local docking and refinement with IF-docking. Our work included the first systematic study of docking using NMR solution-state structures.

NMR ensembles contain conformational variation as a result of both

methodological reasons, such as under-determination of protein structure, as well as biophysical reasons, such as solution-state protein dynamics.102 Our results suggest that while a single conformer may not be sufficient for docking, the use of the entire NMR ensemble can overcome structural uncertainties in the model to a degree that is comparable to rigid-body docking of unbound crystal structure.

Conformational

heterogeneity in NMR structural ensembles has demonstrated qualitative agreement with a number of quantitative measures thought to be related to protein dynamics, from experimental data such as S2 order parameters95 and crystallographic B-factors,94 to computational simulations using molecular dynamics,82;

94

essential dynamics103 or

normal mode analysis.104 Although, to our knowledge, a direct relationship between this heterogeneity and binding-induced conformation changes has not been studied, our results show that the conformational variation found in NMR ensembles provides meaningful backbone conformational sampling in ensemble docking and may thus, to some degree, be relevant to the conformational changes associated with binding. Still, the difference in performance between docking with the NMR ensembles and the unbound crystal structures, suggests that structure under-determination plays a predominant role in the conformational heterogeneity of NMR ensembles.

85

These flexible docking methods were inspired by biophysical models of protein binding, but they can be applied more broadly to address uncertainty in the bound backbone conformation in docking, whether that uncertainty is the result of true flexibility, i.e. binding-induced conformational changes, or because of uncertainties in the initial unbound structures. Efficient methods for general sampling of conformational plasticity in proteins during docking could be applied to docking of homology model or low-resolution structures, where docking of an ensemble of multiple structural models instead of a single model may effectively increase the margin of error in structural modeling as applied to docking. We have recently demonstrated this in a separate study where docking of an ensemble of multiple antibody homology models to their respective antigens, using the CS-docking methods presented here, significantly outperforms the docking of a single homology model in recovering near-native docking solutions.105 Likewise, HADDOCK has shown several specific successes in docking ensembles from a variety of sources from homology models to NMR structures.79;

80; 81

Together, these

results suggest a fundamental robustness of ensemble docking for accommodating conformational plasticity. There are a number of limitations of the methods for use in predictive docking efforts such as CAPRI. First, this study analyzed local docking only, but often global docking is necessary when no biochemical information exists to assist in prediction. A test of global docking for both target sets with all four docking methods was beyond the scope of this study, but successful local docking can improve global docking. Using RosettaDock, Gray et al.7 showed that global docking succeeded in 18 of 24 cases where local docking produced a medium or high-quality prediction and in none of the 6 cases 86

where local docking produced, at best, an acceptable-quality prediction, suggesting that the increase in medium and high-quality models using local flexible docking in this study may directly translate to improvements in global docking. Second, effective energetic discrimination of near-native decoys with explicit backbone flexibility in docking remains a challenge in IF and CS/IF-docking. Although the use of binding energy in this study yielded improvements in discrimination compared to total energy, the existence of false positives still confounded accurate docking, perhaps as a result of inaccuracies in the energy function. Deficiencies in the energy function may include backbone torsion angle and other internal energy components, accurate modeling of electrostatics and side-chain protonation states, ordered water molecules at the interface, or the correct balance between scoring terms.

Additionally, more

sophisticated approaches to discriminating near-native funnels in rigid-body docking106 can be extended to include variables such as backbone conformation or conformer identity. Finally, large global conformation changes (e.g. domain hinge-motions) or large local conformation changes (e.g. loops) cannot yet be accurately modeled because neither the ensemble generation method nor the backbone minimization used in this study can capture conformational changes of that magnitude. The docking methods presented here were developed to be largely independent of the method of ensemble generation, so the ensemble generation method can be tailored to accommodate specific types of conformational flexibility. Large global motions may be modeled using larger or more diverse ensembles generated from normal mode analysis,107 accelerated MD108 or essential dynamics.103;

109

Likewise, specific features of the protein structure can be

87

varied between conformers in the ensemble, for example, the substrate binding-loop conformation of an enzyme, the conformation of a disordered region of an unbound crystal structure, or VH-VL orientation in antibody docking. Previous methods addressed specific types of conformation changes such as hinge motions or domain-domain motions with some predictive success.110; 111 Progress has been made towards the prediction of flexible regions of proteins that change conformation upon binding112 as well as generating ensembles of flexible regions that contain bound-like conformations,113 both of which can improve the quality of the ensembles generated for docking.

3.6 Conclusion In this study we present three different methods for accommodating backbone flexibility in protein docking and show substantial improvements in overall docking accuracy, including sampling and discrimination, without using any a priori knowledge of the flexible regions of the ligand protein. These methods have been intentionally developed to be versatile and can used in conjunction with ensembles derived from a wide variety of sources, including NMR data, MD simulations, loop ensembles, and homology modeling. Local docking using RosettaDock is featured in a number of successful CAPRI strategies,77;

78; 106; 114

and given the substantial improvements in performance

over the standard RosettaDock algorithm, the new methods are a significant step forward in the state of the art of protein-protein docking towards general flexible docking. Incorporation of backbone conformational plasticity in docking might ultimately allow us to expand the list of “dockable” components beyond high-resolution crystal structures, to lower-resolution electron microscopy (EM) structures, NMR structures and homology

88

models, which will be essential to applying predictive docking towards a structural understanding of protein interactions in conjunction with on-going genomic and proteomic efforts.

89

90

CHAPTER 4 MODELING ENZYME-SUBSTRATE SPECIFICITY USING FLEXIBLE DOCKING Previously published as: Chaudhury, S. & Gray, J.J., (2009) Identification of specificity determinants in HIV-1 protease using computational peptide docking: implications for drug resistance. Structure. 17(12):1636-1648.

Reprinted with the permission of the publisher, Cell Press, Inc. with minor revisions.

4.1 Summary In Chapter 3 we studied the use of flexible docking in predicting the structure of protein complexes.

In this chapter we explore the use of flexible docking beyond

structure prediction and the recovery of crystal structures, and into the identification of the structural and biochemical mechanisms that mediate protein interactions. For this study, we chose to study substrate recognition of the enzyme HIV-1 protease. Due to its biomedical relevance, there is an enormous wealth of structural and biochemical information on HIV-1 protease, making it a rich target for using flexible docking methods to probe the mechanisms that mediate substrate recognition. Finally, any insight we gain into the structure of function of the HIV-1 protease active-site may be useful for downstream rational drug design applications.

91

Abstract Drug-resistant mutations (DRMs) in HIV-1 protease are a major challenge to antiretroviral therapy. Protease-substrate interactions that are determined to be critical for native selectivity could serve as robust targets for drug design that are immune to DRMs. In order to identify the structural mechanisms of selectivity, we developed a peptide docking algorithm to predict the atomic structure of protease-substrate complexes and applied it to a large and diverse set of cleavable and non-cleavable peptides. Cleavable peptides showed significantly lower energies of interaction than non-cleavable peptides with six protease active-site residues playing the most significant role in discrimination. Surprisingly, all six residues correspond to sequence positions associated with drug resistance mutations, demonstrating that the very residues that are responsible for native substrate specificity in HIV-1 protease are altered during its evolution to drug resistance, suggesting that drug resistance and substrate selectivity may share common mechanisms.

4.2 Introduction Human immunodeficiency virus type 1 (HIV-1) protease is responsible for the processing of Gag and Pol polyproteins, making it critical for viral assembly and maturation and an important target for antiretroviral therapies.

The ten Federal Drug Administration-

approved protease inhibitors are substrate mimics that resulted from structure-based drug design efforts of the pharmaceutical industry. The efficacy of these drugs is significantly hampered by the emergence of drug resistant mutations (DRMs) in HIV-1 protease that are thought to preferentially alter drug binding over substrate binding to the protease.

92

Understanding the structural mechanisms of molecular recognition between HIV-1 protease and its substrates will be critical to the development of a second generation of protease inhibitors in the treatment of HIV-1 infection. HIV-1 protease is an aspartyl protease dimer that processes the Gag and Pol polyproteins at 10 cleavage sites (hereafter referred to as endogenous substrates) by recognizing an 8residue stretch surrounding the cleavage site. By convention, these substrate residues are denoted P4-P3-P2-P1-P1'-P2'-P3'-P4', where the scissile bond lies between the P1 and P1' residues. Substrate specificity of HIV-1 protease has been studied extensively; large databases of cleavable and non-cleavable peptides have been assembled 115; 116; 117; 118; 119; 120; 121

and the sequences for the ten endogenous substrates have been identified

122

.

Sequence-based methods including artificial neural networks and support vector machines have been developed to discriminate between cleavable and non-cleavable peptides

123; 124; 125

experimental studies

.

From these sequence-based methods in conjunction with

126; 127; 128

, a complex picture of HIV-1 protease selectivity has

emerged. With the exception of two relatively common sequence motifs at P1-P1’, aromatic-proline (Aro-P) and hydrophobic-hydrophobic (Hφ-Hφ)129, there are few salient features in the sequences of cleavable peptides and a high degree of interdependence of various peptide residues 124. In an impressive study by Kontijevskis et al. 130, a statistical model developed from a large database of cleavable and non-cleavable peptides for nine different retroviral proteases identified a number of physico-chemical relationships between peptide and protease residues that accurately define and predict cleavability. Ultimately, purely sequence-based methods, can, at best, implicate, but not explicitly

93

model, the underlying structural and energetic mechanisms of substrate selectivity that are essential for drug design. The structural details of protease-substrate interactions have been characterized through crystallization of HIV-1 protease in complex with various substrates

131; 132; 133

. Prabu-

Jeyabalan et al. crystallized six of the ten endogenous substrates in complex with a deactivated HIV-1 protease and proposed the substrate envelope hypothesis to explain HIV1 protease selectivity 131; 132. They observed that all six substrate peptides conformed to a common volume within the protease active site despite significant diversity in their sequences and theorized that substrate selectivity is determined primarily by whether a given peptide sequence is able adopt a low-energy conformation that fits within this volume, or substrate envelope. This hypothesis was evaluated in the context of HIV-1 protease inhibitors and it was found that the inhibitors also conform to the substrate envelope. More interestingly, the areas of the active site where the inhibitor protruded from the envelope, and consequently formed non-substrate-like interactions with the protease, were adjacent to DRM residue positions

134; 135

. Subsequent design of small

molecules that fit exclusively within the substrate envelope led to tight binding inhibitors that showed low to moderate tolerance of drug resistant mutations 136; 137; 138. Despite the prevalence of sequence-based methods modeling substrate discrimination, and the apparent success of the substrate envelope hypothesis in inhibitor design, there is a dearth of structure-based methods for modeling HIV-1 protease selectivity. Kurt et al. used a coarse-grained sequence threading approach with an empirical potential function to successfully discriminate binders from non-binders in a small set of 16 peptides and identified peptide internal conformational energy as an important discriminating factor 94

139

. Ozer et al. used a similar coarse-grained approach to test binding of a very large set

of random sequences and demonstrated that some sequence motifs in endogenous substrates are near-optimal for binding

140

.

In both these cases, the lack of atomic

resolution in both the structural model and potential function limit the conclusions that can be drawn about the structural mechanisms of selectivity. Wang & Kollman used molecular dynamics methods to study the differences between substrate and inhibitor binding

141

. In peptide design, Altmen et al. successfully designed tighter-binding single

and double mutants from the substrate peptide RT-RH using a atomic-resolution computational design algorithm but did not address the issue of selectivity

142

. Finally,

none of these previous studies, bioinformatic or structure-based, have systematically explored the role of protease active-site residues in selectivity, which is vital given that some of these residues are frequently mutated in drug resistance viral strains. The present study focuses on developing an atomic-resolution structural model of protease specificity through computational peptide docking and identifying the underlying mechanisms of substrate specificity by calculating the free energy contributions of each protease and peptide residue to the binding of cleavable and noncleavable peptides. Active-site residue interactions that are determined to be essential for native substrate selectivity could serve as robust targets for drug design because of their central role in protease function. Finally, an atomic-resolution structural model will enable us to explicitly test the substrate envelope hypothesis in the context of substrate selectivity. Given the promising results of drug design methods implicitly based on this hypothesis, any additional insight into the substrate envelope hypothesis may yield new avenues for HIV drug research.

95

4.3 Methods Collecting the peptide sequence data We relied on several previous publications for peptide sequences that have been demonstrated to be cleaved by HIV-1 protease 116; 117; 118; 119; 120; 121. The ten endogenous substrate sequences were added to the set of cleavable sequences

132

. The majority of

non-cleavable sequences were collected using a sliding 8-residue window around cleavage sites of HIV-1 reverse transcriptase, as determined by Tomasselli et al. 119, from i-8 residues preceding to i+8 residues flanking the cleavage site i, in increments of 2 residues. Non-cleavable sequences also included five peptides predicted by Chou et al. 143

to have the lowest affinity for the protease. Finally, any peptide sequence that was

identical in over four consecutive positions to any other peptide in the set was removed to ensure diversity.

The final data set contained 69 cleavable peptides, including 10

endogenous substrates, and 43 non-cleavable peptide sequences.

Computational peptide docking The goal of peptide docking was to predict the structure of the protease-substrate complex at an atomic resolution for subsequent energy calculations.144;

145; 146

We

developed our own flexible peptide docking algorithm to restrict the conformational search to a relatively local search because the end-goal is specificity determination, not structure prediction. The protease structure used for docking was obtained from the crystal structure of HIV-1 protease bound to the inhibitor Saquinavir

147

. The substrate

peptide starting conformation and rigid-body position was defined from the crystal structure of HIV-1 protease bound to the endogenous substrate p2-NC 96

132

, after its bond

lengths and angles were adjusted to ideal values 148. This starting substrate conformation ensured that the scissile peptide bond is in the correct conformation and orientation with respect to the catalytic active site residues for enzymatic catalysis and is largely constant across all crystallized peptides 132. A given substrate peptide sequence was then threaded through this ‘generic’ starting structure to create the starting structure for that particular peptide sequence for docking.

Figure 4.1 Peptide Docking Algorithm Flow chart of the Monte-Carlo minimization algorithm for peptide docking.

Peptide docking was carried out using a novel algorithm written within the RosettaDock protein docking software

5; 7; 149

.

The protease-peptide complex is

represented entirely in atomic resolution, and the peptide backbone along with the peptide and protease side-chains are perturbed in a Monte Carlo plus minimization-based algorithm with simulated annealing.

A flow-chart representing the algorithm is

illustrated in Figure 4.1. The temperature was decreased geometrically from a starting

97

value of kT = 3.0 to a final value of 0.8. The starting structure underwent 96 cycles of docking to generate a single structural model, or decoy, and a total of 500 decoys were generated for each substrate peptide sequence. Within each cycle, the peptide undergoes φ and ψ perturbations using a series of ‘small’ moves, which randomly perturb φ or ψ of residue i, and ‘shear’ moves, which randomly perturb φ of residue i and ψ of residue i-1 (reviewed in 150). The magnitude of the perturbation was chosen from a randomized Gaussian distribution that was scaled linearly with temperature from 20˚ at the beginning of the docking run to 5.3˚ at the end. Following a set of small and shear moves, the peptide backbone torsion angles, rigidbody position, and peptide and protease side-chain torsion angles were optimized using conjugate gradient-based minimization using the all-atom energy function. Every cycle the peptide and protease side-chain conformations were optimized using the ‘rotamer trials’ method as described in

20

and every eighth cycle, they were further optimized

through a combinatorial repacking algorithm, as described in Kuhlman et al. 151, using an expanded rotamer library

20; 152

. The energy function used throughout the simulation is

the standard docking energy function used in RosettaDock 7, consisting primarily of Van der Waals, hydrogen bonding, solvation, statistical pair-wise, and side chain internal energy terms.

The lowest-energy decoy among the 500 generated for each peptide

sequence was selected as the final structural model for that peptide. Each decoy took approximately 210 seconds to generate on a single 3.0 GHz Intel Xeon processor; the entire data set for 111 sequences was generated in approximately 3300 CPU hours. We compared this flexible peptide docking algorithm with a fixed-backbone algorithm where the same starting structure was used, the peptide backbone and rigid 98

body position was held fixed, and the side-chain conformations were optimized using the same methods as described above.

Energy calculations The cleavability of a given substrate peptide is represented as the enzyme specificity constant towards that substrate, defined as kcat/Km, where kcat is the rate of product formation and Km is dissociation constant. Differences in free energy of enzymesubstrate interactions can be related to the specificity constant by ∆G = -RT ln(kcat/Km) according to transition state theory. In this study, we approximate the free energy of interaction between the enzyme and the substrate as the sum of the individual residue energies (Ei) of each active-site and peptide residue i, for a given structure (struct): ∆G = ΣEi(struct). For each structure, the individual residue energies are calculated as the sum of five components of the standard RosettaDock energy function: Van der Waals energy (EiVdW), side-chain conformational energy (EiDun), solvation energy (Eisolv), hydrogen bonding energy (EiHbond), and a residue reference energy (Eiref): Ei(struct) = EiVdW(struct) + EiDun(struct) + Eisolv(struct) + EiHbond(struct) + Eiref(struct). The Van der Waals energy was represented using a modified Lennard-Jones 6-12 potential as described in 7, the side-chain conformational energy was represented by statistical potential reflecting the Dunbrack rotamer probabilities

151; 152

, the solvation energy was represented by the

Lazaridis-Karplus Gaussian solvent-exclusion model

153

, the hydrogen bonding energy

was represented by a statistical orientation-dependent potential

26

, and the residue

reference energy was represented by a fitted energy function that describes the unfolded reference state, as described in

151

. All energies are listed in arbitrary units of kT. The

99

total energy was taken as the sum of the individual residue energies of active-site residues (residues 8, 23, 25, 27, 28, 29, 30, 32, 45, 47, 48, 49, 50, 76, 81, 82, and 84) on both chains in the protease and all eight peptide residues. This “free energy” approximation implicitly includes the entropy of the solvent but neglects peptide and protease conformational entropy.

Identification of specificity determinants We used a Welch’s t test on both the total free energy and on the individual residue energies between cleavable and non-cleavable peptide sequences to identify significant energetic differences. A threshold of 0.2 kT was set for the difference of mean values, below which the distribution of energies between cleavable and noncleavable peptides was considered indistinguishable. The t test was conducted assuming unequal sample sizes, unequal variances, and an alternative hypothesis that cleavable peptides have a lower energy than non-cleavable peptides. We conducted all statistical analyses using the R statistical software package 154. A residue was considered a ‘specificity-determining residue’ if it showed a significant decrease in energy between cleavable and non-cleavable peptides with at least p < 0.05. We also conducted a jack-knife statistical test to determine the average number of specificity-determining residues identified by chance by randomizing the cleavability of a peptide sequence in our data set and carrying out the statistical protocol 20 times and found that, on average, 0.15 residues were found to be significant at p < 0.01 and 0.55 residues were found to be significant at p < 0.05.

100

Validation of the method We validated the structure prediction accuracy of the peptide docking algorithm by comparing the final structural models of six endogenous substrates (MA-CA, CA-p2, p2-NC, p1-p6, RT-RH, RH-IN) with their respective crystal structures

132

. Structural

accuracy was determined using two measurements, fnat, or the fraction of native residueresidue contacts (defined as two residues in which a heavy atom from one residue is within 5Ǻ of a heavy atom from the other) between the protease and peptide that is recovered in the structure prediction, and Irms, the root mean squared distance (rmsd) of all Cα atoms within 5Å of the protease-peptide interface following an optimal superposition along those atoms 155. We validated our modeling of substrate specificity in two ways. First we tested if there was a significant difference in free energy between cleavable and non-cleavable peptides complexed with the protease using a Welch’s t test, as described above. Second, we classified each peptide residue according to six types: small (G, C, N, S, T, D, A, V, P), hydrophobic (A, V, P, M, F, L, Y, I, W, C), charged (D, E, H, K, R), polar (T, C, S, N, D, Q, E, K, R, H, Y, W), aromatic (F, Y, W), and β-branched (V, I, T). We then sorted the peptides based on free energy of the peptide-protease complex and calculated relative probability of finding each residue type at each subsite between the 30 lowestenergy peptides vs. 30 highest-energy peptides. We compared these relative residue type probabilities to those observed in the literature for cleavable peptides and the endogenous substrates.

101

4.4 Results 4.4.1 Structure prediction of protease-substrate complexes Accurate structure prediction of the protease-peptide complex from a given peptide sequence is critical to subsequent energy calculations that model specificity.

A

comparison of the six previously crystallized substrates shows that there are small, but significant, differences in their peptide backbone conformations

132

.

In order to

accommodate these backbone conformation changes, we developed a novel flexible peptide docking algorithm within RosettaDock

7; 149

.

and compared it to fixed

backbone/side-chain packing methods traditionally used to study peptide/protein design and specificity 142; 156; 157. We validated these methods by comparing the accuracy of the structure predictions on the six endogenous substrates for each crystal structures have been determined 132.

Fixed-backbone rotamer packing

Flexible peptide docking

Name

fnat

Irms(Å)

∆Ebinding

fnat

Irms(Å)

∆Ebinding

CA-p2

0.82

0.53

-8.8

0.87

0.47

-12.5

MA-CA

0.66

0.78

-8.6

0.71

0.73

-11.8

p2-NC

0.67

0.55

-10.2

0.79

0.34

-13.6

p1-p6

0.59

1.1

-10.3

0.68

1.0

-13.3

RT-RH

0.75

0.65

-10.5

0.84

0.49

-13.8

RH-IN

0.66

1.3

-10.8

0.75

1.1

-15.1

Table 4.1 Accuracy using side-chain packing and peptide docking

102

Table 4.1 compares the accuracy of the peptide docking algorithm and the fixedbackbone/side-chain packing method in recovering the crystal structure of the proteasesubstrate complex as a measure of the fraction of native residue-residue contacts recovered (fnat) and rms of interface residues (Irms). The binding energy of the proteasepeptide complex (∆Ebinding), as determined by RosettaDock, is also listed 5. The peptide docking predicted all six HIV-1 protease-substrate crystal structures to within 1.1 Å rmsd with at least 68% of native contacts recovered. The fixed-backbone method performed marginally, but consistently, less accurately than peptide docking in fnat and Irms. More importantly, the peptide docking structures showed significantly more favorable binding energies than fixed-backbone structures due to a greater number of hydrogen bonds and Van der Waals contacts. In a number of cases (CA-p2, p2-NC, p1-p6), key interactions noted by Prabu-Jeyablan et al.

132

, such as the side-chain hydrogen bonds between P2'

and the protease D30 or an intra-molecular hydrogen bond between P2 and P1', are absent in the fixed-backbone structures but are recovered in peptide docking. These results demonstrate that small backbone movements are necessary to recover key protease-substrate interactions. The structural models generated from peptide docking are compared to their respective crystal structures in Figure 4.2, for the representative cases of RT-RH, MACA, and RH-IN. Overall, the peptide conformation of residues P3-P3' is accurately recovered for both the backbone and side-chains.

These residues are the most

conformationally invariant among known protease-substrate structures

132

, most likely

because they are deeply buried within the substrate-binding site and thus the most highly restricted. P4 and P4' are less accurately recovered, which can be attributed to more

103

conformational freedom in these positions, which has been observed experimentally. In a number of crystal structures, either the P4 or P4' residues is partially disordered or has high thermal factors 132. The only readily observable feature in protease-substrate crystal structures that is absent in the predicted structures is the alternate conformation of the P3P4 backbone in the substrates RH-IN (Fig. 4.2C) and p1-p6 132.

Figure 4.2 Protease peptide structure prediction The structure predictions for the peptides (A) RT-RH, (B) MA-CA, and (C) RH-IN. The crystal structure of each complex is shown in gray with the peptide in magenta. The predicted structure is colored with the HIV-1 protease dimer in green and cyan and the peptide substrate in orange. Active site and substrate residues are shown as sticks. P4 is the peptide residue on the bottom-left, P4’ is the peptide residue at the top-right.

4.4.2 Energetic discrimination of cleavable substrates Following the accurate prediction of the protease-substrate complexes for the six known crystallized structures, the test set was expanded to 69 known cleavable peptides and 43 non-cleavable peptides, including 4 known endogenous substrates (NC-p1, TF-PR, PR104

RT, AutoP) for which no crystal structures have been determined. We used the flexible peptide docking algorithm to generate structural models for each of the 112 peptides for subsequent energy calculations. For each structural model, the individual residue energies and total energy of the protease-peptide complex was calculated. Figure 4.3A presents a histogram of the total energies for all 112 substrates and shows a separation between the energies of cleavable peptides and non-cleavable peptides, with some overlap. Over 40% of cleavable peptides have lower energy than all but ~3% of non-cleavable peptides, while over 35% of noncleavable peptides have higher energy than all but ~5% of cleavable peptide. A Welch’s t test confirms that there is a significant difference in the distribution of the total energy between cleavable and non-cleavable peptides (p < 10-6) suggesting that structural prediction with the peptide docking algorithm followed by energy calculations using the RosettaDock energy function accurately captures, to some degree, HIV-1 protease selectivity. A similar trend is seen with just the peptide energies alone (Figure 4.3B), at a lower p value (p < 10-4), indicating that protease residue energies are improving overall discrimination.

105

Figure 4.3 Energy distribution of cleavable and Non-cleavable Peptides Histograms of energies summed over different subsets of residues in the peptide-protease complex for cleavable (blue) and non-cleavable (red) peptides. (A) Energies from all active-site and peptide residues. (B) Energies from peptide residues alone. (C) Energies from active-site DRM-associated residues (protease residues 23, 30, 47, 48, 50, 76, 82, 84) (D) Energies from all active-site residues not associated with DRMs.

Identification of substrate specificity trends from the model In order to further validate that our model was capturing the energetics of substrate specificity in HIV-1 protease, we sorted the protease-peptide complexes by energy,

106

determined the relative probabilities of finding certain amino-acid types at each peptide position from P3-P3' between high-energy and low-energy peptides, and compared the results with previous experimental and bioinformatics studies.

Amino acids were

categorized into the following overlapping sets: small, hydrophobic, aromatic, βbranched, charged, and polar.

Table 4.2 lists the log of the relative probabilities

(Plow_energy/Phigh_energy) for each subsite, a minimum threshold of ±0.25 was set, outside of which a trend was deemed significant.

Small

Hydrophobic

Aromatic

P3

-0.25

-0.27

+0.69

P2

+0.41

+0.79

P3') agrees with previous bioinformatics

124

and

experimental studies 127; 128. The goal of this study was to identify specificity determinants that were critical to native substrate selectivity to serve as potential targets for drug design, but, surprisingly, all six of the protease specificity-determining residues were in sequence positions that are clinically associated with drug resistance mutations 161. Figures 4.3C and 4.3D show histograms for the energy taken as the sum of active-site residues associated with DRMs (L23, D30, L32, G48, I47, I50, L76, V82, I84 for both chains in the protease) and compares it with the sum of non-DRM protease residues for cleavable and non-cleavable peptides. The DRM-associated residues are much stronger discriminators of cleavable peptides (p < 10-3) than non-DRM residues (p < 0.01). These findings demonstrate that the very residues most responsible for substrate selectivity are the ones that are mutated in drug-resistance, suggesting that HIV-1 alters its protease substrate specificity when evolving drug resistance.

111

Structural mechanisms of specificity-determining residues The structural mechanism of substrate discrimination for these specificitydetermining residues was primarily through steric interactions within the proteasesubstrate complex. Figures 4.5 shows a histogram of the individual residue energies accompanied by the structures of representative cleavable and non-cleavable peptides for four of the six specificity-determining residues. In the case of I47' (Figure 4.5A), the primarily energetic component of discrimination was EVdW and EDun (both at p < 0.01) due to steric interactions with the P2' side chain. Likewise, in L76 (not shown), the most significant discrimination was also in the EVdW and EDun components (p < 0.05 and p < 0.01, respectively) due to interactions with the side chains from P2 and P4. In I84' (Figure 4.5B), discrimination is due primarily to EDun (p < 0.05), as steric interactions from the side-chain atoms of P1 and P2' force the I84' side chain to adopt unfavorable conformations. In many cases these steric clashes were from the Cγ methyl group in βbranched side chains at P1. A more complex mechanism emerges for V82 (Figure 4.5C), where EDun is the most responsible for discrimination (p < 0.01). Here, side-chain atoms of P1' and P2 force the I84 into side-chain conformations that lead to steric clashes with the V82 side chain, forcing it to adopt unfavorable conformations. These latter two examples demonstrate a network of steric interactions between P1-P2' and P1'-P2 subsites, that spans more than 10 Å, from the Cα of P2/P2' to the Cβ of V82/V82’.

112

113

Figure 4.5 Structural and energetic mechanisms for specificity Left, histograms for the individual residue energies for specificity determinant residues (A) I47', (B) I84', (C) V82, and (D) D30' for cleavable (blue) and non-cleavable (red) peptides.

Right, structures of

representative cleavable and non-cleavable peptides for each specificity-determining residue.

The

specificity-determining residue is colored with respect to its residue energy (spectrum from blue to red for low energy to high energy). Important interacting residues as well as the specificity-determining residues are shown as sticks for the peptide (dark gray) and the protease (light gray). The cleavable peptides selected for illustration are the ten endogenous substrates, the non-cleavable peptides are the ten with the highest residue energy for each respective specificity-determining residue.

In two specificity-determining residues, D30' and G48', hydrogen bonding and solvation energy played a significant role in discrimination. In G48', the sum of the hydrogen bonding energy and solvation energy (EHbond+Esolv) was highly significant (p < 0.001). In cleavable peptides, the backbone oxygen of G48' forms a hydrogen bond with the backbone amide of P3'. By contrast, non-cleavable peptides are often characterized by the presence of an unsatisfied backbone hydrogen bond that is not exposed to solvent at this position. In D30' (Figure 4.5D), EVdW, EDun, Esolv, and EHbond all play a significant role in discrimination (p < 0.05). Discrimination at this position is a result of two factors: steric interactions between D30 with P2' and P4' side chains and hydrogen bonding of the D30' carboxyl group with either the solvent, the protease (through R8 and N88’), or the peptide (through P2' or P4'). Among the substrate residues, residue energies for P1, P1', P2', P3' were found to be significantly increased for non-cleavable peptides over cleavable peptide. For P1, P1' and P2' this increase was primarily the result of increased EVdW and EDun, as noncleavable peptides had residues that exhibited steric clashes and/or adopted unfavorable 114

side-chain conformations. For P1, there was an additional effect of EHbond+Esolv energy terms (p < 0.001), that favored residues with low solvation energy, such as hydrophobic residues, as well as uncharged residues that can form side-chain or backbone hydrogen bonds with carbonyl oxygen of G27. In our data set, Asn was the only P1 residue that could reliably form this hydrogen bond, perhaps explaining why it is the only polar residue at P1 among the endogenous substrates 132. At P3', there is an effect of EHbond (p < 0.05) that is due to backbone-backbone hydrogen bond formation with G48'. Finally, at P2', among the 75 peptides with polar residues at this position, there was an additional effect of EHbond+Esolv energy terms (p < 0.01), due to hydrogen bond formation with either the side-chain amide of K45’ (in the case of Glu) or side-chain carboxyl and backbone amide of D30' (in the case of Gln). Hydrogen bonding between D30' and the P2' side chain has been observed in a number of protease-substrate crystal structures

133

and demonstrated to be important for specificity 162.

4.4.4 Evaluating the substrate envelope hypothesis The substrate envelope hypothesis states that the protease selects cleavable substrates on the basis of their ability to fit within a given consensus volume

132

. Although this

hypothesis was evaluated in the context of protease inhibitors 134, it has never before been evaluated in the context of substrate peptides. In evaluating this hypothesis, we look at two factors: first, whether it is necessary for a cleavable peptide to fit within the substrate envelope and second, whether being able to fit inside this volume is sufficient to determine cleavability.

115

Figure 4.6 Evaluation of the substrate envelope hypothesis (A) Peptide residues P2-P2' are shown for cleavable (left) and non-cleavable (right) peptides. Peptide sidechain atoms are shown as sticks (magenta), atoms that protrude from the substrate envelope are shown as yellow spheres. Protease specificity-determining residues are shown as sticks (green and cyan). (B) Plot of number of heavy atoms that protrude from the substrate envelope (Natoms) vs. the energy of the proteasepeptide complex, for cleavable (red) and non-cleavable (blue) peptides.

116

We defined a substrate envelope consisting of the 70% consensus volume 135 from the structural models for the ten endogenous substrates, as the volume of all heavy atoms within the Van der Waals radius of another heavy atom for at least 7 other substrate peptides. Figure 4.6 illustrates the degree to which cleavable and non-cleavable peptides protruded from the substrate envelope at subsites P2-P2'. Overall, more heavy atoms (Natoms) protrude from the substrate envelope for non-cleavable than cleavable peptides (p < 10-5), but there were a substantial number of non-cleavable peptides that showed minimal protrusion.

Approximately 90% of cleavable peptides had at most two

protruding heavy atoms between P2-P2', compared to 40% of non-cleavable peptides. These results suggest that while staying within the substrate envelope is necessary for the cleavability of a peptide, it is not sufficient to determine cleavability. Since the substrate envelop hypothesis is based on an active-site volume that is governed by steric and conformational energies, we tested whether the degree of protrusion from the substrate envelop of a given peptide was related to the energy of the protease-peptide complex. Figure 4.6B shows a correlation (R2 = 0.37) between the total energy of the complex and the number of protruding heavy atoms, demonstrating that the substrate selectivity described by the substrate envelope hypothesis is largely related to the energies modeled in our study. The spatial positions of the protease specificity-determining residues are immediately adjacent to regions of the substrate envelope with a large number of protrusions in non-cleavable peptides (Figure 4.6A), suggesting these residues play an important role in maintaining the appropriate shape of the substrate envelope for native substrate recognition. This spatial relationship bears a striking resemblance to the spatial

117

position of DRM-associated residues in the substrate envelope with regards to inhibitor interactions

134; 135

. In both cases, these residues are immediately adjacent to regions of

substrate peptides and drug molecules that fall outside of the envelope. Additionally, every residue identified in the aforementioned studies as active-site DRM residues are identified in the present study as a specificity-determining residue.

Figure 4.7 Substrate and binding envelope in the protease active-site (A) The substrate envelope (red surface) and binding (blue mesh) envelope volumes are illustrated for substrate residues P3-P3'. Drug resistant mutation residues 30, 47, 48, 50, 82, and 84 are shown as sticks. (B)

The inhibitor volume consisting of the combined volumes of inhibitors Nelfinavir, Saquinavir,

Indinavir, Ritonavir, Amprenavir, Lopinavir, and Atazanavir, is shown in orange.

118

Finally, we attempted to capture the promiscuity of HIV-1 protease by defining a ‘binding envelope’ which is based on the 70% consensus volume of all cleavable peptide structure in the data set and compare it to the substrate envelope. Figure 4.7 illustrates the binding envelope along with the substrate envelope and the inhibitor volume, defined as the Van der Waals volume of all heavy atoms in inhibitors Nelfinavir, Saquinavir, Indinavir, Ritonavir, Amprenavir, Lopinavir, and Atazanavir. A comparison of these three volumes reveals three significant characteristics. First, the binding envelope is substantially larger than the substrate envelope, indicating significantly greater structural diversity among all binders than among the ten endogenous substrates. Second, as previous studies have noted

134; 135

, there are substantial portions of the inhibitor volume

that fall outside of the substrate envelope, especially around the regions of DRM residues I50, V82, I84, D30. Finally, the regions of the inhibitor volume that fall outside of the substrate envelope are largely contained within the binding envelope. This is especially clear in the aforementioned DRMs.

4.4.5 Other determinants of specificity Although binding energy models and the substrate envelope hypothesis are largely accurate in defining the cleavability of a given peptide sequence, they are, to some degree, insufficient, because the activity of HIV-1 protease to a particular substrate is a function of both the catalytic rate (kcat) and dissociation constant (Km). While differences in the energies of protease-substrate interactions may affect both kcat and Km, kcat is additionally affected by other features that are critical to enzymatic catalysis, such as active-site bond geometry, that may be essential to defining substrate specificity.

119

Figure 4.8 Distribution of active-site geometry in cleavable and non-cleavable peptides (A) Histogram of the active-site geometry, measured as the angle between the C-O bond vector of P1 and the O δ2-O δ2 vector between D25 and D25' (θactive-site) for cleavable (blue) and non-cleavable (red) peptides. (B) Representative cases of a peptide with θactive-site = 89˚ (left) for the non-cleavable peptide IAEI↓QKQG and a peptide with θactive-site = 141˚ (right) for the cleavable peptide ARVL↓AEAM. The carbonyl oxygen is shown in spheres, the peptide backbone, D25 and D25' are shown in sticks. The C-O vector of P1 and O δ2O δ2 vector between D25 and D25' are shown as arrows.

In the reaction mechanism of an aspartyl protease, one catalytic aspartate (D25) acts as a general base, deprotonating a water molecule that subsequently hydrates the carbonyl group about the scissile peptide bond to form a tetrahedral intermediate, and the second aspartate acts (D25') as a general acid, facilitating peptide-bond cleavage. We measured the geometry of peptide carbonyl group with respect to the carboxyl groups of the catalytic aspartates as the angle between the C-O bond vector of P1 and Oδ2-Oδ2 vector between D25 and D25' (θactive-site). This angle reflects the orientation of the bond geometry of the carbonyl carbon of P1 with respect to the catalytic aspartates that 120

catalyze its hydration.

In all ten endogenous substrates and approximately 90% of

cleavable peptides, θactive-site ranged from 120˚-150˚, compared to approximately 50% of non-cleavable peptides (Figure 4.8). We observed a significant sub-population of non-cleavable peptides (~30%) that had a θactive-site between 60˚ to 100˚. Figure 4.8B illustrates the active-site geometry for two representative cases with θactive-site at 90˚ and 140˚. The carbonyl oxygen at θactive-site of 90˚ occludes the proposed position of the catalytic water molecule 163, while the same oxygen at θactive-site of 140˚ provides ample space for the water as well as the appropriate geometry for hydration. Furthermore, among the two most prominent sequence motifs at the P1-P1' positions for cleavable peptides

129

, 100% of peptides with the Aro-P motif

(11/11) and ~90% of peptides with the Hφ-Hφ motif (46/52) were within the normal range of 120˚-150˚, compared to ~60% (31/49) for all other peptides. These results suggest that specificity, especially at P1 and P1', may be governed, in part, by the ability of a peptide to adopt the appropriate catalytic geometry within active site, that is not reflected by binding energetics alone.

4.4.6 Extending the method to other enzyme-substrate interactions Given the success of discriminating between cleavable and non-cleavable substrates and identifying specificity determinants in HIV-1 protease, we sought to extend this method to other enzymes that catalyze reactions on peptide substrates. Histone modification enzymes have become an area of intense structural, biochemical, and genetic research because of their essential roles in the cell cycle and in cancers. The histone H4-Lys 20 methyltransferase, hSET8, in particular, has been studied for its role in

121

transcriptional regulation and mitotic regulation. Couture et al.

164

have extensively

studied substrate specificity in hSET8 and have both crystallized the enzyme in complex with a peptide substrate and identified several point mutants of the substrate peptide that preserve or disrupt methylation, relative to the wild-type substrate. We extended our algorithm, exactly as described for HIV-1 protease, using the hSET8 crystal structure. We modeled three groups of peptide substrates: non-binding H3 lysine peptides, nonbinding H4-K20 point mutants, and binding H4-K20 point mutants.

Figure 4.9 Modeling substrate specificity for hSET8 (A) The starting structure of peptide docking of hSET8 (B) A comparison of binding energy between various substrate peptides shows a significant difference in energy between the non-binding H3 peptides and the binding H4-K20 peptides. A much smaller, but significant difference between non-binding H4K20 peptides and binding H4-K20 peptides is present as well

122

It is known that hSET8 is highly specific towards a single substrate, the K20 of the H4 histone and does not methylate any other histone tail lysines. As Figure 4.9 shows, there was a significant difference in binding energy between the binding H4-K20 mutants and the H3 lysine peptides. The difference in energy between the binding and non-binding H4-K20 point mutant peptides was much smaller however, indicating a perhaps more subtle mechanisms of specificity are in play between these point mutants, compared to the much more diverse set of non-binding H3 lysine peptides. Encouraged by the success in modeling a completely different enzyme-substrate system, we are currently working to expand the scope of our methods to include post-translational modifications in order to model the specificity of other chromatin remodeling enzymes.

4.5 Discussion In this study we developed a novel peptide docking algorithm that was able to accurately predict the HIV-1 protease-substrate complex structure at an atomic resolution and recover most of the important interactions for six previously determined crystal structures. We then predicted the structures for a large, diverse set of cleavable and noncleavable peptides, calculated an approximate free energy of the resulting complex, and showed significantly more favorable energies among cleavable peptides than noncleavable peptides. We sorted the peptides based purely on energy and recovered most major residue-type trends at P3-P3' identified in previous studies, validating our method as an accurate model of substrate specificity. Finally, through statistical analysis of the individual residue energies in each complex, we identified several specificity-determining residues that are predominately responsible for discriminating cleavable peptides.

123

We found that substrate specificity is largely determined by residues that cluster around the P2-P3' subsites. Steric interactions between side chains at P2-P2' in noncleavable peptides led to significantly higher energies in the protease residues L76, V82, D30', I47' and I84'. Hydrogen bonding and solvation energies played a major role as well, particularly with D30' and G48.

These results are in broad agreement with

bioinformatics-based identification of specificity-determining residues over a data set of nine retroviral proteases and corresponding cleavage data carried out by Kontijevskis et al.

130

.

Using multivariate analysis using physiochemical descriptors based on the

protease sequences, they also found D30, V82, and I84 to play the major role in determining cleavage specificity. However, a number of residues identified in their study were not found to be specificity-determining in the present study (R8, T26, V32, I64, P81, N83, L90), and vice versa (I47, I48, L76). The reason for this discrepancy is unclear; additional specificity effects of protease residues observed in the present study may arise indirectly through the significant intra-peptide cross-dependencies observed in Kontijevskis study, or their use of a set of nine retroviral proteases may bias certain protease residues towards being specificity-determining for reasons of sequence conservation and cleavage data availability.

Additionally, since Kontijevskis et al.

analyzed experimental data directly, they may have captured effects such as protease dynamics or chemical activity that are beyond the scope of our static structural methods. Given that our structural approach is completely orthogonal to their bioinformatics approach, the agreement between the two studies is encouraging, while the discrepancy is of interest for further research.

124

In our model, specificity was a result of both local and non-local structural mechanisms. Local mechanisms conferred specificity at a given subsite through the energetics of the peptide and protease residues within that subsite, such as the disfavoring of β-branched amino acids at P1 through clashes with I84 or the burial of a charged group at P1' in the active site. Non-local mechanisms conferred selectivity through energetics of peptide and protease residues at distal subsites or multiple subsites. Examples of nonlocal mechanisms include P2 side-chains affecting the energies P1' subsite through steric interactions with I84 and V82 or various combinations of peptide side-chains leading to unfavorable peptide backbone conformations that disrupt conserved peptide-protease hydrogen bonds. That no single sequence motifs appeared to dominate low energy or high energy binders in our study underscores the degree to which these indirect effects play a major role in determining cleavability, reflecting the observed interdependence between peptide sequence positions on cleavable substrates in bioinformatics studies 140

124;

. Finally, we found that effects not captured by the protease-peptide energies, such as

the active-site bond geometries, may play a major role in specificity as well, potentially explaining the prevalence of Aro-P and Hφ-Hφ motifs at P1-P1' among cleavable peptides. Through this effort, we hoped to identify several specificity-determining protease residues that would be critical to native selectivity and thus serve as robust targets for drug design. This goal was premised on the prevailing theory that drug resistance in HIV-1 protease is the result of mutations that alter inhibitor binding without significantly affecting substrate recognition

135

. The unexpected result of the study is that those

residues that are determined to be critical for substrate recognition are also residues that

125

clinical studies have revealed to be positions of high-frequency drug resistance mutations, which is demonstrated most clearly by the substantial difference in energies between cleavable and non-cleavable peptides when using only active-site DRM residue energies in our model. These findings indicate that active-site drug resistant mutations do alter substrate recognition and more importantly, imply that drug resistance and substrate recognition may share common mechanisms. The importance of DRM residues in substrate selectivity suggests that any mutations to these residues may alter the specificity of the protease, perhaps even towards its endogenous substrates. In fact, experimental evidence has long supported that native substrate binding can be altered in drug resistant variants of HIV-1 protease. Both clinical and in vitro studies have revealed that many viral variants that display early-stage drug resistance show significantly decreased viral infectivity, as a result of decreased or defective protease activity 165; 166; 167. Lin et al. 168 showed at least a several-fold increase in the Km of HIV-1 protease binding to peptide substrate Ca-P2 in several drug resistant mutations at positions V82 and D30. Additionally, a number of studies have observed polymorphisms in the several Gag polyprotein cleavage sites, including in response to decreased activity of drug resistant protease variants

158; 169; 170

. The most frequently

observed of these cleavage site mutants is the heavily-studied valine to alanine substitution in P2 of NC-p1 found in patients with the V82A protease mutation, for which a crystal structure of the complex has been determined

160

. Dauber et al. found altered

specificity in a peptide-library cleavage assay for several DRM proteases, most prominently in the selectivity of β-branched residues at the P2 position in V82 mutants. Finally, a series of studies on the closely related feline immunodeficiency virus (FIV) 126

protease found that several residues that correspond to DRM positions 30, 46, 47, 48, 50, 82, and 84 of HIV-1 protease are responsible for the differences in specificity between FIV protease and HIV-1 protease 162; 171; 172. We evaluated the substrate envelope hypothesis in the context of substrate selectivity and found that while it is necessary for a peptide to have a minimal number of protrusions to be cleavable, it is not sufficient in defining cleavability, as a significant number of non-cleavable peptides also had either low number or zero protrusions. A residue-type analysis carried out on the 30 least-protruding and 30 most-protruding substrates (data not shown) revealed markedly different trends from the aforementioned analysis using low and high energy peptides that are not reflected in bioinformatics studies, most prominently, and unsurprisingly, that ‘small’ residues play a dominant role in minimizing protrusions (and consequently defining cleavability, according to the hypothesis).

These findings underscore the limitations of the substrate envelope

hypothesis as a model for substrate specificity in HIV-1 protease. However, we found the theory underlying the substrate envelope hypothesis, the use of substrate consensus volumes to define binding behavior, useful in describing the specificity and promiscuity of HIV-1 protease. We generated a substrate envelope, consisting of the consensus volume of the 10 endogenous substrates, and a binding envelope, consisting of the consensus volume of all cleavable peptides, and found that the binding envelope was significantly larger than the substrate envelope, indicating substantially more structural diversity among the set of all cleavable peptides than in the 10 endogenous substrates. Furthermore, we observed that regions of the active-site with the greatest disparity between the binding and substrate envelopes, indicating areas in the

127

active-site where there is the greatest diversity among all cleavable peptides compared to endogenous substrates, (i.e. greatest ‘promiscuity’) are often immediately adjacent to DRM residues. These results suggest that the HIV-1 protease active site is ‘primed’ to develop DRMs where this structural promiscuity is highest (subject to other constrains, such as protease stability). An examination of protease inhibitor volumes shows that heavy atoms in inhibitors often protrude from the envelope into these areas of high promiscuity, adjacent to the DRM residues that disrupt their binding. Whether this relationship between DRM and binding-site structural promiscuity is a general feature of drug resistance requires further study. From a methodological perspective, this study represents a first attempt, to our knowledge, to systematically identify enzyme specificity determinants through a structure-based method. Through this approach we demonstrate that the specificitydetermining residues in HIV-1 protease largely overlap with active-site residues associated with drug resistance and expand on the substrate envelope hypothesis to account for both the specificity and promiscuity of HIV-1 protease in the hope that a more complete understanding its molecular recognition can yield valuable insight for HIV drug research.

Finally, we intentionally designed the method to be generally

applicable to any enzyme that interacts with peptide substrates for which a highresolution structure is known, and a dataset of reactive and non-reactive substrate peptides has been identified. We are currently exploring its uses in other systems. We have uploaded the set of 112 structure predictions in Protein Data Bank format along with a database containing a listing of every peptide, its sequence, cleavability, and residue energies from resulting structure prediction on to our website for 128

public use (http://graylab.jhu.edu). We believe this study represents only a fraction of the analyses that can be performed on this data set and it is our hope that other groups will be able use this information in conjunction their own molecular modeling or statistical techniques to glean additional information about substrate specificity in HIV-1 protease.

4.6 Conclusions In this study we used flexible peptide docking to generate structural models for a large data set of HIV-1 protease-substrate complexes. We then used a relatively simple energy model derived from the Rosetta score function to discriminate between cleavable and non-cleavable peptides. Finally, we carried out statistical analysis of the individual residue energy contributions to the total energy to identify those active-site and peptide residues most responsible for discriminating between cleavable and non-cleavable substrates.

We also demonstrated that this approach is generally applicable to any

enzyme-substrate system that has a known crytal structure of the complex and a robust data set of known substrates. These results show that flexible docking can be used beyond structure prediction, towards identification of the structural and energetic mechanisms that underlie protein-protein interactions.

129

130

CHAPTER 5 FUTURE DIRECTIONS IN MODELING PROTEIN INTERACTIONS

5.1 Summary In the previous chapters, we have detailed various methods for flexible docking and applied them to a number of cases, from blind structure prediction, to homology modeling and substrate specificity. In virtually all of these cases, novel algorithms were designed and then implemented within the Rosetta software package. In this chapter we will outline the more general methodological contributions I have made to Rosetta molecular modeling, specifically with a focus on its future in modeling protein interactions. The goals of our efforts here are two-fold: 1) to expand the range of biomolecular systems we can model in Rosetta and 2) to broaden the user-base of Rosetta beyond a core group of experienced Rosetta developers and users, and into the larger community of computational and structural biologists.

5.2 Introduction Future directions in modeling protein interactions involve translating the molecular modeling methods that have been trained on benchmark sets into methods that are applicable to the often more complex biology of specific biochemical systems or pathways. There are a number of limitations with applying traditional protein docking methods to many real-world protein-protein interactions. In some cases, the interactions occur across interfaces that contain non-protein moieties, such as small molecules or

131

other co-factors. In others, there are dramatic, local or global rearrangements in protein structure upon binding.

Still in others, biochemically significant interactions occur

between subunits of a multi-domain protein instead of two separate monomeric proteins. In many of these cases, the underlying methodology of protein docking is relevant, but the specific application of it towards each of these areas is restricted by scientific challenges associated with modeling these systems, as well as technical challenges in how it is implemented in the software. We have moved in two related directions in improving the Rosetta software package’s ability to model these types of systems. First, in collaboration with Rosetta developers throughout the RosettaCommons organization, we have contributed to the development of Rosetta v3 (A. Leaver-Fay et al. manuscript in preparation). This new version of Rosetta utilizes a more sophisticated object-oriented architecture to create a modular coding environment where basic functionalities across different applications can be easily combined to develop customized protocols. Specifically, I have been one of the lead developers of the protein docking module in Rosetta v3. Among its additional capabilities, is the easy of incorporation small-molecules and other co-factors into docking calculations, a major restriction in the previous RosettaDock v2 package. Second, we have developed a new script-based implementation of Rosetta v3 called PyRosetta. Taking advantage of the object-oriented architecture of Rosetta v3, we built Python libraries of all its major classes and functions, allowing users to develop completely custom algorithms with the same level of functionality of long-time Rosetta developers. We chose Python because of its ease of use, its sophistication as a scripting language, and its wide-spread popularity within the biology and bioinformatics 132

community.

The ultimate goal of the PyRosetta project is to make designing and

implementing custom modeling algorithms tailored towards specific biomolecular modeling problems accessible to the broader community of structural and computational biologists.

5.3 Rosetta v3 The Rosetta v2 software package has been demonstrated to be widely successful in a diverse range of molecular modeling tasks, including ab intio folding, homology modeling, loop modeling, protein and small molecule docking, and protein design. At its core, these methods utilize similar approaches and use many of the same functions. The software code underlying Rosetta v2 was originally conceived in Fortran for protein folding, later ported to C, and dramatically expanded to include the many methods it is used for today. As a result, the program was severely limited in its flexibility to handle modeling tasks outside of the immediate scope of a given protocol, and writing custom algorithms required a substantial amount of knowledge and experience in working with the Rosetta source code. In an effort to modernize the code, consolidate the many disparate functions used by various methods, and organize it in an intuitive manner for easy programming (Figure 5.1), the Rosetta v3 development project was launched two years ago.

It is still in development, but most of the modeling protocols available in

Rosetta v2 are now implemented, in some form, in Rosetta v3.

133

Figure 5.1 Basic organization of Rosetta v3 Rosetta v3 is organized between ‘core’ classes which contain all the basic objects and functions used by virtually every method in Rosetta, and ‘protocol’ classes which uses other core and protocol objects to implement molecular modeling algorithms.

5.3.1 Introduction Beyond general software development in Rosetta v3, our main contribution to the project was writing the protein docking module within it. This involved writing various rigid-body ‘movers’ that replicated the perturbations in protein docking in Rosetta v2, score function components that calculate the energy of docked complexes, and then combining all of these elements to make a protein docking protocol, based on the RosettaDock algorithm. The modularity of the docking code allows developers to use 134

various classes (for example, DockingHighRes, the high-resolution refinement phase of RosettaDock), in their own algorithms. Our goal in developing the protein docking module in Rosetta v3.x was two-fold: first, to show similar or better performance in docking tasks as Rosetta v2.x, and second, to demonstrate the flexibility of Rosetta v3, by combining elements from other modeling protocols into docking with a minimal effort. For the first goal, we have started docking simulations in both Rosetta v2 and Rosetta v3, using targets from the Docking Benchmark 3.0,173 a diverse, non-redundant set of docking targets of varying difficulties. For the second goal, we address the problem of small molecules and co-factors at protein interfaces. In standard docking methods, and in Rosetta v2, these molecules are often discarded because they cannot be properly modeled at the interface, for structural reasons as well as because protein docking algorithms are often heavily parameterized for protein-protein interactions.

We aim to combine elements from the small-molecule

docking methods in Rosetta v3 into protein docking, to see if explicitly modeling interface small-molecules can improve docking performance.

5.3.2 Methods We ran local docking perturbation runs, as described in Chapter 3, for a benchmark set of 119 protein complexes using Rosetta v2, and isolated 35 targets in which there was at least some sampling near the native and carried out local docking perturbations on these 35 targets using the protein docking protocol in Rosetta v3. The starting structure was the unbound structures superimposed on the bound complex. We generated 1000 decoys for each target and sorted the decoys by total score, using the

135

RosettaDock scoring function.7 As of writing of this manuscript we are still in the process of generating the data for this benchmark set and analysis. Inspection of Docking Benchmark 3.0 reveals that a substantial number of targets (~25%) had small molecules bound to at least one of the docking partners, and in approximately one-third of those cases, the small-molecule was in or near the docking interface. For six targets in the benchmarks set that have small-molecules in, or near the interface, that are also present in the unbound structures, we explicitly modeled the smallmolecule while docking the unbound structures. We generated centroid-mode and fullatom parameters for the small-molecules in the same manner that is used in RosettaLIGAND174 and held the small-molecule rigid body position and orientation fixed, relative to the partner it was bound to in the unbound structure. We then carried out the docking protocol as described previously, sorted results by interaction energy, and compared results with the same docking target, with the small-molecule removed. The increase in computational cost for including the small-molecule fixed to one of the partners is trivial because it doesn’t increase the conformational search space and only marginally increases the number of atoms in the system.

5.3.3 Results Overall, among the six targets tested, explicitly modeling the small molecule at the interface substantially improved docking in two cases, and did not have any effect, for better or worse for the other four (Table 5.1). In all the targets except 1EWY, interface small molecules made up only a very small fraction of the total intermolecular contacts at the interface. The greatest improvements were observed in targets 1EWY and 1GRN,

136

where local docking without the small-molecules failed to produce even a single medium-quality prediction in the top 5 scoring decoys.

PDB  1EWY  1GRN  1WQ1  1RLB  1R8S  1E6E 

Without ligand N5 Irmsd 0 4.92 0 9.20 2 1.64 2 1.62 0 10.8 0 4.74

With ligand N5 Irmsd 3 1.07 2 1.02 2 2.90 2 1.99 0 4.19 0 6.87

Table 5.1 Results of protein docking incorporating small molecules in Rosetta v3 The number of predictions, among the top 5 scoring decoys, that had an Irmsd < 4.0Å (N5), and the prediction among the top 5 scoring decoys with the lowest Irmsd.

Target 1EWY is Ferredoxin-NADP+ reductase (FNR) bound to FAD, in complex with ferredoxin (Fd). Figure 5.2 shows the score vs. Irmsd for local docking with, and without explicitly modeling the FAD molecule, with the second lowest energy decoy illustrated at the bottom. The FAD molecule serves as a prosthetic group in FNR and is critical for electron transfer from the Fd to the NADP+ substrate through the formation of a ternary complex.175 Inspection of the crystal structure of the complex shows that almost half of the intermolecular contacts between FNR and Fd are mediated by the FAD molecule.176 In this context, it is unsurprising that explicitly modeling the FAD molecule at the interface is critical to accurate docking of the ternary complex.

137

Figure 5.2 Local docking of the FNR-Fn ternary complex Scatterplots of score vs. Irmsd for local docking of the unbound structures in target 1EWY without the small molecule FAD bound to FNR (left), and with (right), shows that explicit modeling of the FAD molecule substantially improves docking. The third-lowest energy structure from docking using FAD is shown underneath with FNR (green), Fd (cyan), and the FAD molecule (magenta), superimposed on the crystal structure of the complex (gray).

Accurate molecular modeling of real-world biochemical systems often requires the incorporation of ‘non-standard’ features, such as co-factors, small-molecules, metal ions, or post-translational modifications. The ease of use in parameterizing interfacial

138

small molecules, incorporating them into docking, and observing, in some cases, significant improvements in results, underscores the utility and flexibility of Rosetta v3 in faithfully reproducing the complexities often present in the biochemical systems studied in biomedical research.

5.4 PyRosetta Previously published as: Chaudhury, S., Lyskov S., Gray, J.J., (2010) PyRosetta: a script-based interface for implementing custom molecular modeling algorithms using Rosetta. Bioinformatics (in press)

Reprinted with the permission of the publisher, Oxford Journals, Inc. with minor revisions.

PyRosetta is a stand-alone Python-based implementation of the Rosetta molecular modeling package that allows users to write custom structure prediction and design algorithms using the major Rosetta sampling and scoring functions. PyRosetta contains Python bindings to libraries that define Rosetta functions including those for accessing and manipulating protein structure, calculating energies, and running Monte Carlo-based simulations. PyRosetta can be used in two ways: 1) interactively, using iPython and 2) script-based, using Python scripting. Interactive-mode contains a number of help features and is ideal for beginners while script-mode is best suited for algorithm development. PyRosetta has similar computational performance to Rosetta, can be easily scaled up for

139

cluster applications, and has been implemented for algorithms demonstrating protein docking, protein folding, loop modeling, and design.

5.4.1 Introduction Recent advances in molecular modeling have lead to its increasing use in structural biology research for a wide range of applications. The Rosetta biomolecular modeling suite, in particular, has proved effective in many diverse tasks including ab initio structure prediction and homology modeling (Raman, et al., 2009), protein and smallmolecule docking (Davis and Baker, 2009; Chaudhury and Gray, 2007), loop modeling (Mandell, et al., 2009), and design (Kuhlman, et al., 2003). To make these protocols more accessible, a number of web-based servers have been constructed, such as Robetta (Chivian, et al., 2004), RosettaDock (Lyskov and Gray, 2008), and RosettaAntibody (Sivasubramanian, et al., 2008). However, many modeling problems do not fit cleanly into one of the standard Rosetta protocols, and algorithms that combine elements from different methods within Rosetta are often required to adequately model a particular system.

Developing such algorithms requires extensive experience in both C++

programming and Rosetta software development, severely limiting its accessibility. To make custom molecular modeling using Rosetta accessible to a broader community of structural biologists, we developed PyRosetta, a Python-based implementation of the Rosetta molecular modeling suite. Our goal was to enable users to define a molecular modeling problem, design an algorithm to solve it, and implement that algorithm on the computer using pre-existing Rosetta objects and functions. PyRosetta takes advantage of the object-oriented architecture of the new Rosetta release v3.1 to 140

provide users with easy access to all the major functions and objects used by Rosetta developers (A. Leaver-Fay, P. Bradley, D. Baker, et al., manuscript in preparation). PyRosetta can be run in two modes: interactive-mode, which contains tab-completion and help features which are ideal for beginners, and script-mode, which is better suited for algorithm development. We chose Python as the scripting language because it is a sophisticated programming language that enjoys widespread use in the biology community and allows PyRosetta to be compatible with other Python-based packages such as PyMol (DeLano, 2002) and Bio-Python (Cock, et al., 2009). Our hope is that the extensive online communities of users of the many Python-based bioinformatics tools will help develop and share interfaces with PyRosetta. Since familiarity with Rosetta objects and functions is essential for new users, a tutorial, user’s manual, and sample scripts demonstrating usage are available on the website. We used a number of tools to convert the classes and functions in the Rosetta C++ source code into a Python-accessible form. GCC-XML (Kitware Inc., 2007) parses the classes and functions of the Rosetta C++ code into an XML representation using the GCC compiler. The Py++ package (Language Binding Project, 2009) uses the GCCXML objects and generates Python bindings using the Boost.Python library (Boost, 2009). To make this process feasible for over two thousand Rosetta objects, this entire process is automated. The scripts are portable and tested on Mac OSX, Linux, and Windows platforms. The building process requires 4-6 hours depending on the platform and the pre-generated binary libraries are provided for download for all three platforms. A version of PyRosetta will be made available with each new release of Rosetta along with intermediate versions that add additional features, fix bugs, improve accessibility, or

141

expand documentation.

In terms of computational cost, PyRosetta performs almost

identically to the C++ build of Rosetta with performance benchmarks indicating a less than 5% difference in speed.

5.4.2 PyRosetta Features In PyRosetta, a biomolecular system is represented by an object called the Pose. The Pose contains all the structure and sequence information necessary to completely define the system in both Cartesian and internal coordinates. The internal coordinates used for proteins include the backbone torsion angles, φ, ψ, and ω; the side-chain torsion angles χ; and ‘jumps’ which define the relative position and orientation of multiple continuous polypeptide chains in the system. Nucleic acids and other molecules (ligands, posttranslational modifications, etc.) are similarly built using internal coordinates; solvent is typically treated implicitly. The spatial and internal coordinates are synchronized within the Pose. A starting structure is read into a Pose from a Protein Data Bank file and its conformation is altered by perturbing its internal coordinates (Figure 5.3).

142

Figure 5.3 Simple Monte Carlo peptide folding algorithm in PyRosetta A sample script that demonstrates the usage of various PyRosetta functions.

The energy of a structure, or Pose, can be calculated using the ScoreFunction. The ScoreFunction represents an energy function that is the sum of weighted independent energy terms. Over twenty energy terms are available including Van der Waals attractive and repulsive components, hydrogen bonding, solvation, and electrostatics energies. The user can create a custom scoring function by setting the weights of the desired components to non-zero values (Fig 5.3). The energy of a structure is calculated by passing a pose into the scoring function, which returns the score. Sampling functions are written as Mover objects in Rosetta. As a general form, a Mover.apply(pose) function carries out that particular perturbation on that Pose. Examples of movers include those that perturb backbone torsion angles, such as SmallMover, or minimize the structure using a given energy function, such as a

143

MinMover. The conformational search space can be limited and controlled using the MoveMap and FoldTree objects. The MoveMap specifies which internal coordinates are to be held rigid during sampling. The FoldTree instructs the Pose on how to convert its internal coordinates into Cartesian coordinates (Bradley and Baker, 2006) and provides a framework for controlling how local perturbations propagate through the global structure (for example, in loop modeling).

Finally, the MonteCarlo object performs the

bookkeeping for a Monte Carlo simulation, including storing the current structure and energy, calculating the change in energy caused by a perturbation, and applying the Metropolis Criterion using the MonteCarlo.boltzmann(Pose) function (Fig. 1). In addition to the basic movers there are movers that execute standard Rosetta protocols. Examples include DockingLowRes() for the low resolution phase of protein docking, LoopMover_RefineCCD, for the high-resolution refinement phase of loop modeling, and the versatile PackRotamersMover, which carries out side-chain packing and design using a rotamer library.

Finally, the PyJobDistributor object manages

multiple simulation trajectories simultaneously while storing all the decoys and tabulating a score file.

Scripts demonstrating molecular modeling protocols such as alanine-

scanning (Kortemme and Baker, 2002), protein and small-molecule docking (Chaudhury and Gray, 2007; Davis and Baker, 2009), protein design (Kuhlman and Baker, 2000), and all-atom relaxation (Bradley, et al., 2005) can be found on the PyRosetta website.

144

5.5 Conclusions Throughout my research I have consistently pushed the development of novel algorithms and protocols towards new challenges in modeling protein interactions. In our early CAPRI efforts, we explored the use of different types of backbone flexibility and observed first-hand the challenges associated with it.

I developed flexible docking

algorithms, inspired by kinetic models of protein binding, as efficient backbone conformational sampling strategies during docking. The resulting ensemble docking method has been useful, not only in modeling proteins with modest conformational changes, but in docking of NMR structures and homology models. Finally, moving beyond structure prediction of protein interfaces, I used flexible peptide docking to develop a structural and energetic model for substrate recognition in HIV-1 protease, and then extended the algorithm to other enzyme substrate systems. Molecular modeling of the biochemical systems that are of pharmacological and biomedical interest regularly involve a large degree of algorithm customization. These systems are often too complex to be modeled in a one-size-fits-all approach. Towards that end, I have contributed to the development of Rosetta v3, which modernizes the Rosetta software code, and was central in the inception and development of PyRosetta, a script-based modeling tool that has the potential to make customized molecular modeling accessible to a broader audience. The future of molecular modeling lies in two directions. The first is in expanding its applicability in biochemical systems of biomedical and technological significance, through addressing the technical and scientific challenges associated with modeling these complex systems.

The second is in broadening its user-base beyond a core of 145

experienced users, into the larger community of structural biologists and biochemists, by improving accessibility and usability of powerful modeling tools.

Just as protein

structures revolutionized how we think of biochemical interactions, molecular modeling has the potential to revolutionize how we think of protein structures. It is my hope that the contributions made by my graduate research, both in pushing the limits of what can be accurately modeled, and in improving accessibility of the Rosetta modeling package, will serve as a solid foundation for future research and scientific achievement.

146

SUPPLEMENTARY MATERIALS

147

Supplementary Figure 3.1

148

149

Supplementary Figure 3.2

150

151

152

REFERENCES

1.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res 28, 235-42.

2.

Frigerio, F., Coda, A., Pugliese, L., Lionetti, C., Menegatti, E., Amiconi, G., Schnebli, H. P., Ascenzi, P. & Bolognesi, M. (1992). Crystal and molecular structure of the bovine alpha-chymotrypsin-eglin c complex at 2.0 A resolution. J Mol Biol 225, 107-23.

3.

Birtalan, S. C., Phillips, R. M. & Ghosh, P. (2002). Three-dimensional secretion signals in chaperone-effector complexes of bacterial pathogens. Mol Cell 9, 97180.

4.

Engh, R. A. & Huber, R. (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallographica A46, 392-400

5.

Wang, C., Bradley, P. & Baker, D. (2007). Protein-protein docking with backbone flexibility. J Mol Biol 373, 503-19.

6.

Chen, R., Li, L. & Weng, Z. (2003). ZDOCK: an initial-stage protein-docking algorithm. Proteins 52, 80-7.

7.

Gray, J. J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C. A. & Baker, D. (2003). Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 331, 281-99.

153

8.

Fernandez-Recio, J., Totrov, M. & Abagyan, R. (2003). ICM-DISCO docking by global energy optimization with fully flexible side-chains. Proteins 52, 113-7.

9.

Dominguez, C., Boelens, R. & Bonvin, A. M. (2003). HADDOCK: a proteinprotein docking approach based on biochemical or biophysical information. J Am Chem Soc 125, 1731-7.

10.

Gray, J. J. (2006). High-resolution protein-protein docking. Curr Opin Struct Biol 16, 183-93.

11.

Janin, J., Henrick, K., Moult, J., Eyck, L. T., Sternberg, M. J., Vajda, S., Vakser, I. & Wodak, S. J. (2003). CAPRI: a Critical Assessment of PRedicted Interactions. Proteins 52, 2-9.

12.

Gray, J. J., Moughon, S. E., Kortemme, T., Schueler-Furman, O., Misura, K. M., Morozov, A. V. & Baker, D. (2003). Protein-protein docking predictions for the CAPRI experiment. Proteins 52, 118-22.

13.

Daily, M. D., Masica, D., Sivasubramanian, A., Somarouthu, S. & Gray, J. J. (2005). CAPRI rounds 3-5 reveal promising successes and future challenges for RosettaDock. Proteins 60, 181-6.

14.

Zhang, C., Liu, S. & Zhou, Y. (2005). Docking prediction using biological information, ZDOCK sampling technique, and clustering guided by the DFIRE statistical energy function. Proteins 60, 314-8.

15.

Ma, X. H., Li, C. H., Shen, L. Z., Gong, X. Q., Chen, W. Z. & Wang, C. X. (2005). Biologically enhanced sampling geometric docking and backbone flexibility treatment with multiconformational superposition. Proteins 60, 319-23.

154

16.

van Dijk, A. D., de Vries, S. J., Dominguez, C., Chen, H., Zhou, H. X. & Bonvin, A. M. (2005). Data-driven docking: HADDOCK's adventures in CAPRI. Proteins 60, 232-8.

17.

Sivasubramanian, A., Chao, G., Pressler, H. M., Wittrup, K. D. & Gray, J. J. (2006). Structural model of the mAb 806-EGFR complex using computational docking followed by computational and experimental mutagenesis. Structure 14, 401-14.

18.

Smith, G. R., Sternberg, M. J. & Bates, P. A. (2005). The relationship between the flexibility of proteins and their conformational states on forming protein-protein complexes with an application to protein-protein docking. J Mol Biol 347, 1077101.

19.

Méndez, R., Leplae, R., De Maria, L. & Wodak, S. J. (2003). Assessment of blind predictions of protein-protein interactions: Current status of docking methods. Proteins: Structure, Function, and Genetics 52, 51-67.

20.

Wang, C., Schueler-Furman, O. & Baker, D. (2005). Improved side-chain modeling for protein-protein docking. Protein Sci 14, 1328-39.

21.

Kim, D. E., Chivian, D. & Baker, D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-31.

22.

Hou, Z., Bernstein, D. A., Fox, C. A. & Keck, J. L. (2005). Structural basis of the Sir1-origin recognition complex interaction in transcriptional silencing. Proc Natl Acad Sci U S A 102, 8489-94.

155

23.

Bonsor, D. A., Grishkovskaya, I., Dodson, E. J. & Kleanthous, C. (2007). Molecular mimicry enables competitive recruitment by a natively disordered protein. J Am Chem Soc 129, 4800-7.

24.

Bose, M. E., McConnell, K. H., Gardner-Aukema, K. A., Muller, U., Weinreich, M., Keck, J. L. & Fox, C. A. (2004). The origin recognition complex and Sir4 protein recruit Sir1p to yeast silent chromatin through independent interactions requiring a common Sir1p domain. Mol Cell Biol 24, 774-86.

25.

Zhang, Z., Hayashi, M. K., Merkel, O., Stillman, B. & Xu, R. M. (2002). Structure and function of the BAH-containing domain of Orc1p in epigenetic silencing. Embo J 21, 4600-11.

26.

Kortemme, T. & Baker, D. (2002). A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci U S A 99, 14116-21.

27.

Kortemme, T., Kim, D. E. & Baker, D. (2004). Computational alanine scanning of protein-protein interfaces. Sci STKE 2004, pl2.

28.

Wang, C., Bradley, P. & Baker, D. (2007). CAPRI Issue (fill here). Proteins.

29.

Heurgue-Hamard, V., Champ, S., Engstrom, A., Ehrenberg, M. & Buckingham, R. H. (2002). The hemK gene in Escherichia coli encodes the N(5)-glutamine methyltransferase that modifies peptide release factors. Embo J 21, 769-78.

30.

Rohl, C. A., Strauss, C. E., Chivian, D. & Baker, D. (2004). Modeling structurally variable regions in homologous proteins with rosetta. Proteins 55, 656-77.

31.

Vestergaard, B., Van, L. B., Andersen, G. R., Nyborg, J., Buckingham, R. H. & Kjeldgaard, M. (2001). Bacterial polypeptide release factor RF2 is structurally distinct from eukaryotic eRF1. Mol Cell 8, 1375-82. 156

32.

Schubert, H. L., Phillips, J. D. & Hill, C. P. (2003). Structures along the catalytic pathway of PrmC/HemK, an N5-glutamine AdoMet-dependent methyltransferase. Biochemistry 42, 5592-9.

33.

Graille, M., Heurgue-Hamard, V., Champ, S., Mora, L., Scrima, N., Ulryck, N., van Tilbeurgh, H. & Buckingham, R. H. (2005). Molecular basis for bacterial class I release factor methylation by PrmC. Mol Cell 20, 917-27.

34.

Hyvonen, M., Macias, M. J., Nilges, M., Oschkinat, H., Saraste, M. & Wilmanns, M. (1995). Structure of the binding site for inositol phosphates in a PH domain. Embo J 14, 4676-85.

35.

Shiba, T., Kawasaki, M., Takatsu, H., Nogi, T., Matsugaki, N., Igarashi, N., Suzuki, M., Kato, R., Nakayama, K. & Wakatsuki, S. (2003). Molecular mechanism of membrane recruitment of GGA by ARF in lysosomal protein transport. Nat Struct Biol 10, 386-93.

36.

Bradley, P. & Baker, D. (2006). Improved beta-protein structure prediction by multilevel optimization of nonlocal strand pairings and local backbone conformation. Proteins 65, 922-9.

37.

Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268, 209-25.

38.

Canutescu, A. A. & Dunbrack, R. L., Jr. (2003). Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci 12, 963-72.

39.

Goldberg, J. (1998). Structural basis for activation of ARF GTPase: mechanisms of guanine nucleotide exchange and GTP-myristoyl switching. Cell 95, 237-48.

157

40.

Dubois, T., Paleotti, O., Mironov, A. A., Fraisier, V., Stradal, T. E., De Matteis, M. A., Franco, M. & Chavrier, P. (2005). Golgi-localized GAP for Cdc42 functions downstream of ARF1 to control Arp2/3 complex and F-actin dynamics. Nat Cell Biol 7, 353-64.

41.

Grunberg, R., Leckner, J. & Nilges, M. (2004). Complementarity of structure ensembles in protein-protein binding. Structure 12, 2125-36.

42.

Betts, M. J. & Sternberg, M. J. (1999). An analysis of conformational changes on protein-protein association: implications for predictive docking. Protein Eng 12, 271-83.

43.

Daily, M. D. & Gray, J. J. (2007). Local motions in a benchmark of allosteric proteins. Proteins 67, 385-99.

44.

Ray, M. C., Germon, P., Vianney, A., Portalier, R. & Lazzaroni, J. C. (2000). Identification by genetic suppression of Escherichia coli TolB residues important for TolB-Pal interaction. J Bacteriol 182, 821-4.

45.

Abergel, C., Bouveret, E., Claverie, J. M., Brown, K., Rigal, A., Lazdunski, C. & Benedetti, H. (1999). Structure of the Escherichia coli TolB protein determined by MAD methods at 1.95 A resolution. Structure 7, 1291-300.

46.

Abergel, C., Walburger, A., Bouveret, E. & Claverie, J. M. (2004). MAD structure of a periplasmic domain in E. Coli Pal protein. to be published.

47.

Menetrey, J., Perderiset, M., Cicolari, J., Dubois, T., El Khatib, N., El Khadali, F., Franco, M., Chavrier, P. & Houdusse, A. (2007). Structural Basis of Arhgap21Mediated Cross-Talk between Arf and Rho Signalling Pathways to be published.

158

48.

Tong, H., Hateboer, G., Perrakis, A., Bernards, R. & Sixma, T. K. (1997). Crystal structure of murine/human Ubc9 provides insight into the variability of the ubiquitin-conjugating system. J Biol Chem 272, 21381-7.

49.

Choe, J., Avvakumov, G., Newman, E., Mackenzie, F., Kozieradzki, I., Bochkarev, A., Sundstrom, M., Arrowsmith, C., Edwards, A., Dhepaganon, S. & (SGC), S. G. C. (2005). Ubiquitin-conjugating enzyme E2-25kDa (Huntingtin interacting protein 2). to be published.

50.

Pichler, A., Knipscheer, P., Oberhofer, E., van Dijk, W. J., Korner, R., Olsen, J. V., Jentsch, S., Melchior, F. & Sixma, T. K. (2005). SUMO modification of the ubiquitin-conjugating enzyme E2-25K. Nat Struct Mol Biol 12, 264-9.

51.

Lin, D., Tatham, M. H., Yu, B., Kim, S., Hay, R. T. & Chen, Y. (2002). Identification of a substrate recognition site on Ubc9. J Biol Chem 277, 21740-8.

52.

Walker, J., Avvakumov, G., Xue, S., Newman, E., Mackenzie, F., Weigelt, J., Sundstrom, M., Arrowsmith, C., Edwards, A., Bochkarev, A., Dhepaganon, S. & SGC, S. G. C. (2007). A novel and unexpected complex between SUMO-1conjugationg enzyme ubc9 and the ubiquitin-conjugating enzyme E2-25kDa. to be published.

53.

Liang, S., Liu, S., Zhang, C. & Zhou, Y. (2007). A simple reference state makes a significant improvement in near-native selections from structurally refined docking decoys. Proteins In press.

54.

Verdecia, M. A., Joazeiro, C. A., Wells, N. J., Ferrer, J. L., Bowman, M. E., Hunter, T. & Noel, J. P. (2003). Conformational flexibility underlies ubiquitin ligation mediated by the WWP1 HECT domain E3 ligase. Mol Cell 11, 249-59.

159

55.

Ogunjimi, A. A., Briant, D. J., Pece-Barbara, N., Le Roy, C., Di Guglielmo, G. M., Kavsak, P., Rasmussen, R. K., Seet, B. T., Sicheri, F. & Wrana, J. L. (2005). Regulation of Smurf2 ubiquitin ligase activity by anchoring the E2 to the HECT domain. Mol Cell 19, 297-308.

56.

Schueler-Furman, O., Wang, C. & Baker, D. (2005). Progress in protein-protein docking: atomic resolution predictions in the CAPRI experiment using RosettaDock with an improved treatment of side-chain flexibility. Proteins 60, 187-94.

57.

Huang, L., Kinnucan, E., Wang, G., Beaudenon, S., Howley, P. M., Huibregtse, J. M. & Pavletich, N. P. (1999). Structure of an E6AP-UbcH7 complex: insights into ubiquitination by the E2-E3 enzyme cascade. Science 286, 1321-6.

58.

Walker, J., Avvakumov, G., Xue, S., Butler-Cole, C., Weigelt, J., Sundstrom, M., Arrowsmith, C., Edwards, A., Bochkarev, A. & Dhe-paganon, S. (2007). NEDD4-like E3 Ubiquitin-protein Ligase. to be published.

59.

Lo Conte, L., Chothia, C. & Janin, J. (1999). The atomic structure of proteinprotein recognition sites. J Mol Biol 285, 2177-98.

60.

Bonvin, A. M. (2006). Flexible protein-protein docking. Curr Opin Struct Biol 16, 194-200.

61.

Fischer, E. (1894). Einfluss der Configuration auf die Wirkung der Enzyme. Ber. Dt. Chem. Ges. 27, 2985-2993.

62.

Wodak, S. J. & Janin, J. (1978). Computer analysis of protein-protein interaction. J Mol Biol 124, 323-42.

160

63.

Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A. A., Aflalo, C. & Vakser, I. A. (1992). Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci U S A 89, 2195-9.

64.

Li, L., Chen, R. & Weng, Z. (2003). RDOCK: refinement of rigid-body protein docking predictions. Proteins 53, 693-707.

65.

Comeau, S. R., Gatchell, D. W., Vajda, S. & Camacho, C. J. (2004). ClusPro: an automated docking and discrimination method for the prediction of protein complexes. Bioinformatics 20, 45-50.

66.

Vajda, S. (2005). Classification of protein complexes based on docking difficulty. Proteins 60, 176-80.

67.

Goodford, P. J. (1985). A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J Med Chem 28, 849-57.

68.

Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R. & Ferrin, T. E. (1982). A geometric approach to macromolecule-ligand interactions. J Mol Biol 161, 26988.

69.

Monod, J., Wyman, J. & Changeux, J. P. (1965). On the Nature of Allosteric Transitions: A Plausible Model. J Mol Biol 12, 88-118.

70.

Kumar, S., Ma, B., Tsai, C. J., Sinha, N. & Nussinov, R. (2000). Folding and binding cascades: dynamic landscapes and population shifts. Protein Sci 9, 10-9.

71.

Krol, M., Chaleil, R. A., Tournier, A. L. & Bates, P. A. (2007). Implicit flexibility in protein docking: Cross-docking and local refinement. Proteins.

161

72.

Bastard, K., Prevost, C. & Zacharias, M. (2006). Accounting for loop flexibility during protein-protein docking. Proteins 62, 956-69.

73.

Koshland, D. E. (1958). Application of a Theory of Enzyme Specificity to Protein Synthesis. Proc Natl Acad Sci U S A 44, 98-104.

74.

Krol, M., Tournier, A. L. & Bates, P. A. (2007). Flexible relaxation of rigid-body docking solutions. Proteins 68, 159-69.

75.

Smith, G. R., Fitzjohn, P. W., Page, C. S. & Bates, P. A. (2005). Incorporation of flexibility into rigid-body docking: applications in rounds 3-5 of CAPRI. Proteins 60, 263-8.

76.

de Vries, S. J., van Dijk, A. D., Krzeminski, M., van Dijk, M., Thureau, A., Hsu, V., Wassenaar, T. & Bonvin, A. M. (2007). HADDOCK versus HADDOCK: new features and performance of HADDOCK2.0 on the CAPRI targets. Proteins 69, 726-33.

77.

Chaudhury, S., Sircar, A., Sivasubramanian, A., Berrondo, M. & Gray, J. J. (2007). Incorporating biochemical information and backbone flexibility in RosettaDock for CAPRI rounds 6-12. Proteins.

78.

Wang, C., Schueler-Furman, O., Andre, I., London, N., Fleishman, S. J., Bradley, P., Qian, B. & Baker, D. (2007). RosettaDock in CAPRI rounds 6-12. Proteins.

79.

Dominguez, C., Bonvin, A. M., Winkler, G. S., van Schaik, F. M. A., Timmersm, H. T. M. & Boelens, R. (2004). Structural model of the UbcH5B/CNOT4 complex revealed by combining NMR, mutagenesis, and docking approaches. Structure 12, 633-644.

162

80.

Jensen, G. A., Andersern, O. M., Bonvin, A. M., Bjerrum-Bohr, I., Etzerodt, M., Thogersen, H. C., O'Shea, C., Poulsen, F. M. & Kragelund, B. B. (2006). Binding site structure f LRP-RAP complex: implications for a common ligand-receptor binding motif. Journal of Molecular Biology 362, 700-716.

81.

Drogen-Petit, A., Zwahlen, C., Peter, M. & Bonvin, A. M. (2004). Insight into molecular interactions between two PB1 domains. Journal of Molecular Biology 336, 1195-1210.

82.

Philippopoulos, M. & Lim, C. (1999). Exploring the dynamic information content of a protein NMR structure: comparison of a molecular dynamics simulation with the NMR and X-ray structures of Escherichia coli ribonuclease HI. Proteins 36, 87-110.

83.

Chen, R., Mintseris, J., Janin, J. & Weng, Z. (2003). A protein-protein docking benchmark. Proteins 52, 88-91.

84.

Mintseris, J., Wiehe, K., Pierce, B., Anderson, R., Chen, R., Janin, J. & Weng, Z. (2005). Protein-Protein Docking Benchmark 2.0: an update. Proteins 60, 214-6.

85.

Mendez, R., Leplae, R., De Maria, L. & Wodak, S. J. (2003). Assessment of blind predictions of protein-protein interactions: current status of docking methods. Proteins 52, 51-67.

86.

Sivasubramanian, A., Maynard, J. A. & Gray, J. J. (2007). Modeling the structure of mAb 14B7 bound to the anthrax protective antigen. Proteins.

87.

Misura, K. M. & Baker, D. (2005). Progress and challenges in high-resolution refinement of protein structure models. Proteins 59, 15-29.

163

88.

Camacho, C. J. & Vajda, S. (2002). Protein-protein association kinetics and protein docking. Curr Opin Struct Biol 12, 36-40.

89.

Bradley, P., Misura, K. M. & Baker, D. (2005). Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868-71.

90.

Harel, M., Kleywegt, G. J., Ravelli, R. B., Silman, I. & Sussman, J. L. (1995). Crystal structure of an acetylcholinesterase-fasciculin complex: interaction of a three-fingered toxin from snake venom with its target. Structure 3, 1355-66.

91.

le Du, M. H., Housset, D., Marchot, P., Bougis, P. E., Navaza, J. & FontecillaCamps, J. C. (1996). Structure of fasciculin 2 from green mamba snake venom: evidence for unusual loop flexibility. Acta Crystallogr D Biol Crystallogr 52, 8792.

92.

Gabdoulline, R. R. & Wade, R. C. (2001). Protein-protein association: investigation of factors influencing association rates by brownian dynamics simulations. J Mol Biol 306, 1139-55.

93.

Bui, J. M. & McCammon, J. A. (2006). Protein complex formation by acetylcholinesterase and the neurotoxin fasciculin-2 appears to involve an induced-fit mechanism. Proc Natl Acad Sci U S A 103, 15451-6.

94.

Brunne, R. M., Berndt, K. D., Guntert, P., Wuthrich, K. & van Gunsteren, W. F. (1995). Structure and internal dynamics of the bovine pancreatic trypsin inhibitor in aqueous solution from long-time molecular dynamics simulations. Proteins 23, 49-62.

164

95.

Guenneugues, M., Drevet, P., Pinkasfeld, S., Gilquin, B., Menez, A. & ZinnJustin, S. (1997). Picosecond to hour time scale dynamics of a "three finger" toxin: correlation with its toxic and antigenic properties. Biochemistry 36, 16097108.

96.

Sivasubramanian, A., Sircar, A., Chaudhury, S. & Gray, J. J. (2009). Toward high-resolution homology modeling of antibody Fv regions and application to antibody-antigen docking. Proteins 74, 497-514.

97.

Alexandrov, A., Martzen, M. R. & Phizicky, E. M. (2002). Two proteins that form a complex are required for 7-methylguanosine modification of yeast tRNA. Rna 8, 1253-66.

98.

Leulliot, N., Chaillet, M., Durand, D., Ulryck, N., Blondeau, K. & van Tilbeurgh, H. (2008). Structure of the yeast tRNA m7G methylation complex. Structure 16, 52-61.

99.

Osborne, M. J., Breeze, A. L., Lian, L. Y., Reilly, A., James, R., Kleanthous, C. & Moore, G. R. (1996). Three-dimensional solution structure and 13C nuclear magnetic resonance assignments of the colicin E9 immunity protein Im9. Biochemistry 35, 9505-12.

100.

Sui, M. J., Tsai, L. C., Hsia, K. C., Doudeva, L. G., Ku, W. Y., Han, G. W. & Yuan, H. S. (2002). Metal ions and phosphate binding in the H-N-H motif: crystal structures of the nuclease domain of ColE7/Im7 in complex with a phosphate ion and different divalent metal ions. Protein Sci 11, 2947-57.

165

101.

Tang, C., Schwieters, C. D. & Clore, G. M. (2007). Open-to-closed transition in apo maltose-binding protein observed by paramagnetic NMR. Nature 449, 107882.

102.

Andrec, M., Snyder, D. A., Zhou, Z., Young, J., Montelione, G. T. & Levy, R. M. (2007). A large data set comparison of protein structures determined by crystallography and NMR: statistical test for structural differences and the effect of crystal packing. Proteins 69, 449-65.

103.

de Groot, B. L., van Aalten, D. M., Scheek, R. M., Amadei, A., Vriend, G. & Berendsen, H. J. (1997). Prediction of protein conformational freedom from distance constraints. Proteins 29, 240-51.

104.

Yang, L. W., Eyal, E., Chennubhotla, C., Jee, J., Gronenborn, A. M. & Bahar, I. (2007). Insights into equilibrium dynamics of proteins from comparison of NMR and X-ray data with computational predictions. Structure 15, 741-9.

105.

Sivasubramanian, A., Sircar, A., Chaudhury, S. & Gray, J. J. (2008). Highresolution homology modeling of antibody Fv regions using knowledge-based techniques, de novo loop modeling and docking. submitted

106.

London, N. & Schueler-Furman, O. (2007). Assessing the energy landscape of CAPRI targets by FunHunt. Proteins 69, 809-815.

107.

Cavasotto, C. N., Kovacs, J. A. & Abagyan, R. A. (2005). Representing receptor flexibility in ligand docking through relevant normal modes. J Am Chem Soc 127, 9632-40.

166

108.

Markwick, P. R., Bouvignies, G. & Blackledge, M. (2007). Exploring multiple timescale motions in protein GB3 using accelerated molecular dynamics and NMR spectroscopy. J Am Chem Soc 129, 4724-30.

109.

Seeliger, D., Haas, J. & de Groot, B. L. (2007). Geometry-based sampling of conformational transitions in proteins. Structure 15, 1482-92.

110.

Ben-Zeev, E., Kowalsman, N., Ben-Shimon, A., Segal, D., Atarot, T., Noivirt, O., Shay, T. & Eisenstein, M. (2005). Docking to single-domain and multiple-domain proteins: old and new challenges. Proteins 60, 195-201.

111.

Schneidman-Duhovny, D., Inbar, Y., Nussinov, R. & Wolfson, H. J. (2005). Geometry-based flexible and symmetric protein docking. Proteins 60, 224-31.

112.

Tobi, D. & Bahar, I. (2005). Structural changes involved in protein binding correlate with intrinsic motions of proteins in the unbound state. Proc Natl Acad Sci U S A 102, 18908-13.

113.

Noy, E., Tabakman, T. & Goldblum, A. (2007). Constructing ensembles of flexible fragments in native proteins by iterative stochastic elimination is relevant to protein-protein interfaces. Proteins 68, 702-11.

114.

Wiehe, K., Pierce, B., Tong, W. W., Hwang, H., Mintseris, J. & Weng, Z. (2007). The performance of ZDOCK and ZRANK in rounds 6-11 of CAPRI. Proteins 69, 719-725.

167

115.

Chattopadhyay, D., Evans, D. B., Deibel, M. R., Jr., Vosters, A. F., Eckenrode, F. M., Einspahr, H. M., Hui, J. O., Tomasselli, A. G., Zurcher-Neely, H. A., Heinrikson, R. L. & et al. (1992). Purification and characterization of heterodimeric human immunodeficiency virus type 1 (HIV-1) reverse transcriptase produced by in vitro processing of p66 with recombinant HIV-1 protease. J Biol Chem 267, 14227-32.

116.

Hellen, C. U., Krausslich, H. G. & Wimmer, E. (1989). Proteolytic processing of polyproteins in the replication of RNA viruses. Biochemistry 28, 9881-90.

117.

Oswald, M. & von der Helm, K. (1991). Fibronectin is a non-viral substrate for the HIV proteinase. FEBS Lett 292, 298-300.

118.

Riviere, Y., Blank, V., Kourilsky, P. & Israel, A. (1991). Processing of the precursor of NF-kappa B by the HIV-1 protease during acute infection. Nature 350, 625-6.

119.

Tomasselli, A. G., Sarcich, J. L., Barrett, L. J., Reardon, I. M., Howe, W. J., Evans, D. B., Sharma, S. K. & Heinrikson, R. L. (1993). Human immunodeficiency virus type-1 reverse transcriptase and ribonuclease H as substrates of the viral protease. Protein Sci 2, 2167-76.

120.

Tomaszek, T. A., Jr., Moore, M. L., Strickler, J. E., Sanchez, R. L., Dixon, J. S., Metcalf, B. W., Hassell, A., Dreyer, G. B., Brooks, I., Debouck, C. & et al. (1992). Proteolysis of an active site peptide of lactate dehydrogenase by human immunodeficiency virus type 1 protease. Biochemistry 31, 10153-68.

168

121.

Tozser, J., Blaha, I., Copeland, T. D., Wondrak, E. M. & Oroszlan, S. (1991). Comparison of the HIV-1 and HIV-2 proteinases using oligopeptide substrates representing cleavage sites in Gag and Gag-Pol polyproteins. FEBS Lett 281, 7780.

122.

Debouck, C., Gorniak, J. G., Strickler, J. E., Meek, T. D., Metcalf, B. W. & Rosenberg, M. (1987). Human immunodeficiency virus protease expressed in Escherichia coli exhibits autoprocessing and specific maturation of the gag precursor. Proc Natl Acad Sci U S A 84, 8903-6.

123.

Kim, H., Zhang, Y., Heo, Y. S., Oh, H. B. & Chen, S. S. (2008). Specificity rule discovery in HIV-1 protease cleavage site analysis. Comput Biol Chem 32, 71-8.

124.

Kontijevskis, A., Wikberg, J. E. & Komorowski, J. (2007). Computational proteomics analysis of HIV-1 protease interactome. Proteins 68, 305-12.

125.

You, L., Garwicz, D. & Rognvaldsson, T. (2005). Comprehensive bioinformatic analysis of the specificity of human immunodeficiency virus type 1 protease. J Virol 79, 12477-86.

126.

Ridky, T. W., Cameron, C. E., Cameron, J., Leis, J., Copeland, T., Wlodawer, A., Weber, I. T. & Harrison, R. W. (1996). Human immunodeficiency virus, type 1 protease substrate specificity is limited by interactions between substrate amino acids bound in adjacent enzyme subsites. J Biol Chem 271, 4709-17.

127.

Bagossi, P., Sperka, T., Feher, A., Kadas, J., Zahuczky, G., Miklossy, G., Boross, P. & Tozser, J. (2005). Amino acid preferences for a critical substrate binding subsite of retroviral proteases in type 1 cleavage sites. J Virol 79, 4213-8.

169

128.

Eizert, H., Bander, P., Bagossi, P., Sperka, T., Miklossy, G., Boross, P., Weber, I. T. & Tozser, J. (2008). Amino acid preferences of retroviral proteases for aminoterminal positions in a type 1 cleavage site. J Virol 82, 10111-7.

129.

Beck, Z. Q., Morris, G. M. & Elder, J. H. (2002). Defining HIV-1 protease substrate selectivity. Curr Drug Targets Infect Disord 2, 37-50.

130.

Kontijevskis, A., Prusis, P., Petrovska, R., Yahorava, S., Mutulis, F., Mutule, I., Komorowski, J. & Wikberg, J. E. (2007). A look inside HIV resistance through retroviral protease interaction maps. PLoS Comput Biol 3, e48.

131.

Prabu-Jeyabalan, M., Nalivaika, E. & Schiffer, C. A. (2000). How does a symmetric dimer recognize an asymmetric substrate? A substrate complex of HIV-1 protease. J Mol Biol 301, 1207-20.

132.

Prabu-Jeyabalan, M., Nalivaika, E. & Schiffer, C. A. (2002). Substrate shape determines specificity of recognition for HIV-1 protease: analysis of crystal structures of six substrate complexes. Structure 10, 369-81.

133.

Tie, Y., Boross, P. I., Wang, Y. F., Gaddis, L., Liu, F., Chen, X., Tozser, J., Harrison, R. W. & Weber, I. T. (2005). Molecular basis for substrate recognition and drug resistance from 1.1 to 1.6 angstroms resolution crystal structures of HIV-1 protease mutants with substrate analogs. Febs J 272, 5265-77.

134.

Chellappan, S., Kairys, V., Fernandes, M. X., Schiffer, C. & Gilson, M. K. (2007). Evaluation of the substrate envelope hypothesis for inhibitors of HIV-1 protease. Proteins 68, 561-7.

170

135.

King, N. M., Prabu-Jeyabalan, M., Nalivaika, E. A. & Schiffer, C. A. (2004). Combating susceptibility to drug resistance: lessons from HIV-1 protease. Chem Biol 11, 1333-8.

136.

Chellappan, S., Kiran Kumar Reddy, G. S., Ali, A., Nalam, M. N., Anjum, S. G., Cao, H., Kairys, V., Fernandes, M. X., Altman, M. D., Tidor, B., Rana, T. M., Schiffer, C. A. & Gilson, M. K. (2007). Design of mutation-resistant HIV protease inhibitors with the substrate envelope hypothesis. Chem Biol Drug Des 69, 298-313.

137.

Surleraux, D. L., de Kock, H. A., Verschueren, W. G., Pille, G. M., Maes, L. J., Peeters, A., Vendeville, S., De Meyer, S., Azijn, H., Pauwels, R., de Bethune, M. P., King, N. M., Prabu-Jeyabalan, M., Schiffer, C. A. & Wigerinck, P. B. (2005). Design of HIV-1 protease inhibitors active on multidrug-resistant virus. J Med Chem 48, 1965-73.

138.

Altman, M. D., Ali, A., Reddy, G. S., Nalam, M. N., Anjum, S. G., Cao, H., Chellappan, S., Kairys, V., Fernandes, M. X., Gilson, M. K., Schiffer, C. A., Rana, T. M. & Tidor, B. (2008). HIV-1 protease inhibitors from inverse design in the substrate envelope exhibit subnanomolar binding to drug-resistant variants. J Am Chem Soc 130, 6099-113.

139.

Kurt, N., Haliloglu, T. & Schiffer, C. A. (2003). Structure-based prediction of potential binding and nonbinding peptides to HIV-1 protease. Biophys J 85, 85363.

140.

Ozer, N., Haliloglu, T. & Schiffer, C. A. (2006). Substrate specificity in HIV-1 protease by a biased sequence search method. Proteins 64, 444-56.

171

141.

Wang, W. & Kollman, P. A. (2001). Computational study of protein specificity: the molecular basis of HIV-1 protease drug resistance. Proc Natl Acad Sci U S A 98, 14937-42.

142.

Altman, M. D., Nalivaika, E. A., Prabu-Jeyabalan, M., Schiffer, C. A. & Tidor, B. (2008). Computational design and experimental study of tighter binding peptides to an inactivated mutant of HIV-1 protease. Proteins 70, 678-94.

143.

Chou, K. C. (1996). Prediction of human immunodeficiency virus protease cleavage sites in proteins. Anal Biochem 233, 1-14.

144.

Hao, J., Serohijos, A. W., Newton, G., Tassone, G., Wang, Z., Sgroi, D. C., Dokholyan, N. V. & Basilion, J. P. (2008). Identification and rational redesign of peptide ligands to CRIP1, a novel biomarker for cancers. PLoS Comput Biol 4, e1000138.

145.

Prasad, P. A., Kanagasabai, V., Arunachalam, J. & Gautham, N. (2007). Exploring conformational space using a mean field technique with MOLS sampling. J Biosci 32, 909-20.

146.

Sood, V. D. & Baker, D. (2006). Recapitulation and design of protein binding peptide structures and sequences. J Mol Biol 357, 917-27.

147.

Krohn, A., Redshaw, S., Ritchie, J. C., Graves, B. J. & Hatada, M. H. (1991). Novel binding mode of highly potent HIV-proteinase inhibitors incorporating the (R)-hydroxyethylamine isostere. J Med Chem 34, 3340-2.

148.

Engh, R. A. & Huber, R. (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallographica A47, 392-400.

172

149.

Chaudhury, S. & Gray, J. J. (2008). Conformer selection and induced fit in flexible backbone protein-protein docking using computational and NMR ensembles. J Mol Biol 381, 1068-87.

150.

Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. (2004). Protein structure prediction using Rosetta. Methods Enzymol 383, 66-93.

151.

Kuhlman, B. & Baker, D. (2000). Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci U S A 97, 10383-8.

152.

Dunbrack, R. L., Jr. & Cohen, F. E. (1997). Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci 6, 1661-81.

153.

Lazaridis, T. & Karplus, M. (1999). Effective energy function for proteins in solution. Proteins 35, 133-52.

154.

Team, R. D. C. (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

155.

Mendez, R., Leplae, R., Lensink, M. F. & Wodak, S. J. (2005). Assessment of CAPRI predictions in rounds 3-5 shows progress in docking procedures. Proteins 60, 150-69.

156.

Humphris, E. L. & Kortemme, T. (2007). Design of multi-specificity in protein interfaces. PLoS Comput Biol 3, e164.

157.

Sammond, D. W., Eletr, Z. M., Purbeck, C., Kimple, R. J., Siderovski, D. P. & Kuhlman, B. (2007). Structure-based protocol for identifying mutations that enhance protein-protein binding affinities. J Mol Biol 371, 1392-404.

173

158.

Ho, S. K., Coman, R. M., Bunger, J. C., Rose, S. L., O'Brien, P., Munoz, I., Dunn, B. M., Sleasman, J. W. & Goodenow, M. M. (2008). Drug-associated changes in amino acid residues in Gag p2, p7(NC), and p6(Gag)/p6(Pol) in human immunodeficiency virus type 1 (HIV-1) display a dominant effect on replicative fitness and drug response. Virology 378, 272-81.

159.

Dauber, D. S., Ziermann, R., Parkin, N., Maly, D. J., Mahrus, S., Harris, J. L., Ellman, J. A., Petropoulos, C. & Craik, C. S. (2002). Altered substrate specificity of drug-resistant human immunodeficiency virus type 1 protease. J Virol 76, 1359-68.

160.

Prabu-Jeyabalan, M., Nalivaika, E. A., King, N. M. & Schiffer, C. A. (2004). Structural basis for coevolution of a human immunodeficiency virus type 1 nucleocapsid-p1 cleavage site with a V82A drug-resistant mutation in viral protease. J Virol 78, 12446-54.

161.

Rhee, S. Y., Gonzales, M. J., Kantor, R., Betts, B. J., Ravela, J. & Shafer, R. W. (2003). Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31, 298-303.

162.

Lin, Y. C., Beck, Z., Morris, G. M., Olson, A. J. & Elder, J. H. (2003). Structural basis for distinctions between substrate and inhibitor specificities for feline immunodeficiency virus and human immunodeficiency virus proteases. J Virol 77, 6589-600.

163.

Silva, A. M., Cachau, R. E., Sham, H. L. & Erickson, J. W. (1996). Inhibition and catalytic mechanism of HIV-1 aspartic protease. J Mol Biol 255, 321-46.

174

164.

Couture, J. F., Collazo, E., Brunzelle, J. S. & Trievel, R. C. (2005). Structural and functional analysis of SET8, a histone H4 Lys-20 methyltransferase. Genes Dev 19, 1455-65.

165.

Zennou, V., Mammano, F., Paulous, S., Mathez, D. & Clavel, F. (1998). Loss of viral fitness associated with multiple Gag and Gag-Pol processing defects in human immunodeficiency virus type 1 variants selected for resistance to protease inhibitors in vivo. J Virol 72, 3300-6.

166.

Martinez-Picado, J., Savara, A. V., Shi, L., Sutton, L. & D'Aquila, R. T. (2000). Fitness of human immunodeficiency virus type 1 protease inhibitor-selected single mutants. Virology 275, 318-22.

167.

Mammano, F., Trouplin, V., Zennou, V. & Clavel, F. (2000). Retracing the evolutionary pathways of human immunodeficiency virus type 1 resistance to protease inhibitors: virus fitness in the absence and in the presence of drug. J Virol 74, 8524-31.

168.

Lin, Y., Lin, X., Hong, L., Foundling, S., Heinrikson, R. L., Thaisrivongs, S., Leelamanit, W., Raterman, D., Shah, M., Dunn, B. M. & et al. (1995). Effect of point mutations on the kinetics and the inhibition of human immunodeficiency virus type 1 protease: relationship to drug resistance. Biochemistry 34, 1143-52.

169.

Doyon, L., Croteau, G., Thibeault, D., Poulin, F., Pilote, L. & Lamarre, D. (1996). Second locus involved in human immunodeficiency virus type 1 resistance to protease inhibitors. J Virol 70, 3763-9.

175

170.

Zhang, Y. M., Imamichi, H., Imamichi, T., Lane, H. C., Falloon, J., Vasudevachari, M. B. & Salzman, N. P. (1997). Drug resistance during indinavir therapy is caused by mutations in the protease gene and in its Gag substrate cleavage sites. J Virol 71, 6662-70.

171.

Lin, Y. C., Beck, Z., Lee, T., Le, V. D., Morris, G. M., Olson, A. J., Wong, C. H. & Elder, J. H. (2000). Alteration of substrate and inhibitor specificity of feline immunodeficiency virus protease. J Virol 74, 4710-20.

172.

Beck, Z. Q., Lin, Y. C. & Elder, J. H. (2001). Molecular basis for the relative substrate specificity of human immunodeficiency virus type 1 and feline immunodeficiency virus proteases. J Virol 75, 9458-69.

173.

Hwang, H., Pierce, B., Mintseris, J., Janin, J. & Weng, Z. (2008). Protein-protein docking benchmark version 3.0. Proteins 73, 705-9.

174.

Davis, I. W. & Baker, D. (2009). RosettaLigand docking with full ligand and receptor flexibility. J Mol Biol 385, 381-92.

175.

Hermoso, J. A., Mayoral, T., Faro, M., Gomez-Moreno, C., Sanz-Aparicio, J. & Medina, M. (2002). Mechanism of coenzyme recognition and binding revealed by crystal structure analysis of ferredoxin-NADP+ reductase complexed with NADP+. J Mol Biol 319, 1133-42.

176.

Morales, R., Kachalova, G., Vellieux, F., Charon, M. H. & Frey, M. (2000). Crystallographic studies of the interaction between the ferredoxin-NADP+ reductase and ferredoxin from the cyanobacterium Anabaena: looking for the elusive ferredoxin molecule. Acta Crystallogr D Biol Crystallogr 56, 1408-12.

176

CURRICULUM VITAE

PERSONAL Born January 22nd, 1983, Calcutta, India

EDUCATION Johns Hopkins University Ph.D. in Computational and Molecular Biophysics, 2010

Johns Hopkins University B.A. in Cellular and Molecular Neuroscience, 2005

RESEARCH EXPERIENCE Graduate student, Johns Hopkins University Advisor: Dr. Jeffrey J. Gray Topic: Computational modeling of the structure and specificity of protein interactions

Research assistant, Massachusetts General Hospital Advisor: Dr. Marcy E. MacDonald Topic: Altered proteomic phophorylation levels in Huntington’s Disease

Research assistant, Johns Hopkins University Advisor: Dr. Kathleen A. Turano Topic: Gender differences in spatial perception in virtual reality environments

177

PUBLICATIONS Chaudhury, S., Lyskov S., Gray, J.J., (2010) PyRosetta: a script-based interface for implementing custom molecular modeling algorithms using Rosetta. Bioinformatics (in press)

Chaudhury, S. & Gray, J.J., (2009) Identification of specificity determinants in HIV-1 protease using computational peptide docking: implications for drug resistance. Structure. 17(12):16361648.

Chaudhury, S. & Gray, J.J., (2008) Conformer selection and induced fit in flexible backbone protein-protein docking using computational and NMR ensembles. Journal of Molecular Biology. 381(4):1068-1087.

Pickin, K., Chaudhury, S., Dancy B.C.R., Gray J.J., Cole, P.A. (2008) Analysis of protein kinase autophosphorylation using expressed protein ligation and computational modeling. Journal of the American Chemical Society. 130:5667-5869.

Sivasubramanian, A., Sircar, A., Chaudhury, S., Gray, J.J., (2008) High-resolution homology modeling of antibody Fv regions using knowledge-based techniques, de novo loop modeling and docking. Proteins. 74:497-514.

Chaudhury, S., Sircar, A., Sivasubramanian, A., Berrondo, M., Gray, J.J., (2007) Incorporating biochemical information and backbone flexibility in RosettaDock for CAPRI rounds 6-12. Proteins. 69:793-800.

178

Fortenbaugh, F.C., Chaudhury, S., Hicks, J., Lei. H., Turano, K.A.. (2006) Gender differences in cue preference during path integration in virtual environments. Transactions on Applied Perception. 4(1):1-16.

Chaudhury, S., Eisinger, J. M., Lei, H., Hicks, J., Turano, K.A. (2004) Visual illusion in far space alters women's walking in a virtual world. Experimental Brain Research. 159:360-369.

SELECTED HONORS AND AWARDS Johns Hopkins University, Departmental Honors in Neuroscience, 2004 Johns Hopkins University, Dean’s List 2002-2004 Johns Hopkins University, Woodrow Wilson Research Fellowship, 2001-2004

SCIENTIFIC MEETINGS AND TALKS Chaudhury S. & Gray, J.J. (2010) Identification of specificity determinants in HIV-1 protease using computational peptide docking: implications for drug resistance. Gordon Research Seminar – Biomolecular Interactions and Methods. Galveston, TX. [presentation]

Chaudhury S. & Gray, J.J. (2009) Identification of specificity determinants in HIV-1 protease using computational peptide docking: implications for drug resistance. American Institute of Chemical Engineers Conference. Nashville, TN. [presentation]

Chaudhury, S. & Gray, J.J. (2009) “PyRosetta – Rosetta molecular modeling for the broader community.” Rosetta Conference, Seattle, WA. [presentation]

179

Chaudhury, S. & Gray, J.J. (2007) “Flexible backbone protein-protein docking with conformer selection and induced fit models.” Modeling of Protein Interactions Conference. Lawrence, KS. [poster]

Chaudhury, S. & Gray, J.J. (2007) “Ensemble docking in RosettaDock: Docking idealized, minimized, structures with multiple backbone conformations.” Rosetta Conference, Seattle, WA. [presentation]

Chaudhury, S. & Gray, J.J. (2007) “Towards ensemble docking in RosettaDock: Local docking of idealized, energy minimized structures with multiple backbone conformations,” CAPRI Evaluation Meeting, Toronto, Canada. [presentation]

Chaudhury, S. & Gray J.J. (2007) “Coupling local changes in activation loop conformation with global conformation changes in insulin receptor kinase,” Biophysical Society 51st Annual Meeting, Baltimore, MD. [poster]

TEACHING EXPERIENCE Teaching Assistant, Computational and Experimental Design of Biomolecules, Johns Hopkins University, Spring 2008 Teaching Assistant, Computational Protein Structure Prediction and Design, Johns Hopkins University, Spring 2009

180