proteins STRUCTURE O FUNCTION O BIOINFORMATICS
FUNCTION: PREDICTIONS
Protein-binding site prediction based on three-dimensional protein modeling Mina Oh, Keehyoung Joo, and Jooyoung Lee* School of Computational Sciences, Korea Institute for Advanced Study, Seoul 130-722, Korea
ABSTRACT Structural information of a protein can guide one to understand the function of the protein, and ligand binding is one of the major biochemical functions of proteins. We have applied a two-stage template-based ligand binding site prediction method to CASP8 targets and achieved high quality results with accuracy/coverage 5 70/80 (LEE). First, templates are used for protein structure modeling and then for binding site prediction by structural clustering of ligand-containing templates to the predicted protein model. Remarkably, the results are only a few percent worse than those one can obtain from native structures, which were available only after the prediction. Prediction was performed without knowing identity of ligands, and consequently, in many cases the ligand molecules used for prediction were different from the actual ligands, and yet we find that the prediction was quite successful. The current approach can be easily combined with experiments to investigate protein activities in a systematic way. Proteins 2009; 77(Suppl 9):152–156.
C 2009 Wiley-Liss, Inc. V
Key words: function prediction; ligand binding site; protein structure modeling; global optimization; CASP.
152
PROTEINS
INTRODUCTION Living cells function through networks of interacting molecules, and proteins constitute the majority of the molecules. Virtually in all cases, proteins play the most important roles in the networks. It is estimated that the average number of interacting partners of a protein is in the range of 3–10.1 Practically all molecular interactions in biological systems can be explained by either the binding of ligand molecules with receptor proteins, the binding of antibody proteins to antigens, protein–DNA interactions, or protein–protein interaction by docking.2 Recently, much attention is drawn to the atomistic description of protein–ligand interactions, which can provide us practical tools for the discovery of new drugs. CASP is a community-wide experiment, conducted every 2 years since 1994, with the primary purpose of assessing the effectiveness of protein structure modeling methods.3 Participants predict, in a blind fashion, three-dimensional (3D) structures of proteins as server (predictions are due in 3 days) or human (due in 3 weeks) predictors. In addition, there are several non-3D categories including the function prediction. Previous function prediction (CASP6 & 7) was evaluated based on Gene Ontology term in a free format. Starting CASP8, the function prediction category was embodied in binding site prediction where predictors are asked to identify the residues in contact with biologically relevant ligands. Among many aspects of protein functionality, we intend to identify ligand binding residues of a given protein receptor. To date, many computational methods have been developed for protein–ligand binding site prediction, and they can be divided into sequence- and structure-based methods. Sequencebased methods mainly rely on the sequence conservation information of functionally and/or structurally important residues. Structure-based methAdditional Supporting Information may be found in the online version of this article. The authors state no conflict of interest. Grant sponsor: Korea Science and Engineering Foundation (KOSEF); Grant sponsor: Korea government (MEST); Grant number: 2009-0063610. *Correspondence to: Jooyoung Lee, School of Computational Sciences, Korea Institute for Advanced Study, Seoul 130-722, Korea. E-mail:
[email protected] Received 17 March 2009; Revised 14 July 2009; Accepted 16 July 2009 Published online 12 August 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.22572
C 2009 WILEY-LISS, INC. V
Protein-Binding Site Prediction
ods can be further divided into three groups: purely geometric methods, energetic methods, and miscellaneous methods incorporating various information such as homology, surface accessibility, and chemical properties of the target protein.2,4–6 In this work, we propose a two-stage template-based ligand binding site prediction method, where templates are used first for protein 3D modeling and then for binding site prediction by structural clustering of ligand-containing templates to the predicted 3D model. Apparently, the effectiveness of this approach is limited by the accuracy (A) of 3D models used, and the upper bound of the prediction A is set by the A generated from the corresponding native structure. It appears that the prediction quality obtained using LEE 3D models (A/C 5 70/80) is only slightly worse than that by using native structures (A/C 5 74/82). In this work, we provide a brief description of the structure-based binding site prediction and its successful application to CASP8 targets. Successful examples as well as failures are also discussed.
METHODS The method is divided into two stages: (i) protein 3D modeling and (ii) structural clustering of ligand-containing templates to the predicted 3D model. For a given amino acid sequence of a target protein, suitable template structures are identified from PDB library using the metaserver 3D-Jury7 as well as the in-house fold recognition method called Foldfinder (Joo K. et al., manuscript in preparation). All templates are used for the 3D modeling of the target protein, but templates void of biologically relevant ligand molecules are not used in the second stage. For given templates, 3D protein models are generated by a template-based protein modeling method,8 which applies straightforward and thorough global optimization by conformational space annealing9,10 to three sequential layers of modeling, multiple sequence alignment11 between a target and templates, 3D chain building,12 and side-chain re-modeling (data not shown). The 3D model quality for both main chains and side-chains generated by this procedure is shown to be excellent, especially for HA-TBM targets where the template selection is straightforward.13,14 LEE and LEE-SERVER (LEE-S) prediction for 3D protein models basically followed the same procedure as described earlier, although LEE performed more template combination than LEE-S. After 3D modeling, top predicted models (model 1s as often called in the CASP) are used for binding site prediction. LEE (FN407) and LEE-S (FN293) binding site predictions differ from each other only in the protein 3D model used [i.e. FN407 (FN293) used protein 3D models from TS407 (TS293)], and the identical procedure is applied to the 3D model. It should be noted that we solved all target proteins of CASP8 (except T0472 can-
celled by early native structure release) using LEE, whereas LEE-S solved all server-only targets plus 11 earlier human targets (up to T0405). All templates used in 3D modeling are screened out, so that only templates with biologically important ligands are identified. Heteroatoms for modified amino acid residues, solvent molecules, and nonbiological ions are not considered as ligands.15 Templates with proper ligands are superimposed to the model. After templates are removed, only the protein model and ligands from templates are left. Ligands are grouped into clusters first by calculating the ligand center of masses and then by calculating all pairwise distances between centers. We varied the effective distance cutoff (1–8 A˚) for each target by considering distances between ligands. The final cutoff value was adjusted manually by human inspection for both in LEE and LEE-S. Each cluster of ligands is considered to correspond to a putative binding site. Clusters are ranked by the cluster size and the number of ligands. If two or more clusters are of the same cluster size, the structural similarity between the model and ligand-containing templates is used as the second ranking criterion. We assumed that templates with higher structural similarity to the model would work better for function prediction. TMscore16 is used to measure the structural similarity. To identify binding residues, the distances between all ligand atoms in a given cluster and all model protein atoms are calculated. Two atoms are considered to be in contact, if the distance between them is less than 0.5 A˚ plus their van der Waals contact distance; the protein residues in contact with any ligand atom are predicted as binding residues, which often led to over-prediction. To fix the problem, we varied the contact distance cutoff by visual inspection. The human intervention when identifying residues in contact with ligand atoms often reduced the overprediction rate, which arises mainly from wrong side-chain conformations of the model. The identical method was applied to both LEE (FN407) and LEE-S (FN293).
RESULTS AND DISCUSSION Prediction results for LEE and LEE-S are summarized in Table I. Contrary to the other CASP8 groups, LEE (FN407) predicted for all 27 (all human and server-only) targets and LEE-S (FN293) predicted for only 19 (all server-only plus human up to T0405) targets; LEE-S (TS293) generated 3D models for 19 function prediction targets. This is mainly because of our misunderstanding that server targets exclude human targets. The quality of prediction is assessed by two simple measures. Accuracy represents the percentage of correctly predicted residues (N) out of the total number of predicted residues (P); A 5 N/P 3 100. Coverage (C) is the number of correctly predicted residues (N) out of the PROTEINS
153
M. Oh et al.
Table I Prediction Summary for FN407 (LEE) and FN293 (LEE-S) LEE Target
Ligand
T0391 T0394 T0396 T0406 T0407 T0410 T0422 T0425 T0426 T0430
FES PO4 FAD NI ZN ZN ZN FE ADP ZN ZN AMP MG
T0431 T0440 T0444 T0450
HEM FE ZN FE FE FAD
T0453 T0457 T0461 T0470 T0476 T0477 T0478 T0480 T0483 T0485 T0487 T0490
CA CA CA MG ZN MG ZN ADP MG FE ZN ADP MG MG SAM MG FAD
T0508 AVG.
SAM
LEE-SERVER A
C
A
Cb
90 53 78 100 62 12 71 100 100 59
100 100 91 100 56 100 80 100 100 54
100 71 83 100 DNPd 38 65 DNPd 60 68
78 62 65 100 DNPd 100 87 DNPd 100 71
82 50 NNSc 100 100 19 76 75 67 76
100 100 NNSc 100 78 100 87 100 67 79
62 100 100 67
95 44 100 100
DNPd DNPd 80 73
DNPd DNPd 100 100
75 100 80 73
95 33 100 100
0e 67 100 100 —g 45 0 100 80 71 67 65
0e 100 100 100 0 100 0 100 100 89 50 97
0e DNPd 100 100 DNPd 56 0 DNPd 88 71 DNPd 65
0e DNPd 100 100 DNPd 100 0 DNPd 94 89 DNPd 97
100f 80 75 100 —g 50 0 80 73 79 NNSc 65
75f 100 100 100 0 100 0 100 100 100 NNSc 100
70 70
100 80
70 68
100 81
A
57 59-62 80 82-83 85 15-16 22 28 66 94 203-204 3-4 7-8 10-11 15 44 48-49 52 75 77-79 81-82 84-85 87 90 95 98 48 127 131 44 46 113 214 122 157 51 76 216 206 211 386 78-79 85 86-87 127-132 259 288-289 292 11 25 77 117 119 142 49 50-51 54 57 68 70 116 164-165 166-167 168 170 212-213 215 245-246 211 275 277 306 310 84 112 116 268-269 272-273 276 343 419-420 425 427-429 432-433 436 471 6 8 93 258 123 181 14 40 260 135 198 232 235 24 27-29 47-49 54-57 60-61 63 65 191 193 228-231 235 252 254 292 338-340 372-373 374-375 76 77 78 83 29 83 106 158 75 111 114 29 58 59 122 4 7 47 50 49 51 53 56 75-80 30 121 248 252 117 154 158 21 24 39 42 32-33 40 53 55 92 109-111 114 116 159-160 162 172-173 8 16 28 50-53 72-74 77 99-101 117-119 122-123 478 546 548 660 10 11 13-15 33-35 43- 44 46-48 50 52 171-173 204-206 208 234 272 315-316 348-349 350-354 22 46-52 67-69 82-85 111-113 151
b
a
b
Native
C
Ligand binding residues
a
a
70 74
100 82
Target numbers, ligand codes, and residue numbers are according to the CASP8 data. Underlined residues represent the correctly predicted residues by LEE (Those by LEE-S are in the Supporting Information Table S1). a Accuracy (%). b Coverage (%). c NNS, No native structure is available. Native structure of T0396 is not available yet. For T0487, the ligand binding portion of the native structure is not available. d DNP, did not predict. e When selecting the top model, we did not follow the model selection criterion (number of ligands followed by structural similarity between ligand-containing templates and the 3D protein model). f The score of A/C 5 100/75 for native structure comes from direct application of the criterion. g A for LEE is calculated excluding T0476 (see Results and Discussion).
total number of annotated residues (T); C 5 N/T 3 100.17 The average A/C for LEE and LEE-S is 70/80 and 68/ 81, respectively. It should be noted that A of LEE was calculated excluding T0476 for which no ligand binding was our prediction because no ligands were found from templates used. In the next section, we describe examples of good prediction for a metal binding target and a nonmetal binding one. Reasons for failures are also discussed. What went right T0425 and T0457
Binding residues of T0425 for LEE are predicted with perfect A and C, where a Zn ion is bound to H11, E25, and H77 [see Fig. 1(a)]. Among many templates identi-
154
PROTEINS
fied from 3D-Jury7 and Foldfinder (Joo K. et al., manuscript in preparation), two templates (PDB code: 1jwq, 1xov) with average SeqID of 22% are used for 3D modeling of the top predicted model (model1 in the CASP). The backbone and side-chain A of the model is represented by GDT-HA 5 50.84 GDT-TS 5 70.11 and v1 5 63.8 v1 1 2 5 45.7 using 308. To predict binding sites, we collected a total of six templates containing ligands. Using the distance cutoff of 3.0 A˚, ligands are clustered into three groups, and the biggest cluster contained four Zn ions and one Ca ion. The binding mode of the ligand to the predicted protein structure is highly native-like as shown in Figure 1(a). For another metalbinding target T0457, LEE predicted that two Mn ions bind to T0457. However, the experimental structure of T0457 is found to contain one ion not Mn but Mg, which lead to the prediction results of A/C 5 67/100.
Protein-Binding Site Prediction
T0483
Two Mg ions and an ADP are bound to T0483 [see Fig. 1(b)]. Nonmetal ligands such as ADP are bigger in size and of complicated molecular structures relative to simple metal ligands. For this reason, binding residues for nonmetal ligands are often over-predicted and high A prediction is more difficult. For T0483, after 3D modeling, a total of 15 ligand containing templates were used for function prediction, and ligands were mostly ATP or ATP-like molecules. After superposition, ligands were loosely clustered. To reduce false positives from possibly misaligned large ligands, the relatively short distance cutoff of 1.0 A˚ is used for clustering,
Figure 2 The native (yellow) and model (cyan) structures of T0453 are shown. T0453 binds three Ca ions at D76 M77 D78 and D83 as shown by light purple spheres. The clusters for model 1 (red dots) and model 2 (yellow dots) are also shown. As described in the text, two errors created the model 1 cluster. Correctly-predicted (green) and missed (blue) residues by model 2 are shown in stick representation.
which led to the 20 residues predicted by LEE. All 16 residues in contact with ligands are correctly predicted, thus A/C 5 80/100.
What went wrong T0453
Figure 1 Examples of successful binding site prediction are shown. (a) The predicted model structure (purple) of T0425 is superimposed to the native (yellow) structure, where the Zn ligand (blue sphere) and binding residues (sticks) are shown. The binding residues of H11 E25 and H77 are predicted with perfect scores of A/C 5 100/100. (b) The native (yellow) and model (magenta) structures of T0483 are shown. Native ligand atoms (2Mg-ADP) are shown as spheres and predicted binding residues are shown in sticks (cyan for correct prediction blue for incorrect ones).
The failure for this target is interesting to review because the Sternberg group* (FN202) managed to generate good binding residue prediction (A/C 5 67/100) using a LEE-S model. For LEE and LEE-S, one critical error and subsequent judgmental error are responsible. The first error was that heteroatoms from modified amino acids (N-dimethyl lysine) were not removed unintentionally, which led to a wrongly identified cluster. After clustering, two clusters with more than one ligand, (E66, H90, H91) submitted as the top model (A/C 5 0/0) and (V61, D76, M77, D78) submitted as the second model (A/C 5 75/75), are found. Secondly, we were misguided by the charged residues from the first cluster, although the second cluster contained more ligands. The score of A/C 5 100/75 using the native structure (without removing erroneous heteroatoms for N-dimethyl lysines for fair comparison) represents the direct application of the model selection criterion described in Methods section without human intervention. Two predicted clusters are shown in Figure 2.
*Function evaluation paper by assessors in this issue.
PROTEINS
155
M. Oh et al.
T0478
The failure for T0478 was because of the wrong identification of relevant templates, which led to poor 3D modeling (GDT-HA 5 10.4 GDT-TS 5 18.5 for LEE). All templates used for 3D modeling and function prediction were not relevant to the native structure of T0478. CONCLUSIONS A two-stage template-based ligand binding site prediction method is applied to CASP8 targets. Without prior knowledge of the identity of ligands, good binding site prediction is accomplished, which implies binding pockets of proteins can be quite robust to the variation of ligand molecules. In many cases, the ligand molecules used for binding site prediction were different from the actual ligands bound in the experimental structure, and yet the prediction was quite successful. One critical issue for this type of exercise is that one may not exclude the possibility that the same protein investigated here can bind other ligands at a separate pocket not identified here, which should be addressed by future studies. It is quite encouraging to observe that the A/C gap between 3D-model-used prediction (70/80) and native-structureused prediction (74/82) is quite minimal [see Table I]. It indicates that state-of-the-art protein models can be successfully utilized to predict ligand binding sites when experimental protein structures are not available. Furthermore, the current approach can be easily combined with mutagenesis studies of proteins to control binding affinity of proteins to ligands. It should be noted that the methods for binding site prediction (LEE and LEE-S) were developed during the CASP8 season, and consequently, the prediction results by LEE-S were generated not from an existing server but from a series of automated procedures with a small amount of human intervention. In hindsight, LEE-S for binding site prediction could have been fully automated and the prediction results are shown to be more and less similar (unpublished).
156
PROTEINS
REFERENCES 1. Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM. Protein interaction networks from yeast to human. Curr Opin Struct Biol 2004;14:292–299. 2. Huang B, Schroeder M. LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC Struct Biol 2006;6:19. 3. Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 2005;15:285–289. 4. Laurie ATR, Jackson RM. Q-Site Finder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005;21:1908–1916. 5. Morita M, Nakamura S, Shimizu K. Highly accurate method for ligand-binding site prediction in unbound state (apo) protein structures. Proteins 2008;73:468–479. 6. Brylinski M, Skolnick J. A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA 2008;105:129–134. 7. Ginalski K, Elofsson A, Fischer D, Rychlewski L. 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 2003;19:1015–1018. 8. Joo K, Lee J, Lee S, Seo J, Lee S, Lee J. High accuracy template based modeling by global optimization. Proteins 2007;69:83–89. 9. Lee J, Scheraga HA, Rackovsky S. New optimization method for conformational energy calculations on polypeptides: conformational space annealing. J Comput Chem 1997;18:1222–1232. 10. Lee J, Lee I, Lee J. Unbiased global optimization of Lennard Jones clusters for N 201 by conformational space annealing method. Phys Rev Lett 2003;91:080201. 11. Joo K, Lee J, Kim I, Lee S, Lee J. Multiple sequence alignment by conformational space annealing. Biophys J 2008;95:4813–4819. 12. Joo K, Lee J, Seo J, Lee K, Kim B, Lee J. All-atom chain-building by optimizing MODELLER energy function using conformational space annealing. Proteins 2009;75:1010–1023. 13. Read RJ, Chavali G. Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins 2007;69:27–37. 14. Kopp J, Bordoli L, Battey JND, Kiefer F, Schwede T. Assessment of CASP7 predictions for template-based modeling targets Proteins 2007;69:38–56. 15. Lopez G, Valencia A, Tress M. FireDB—a database of functionally important residues from proteins of known structure. Nucleic Acids Res 2007;35:D219–D223. 16. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins 2004;57:702–710. 17. Lopez G, Rojas A, Tress M, Valencia A. Assessment of predictions submitted for the CASP7 function prediction category. Proteins 2007;69:165–174.