âAlanine scanning for experimental determination. ... ConSurf Server Database. ... Tuncbag N, Keskin O, Gursoy A: HotPoint: hot spot prediction server for ...
Critical Assessment of the Methods and the Features Used for Hot Spot Residue Prediction at Protein-Protein Interfaces Selin Karagulle, Ozlem Keskin, Attila Gursoy Center for Computational Biology and Bioinformatics (CCBB) Koç University, Istanbul, Turkey {skaragulle, okeskin, agursoy}@ku.edu.tr
BACKGROUND
METHODOLOGY
Proteins interact through interfaces. A few residues at the interface contribute significantly to binding free energies, these residues are called hot spots.
Datasets Interface residues of complexes with no 3D structure , interface residues of single chain and interface residues of DNA complexes are eliminated.
Interface residues whose mutation leads to change in binding free energy greater than or equal to 2.0 kcal/mol the ones whose interaction strengths are labeled as ‘strong’ considered as hotspots.
There are totally 1206 residues of 66 complexes collected from 12 different sources.
Alanine scanning for experimental determination. High cost Not feasible at large scale
Training set The interface residues whose observed binding free energies are ≥2.0 kcal/mol are considered as hot spots. Testing set Hot spot residues are labeled as the ones with ‘strong’ interaction strengths and others are tagged as non-hot spots. Training set vs. Testing set Train
OBJECTIVES In recent studies, lots of features for hot spot prediction were used.These features were tested on different data sets by implementing various machine learning methods. Comparison of these studies for the accuracy of hot spot prediction is difficult. We aim to provide single data and feature set for hot spot prediction using machine learning methods, to generate nonredundant data set and to design gold standard database, HOTBASE
Features
Test
148
268
134
656
H NH
Accesible Surface Area Pair Potentials Knowledge-based solvent mediated inter-residue potentials extracted from protein interfaces, are used in this work. Evolutionary Conservation Score Residue conservations are found by Rate4Site (R4S) algorithm and are obtained from ConSurf Server Database.
RESULTS We implemented random forest model using six classes based on their dipoles and volumes of the side chains, secondary structure, atom contacts and atom contact areas, residue contacts, physicochemical features, ASA and depth index on our nonredundat data set. There are 57 features for every residue as in work of Wang et al. [2].
BeatMusic BeatMusic[1] evaluates the change in binding affinity between proteins (or protein chains) caused by single-site mutations in their sequence. The predictions are based on the structure of the protein-protein complex. TP:
TN:
FP:
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
FN:
0.771 0.520
6% 17% 6%
0.358
0.273
71%
Accuracy
Recall
Precision
F-measure
Features mass_target atom_contact_areas_target polarity_target isopoint_target residue_contacts_target polarizability_target rel_dpx_target ave_dpx_target isopoint_mirror rel_ASA_target rel_s_ch_dpx_target atom_contacts_target rel_s_ch_ASA_target rel_ASA_intra polarity_mirror hydrophilicity_target s_ch_ave_dpx_target mass_intra sd_s_ch_dpx_target class_mirror hydrophobicity_intra rel_s_ch_ASA_mirror hydrophobicity_target
MeanDecreaseAccuracy 6.512 6.136 5.764 5.666 5.246 5.017 4.949 4.683 4.674 4.523 4.522 4.297 4.262 4.126 3.975 3.945 3.916 3.777 3.638 3.209 3.077 3.039 3.037
Training set
0.896
Testing set 0.632
0.477
Recall
Precision
Relative ASA in complex state and Pair potentials
F-measure
TP
TN
8%
FP
FN
TP
TN
FP
FN
0.530 17%
0.555
0.514
18%
Physicochemical Features The physicochemical characteristics of an amino acid are hydrophobicity, hydrophilicity, polarity, polarizability, propensities, isoelectric point, mass, and average accessible surface area.
Support Vector Machines Support Vector Machines (SVMs) are a class osupervised learning algorithms, and can learn a linear decision boundary to discriminate different classes with maximum margin.
0.680
9%
Residue Contacts Two residues will have residue contact information if there is one pair of contact atoms from them individually.
Machine Learning Algorithms
Testing Set
0.743
Atom Contacts and Atom Contact Areas The contact between two atoms (atom_contact) is defined by the CSU program
Depth Index The depth of an atom refers to the distance from its closest solvent accessible atom.
We used constructed model that if relative ASA in complex state of a residue = 18.0 hotspot else “NonHotspot” [3] Testing Set
Category of residues and secondary structure Residues have six classes based on their dipoles and volumes of the side chains. There are three types of secondary structure: helix, strand and loop.
0.262
0.166 Accuracy
0.609
MeanDecreaseAccuracy: The average decrease of classification accuracy when the values of a particular feature are randomly permuted on the out-of-bag samples.
Training Set
Sequence Profile Sequence profile is obtained by PSI-BLAST searching against NCBI non-redundant database. The BLOSUM62 substitution matrix and E-value threshold of 0.001 are chosen as parameters.
0.840
0.668
Training set
Sequence Entropy Sequence entropy value for each residue is obtained from HSSP database
0.533
0.411 0.335
18%
Random Forest Model RF is an ensemble classification algorithm that employs a collection of decision trees to reduce the output variance of individual trees and thus improves the stability and accuracy of classification
15%
65%
50%
Accuracy
Recall
Precision
F-measure
Accuracy
Recall
Precision
F-measure
CONCLUSION
Results of all features
We implemented support vector machines algorithm using hydrophobicity, hydrophilicity, polarity, polarizability, propensities, average accessible surface area, sequence Profile, evaluationary conservation score, sequence entropy on the dataset of Chen et al.[4]
We implemented support vector machines algorithm using all features on our nonredundant dataset.
Ab+ 10-fold cross validation
0.573
0.604
0.575
Ab+ Evaluation on test set
0.6
Ab- 10-fold cross validation
0.593
Ab- Evaluation on test set
0.560
0.424
0.523 0.537
0.514
0.576
0.519
0.45
0.402
0.309
0.235
0.152
Precision Precision
Recall
Accuracy
Once database is completed, it will be published at http://prism.ccbb.ku.edu.tr/hotbase/
0.674 0.488
0.481 0.483
Evaluation on test set 0.848
0.617 0.569
Our database includes values of many features that are used for machine learning methods and give highly accurate results.
SVM Results of All features 10-fold cross validation
F-Measure
We have nonredundant dataset.
Recall
Accuracy
F-measure
REFERENCES 1. Dehouck Y, Kwasigroch JM, Rooman M, Gilis D. BeAtMuSiC: Prediction of changes in protein-protein binding affinity upon mutations. Nucleic Acids Research (2013).doi: 10.1093/nar/gkt450 2. Wang L, Liu Z-P, Zhang X-S, Chen L: Prediction of hot spots in protein interfaces using a random forest model with hybrid features. Protein Eng. Des. Sel. 2012, 25:119– 126. 3. Tuncbag N, Keskin O, Gursoy A: HotPoint: hot spot prediction server for protein interfaces. Nucleic Acids Research 2010, 38:W402–W406. 4. Chen R, Chen W, Yang S, Wu D, Wang Y, Tian Y, Shi Y: Rigorous assessment and integration of the sequence and structure based features to predict hot spots. BMC Bioinformatics 2011, 12:311.
Thanks to Scientific and Technological Research Council of Turkey (TUBITAK) for their funding.