{savojard,piero,piovesan,gigi,casadio}@biocomp.unibo.it http://www.biocomp.unibo.it. Abstract. A reliable predictor of protein-protein interaction sites is.
Machine-Learning Methods to Predict Protein Interaction Sites in Folded Proteins Castrense Savojardo1,2, Piero Fariselli1,2 , Damiano Piovesan1, Pier Luigi Martelli1 , and Rita Casadio1 1
Biocomputing Group University of Bologna via Irnerio 42, 40126 Bologna, Italy 2 Department of Computer Science, University of Bologna via Mura Anteo Zamboni 7, 40127 Bologna, Italy {savojard,piero,piovesan,gigi,casadio}@biocomp.unibo.it http://www.biocomp.unibo.it
Abstract. A reliable predictor of protein-protein interaction sites is necessary to investigate and model protein functional interaction networks. Hidden Markov Support Vector Machines (HM-SVM) have been shown to be among the best performing methods on this task. Furthermore, it has been noted that the performance of a predictor improves when its input takes advantage of the difference between observed and predicted residue solvent accessibility. In this paper, for first time, we combine these elements and we present ISPRED2, a new HM-SVMbased method that overpasses the state of the art performance (Q2=0.71 and correlation=0.43). ISPRED2 consists of a sets of Python scripts aimed at integrating the different third-party software to obtain the final prediction. Keywords: Hidden Markov Support Vector Machines, Prediction of Interaction sites, Protein-Protein Interaction, Machine Learning, Evolutionary Information, Solvent Accessibility.
1
Introduction
Identifying protein-protein interaction sites on a protein surface is a fundamental step for investigating protein functions and interactions. In the last years, several computational tools have been introduced to address this problem and described in several reviews [33,9,11]. The different methods are based on Neural Networks [32,12,13,22,6,23,24], Support Vector Machines [17,4,3,26,7,31,10,29,19,21], Random Forest [5,27], Conditional Random Fields [17], Hidden Markov Support Vector Machines [20] and meta predictors or ensemble methods [25,8]. A direct comparison of the different methods is difficult since the various authors adopted different data sets, different measures of performances and different definitions of interaction sites. Nonetheless, from the previous efforts, we are able to identify some relevant features that a predictor of protein interaction sites, starting from the protein 3D structure, should include [33,11]. First of all, input has to E. Biganzoli et al. (Eds.): CIBB 2011, LNBI 7548, pp. 127–135, 2012. c Springer-Verlag Berlin Heidelberg 2012
128
C. Savojardo et al.
incorporate evolutionary information, better if in the form of Position Specific Scoring Matrix (computed with PSI-BLAST). Moreover solvent accessibility (or its variants, such as protrusion index or residue depth) plays a relevant role [11]. A further improvement was achieved by exploiting the information derived from the difference between the observed and predicted residue accessibility [24]. It was also noted that the relation between labels of neighboring residues is useful for prediction of protein binding sites [20]. For this reason HM-SVMs are the best performing method when input consists of spatial neighbors (encoded as sequence profile vectors) and computed residue accessibility [20]. Finally, the prediction performance is increased by averaging over the predictions of spatial neighbors [13]. Here we implement ISPRED2 a new method that for the first time incorporates all the above-mentioned predictive strategies in order to achieve the state of the art performance.
2
Dataset
For training and testing our method we adopted the data set selected by Liu and co-workers [20] with a low level of sequence identity. Starting from the protein structures included in the PDB, multi-chains, non-NMR structures and structures with resolution < 4 ˚ A were retained. Protein chains of 40%.
7
The Effect of the Definition of Interaction Patches
In this paper we used a definition of interaction site based on the CA-CA distance (see section 3). However, other authors adopted a different definition of interacting residues based on the difference between the accessible surface areas (ASA) of exposed residue. In particular, Jones and Thornton (1997) [15] defined as interacting those residues with side-chains possessing an ASA that decreased by >1 ˚ A2 on complexation. The accessible surface areas (ASA) were calculated using the program ACCESS[34]. Similarly, Liu et al., 2009 [20], considered the same measure to define interface residues but they computed the ASA using the DSSP program [16]. In order to study the effect of the different definitions of interacting residues, we tested ISPRED2 in cross-validation using the three different definitions described above (Table 3). From Table 3, it is clear that, at least for the data set at hand, there are no significant differences in the adoption
132
C. Savojardo et al. Table 1. Cross-validation performance of the different methods Method Q2(%) Sp(%) Sn(%) MCC(%) NN NN NN NN HM-SVM HM-SVM HM-SVM ISPRED2
64 66 69 69 68 70 71 71
55 62 65 65 70 72 73 73
77 84 82 85 65 66 67 68
31 35 39 40 36 40 42 43
Encoding (Profile+RSA) (PSSM+RSA) (PSSM+dSA) * (Profile+RSA) (PSSM+RSA) (PSSM+dSA) *
*=prediction averaged over a window of 11 neighbors. NN= neural network as described in [12]. HM-SVM=Hidden Markov Support Vector Machine. Q2=overall accuracy. Sp= interaction site specificity. Sn=interaction site sensitivity. MCC=Matthews’ correlation coefficient. Profile=sequence profile. PSSM=position specific scoring matrix as computed by BLAST. Table 2. ISPRED2 performance as a function of the number of spatial neighbors in the input window Window Q2(%) Sp(%) Sn(%) MCC(%) 5 7 11 13 15 17 19
69.0 69.8 70.9 70.7 71.3 71.4 71.2
71.5 72.1 73.0 72.7 73.2 73.2 73.3
63.8 65.0 67.0 66.9 67.8 68.1 67.9
38.2 39.7 42.0 41.5 42.7 43.0 43.0
Table 3. Comparison of PredPPI (HM-SVM) performance as function of different definitions of interaction sites Interaction site definition Liu et al. 2009 [20] Jones and Thornton 1997 [15] Fariselli et al. 2002(This paper)
Q2(%) Sp(%) Sn(%) MCC(%) 71 71 71
73 73 73
67 67 68
43 42 43
of one of the three definitions. This means that the three different interaction site definitions describe almost the same kind of physical features, even though there are small variations in the number and type of involved residues.
8
Comparison with Other Method
Although performance comparison between our method and previously developed methods is rather difficult owing to the different data sets (with the exception of Liu et al., 2009[20]), in Table 4 we report the accuracy of ISPRED2 with
Machine-Learning Methods to Predict Protein Interaction Sites
133
respect to other recently developed machine-learning approaches. Table 4 shows that. ISPRED2 improves over the recently introduced predictors of interaction sites ( Table 4). This achievement is due to the fact that ISPRED2 exploits relevant input features (difference between predicted and observed solvent accessibility and evolutionary information), includes temporal chain dependency among consecutive accessible residues (by means of the HM-SVMs) and takes advantage of a smoothing approach (by averaging the different predictions that are in their spatial proximity). Table 4. Comparison with other recent methods Method Wang et al. (2006) [31] Nguyen-Rajapakse (2006) [21] Deng et al. (2009) [8] Liu et al. (2009) [20]* ISPRED2 *
Q2(%) Sp(%) Sn(%) MCC(%) F1(%) NA NA NA 69 71
65 93 63 54 73
69 36 77 59 68
28 33 35 33 43
67 52 69 56 71
*= On the same data set. The reported performances are taken from the corresponding papers (with the exception of ISPRED2). NA=not available.
9
Conclusions
In this paper, we implement ISPRED2 a new HM-SVM-based method that takes advantage of the difference between observed and predicted residue solvent accessibility. In cross-validation, ISPRED2 overpasses the state of the art performance achieving an overall accuracy of 0.71 with a correlation coefficient of 0.43. ISPRED2 is therefore advisable for determining over the protein structure putative interaction sites and it can be applied in structural Systems Biology studies aiming at validating at a molecular level protein-protein interaction networks [2]. Acknowledgments. CS is the recipient of a MIUR (Ministero Istruzione Universit Ricerca) fellowship supporting his PhD program. The project is partially funded by a MIUR-FIRB grant for the LIBI project delivered to RC.
References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 213(3), 403–410 (1990) 2. Bartoli, L., Martelli, P.L., Rossi, I., Fariselli, P., Casadio, R.: The prediction of protein-protein interacting sites in genome-wide protein interaction networks: The Test Case of the Human Cell Cycle. Curr. Protein Pept. Sci. 11, 601–608 (2010) 3. Bordner, A.J., Abagyan, R.: Statistical analysis and prediction of protein-protein interfaces. Proteins 60(3), 353–366 (2005)
134
C. Savojardo et al.
4. Bradford, J.R., Westhead, D.R.: Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics 21(8), 1487–1494 (2005) 5. Chen, X.W., Jeong, J.C.: Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5), 585–591 (2009) 6. Chen, H., Zhou, H.X.: Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins 61(1), 21– 35 (2005) 7. Chung, J.L., Wang, W., Bourne, P.E.: Exploiting sequence and structure homologs to identify protein-protein binding sites. Proteins 62(3), 630–640 (2006) 8. Deng, L., Guan, J., Dong, Q., Zhou, S.: Prediction of protein-protein interaction sites using an ensemble method. BMC Bioinformatics 10, 426 (2009) 9. DeVries, S.J., Bonvin, A.M.J.J.: How Proteins Get in Touch: Interface Prediction in the Study of Biomolecular Complexes. Current Protein and Peptide Science, 394–406 (2008) 10. Dong, Q., Wang, X., Lin, L., Guan, Y.: Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinformatics 8, 147 (2007) 11. Ezkurdia, I., Bartoli, L., Fariselli, P., Casadio, R., Valencia, A., Tress, M.L.: Progress and challenges in predicting protein-protein interaction sites. Brief Bioinform. 10(3), 233–246 (2009) 12. Fariselli, P., Pazos, F., Valencia, A., Casadio, R.: Prediction of protein–protein interaction sites in heterocomplexes with neural networks. Eur. J. Biochem. 269(5), 1356–1361 (2002) 13. Fariselli, P., Zauli, A., Rossi, I., Finelli, M., Martelli, P.L., Casadio, R.: A neural network method to improve prediction of protein-protein interaction sites in heterocomplexes. In: IEEE Int. Workshop on Neural Network on Signal Processing 2003, Toulouse (FRANCE), pp. 33–41. IEEE Press (2003) 14. Henrick, K., Thornton, J.M.P.Q.S.: A protein quaternary structure file server. Trends Biochem. Sci. 23(9), 302–305 (1998) 15. Jones, S., Thornton, J.M.: Analysis of protein-protein interaction sites using surface patches. J. Mol. Biol. 272, 121–132 (1997) 16. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983) 17. Koike, A., Takagi, T.: Prediction of protein-protein interaction sites using support vector machines. Protein Eng. Des. Sel. 17(2), 165–173 (2004) 18. Li, M.H., Lin, L., Wang, X.L., Liu, T.: Protein-protein interaction site prediction based on conditional random fields. Bioinformatics 23(5), 597–604 (2007) 19. Li, N., Sun, Z., Jiang, F.: Prediction of protein-protein binding site by using core interface residue and support vector machine. BMC Bioinformatics 9, 553 (2008) 20. Liu, B., Wang, X., Lin, L., Tang, B., Dong, Q., Wang, X.: Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics 10, 381 (2003) 21. Nguyen, M.N., Rajapakse, J.C.: Protein-Protein Interface Residue Prediction with SVM Using Evolutionary Profiles and Accessible Surface Areas. In: CIBCB 2006, pp. 1–5 (2006) 22. Ofran, Y., Rost, B.: Predicted protein-protein interaction sites from local sequence information. FEBS Lett. 544(1-3), 236–239 (2003) 23. Ofran, Y., Rost, B.: ISIS: interaction sites identified from sequence. Bioinformatics 23(ECCB 2006), e13–e16 (2006)
Machine-Learning Methods to Predict Protein Interaction Sites
135
24. Porollo, A., Meller, J.: Prediction-based fingerprints of protein-protein interactions. Proteins 66(3), 630–645 (2007) 25. Qin, S., Zhou, H.X.: meta-PPISP: a meta web server for protein-protein interaction site prediction. Bioinformatics 23(24), 3386–3387 (2007) 26. Res, I., Mihalek, I., Lichtarge, O.: An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics 21(10), 2496– 2501 (2005) ˇ c, M., Tomi´c, S., Vlahoviˇcek, K.: Prediction of Protein-Protein Interaction Sites 27. Siki´ in Sequences and 3D Structures by Random Forests. PLoS Comput. Biol. 5(1), e1000278 (2009) 28. Tsochataridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 6, 1453–1484 (2005) 29. Yan, C., Dobbs, D., Honavar, V.: A two-stage classifier for identification of proteinprotein interface residues. Bioinformatics 20(suppl. 1), I371–I378 (2004) 30. Wagner, M., Adamczak, R., Porollo, A., Meller, J.: Linear regression models for solvent accessibility prediction in proteins. Journal of Computational Biology 12, 355–369 (2005) 31. Wang, B., Chen, P., Huang, D.S., Li, J.J., Lok, T.M., Lyu, M.R.: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett. 580(2), 380–384 (2006) 32. Zhou, H.X., Shan, Y.: Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins 44(3), 336–343 (2001) 33. Zhou, H.X., Qin, S.: Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics 23(17), 2203–2220 (2007) 34. Hubbard, S.J.: ACCESS: A Computer Program Written in C. University College, London (1989)