Data construction for phosphorylation site prediction - CiteSeerX

B RIEFINGS IN BIOINF ORMATICS . VOL 15. NO 5. 839^ 855 Advance Access published on 29 March 2013

doi:10.1093/bib/bbt012

Data construction for phosphorylation site prediction Haipeng Gong, Xiaoqing Liu, Jun Wu and Zengyou He* Submitted: 29th October 2012; Received (in revised form) : 9th February 2013

Abstract

Keywords: phosphorylation site prediction; classification; Friedman test; generalization performance; Wilcoxon signed-ranks test

INTRODUCTION Post-translational modification is the chemical modification of a protein after its translation. Protein phosphorylation is a kind of post-translational modification, which plays key roles in many cellular

processes. A protein kinase phosphorylates the substrates by transferring phosphate from adenosine triphosphate or guanosine triphosphate to specific amino acids (serine, threonine and tyrosine). A biological function usually involves a series of

Corresponding author. Zengyou He, School of Software, Dalian University of Technology, DaLian, LiaoNing, 116621, China. Tel.: þ86-411-87571630 Fax: þ86-411-87571630; E-mail: [email protected] Haipeng Gong received the BS degree in software engineering from Dalian University of Technology, China, in 2011. He is currently working towards the MS degree in the School of Software at Dalian University of Technology. His research interests include data mining and computational proteomics. Xiaoqing Liu is currently working towards the BS degree in the School of Software at Dalian University of Technology. She is one of the students recommended for admission for MS degree in the School of Software at Dalian University of Technology. Her research interests include bioinformatics and data mining. JunWu is currently working towards the BS degree in the School of Software at Dalian University of Technology. He is one of the students recommended for admission for MS degree in the School of Software at Dalian University of Technology. His research interests include bioinformatics and data mining. Zengyou He received the BS, MS and PhD degrees in computer science from Harbin Institute of Technology, China, in 2000, 2002 and 2006, respectively. He was a research associate in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology from February 2007 to February 2010. Since March 2010, he has been an associate professor in the School of Software at Dalian University of Technology. His research interests include data mining and computational mass spectrometry. ß The Author 2013. Published by Oxford University Press. For Permissions, please email: [email protected]

Downloaded from http://bib.oxfordjournals.org/ at Hong Kong Baptist University on September 23, 2014

Protein phosphorylation is one of the most pervasive post-translational modifications, regulating diverse cellular processes in various organisms. As mass spectrometry-based experimental approaches for identifying phosphorylation events are resource-intensive, many computational methods have been proposed, in which phosphorylation site prediction is formulated as a classification problem. They differ in several ways, and one crucial issue is the construction of training data and test data for unbiased performance evaluation. In this article, we categorize the existing data construction methods and try to answer three questions: (i) Is it equivalent to use different data construction methods in the assessment of phosphorylation site prediction algorithms? (ii) What kind of test data set is unbiased for assessing the prediction performance of a trained algorithm in different real world scenarios? (iii) Among the summarized training data construction methods, which one(s) has better generalization performance for most scenarios? To answer these questions, we conduct comprehensive experimental studies for both non-kinase-specific and kinase-specific prediction tasks. The experimental results show that: (i) different data construction methods can lead to significantly different prediction performance; (ii) there can be different test data construction methods that are unbiased with respect to different real world scenarios; and (iii) different data construction methods have different generalization performance in different real world scenarios. Therefore, when developing new algorithms in future research, people should concentrate on what kind of scenario their algorithm will work for, what the corresponding unbiased test data are and which training data construction method can generate best generalization performance.

840

Gong et al. because most phosphorylation sites are still unknown. As a result, different strategies have been exploited for constructing the training/test data in different methods, making it possible to generate inconsistent performance evaluation results. In this article, we first give a summary of both training and test data construction methods in the literature. Then, we concentrate on answering three questions. The first one is whether different data construction methods can lead to significantly different performance. The second one is what kind of test data are unbiased in different real world scenarios. The last one is which training data construction methods can have better generalization performance over most scenarios. To answer the first question, we apply two different feature extraction methods and use 16 frequently used classifiers from Weka [6] to conduct comprehensive experiments. We use Friedman test together with post-hoc tests to evaluate whether different data construction methods can lead to significantly different performance on phosphorylation site prediction. To answer the second question, we summarize the existing scenarios and discuss relationships between the scenarios and negative test data construction methods. To answer the third question, we generate all kinds of test data in different scenarios and for each of them, we apply all the training data construction methods to train models for prediction performance comparison. We apply Wilcoxon signed-ranks test to check which training data construction method can achieve better generalization performance.

SUMMARY OF DATA CONSTRUCTION METHODS We define a phosphorylated protein as one protein with at least one S/T/Y residue known to be phosphorylated. Each S/T/Y site in proteins is usually transformed into a short sequence of fixed length, whose central residue is the target S/T/Y site. Such a sequence is called a phosphorylated peptide if its central residue can be phosphorylated. Otherwise, it is an unphosphorylated peptide.

Training data construction methods For non-kinase-specific predictions, there is one method for constructing positive training data (GTrainP) and five methods for generating negative training data (GTrainN1-GTrainN5), as described in


phosphorylation processes [1], in which protein kinases recognize specific substrates and determine the exact time and place for phosphorylation to occur. In a eukaryotic cell, it is expected that 3050% of the proteins can be phosphorylated [2]. However, most of the phosphorylation substrates still remain to be unrecognized. To identify phosphorylation sites, various experimental methods have been exploited. In particular, the advent of high-throughput mass spectrometry enables the large-scale investigation of phosphorylated peptides under different conditions. However, the determination of phosphorylation sites experimentally is still labourintensive and time-consuming. To reduce the number of candidate phosphorylation sites that have to be validated experimentally, the computational prediction of potential sites has become a useful strategy. Most proposed computational methods formulate phosphorylation site prediction as a binary classification problem, namely each serine/threonine/tyrosine can be classified as either phosphorylation site or non-phosphorylation site. Although there are also some other amino acids that can be phosphorylated such as histidine and aspartate, the serine/threonine/tyrosine is considered frequently in the majority of computational methods. To date, there are already many algorithms for phosphorylation site prediction. The readers may refer to [3–5] for a summary of these methods. Briefly, they differ in the machine-learning technique used, the amount of sequence/structural information adopted, whether the prediction models are kinases-specific. With respect to the fact that a large number of phosphorylation site prediction methods are already available, it is critical to properly assess and compare the performance of different prediction methods. An accurate performance evaluation protocol can reveal the deficiencies of existing methods and show whether a newly developed method can really achieve a clear performance gain compared to the previous ones. Fundamental to any classification problem, it is crucial to have a collection of training/test data with known positive data and negative data for model construction and performance evaluation. Here, positive data correspond to a set of phosphorylated peptides, and negative data correspond to a set of non-phosphorylated peptides. However, it is not straightforward to generate such kinds of data

Data construction for phosphorylation site prediction A

B

C

D

E

F

841

Figure 1. Similarly, the data construction methods for kinase-specific predictions are summarized in Figure 2. The relationship of the six methods for generating kinase-specific negative training data is shown in Figure 3. For rigorous mathematical definitions and references in each category, please refer to Table 1.

Test data construction methods Cross-validation and independent test are used for generating the test data, as shown in Figure 4. Note that the positive data and negative data in each test data construction procedure are generated with the same strategies as we have described in section ‘Training data construction methods’ for the training data. We also provide mathematical definitions for the test data generated from each method in Table 2.

METHODS To answer question 1, we apply two feature extraction methods and 16 frequently used classifiers to conduct the performance comparison. We use Friedman test and post-hoc tests to check whether

different data construction methods have significant differences for assessing the prediction performance. To answer question 2, we give the explanation of ‘unbiased test data’, summarize real world application scenarios and provide the mapping relationship between populations and negative test data construction methods. To answer question 3, we use data construction methods in section ‘Summary of data construction methods’ to generate various kinds of test data sets. Then for each of them, we use all the training data constructed in section ‘Experimental methods for answering question one’, the first feature extraction method and 16 classifiers to train models for performance evaluation.

Experimental methods for answering question one There are three key issues for evaluating the performance of phosphorylation site prediction methods: data construction, feature extraction and classifier selection. In this article, we fix the feature extraction method and classifier used to evaluate whether different data construction methods can lead to significant differences on the prediction performance.


Figure 1: Training data construction methods for non-kinase-specific phosphorylation site prediction approaches. P1 is the set of phosphorylated proteins and P2 is the set of unphosphorylated proteins. (A) GTrainP: the positive training data consist of annotated phosphorylated peptides from various data sources without considering the kinases information. (B) GTrainN1: the negative training data generated from this method consist of unphosphorylated peptides from phosphorylated proteins. (C) GTrainN2: it includes unphosphorylated peptides from both phosphorylated proteins and unphosphorylated proteins. (D) GTrainN3: it constructs negative training data using unphosphorylated peptides from the unphosphorylated proteins. (E) GTrainN4: it randomly generates a set of unphosphorylated peptides based on the frequency distribution of amino acids derived from GTrainN1. This method includes three steps: first, extract peptides from GTrainN1; second, calculate the amino acid frequency distribution; third, randomly generate new peptides with the calculated distribution. (F) GTrainN5: this method randomly generates a set of unphosphorylated peptides with equal frequency distribution of amino acids, i.e. the probability of each amino acid appearing in each non-central position of a peptide is the same.

842

Gong et al.

A

B

C

D

E

F

G

Figure 3: The relationship of six kinase-specific negative training data construction methods.

Data We use Phospho.ELM (version 9.0) [31] and Swiss-Prot (release 2011_11) [32] as the data source to extract all the proteins annotated as ‘Homo sapiens’. We first follow Musite [15] to partition all the proteins into different groups through BLAST-Clust in BLAST [33] package version 2.2.19 with a

sequence identity threshold of 50%. A sequence identity threshold of 50% may not guarantee that there are no highly similar sequences between training and test set. Here, we did not use a larger threshold or perform a post-processing step to further remove such sequences. This is because some phosphorylation sites and non-phosphorylation sites have very similar surroundings, which leads to the existence of highly similar sequences between training and test data. How to distinguish these sites/sequences is the most challenging problem in phosphorylation site prediction. Keeping some of these sequences that are hard to separate can help us to check which classifiers have better predictive performance. Then within each group, the protein with the largest number of known phosphorylation sites is selected as one element of the non-redundant (NR) data set; if there are no phosphorylated proteins in a group, the longest protein is selected. To construct test data sets for independent test 2, we also


Figure 2: Training data construction methods for kinase-specific phosphorylation site prediction approaches. Pj is the set of proteins phosphorylated by kinase j, Po is the set of proteins phosphorylated by other kinases and Pu is the set of unphosphorylated proteins. (A) KTrainP: the positive training data are composed of known phosphorylated peptides from phosphorylated proteins of corresponding kinases. (B) KTrainN1: it constructs the negative training data by extracting unphosphorylated peptides from phosphorylated proteins of kinase j. (C) KTrainN2: it generates the negative training data using both unphosphorylated peptides from all phosphorylated proteins and those phosphorylated peptides from proteins that are not phosphorylated by kinase j. This construction method assumes that different kinases have different local patterns around the phosphorylation site. (D) KTrainN3: it constructs the negative training data using unphosphorylated peptides from both phosphorylated proteins and unphosphorylated ones. (E) KTrainN4: the negative training data produced by this method consist of unphosphorylated peptides from proteins that can be phosphorylated by any kinase. (F) KTrainN5: the negative training data from this method include both unphosphorylated peptides from all proteins and phosphorylated peptides from proteins phosphorylated by other kinases. (G) KTrainN6: the negative training set is composed of unphosphorylated peptides whose central S/T/Y residues are buried in the core of proteins phosphorylated by kinase j. This strategy relies on the assumption that buried residues would not be physically accessible to any kinase, thus reducing the number of so-called negatives that later turn out to be positives. A disadvantage of this method is that it requires knowledge of the protein’s tertiary structure, and only a small portion of proteins currently have solved structures.

Data construction for phosphorylation site prediction

843

Table 1: Training data construction methods G/K

P/N Positive

Index GTrainP

Methods fð

mi S j¼1

GTrainN1

fð

mi S j¼1

Examples

Reference

þ Ki,j Þ

1

1

Pþ i,j Þ

NetPhos rBBFNN DISPHOS NetPhosYeast GANNPhos AutoMotif PPRED PlantPhos

[7] [8] [9] [10] [11] [12] [13] [14]

Musite

[15]

PHOSIDA

[16]

GPS 2.0

[17]

Non-kinase-specific GTrainN2

mi S

f ðð

j¼1

Pþ i,j Þ

GTrainN3

f ðQþ i Þ

GTrainN4

f ðR1 ð

mi S j¼1

Positive

Qþ i Þ

Pþ i,j ÞÞ

GTrainN5

f ðR2 ðÞÞ

*

[18]

KTrainP

þ Ki,j

1

1

KTrainN1

f ðPþ i,j Þ

CRPhos NetPhosK PHOSITE PostMod

[19] [20] [21] [22]

KTrainN2

f ðð

ALSVM

[23]

AMS3.0 *

[24] [25]

PredPhospho KinasePhos 1.0 KinasePhos 2.0

[26] [27] [28]

N.A

N.A

NetPhosK

[29]

S x6¼j

KTrainN3

Kinase-specific

S

f ðð

mi S x¼1

þ Ki,x Þ[ð

Pþ i,x Þ

S

mi S x¼1

Pþ i,x ÞÞ

Qþ i Þ

Negative KTrainN4

fð

mi S x¼1

KTrain5

f ðð

Pþ i,x Þ

S x6¼j

KTrain6

þ Ki,x Þ

mi S S S þ ð Pþ Qi Þ i,x Þ

f ððPþ i,j Þ \ Bi Þ

x¼1

We define Si as the set of peptide sequence with S/T/Yas the central residue of species i (i ¼1, 2, . . . , n), then the set of such peptides from all species S U ¼ ni¼1 Si . Suppose there are mi kinases for Si, let Ki , 1 , Ki , 2 , . . . , Ki , mi be the subsets of phosphorylated peptides and Pi , 1 , Pi , 2 , . . . , Pi , mi be the subsets of unphosphorylated peptides from proteins phosphorylated by mi kinases, and let Qi be the set of peptide sequences with S/T/Yas the S i S S i S centre from unphosphorylated proteins of species i, we have Si ¼ ð mj¼1 Ki , j Þ ð mj¼1 Pi , j Þ Qi . Usually, data sets are partitioned into two parts for the independent test within the same species, one for extracting training data (we use ‘þ’ as the marker) and another one for extracting test S data (we use ‘’ as the marker), i.e. Si ¼ Sþ Si . Specifically, we define Bi as the set of peptide sequences whose centre residues are buried sites i from the core of proteins of species i.We define f(y) as the function of sampling or bootstrapping a (sub)set from the set y, R1 ðyÞ as the function of artificially constructing a set of unphosphorylated peptides based on the frequency distribution of amino acids in set y, R2 ðÞ as the function of randomly constructing a set of unphosphorylated peptides with each amino acid having the same frequency distribution. Here, non-kinase-specific training data sets are built for species i, whereas kinase-specific data sets are built for kinase j of species i. Specifically,‘1’ symbolizes that almost all the existing methods have applied this method, and ‘*’ means that there is no abbreviated name.


Negative

844

Gong et al.

A

D B

C

extract proteins annotated as ‘Mus musculus’ and use the same method to generate the NR data set. Data construction methods Because a serine/threonine-specific kinase can often phosphorylate both serine and threonine residues [34], phosphoserines and phosphothreonines are combined for model training. In our experiment, the length of each phosphorylated/unphosphorylated peptide is 13. We randomly split the Homo sapiens NR data set into two subsets: TrainSet and TestSet, which contain two-third and one-third of proteins, respectively. For non-kinase-specific prediction experiments, we generate five kinds of training data sets that are denoted as GM1, GM2, GM3, GM4 and GM5, respectively. In each GMi (i ¼ 1, 2, 3, 4, 5), the positive and negative training data are constructed with GTrainP and GTrainNi, respectively. Furthermore, we randomly generate five training data sets for each GMi by sampling 5000 phosphorylated peptides and 5000 unphosphorylated peptides from TrainSet. Besides cross-validation, we also construct three groups of independent test data GT1, GT2 and GT3 with ID1, ID2 and ID3, respectively.

To match the training data, every group also has five test data sets, where each set is composed of 5000 phosphorylated peptides and 5000 unphosphorylated peptides. The difference is that the component peptides for GT1, GT2 and GT3 are sampled from TestSet, Mus musculus NR data and the whole Homo sapiens NR data, respectively. In addition, we fix GTrainN2 as the negative data extraction method when generating independent test data. Here, we choose GTrainN2 because we hold the assumption that what we usually want to predict are proteins without known phosphorylated sites, and the population should correspond to both phosphorylated proteins and unphosphorylated proteins. In this case, test data constructed from GTrainN2 is unbiased so that the estimated prediction performance can be safely generalized to population level. For kinase-specific experiments, we choose three kinases: CDK, PKA and CK2. For each kinase, we construct five groups of training data (denoted as CDKMi for kinase CDK, PKAMi for kinase PKA and CK2Mi for kinase CK2) using KTrainP and KTrainNi (i ¼ 1, 2, 3, 4, 5) and three groups of independent test data with ID1, ID2, ID3 (KTrainN5


Figure 4: Test data construction methods for non-kinase-specific and kinase-specific phosphorylation site prediction approaches: cross-validation and independent test. (A) Cross-validation (CV) splits the original training data into different parts with approximately equal size. Each part is masked in turn as the test data, whereas the remaining parts are used as the training data. The classifier built on the training data is then applied to predict the class labels of test data. This process is repeated until all parts have been masked once, and then the errors made in all blinded tests are combined to give an overall performance estimate. (B) ID1: the first independent test method uses a subset of phosphorylated peptides and unphosphorylated peptides as the test data, which has no overlap with the training data from the same species. (C) ID2: this method constructs a test data set using peptides from other species. (D) ID3: the test data are sampled from the same set that is used for generating the training data. This means that there can be some overlap between the test data and training data.


845

Table 2: Test data construction methods G/K

Test

Index GTestP1

Methods mi S

cð

þ Ki,j Þ

j¼1

CV GTestN1

mi S

cðeðð

j¼1

GTestP2

fð

eðð

j¼1

Non-kinase-specific GTestP3

fð

eðð

ID3 GTestN4

fð

mi S

eðð

Pl,j Þ

Pi,j Þ

cðgðð

S x6¼j

KTestP2

S

S

gðð

x6¼j

1

PlantPhos

[14]

rBBFNN

[8]

Musite

[15]

*

[12]

PPRED

[13]

1

1

AMS3.0

[24]

PhoScan

[30]

Musite

[15]

Ql ðl 6¼ iÞÞ

Qi Þ

mi þ S S þ S þ Ki,x Þ ð Pi,x Þ Qi ÞÞ x¼1

Ki,j

ID1 KTestN2

S

þ Þ cðKi,j

CV KTestN1

Q i Þ

Ki,j Þ

j¼1 mi S

j¼1

KTestP1

S

1

mi S S S Ki,x Þ ð P Qi Þ i,x Þ x¼1

Kinase-specific KTestP3

Kl,j ðl 6¼ iÞ

KTestN3

S

ID2 gðð

x6¼j

KTestP4

x¼1

Ki,j

ID3 KTestN4

ml S S S Kl,x Þ ð Pl,x Þ Ql ðl 6¼ iÞÞ

S

gðð

x6¼j

mi S S S Ki,x Þ ð Pi,x Þ Qi Þ x¼1

ContinuingTable 1, we define e(y) as the operation of selecting one method from GTrainN1, GTrainN2, GTrainN3, GTrainN4 and GTrainN5 to build negative data based on set y. Similarly, we define g(y) as the operation of selecting one method from KTrainN1, KTrainN2, KTrainN3, KTrainN4, KTrainN5 and KTrainN6 to build negative data based on set y. We define c(y) as the operation of building test data through cross-validation based on set y. For KTestP3 and KTestN3, the jth kinase of species i and species l are the same.

is fixed as the negative data construction method and this choice is based on our motivation for introducing KTrainN5). Each group has five randomly generated data sets of size k, where k is the number of kinase-specific phosphorylated sites in TrainSet for training data and the number of kinase-specific phosphorylated sites in TestSet, Mus musculus NR, Homo sapiens NR for three groups of test data. Feature extraction methods In the experiments, we use two feature extraction methods to generate features for classifier training and testing. In the first method, we use features that have been exploited in Musite [15]: KNN

scores, protein disorder features, compositions of amino acids surrounding S/T/Y sites. In the second method, we directly use the animo acid in each position as a categorical feature. For classifiers that cannot handle discrete data, we transform each categorical feature into 20 binary features. Such a binary feature represents the presence/absence of a specific amino acid in that position using 1/0. A general description on these features is provided in the supplementary document. Statistical testing methods To evaluate whether different training/test data construction methods have significant influence on the


j¼1

GTestP4

P i,j Þ

Reference

Qþ i ÞÞ

Kl,j ðl 6¼ iÞÞ

j¼1 ml S

ID2 GTestN3

ml S

S

Ki,j Þ

j¼1 mi S

ID1 GTestN2

mi S

Pþ i,j Þ

Examples

846

Gong et al.

reported prediction performance, we apply Friedman test and post-hoc tests (Nemenyi test, BonferroniDunn test and Holm’s step down test) [35] to analyse the results. Basic principles and formulas of these statistical tests are described in the supplement.

Defining unbiased test data for answering question two First of all, the construction of both training data and test data is to draw a sample (subset) from the population of interest (full set). However, their objectives are fundamentally different. The subset sampling for test data generation desires unbiased subsets so that the predictive performance estimated with them can be safely assumed to generalize to the population level. The training subset sampling aims at generating subsets such that classification algorithms trained on them can have best predictive performance, even if these subsets are biased. Typically, an unbiased test data set is an unbiased representative of the population and should have the same data distribution as the population. Test subsets generated by a simple uniform random sampling from the population can yield an unbiased estimate of the true distribution. However, in the context of phosphorylation site prediction, there is more than one real world scenario in which more than one population exists. That means no test subsets are always unbiased. For example, if we use a model to determine the phosphorylation status of candidate sites from proteins without any known phosphorylation sites, prediction performance estimated by test data generated

Experimental methods for answering question three The generalization performance of a learning method relates to its prediction capability on independent test data [36]. The assessment of such performance is extremely important in practice, as it guides the choice of learning method and measures the quality of the ultimately chosen model. In this article, we assess the generalization performance of training data construction methods. We perform experiments to obtain the average performance of different training data construction methods using independent test data sets that cover all the scenarios in section ‘Results for answering the second question’ and apply the Wilcoxon signed-ranks test to check which one is better. Data construction and feature extraction methods For non-kinase-specific prediction experiments, we generate three kinds of independent test data sets with ID1 that are denoted as GT11, GT12 and GT13, respectively. In each GT1i (i ¼ 1, 2, 3), the positive and negative test data are constructed with GTrainP and GTrainNi, respectively. Furthermore, we randomly generate five test data sets for each GT1i by sampling 5000 phosphorylated peptides and 5000 unphosphorylated peptides from TestSet. For generating training data sets, we use GM1, GM2 and GM3 in section ‘Experimental methods for answering question one’. For kinase-specific experiments, we also choose CDK, PKA and CK2. For each kinase, we construct five groups of test data (denoted as CDKT1i for kinase CDK, PKAT1i for kinase PKA and CK2T1i for kinase CK2) using ID1, KTrainP and KTrainNi (i ¼ 1, 2, 3, 4, 5). Each group has five randomly generated data sets of size k, where k is the number of kinase-specific phosphorylated sites in TestSet. For constructing training data sets, we use CDKMi (i ¼ 1, 2, 3, 4, 5), PKAMi (i ¼ 1, 2, 3, 4, 5) and CK2Mi (i ¼ 1, 2, 3, 4, 5), respectively.


Performance comparison We choose 16 frequently used classifiers (see the supplement for details) from Weka [6] to conduct the experiment. For each kind of training set, one 5-fold cross-validation and three independent tests are performed. In cross-validation, all five training data sets in each group are used to perform the 5-fold cross-validation. In independent test, we randomly select one test data from the corresponding group for both non-kinase-specific predictions and kinasespecific experiments. We calculate the average of the area under the ROC curve (AUC) values as the statistic for the subsequent statistical analysis. Our comparison includes two major parts: the comparison of training data construction methods and the comparison of test data construction methods. In all our experiments, we set the statistical significance level as a ¼ 0.05.

from only the sites of known phosphorylated proteins might have a bias. In another case, if the model is used to predict phosphorylation sites for a set of already known phosphorylated proteins, then the unbiased test subset should be sampled only from phosphorylated proteins. In a word, different application scenarios correspond to different populations, whereas different populations correspond to different unbiased test subset sampling methods.


847

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

15

15 10

4 5

y

0

2

15 10

4 5

x

y

0

2

15 10

4 5

x

y

0

2

10

4 5

x

y

0

2

x

In the experiments, we use features that have been exploited in Musite: KNN scores, protein disorder features, compositions of amino acids surrounding S/ T/Y sites. Generalization performance comparison We use the same 16 classifiers in section ‘Experimental methods for answering question one’ to conduct the experiment. In the nonkinase-specific predictions, we use GMi as the training data and obtain its average AUC performance for each GT1i. Then, we calculate each GMi’s average performance over all the test data sets (GT11, GT12 and GT13). Then, we apply the Wilcoxon signed-ranks test [35] to conduct the pair-wise comparison. For kinase-specific experiments, we use the same experimental process. In our experiments, we set the statistical significance level as a ¼ 0.05.

RESULTS Results for answering the first question The comparison of training data construction methods Figure 5 plots the distribution of AUC values when different training data construction methods are used for the non-kinase-specific prediction tasks. Here, we use the Musite feature for all 16 classifiers (the result of using categorical feature is presented in the Supplementary Figure S1). It is clearly visible that different training data construction methods can lead to different performance evaluation results. To quantitatively clarify this fact, we perform the Friedman test to check whether such difference is statistically significant under the null hypothesis that

the use of different training data construction methods has no effect on the reported prediction performance. Friedman test is a non-parametric statistical test that is similar to ANOVA [35]. To apply Friedman test, we rank the data construction methods for each classifier separately and then sum the rankings for each data construction method over 16 classifiers. Clearly, if all the data construction methods are equivalent in the assessment of phosphorylation site prediction, it can be expected that their average ranks should be the same. Under such a null hypothesis, the Friedman statistic FF in the supplementary equation (3) is distributed according to the F-distribution with (k 1) and (k 1)(N 1) degrees of freedom, where k is the number of data construction methods and N is the number of classifiers. GTrainN1, GTrainN2 and GTrainN3 are frequently used methods in the literature; therefore, we first exclude GM4, GM5 and apply Friedman test and post-hoc tests to the results of GM1, GM2 and GM3. Here, we have three data construction methods and 16 classifiers, the critical value for F(2, 30) at a ¼ 0:05 is 3.3158. As shown in Table 3, all the FF values except that in ID1 are significantly larger than 3.3158. It means that the null hypothesis that all training data construction methods are equivalent should be rejected in CV, ID2 and ID3. From Table 3, we can find that CV has the largest FF value in the Friedman test, and it is much larger than FF values of ID1, ID2 and ID3. Furthermore, FF values among ID1, ID2 and ID3 are very similar. This is because the test data are the same for GM1, GM2 and GM3 in ID1, ID2 and ID3. That means the difference between the positive training data and


Figure 5: The distribution of AUC values for the non-kinase-specific training data construction methods with Musite feature. From left to right, the test data construction methods are CV, ID1, ID2 and ID3, respectively. In each sub-figure, x-axis represents the training data set (GMi, i ¼1, 2, 3, 4, 5), y-axis represents the classifier used and z-axis is the AUC value.

848

Gong et al. CDK in Figure 6 (Supplementary Figure S2) and Table 4 (Supplementary Tables S4–S6), respectively. From Table 4, we can find that CV has the largest FF value, which is much larger than that in ID1, ID2 and ID3. In addition, FF values of ID1, ID2 and ID3 are also very similar. This phenomenon is analogous to that in the non-kinase-specific prediction and the reason is the same. The experimental results for PKA are summarized in the Supplementary Figures S3 and S4 and Supplementary Tables S7–S9. The experimental results for CK2 are summarized in the Supplementary Figures S5 and S6 and Supplementary Tables S10–S12. From these results, we can observe that the prediction performances reported with different training data construction methods for kinase-specific prediction are also statistically different. The comparison of test data construction methods Figure 7 (Supplementary Figure 5) plots the distribution of AUC values generated from different

Table 3: Average ranks and Friedman statistic for comparing three non-kinase-specific training data construction methods (GTrainN1, GTrainN2 and GTrainN3) with Musite feature

Table 4: Average ranks and Friedman statistic for comparing CDK-specific training data construction methods with Musite feature

Test data construction methods

FF

GM1

GM2

GM3

Test data Average rank FF construction CDKM1 CDKM2 CDKM3 CDKM4 CDKM5 methods

CV ID1 ID2 ID3

3 2.219 1.625 2.156

2 1.531 1.781 1.375

1 2.25 2.594 2.469

þ1 2.969 5.568 6.977

CV ID1 ID2 ID3

Average rank

3.438 4 4.094 4

5 2.656 2.469 2.531

1.063 3.063 2.281 2.469

1

1

1

1

0.9

0.9

0.9

0.9

0.8

0.8

0.8

0.8

0.7

0.7

0.7

0.7

0.6

0.6

0.6

0.6

15

15 10

4 5

y

0

2

4 5

x

y

0

2

10

4 5

x

2.25 2.719 3.094 2.531

89.918 2.438 3.771 3.616

15

15 10

3.25 2.563 3.063 3.469

y

0

2

10

4 5

x

y

0

2

x

Figure 6: The distribution of AUC values for the CDK-specific training data construction methods with Musite feature. From left to right, the test data construction methods are CV, ID1, ID2 and ID3, respectively. In each sub-figure, x-axis represents the training data set (CDKMi, i ¼1, 2, 3, 4, 5), y-axis represents the classifier used and z-axis is the AUC value.


negative test data is not very big among them, which leads to much weaker difference in their predictive performance. Hence, the FF values are very similar and much smaller than that in CV. The results of Friedman test with categorical feature are shown in the Supplementary Table S1. Apart from Friedman test, we use Nemenyi test, Bonferroni-Dunn test and Holm’s step down test to check the differences of training data construction methods in a pair-wise manner. The detailed results are provided in the Supplementary Tables S2 and S3. These post-hoc statistical tests further reveal the fact that the use of different training data construction methods can lead to different conclusions on the prediction performance evaluation. We also include GM4 and GM5 to conduct a comparison using Friedman test and post-hoc tests. The detailed results are shown in the supplementary section 7. For kinase-specific predictions, we present the distribution of AUC values and statistical test results for


849

1

1

1

1

1

0.8

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.4

10 y

0 1

2 x

3

4

10 y

0 1

2 x

3

4

10 y

0 1

2 x

3

4

10 y

0 1

2 x

3

4

10 y

0 1

2 x

3

4

Figure 7: The distribution of AUC values for non-kinase-specific test data construction methods with Musite feature. From left to right, the training data sets are GM1, GM2, GM3, GM4 and GM5, respectively. In each sub-figure, x-axis represents the test data construction method (CV, ID1, ID2, ID3), y-axis represents the classifier used and z-axis is the AUC value.

Training data sets GM1 GM2 GM3 GM4 GM5

Average rank

FF

CV

ID1

ID2

ID3

4 3.938 1.313 1 1

2.313 2.375 3.313 3.719 3.313

1.25 1.25 2.188 2.75 2.938

2.438 2.438 3.188 2.531 2.75

50.306 40.491 16.788 47.472 25.851

non-kinase-specific test data construction methods when Musite (categorical) feature is used for all 16 classifiers. To check whether their performance differences are statistically significant, Table 5 presents the average rank of each construction method and the Friedman statistic, as we have done in comparing different training data construction methods. In this test, the critical value at a ¼ 0:05 for Fð3, 45Þ is 2.81. Clearly, the FF values in Table 5 provide strong statistical evidence on the differences of non-kinase-specific test data construction methods for evaluating the phosphorylation site prediction problem. We also use Nemenyi test, Bonferroni-Dunn test and Holm’s step down test to check the differences of these test data construction methods and provide the results in the Supplementary Tables S11 and S12. We can draw the same conclusion from these post-hoc statistical tests as well. For kinase-specific experiments, we follow the same pipeline and present the results in Figure 8, Supplementary Figure S6, Table 6 and Supplementary Tables S13–S15 for CDK.

All detailed experimental results for PKA and CK2 are given in the supplementary section 6. All these results indicate that the use of different test data construction methods in kinase-specific prediction tasks is not equivalent. Different test data construction methods have statistically different performance evaluation results. The main reason is that different test data construction methods use different data sources. Specifically, ID2 usually has the best predictive performance, whereas CV ranks the worst in both nonkinase-specific prediction and kinase-specific prediction. This is because the training data and test data in ID2 belong to different species, and this may lead to more differences between the positive training data and negative test data. As a result, it is easier to generate more accurate classification. In addition, the negative training data sets of GM4 and GM5 are constructed with random generation methods; therefore, the positive data and negative data are quite different. Hence, it is very easy to distinguish them from each other, making the CV has the best predictive performance for GM4 and GM5. Summary Our experiments show that, different data construction methods can lead to significant differences in performance assessment, which empirically reveals the potential risk of producing biased assessment results with different data construction methods. This finding suggests that we must use the same data construction method in performance comparison if we want to claim the superiority of a newly developed prediction method. However, some previous researches have simply ignored this problem and compare their performance directly with those


Table 5: Average ranks and Friedman statistic for comparing non-kinase-specific test data construction methods with Musite feature.

850

Gong et al.

1

1

1

1

1

0.8

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.6

4

4

4

10 y

0 1

2 x

3

10 y

0 1

2 x

3

10 y

0 1

2 x

3

10 y

0 1

2 x

3

4

10 y

0 1

2 x

3

4

Figure 8: The distribution of AUC values for CDK-specific test data construction methods with Musite feature. From left to right, the training data sets are CDKM1, CDKM2, CDKM3, CDKM4 and CDKM5, respectively. In each sub-figure, x-axis represents the test data construction method (CV, ID1, ID2, ID3), y-axis represents the classifier used and z-axis is the AUC value.

Training data sets CDKM1 CDKM2 CDKM3 CDKM4 CDKM5

Average rank

FF

CV

ID1

ID2

ID3

3.75 3.938 3.063 3.875 3.563

2.688 2.563 3.313 2.5 2.875

1.063 1 1.063 1.063 1.063

2.5 2.5 2.563 2.563 2.5

41.14 95.345 23.4 57.18 30.07

given in the articles, even though the data construction methods are significantly different. Furthermore, it will be plausible to use multiple data construction methods to obtain more confident evaluation results.

Results for answering the second question It is easy to generate an unbiased positive test data set because the population of phosphorylated sites is not difficult to confirm. However, owing to the existence of different application scenarios and different negative data construction methods, getting the unbiased negative test data is complicated. According to our definition in section ‘Defining unbiased test data for answering question two’, we summarize existing real world scenarios’ populations and give the corresponding negative test data construction methods. Table 7 summarized two scenarios and their corresponding unbiased negative test data construction methods for non-kinase-specific phosphorylation site prediction. If our objective is to predict phosphorylation sites from phosphorylated proteins, then GTrainN1 is unbiased. If we focus on determining

the phosphorylation status of sites from the mixture of phosphorylated proteins and unphosphorylated ones, then GTrainN2 is unbiased. For kinase-specific phosphorylation site prediction, we list three scenarios and their corresponding unbiased negative data construction methods in Table 8. Here, we use CDK as the example kinase for illustration. If we focus on predicting sites from CDK-specific proteins with at least one site phosphorylated by CDK, the unbiased test data construction method is KTrainN1. If we desire to predict sites from both CDK-specific proteins and other proteins that have no sites phosphorylated by CDK, the unbiased test data construction method is KTrainN2. If we concentrate on site prediction from the mixture of phosphorylated proteins and unphosphorylated ones, the unbiased test data construction method is KTrainN5. Overall, an unbiased test data set for assessing the predictive performance in a specific scenario consists of two parts: the positive test data set constructed by GTrainP (or KTrainP) and the negative test data set constructed by the corresponding underlined method listed in Table 7 (or Table 8). To provide more information for end users who use phosphorylation site prediction tools, we categorize some existing typical tools based on the relationship of scenarios and unbiased test data construction methods, as listed in the last two columns of Tables 7 and 8.

Results for answering the third question Experimental results We use the comparison between GM1 and GM2 as an example to illustrate the procedure of Wilcoxon signed-ranks test. Table 9 shows the average AUC for GM1 and GM2. We are trying to reject the null


Table 6: Average ranks and Friedman statistic for comparing CDK-specific test data construction methods with Musite feature


851

Table 7: Real world application scenarios, corresponding negative test data construction methods and suggested prediction tools for non-kinase-specific predictions Scenario

Population

Method

Tool

Reference

NetPhos rBBFNN DISPHOS NetPhosYeast GANNPhos AutoMotif PPRED PlantPhos

[7] [8] [9] [10] [11] [12] [13] [14]

Musite

[15]

1

Sites of phosphorylated proteins

GTrainN1

2

Sites of phosphorylated and unphosphorylated proteins

GTrainN2 GTrainN3

Table 8: Real world application scenarios, corresponding negative test data construction methods and suggested prediction tools for kinase-specific predictions Scenario

Population

Method

Tool

Reference

CRPhos NetPhosK PHOSITE PostMod

[19] [20] [21] [22]

1

Sites of phosphorylated and kinase-specific proteins

KTrainN1

2

Sites of phosphorylated proteins

KTrainN2 KTrainN4

ALSVM

[23]

3

Sites of phosphorylated and unphosphorylated proteins

KTrainN3 KTrainN5

AMS3.0 *

[24] [25]

When determining the population of unphosphorylated sites, we assume that phosphorylation sites of kinase A should be negative examples for kinase B.The underlined methods KtrainN1, KTrainN2 and KTrainN5 are unbiased negative test data construction methods for scenario 1, 2 and 3, respectively.

hypothesis that both training sets perform equally well. The ranks are assigned from the lowest to the highest absolute difference, and the equal differences are assigned average ranks. The sum of ranks for the positive differences is Rþ ¼ 3 þ 7 þ 11 ¼ 21, and the sum of ranks for the negative differences is R ¼ 3 þ 3 þ 7 þ 7 þ 7 þ 7 þ 11 þ 11 þ 13þ 14 þ 15 þ 16 ¼ 114. Here, T ¼ minðRþ , R Þ ¼ 21. According to the table of exact critical values for the Wilcoxon’s test, for a confidence level of a ¼ 0:05 and N ¼ 16 classifiers, the difference between the training data sets is significant if T is 30. We therefore reject the null hypothesis and conclude that the generalization performance of GM2 is better than GM1. Table 10 shows the final results of Wilcoxon signed-ranks test for non-kinase-specific experiments. From Table 10, GM2 is better than GM1.

Table 9: The comparison of generalization performance between GM1 and GM2 Classifiers

GM1

GM2

Difference

Rank

J48 J48graft RandomForest RandomTree Naivebayes RBFNetwork BayesNet MultilayerPerceptron Logistic LibSVM SMO IBk AdaBoostM1 DTNB PART DecisionTable

0.691 0.694 0.813 0.671 0.839 0.830 0.839 0.845 0.848 0.846 0.845 0.846 0.830 0.836 0.774 0.833

0.702 0.704 0.818 0.677 0.834 0.828 0.837 0.847 0.850 0.848 0.850 0.847 0.830 0.838 0.780 0.834

0.011 0.010 0.005 0.007 þ0.005 þ0.001 þ0.002 0.002 0.002 0.002 0.005 0.001 0.000 0.002 0.006 0.001

16 15 11 14 11 3 7 7 7 7 11 3 1 7 13 3


The underlined methods GtrainN1and GTrainN2 are unbiased negative test data construction methods for scenario1 and 2, respectively.

852

Gong et al.

Table 10: Results of Wilcoxon signed-ranks test for non-kinase-specific generalization performance comparison

Table 11: Results of Wilcoxon signed-ranks test for CDK-specific generalization performance comparison T

T GM1 GM2

GM1

GM2

GM3

Rþ 21

R 65 R 43

For each pair-wise comparison, we give the Tvalue and mark whether T is the sum of ranks for positive differences or the sum of ranks for negative differences. For example, Rþ 21 means T is the sum of ranks for positive differences, and its value is 21. As 21 30, we reject the null hypothesis.

CDKM2

CDKM3

CDKM4

CDKM5

Rþ 21

Rþ 21 Rþ 60

Rþ 35:5 Rþ 59 Rþ 57:5

Rþ 20 R 50 R 45 R 51:5

For each pair-wise comparison, we give the Tvalue and mark whether T is the sum of ranks for positive differences or the sum of ranks for negative differences.

For both non-kinase-specific predictions and kinase-specific predictions, we compute the rank of the training data with respect to their average AUC performance over test data sets for each classifier, and plot the rank distribution in Figures 9 and 10. From the rank distribution, we can derive consistent results with Wilcoxon signed-ranks test. Figure 9 shows that GM2 has better rank over 16 classifiers than GM1 and GM3. From Figure 10, we can see that the ranks are disordered, and it is hard to distinguish which one is much better. The experimental results for PKA and CK2 are shown in the Supplementary Tables S29 and S30 and Supplementary Figures S13 and S14.

Summary With Musite feature, for non-kinase-specific predictions, GM2 has significantly better generalization performance than GM1; for other comparisons, the generalization performance is statistically the same. For kinase-specific predictions, taking CDK as example, the generalization performance of CDKM2, CDKM3 and CDKM5 is significantly better than CDKM1, whereas for other comparisons, the generalization performance is statistically the same. In our article, we examine the generalization performance of training data constructions with Musite feature and categorical feature. It is obvious that when using different feature extraction methods, the generalization performance of training data construction methods may change, and it is hard for us to examine all the possible features. Furthermore, for different kinases, the generalization performance of different training data construction methods is also different. When the feature extraction method and kinase are fixed, it is possible to find a training


While for other pair-wise comparisons, the null hypothesis is accepted. The potential reason why they have different generalization performance is that the information of negative training data they include is different. First, the positive training data of GM1, GM2 and GM3 contain the same amount of information; therefore, the different performance is mainly caused by the negative data. For GM2, the negative data include two parts: the first one is extracted from the unphosphorylated sites of phosphorylated proteins, and the second one is extracted from the unphosphorylated sites from unphosphorylated proteins. However, the negative data constructed by GTrainN1 only contains the first part, whereas the negative data constructed by GTrainN3 only contains the second part. The training data sets with diverse information generally will produce classifiers that have better generalization performance. For CDK-specific experimental results shown in Table 11, CDKM2, CDKM3 and CDKM5 have better generalization performance than CDKM1, whereas for other pair-wise comparisons, we accept the hypothesis that the two compared data sets have the same generalization performance. For kinasespecific predictions, the reason for different generalization performance of training data construction methods is similar. For a certain kinase, the negative examples can include four parts: the unphosphorylated sites of the kinase’s substrates (each protein has at least one site phosphorylated by the kinase), the phosphorylated and unphosphorylated sites of the other phosphorylated proteins excluding the kinase’s substrates, the unphosphorylated sites of the unphosphorylated proteins. Different training data construction methods include different kinds of negative data, and, finally, the trained classifiers have different prediction capabilities.

CDKM1 CDKM2 CDKM3 CDKM4

CDKM1

Ranks of average AUC


853

3.5 3

GM1 GM2 GM3

2.5 2 1.5 1 0.5

0

2

4

6

8

10

12

14

16

18

20

Figure 9: Rank distribution of training data’s average performance on GT11, GT12 and GT13 for each classifier in non-kinase-specific predictions. The x-axis represents the classifier, and y-axis is the rank of average AUC value.

CDKM1 CDKM2 CDKM3 CDKM4 CDKM5

4 3 2 1 0

2

4

6

8

10

12

14

16

18

20

Figure 10: Rank distribution of training data’s average performance on CDKT11, CDKT12, CDKT13, CDKT14 and CDKT15 for each classifier in CDK-specific predictions. The x-axis represents the classifier, and y-axis is the rank of average AUC value.

data construction method that has better generalization performance for predicting unknown phosphorylation sites.

CONCLUSION In this article, we first summarize the available data construction methods for both training data and test data in the context of phosphorylation site prediction. Then, we come up with three questions and conduct experiments to find the answers. First, our experimental results show that different data construction methods can have significant differences in performance assessment. For the comparison of training data construction methods, a significant performance gap can be observed, where GM2 (GM3) has the best (worst) performance in non-kinase-specific prediction, whereas KM2 (KM3) has the best (worst) average performance over three different kinases in kinase-specific prediction. For the comparison of test data construction methods, ID2 has the best performance, and CV has the worst performance in both non-kinasespecific and kinase-specific predictions.

Second, we summarize the real world scenarios’ populations and their corresponding negative test data construction methods. For non-kinase-specific prediction, if the scenario is to predict phosphorylation sites from phosphorylated proteins, GTrainN1 is unbiased, and other methods are not proper. If the scenario is to determine the phosphorylation sites from the mixture of phosphorylated proteins and unphosphorylated ones, GTrainN2 is unbiased and other methods are biased. For kinase-specific prediction, the unbiased negative test data construction method depends on the set of target proteins as well. If our objective is to predict phosphorylation sites from (i) kinase-specific proteins with at least one site phosphorylated by the target kinase, (ii) both kinase-specific proteins and other phosphorylated proteins, (iii) both phosphorylated proteins and unphosphorylated proteins, then the corresponding unbiased negative data construction methods are as follows: (i) KTrainN1; (ii) KTrainN2; and (iii) KTrainN5, respectively. Finally, we empirically compare the generalization performance of different training data construction methods. It reveals that GM2 has the best generalization performance, and GM1 is the worst in


Ranks of average AUC

5

854

Gong et al.

non-kinase-specific prediction, whereas KM2 has the best generalization performance and KM3 has the worst generalization performance in kinase-specific prediction.

SUPPLEMENTARY DATA Supplementary data are available online at http:// bib.oxfordjournals.org/.

Key Points

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

FUNDING Natural Science Foundation of China (in part) (grant no. 61003176 and 61073051).

19.

20.

References 1. 2. 3. 4.

5. 6.

7.

Ubersax JA, Ferrell JEJr. Mechanisms of specificity in protein phosphorylation. Nat Rev Mol Cell Biol 2007;8:530–41. Pinna LA, Ruzzene M. How do protein kinases recognize their substrates? Biochim Biophys Acta 1996;1314:191–225. Miller ML, Blom N. Kinase-specific prediction of protein phosphorylation sites. Methods Mol Biol 2009;527:299–310. Xue Y, Gao X, Cao J, et al. A summary of computational resources for protein phosphorylation. Curr Protein Pept Sci 2010;11:485–96. Trost B, Kusalik A. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 2011;27:2927–35. Hall M, Frank E, Holmes G, et al. The WEKA data mining software: an update. SIGKDD Explor. Newsl. 2009;11: 10–18. Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 1999;294:1351–62.

21.

22.

23.

24.

25.

Berry EA, Dalby AR, Yang ZR. Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms. Comput Biol Chem 2004;28:75–85. Iakoucheva LM, Radivojac P, Brown CJ, et al. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 2004;32:1037–49. Ingrell CR, Miller ML, Jensen ON, et al. NetPhosYeast: prediction of protein phosphorylation sites in yeast. Bioinformatics 2007;23:895–7. Tang YR, Chen YZ, Canchaya CA, et al. GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network. Protein Eng Des Sel 2007;20:405–12. Gao J, Agrawal G, Thelen J, et al. A new machine learning approach for protein phosphorylation site prediction in plants. Lect Notes Comput Sci 2009;5462:18–29. Biswas A, Noman N, Sikder A. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinformatics 2010;11: 273. Lee TY, Bretana N, Lu CT. PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity. BMC Bioinformatics 2011; 12:261. Gao J, Thelen JJ, Dunker AK, et al. Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Mol Cell Proteomics 2010;9:2586–600. Gnad F, Ren S, Cox J, et al. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol 2007;8:R250. Xue Y, Ren J, Gao X, et al. GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. Mol Cell Proteomics 2008;7:1598–608. Gao X, Jin C, Ren J, et al. Proteome-wide prediction of PKA phosphorylation sites in eukaryotic kingdom. Genomics 2008;92:457–63. Dang TH, Van Leemput K, Verschoren A, et al. Prediction of kinase-specific phosphorylation sites using conditional random fields. Bioinformatics 2008;24:2857–64. Hjerrild M, Stensballe A, Rasmussen TE, et al. Identification of phosphorylation sites in protein kinase a substrates using artificial neural networks and mass spectrometry. J Proteome Res 2004;3:426–33. Koenig M, Grabe N. Highly specific prediction of phosphorylation sites in proteins. Bioinformatics 2004;20: 3620–7. Jung I, Matsuyama A, Yoshida M, et al. PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship. BMC Bioinformatics 2010; 11(Suppl 1):S10. Jiang J, Ip H. Active learning for the prediction of phosphorylation sites. In: Liu, Derong (eds). International Joint Conference on Neural Networks. Piscataway, N.J.: IEEE, 2008, 3158–65. Basu S, Plewczynski D. AMS 3.0: prediction of post-translational modifications. BMC Bioinformatics 2010; 11:210. Li T, Du P, Xu N. Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources. Plos One 2010;5:e15411.


Many computational methods formulate phosphorylation site prediction as a classification problem; however, their construction methods for training data and test data are different, which may lead to biased performance evaluation and comparison. For both non-kinase-specific and kinase-specific predictions, we categorize the construction methods of training data and test data in the literature and give rigorous mathematical definitions. We conduct statistical tests and find that different data construction methods can have significantly different prediction performance. The construction of unbiased test data is important for the unbiased performance assessment. We summarize different real world scenarios for phosphorylation site prediction and their corresponding biased and unbiased negative test data construction methods. Generalization performance relates to the prediction capability of models and is an important measure for model selection and model assessment.We conduct experiments and find that different training data construction methods can have different generalization performance.

8.


31. Dinkel H, Chica C, Via A, et al. Phospho.ELM: a database of phosphorylation site-supdate 2011. Nucleic Acids Res 2011;39(Suppl 1):D261–7. 32. Farriol-Mathis N, Garavelli JS, Boeckmann B, et al. Annotation of post-translational modifications in the Swiss-Prot knowledge base. Proteomics 2004;4:1537–50. 33. Altschul SF, Madden TL, Schffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25: 3389–402. 34. Shi Y. Serine/Threonine phosphatases: mechanism through structure. Cell 2009;139:468–84. 35. Demsˇar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006;7:1–30. 36. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, chap. 7. New York: Springer, 2008.


26. Kim JH, Lee J, Oh B, et al. Prediction of phosphorylation sites using SVMs. Bioinformatics 2004;20:3179–84. 27. Huang HD, Lee TY, Tzeng SW, et al. KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res 2005;33(Suppl 2):W226–9. 28. Wong YH, Lee TY, Liang HK, et al. KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res 2007;35(Suppl 2):W588–94. 29. Blom N, Sicheritz-Pontn T, Gupta R, et al. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004;4: 1633–49. 30. Li T, Li F, Zhang X. Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach. Proteins 2008;70:404–14.

855

Supplementary Document for “Data construction for phosphorylation site prediction” Haipeng Gong, Xiaoqing Liu, Jun Wu, Zengyou He∗ School of Software, Dalian University of Technology, China [email protected], [email protected], [email protected], [email protected]

1

Musite Feature Description

General description of features used in Musite is given here. Concrete feature extraction processes can be found in [1]. • KNN Scores. A KNN score measures whether the local sequence surrounding a query site is more similar to the sequences containing phosphorylation sites in the positive set or those with non-phosphorylation sites in the negative set. • Protein disorder. There is a clear preference of known phosphorylation sites to be within disordered regions, which justifies the use of disorder scores as features for phosphorylation site prediction. • Amino acid composition surrounding phosphorylation sites. The compositions of amino acids surrounding phosphorylation sites and non-phosphorylation sites are quite different, so amino acid frequencies are used as features for phosphorylation site prediction.

2

Statistical Tests • Friedman test. It ranks the training (testing) data construction methods for each classifier separately, the best performing method has the rank of 1, the second best has the rank of 2, and so on. In case of ties, average ranks are assigned. Let rij be the rank of the j-th training (testing) data construction method on the i-th classifier. The Friedman test compares the average ranks of these training (testing) data construction methods. Rj is the average rank of the j-th training (testing) data construction method. Under the null hypothesis that all the training (testing) data construction methods are equivalent, all Rj s should be equal. The statistic FF is distributed according to the F-distribution with k − 1 and (k − 1)(N − 1) degrees of freedom, where k is the number of data construction methods and N is the number of classifiers. 1 ∑ j (1) ri Rj = N i

χ2F =

12N ∑ 2 k(k + 1)2 Rj − [ ] k(k + 1) 4 j

∗

Corresponding author

(2)

FF =

(N − 1)χ2F N (k − 1) − χ2F

(3)

• Nemenyi test. The Nemenyi test is used when all the training (testing) data construction methods are compared with each other. The performance of two data construction methods is significantly different if the corresponding average ranks differ by at least the critical difference √ k(k + 1) CD = qα (4) 6N √ where the critical value qα is based on the Studentized range statistic divided by 2. Here we give the critical values used in our experiments: when α = 0.05 and k = 5, qα = 2.728; when α = 0.05 and k = 4, qα = 2.569; when α = 0.05 and k = 3, qα = 2.343. • Bonferroni-Dunn test. When all training (testing) data construction methods are compared with a control training (testing) data construction method, the Bonferroni-Dunn test can be used instead of Nemenyi test. The way to perform the Bonferroni-Dunn test is to calculate the CD using the same equation as for the Nemenyi test, but using a different critical value qα : when α = 0.05 and k = 5, qα = 2.498; when α = 0.05 and k = 4, qα = 2.394; when α = 0.05 and k = 3, qα = 2.241. • Holm’ step-down test. Step-up and step-down procedures sequentially test the hypotheses ordered by their significance. √ k(k + 1) (5) z = (Ri − Rj )/ 6N The z value is used to find the corresponding p value from the table of normal distribution, which is then compared with an appropriate α. We denote the ordered p values by p1 , p2 , ..., so that p1 ≤ p2 ≤ ... ≤ pk−1 . This test starts with the most significant p value. If p1 is below α/(k − 1), the corresponding hypothesis is rejected and we are allowed to compare p2 with α/(k − 2). If the second hypothesis is rejected, the test proceeds with the third, and so on. As soon as a certain null hypothesis cannot be rejected, all the remaining hypotheses are retained as well. • Wilcoxon signed-ranks test. This test is used to compare the performance of two data construction methods. It ranks the performance difference of two data construction methods for each classifier and compares the ranks for the positive and the negative differences. Let di be the difference between the performance scores of the two data construction methods on i-th out of N classifiers. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let R+ be the sum of ranks for the classifiers on which the first data construction method outperforms the second one, and R− the sum of ranks for the opposite. Ranks of di = 0 are split evenly among the sums; if there is an odd of them, one is ignored: R+ =

∑

rank(di ) +

∑

(6)

1∑ rank(di ) 2

(7)

di =0

di >0

R− =

1∑ rank(di ) 2

rank(di ) +

di