GlycoMine: a machine learning-based approach for predicting N-, C ...

3 downloads 0 Views 9MB Size Report
predicting N-, C-, and O-linked glycosylation in the human ... In particular, amino acid residues (N, S, T, and W) ...... 1.24E-06 Neurotensin/neuromedin N. 9606.
GlycoMine: a machine learning-based approach for predicting N-, C-, and O-linked glycosylation in the human proteome Supplementary Material

SUPPLEMENTAL METHODS 1. Collection of datasets and preprocessing The experimentally determined C-linked, N-linked, and O-linked glycosylation sites were extracted from four different public databases, i.e., PhosphoSitePlus (Hornbeck et al., 2012), SysPTM (Li et al., 2009), O-GlycBase (version 6.0) (Gupta et al., 1999), and UniProt (Chan and Consortium, 2010). All of the experimentally verified glycosylation sites in human proteins were extracted from these databases. To ensure that all of the glycosylation sites collected had been determined experimentally, glycosylation sites that were labeled as ‘Probable,’ ‘Potential,’ or ‘By similarity’ in UniProt were removed when processing the entries. In particular, among the protein entries collected from UniProt, the Mucin-2 protein (UniProt ID: Q02817) (Inoue et al., 2001) had 1,200 O-linked glycosylation sites that encompassed many repeated amino acid fragments. Thus, only one fragment from this protein was retained in the final datasets. Sequence redundancy in the curated datasets was removed using the CD-HIT program (Huang et al., 2010) to ensure that the sequence identity between any two proteins was no greater than 30%. This step was essential for eliminating sequence redundancy and avoiding overestimates of the performance of machine learning-based classifiers. The experimentally determined glycosylation sites were used as positive samples. However, it would be difficult to prove that a protein is not glycosylated in any specific conditions. Thus, it was difficult to collect proteins that could be considered as non-glycosylated. A background dataset that contained all human proteins was retrieved from UniProt. Negative samples (no glycosylation sites) were selected from this background protein set, which excluded all

experimentally verified glycosylated proteins. In particular, amino acid residues (N, S, T, and W) that were not experimentally verified as glycosylation sites were regarded as negative samples. After removing sequence redundancy, the final datasets were divided into two random subsets, which are referred to as the benchmark dataset and the independent dataset (~20% of the size of the benchmark dataset). The performance of our method and other existing methods were compared using five-fold cross-validation tests with the benchmark dataset and further validated with the independent dataset. In addition, feature selection and model training were performed with the benchmark dataset. To extract the local sequence or structural features around potential glycosylation sites, we used a local sliding window (Trost and Kusalik, 2011) that comprised 15 residues, where the potential glycosylation site was located at the center with seven neighboring residues upstream and downstream of the central site.

2. Feature extraction In this section, we discuss the different types of features extracted and used for training GlycoMine models. These include sequence-derived features, predicted secondary structure, predicted solvent accessibility, protein functional features, and functional annotations from various public resources. A list of the collected features is shown in Table S1. To extract the local features around glycosylation sites, we used a local sliding window approach, with a window size of 15 amino acids (i.e., 7 upstream residues and 7 downstream residues around a central glycosylation site).

2.1 Sequence-derived features Five types of sequence-derived features were extracted. These include AAindex, physicochemical properties of amino acids, position-specific scoring matrix (PSSM), evolutionary conservation score, and CKSAAP features. (1) AAindex AAindex (Kawashima et al., 2008) is a database of numerical indices representing multiple biochemical and physicochemical properties of amino acids. It is a flat file database that consists of three sections: AAindex1, AAindex2 and AAindex3. AAindex1 provides 20 numerical values. AAindex2 includes the substitution matrix of amino acids, while AAindex3 provides the statistical

protein contact potentials. In this study, only AAindex1 was used to calculate the average of AAindex1 for the local window of central residues and the calculated values were encoded as input vectors of machine learning models. (2) Physicochemical properties of amino acids In this study, we used 14 types of amino acid properties. These include pK1 (-COOH), pK2 (-NH3+), pKR(R group), pI, hydropathy index, occurrence in proteins (%), percentage of buried residues (%), average volume, accessible surface area, van der waals volume, ranking of amino acid polarities, side chain polarity, conformational preferences of the amino acids (α-helix) and conformational preferences of the amino acids (β-strand). (3) PSSM PSSM of a sequence alignment contains sequence evolutionary information, which has proven to be useful for improving the prediction performance. The element in the PSSM represents the probability of each residue at the specific position in the multiple sequence alignment. A number of previous studies have shown that multiple sequence alignments in the form of PSSMs are a key factor for the improvements of prediction performance. In this study, the PSSM profile for each sequence was extracted by performing PSI-BLAST search (Altschul et al., 1997) against the NCBI non-redundant database using the default E-value cutoff. We then further encoded these features of central residues using a sliding window approach. (4) Evolutionary conservation score Another important sequence-derived feature is the evolutionary conservation score. A higher conservation score of a residue generally indicates that this residue is more likely to be important for protein function. The conservation score is defined as follows: 20

Scorei = −∑ pi , j log 2 pi , j , j =1

where pi,j is the frequency of amino acid j at the position i. All parameters in this formula are extracted from the outputs of PSSM generated by PSI-BLAST. (5) CKSAAP (the composition of k-spaced amino acid pairs): To extract CKASSP features, we calculated the composition of each possible k-spaced amino acid pair i (Chen et al., 2009; Chen et al., 2013) using the following equation: CKSAAP [i=1, 2, …, (kmax+1)×400] = Ni / (W-k-1)

Where Ni is the count of the k-spaced amino acid pair i and W is the window size. The maximum space taken into consideration (kmax) was set as 5, resulting in a 2400-dimensional feature vector.

2.2 Predicted structural features (1) Predicted secondary structure Protein secondary structure is a useful type of feature. Due to limited availability of structurally solved proteins in our datasets, we predicted protein secondary structure using SABLE (Wagner et al., 2005) in this study. SABLE outputs three kinds of secondary structures for each residue based on the primary sequence information: H, E and C, which denote alpha-helix, beta-strand and coil respectively. We encoded the secondary structure by a 3-bit binary encoding scheme (Song et al., 2006; Song et al., 2012). (2) Predicted solvent accessibility Protein solvent accessibility is another important type of feature for the prediction. SABLE was used to predict solvent accessibility based on protein sequences. It provides a quantitative score ranging from 0 to 9 to represent the extent of the relative solvent accessibility of a residue from being fully buried to fully exposed. (3) Disordered region Protein disorder region is defined as a protein region that lacks well-defined tertiary structure, either fully or partially unfolded. Many studies indicate that the disordered region represents an important feature for studying protein functions (Dunker and Obradovic, 2001). For example, it has been suggested that another type of PTM, phosphorylation, prefers to occur in disordered regions rather than ordered regions (Dunker et al., 2008; Iakoucheva et al., 2004). Therefore, we also predicted protein disordered regions and encoded as features to our models. The disordered regions were predicted by DISOPRED2 (Ward et al., 2004).

2.3 Functional annotations Protein functional annotation was extracted from UniProt (http://www.uniprot.org/) (UniProt Consortium, 2010), which can be found in the ‘FT’ line of the annotation. 8 types of functional annotations are extracted as features in this study, including: (1) functional domain, (2) nucleotide-binding site, (3) disulfide bond, (4) post-translational modified residue (glycosylation

sites were removed), (5) active site, (6) natural variant, (7) metal ion-binding site and (8) other binding sites for any chemical groups. Within a sliding window, an amino acid residue would be encoded as ‘1’, if this residue had an annotation of a specific function. If the residue did not have that functional annotation, it would be encoded as ‘0’.

2.4 Functional features The functional features we used include: (1) Biological Process features from Gene Ontology (BP) (Ashburner et al., 2000); (2) Cellular Component features from Gene Ontology (CC) (Ashburner et al., 2000); (3) Molecular Function features from Gene Ontology (MF) (Ashburner et al., 2000); (4) Functional domain features from InterPro (this feature type was denoted as “InterPro” for brevity) (Hunter et al., 2012); (5) Pathway features from KEGG (Wixon and Kell, 2000); (6) Functional domain features from Pfam (Punta et al., 2012) and (7) Protein-protein interactions from STRING (Jensen et al., 2009). After extracting these functional features for each of the glycosylated proteins in our datasets, we performed statistical tests to further identify overand under-represented functional features, which will be described below.

2.5 Analysis of over- and under-represented features As these functional features usually contain redundant and noisy information, it is a time-consuming process for machine learning classifiers to process all these features. Accordingly, we follow a similar method proposed by Li et al. (Li et al., 2010) to identify the over- and under-represented functional features, prior to input to machine learning classifiers. To distinguish over-represented terms from the background protein set, which includes both glycosylated and non-glycosylated proteins, we performed two-side hypergeometric tests using the R package. Terms with p-values less than 0.01 were considered as significant. The p-values were defined as follows: p = Fhypergeom (q, m, n, k ) ,

where q is the number of samples with the feature in the dataset, m is the number of samples with the feature in the background set, n is the number of samples without the feature, while k is the number of samples in the study set. Finally, p-values were adjusted by Bonferroni correction for

testing on multiple terms. After identifying the over-represented features, we further scored each protein in the dataset using the identified over-represented features following Li et al.’s previous work (Li et al., 2010). For a binary feature index i (i = 1, 2, … n, where n is the number of significant functional features for glycosylated proteins), its value xi was measured from the functional annotations of the protein and encoded as 0 for non-existing, or 1 for existing functional annotations, respectively. In this way, we calculated the overall protein score by a simple log-odds ratio approach, which can be summed up as the final score, using the Jaccard distance. The identified over-represented functional features would be encoded as additional features to the machine learning models and undergo feature selection, for the purpose of selecting more important and relevant features that contributed to the prediction.

3. Feature selection It is imperative and often a common practice to employ feature selection techniques to reduce the dimensionality of the feature vectors in the initial sets (Saeys et al., 2007) by eliminating features that do not contribute to the prediction. Thus, we developed a novel two-step feature selection strategy based on mRMR and IG, both of which are well-established feature selection methods, and they were applied initially to rank the importance and contributions of each feature type to glycosylation prediction. IG regards the features with the highest IG as important, whereas mRMR ranks features according to their relevance to the target class and the redundancy between the features. These two feature selection methods are briefly discussed below.

3.1. mRMR feature selection The relevance and redundancy are quantified by mutual information (MI), which is defined as: 𝑀𝐼 𝑥, 𝑦 = where x and y are two vectors, and

𝑝(𝑦)

𝑝(𝑥, 𝑦)

𝑝(𝑥, 𝑦)𝑙𝑜𝑔

𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦, 𝑝 𝑥 𝑝 𝑦

is the joint probabilistic density of the two vectors, and

𝑝(𝑥)

are the marginal probabilistic densities. It is assumed that Ω represents the entire feature

space that contains all of the features, Ωs is the feature set that contains m previously selected features, and Ωt denotes the feature set that contains n features that need to be selected. The

relevance D between feature f in the overall feature set Ω and the target class c can be calculated by 𝐷 = 𝑀𝐼 𝑓, 𝑐 . The redundancy R between the feature f in Ωt and all the features in the Ωs is calculated by 𝑅=

1 𝑚

𝑀𝐼(𝑓, 𝑓! ). !! ∈!!

Thus, the mRMR (Peng et al., 2005) criterion that combines the two equations given above to obtain the feature

𝑓!

in Ωt with the mRMR is defined as:

max[𝑀𝐼 𝑓! , 𝑐 −

1 𝑚

𝑀𝐼(𝑓! , 𝑓! )] , 𝑓! ∈ Ω𝑡, 𝑗 = 1,2, ⋯ , 𝑛. !"∈!!

The mRMR feature selection completes N rounds of feature evaluation, thereby obtaining a feature set S with N = m+n features. 𝑆 = {𝑓! ! , 𝑓! ! , ⋯ , 𝑓! ! , ⋯ , 𝑓! ! } The index h of each feature indicated the round where the feature was selected.

3.2 IG feature selection The entropy or information of a feature X is defined as: 𝐼𝑛𝑓𝑜(𝑋) = −

𝑃 𝑥! 𝑙𝑜𝑔! (𝑃 𝑥! ), !

where 𝑃 𝑥! is the prior probability of 𝑥! , which is a set of values of feature X. Given another feature Y, the conditional entropy of X is defined as: 𝐼𝑛𝑓𝑜 𝑋 𝑌 = −

𝑃 𝑦! !

where 𝑃 𝑥! |𝑦!

𝑃 𝑥! |𝑦! 𝑙𝑜𝑔! 𝑃 𝑥! |𝑦!

,

!

is the a posteriori probability of 𝑥! given the value of 𝑦! in set Y. The amount

by which the entropy of X decreases reflects the additional information about X provided by Y. Thus, the IG is defined as: 𝐼𝐺 𝑋 𝑌 = 𝐼𝑛𝑓𝑜 𝑋 − 𝐼𝑛𝑓𝑜(𝑋|𝑌).

4. Performance evaluation We used five measures, i.e., sensitivity, specificity, precision, accuracy, and Mathew’s correlation coefficient (MCC) to evaluate the prediction performance in the present study. Sensitivity or true positive rate (TPR) (percentage of correctly predicted cleavage sites):

Sensitivity = TP / (TP + FN ) . Specificity (percentage of correctly predicted non-cleavage sites):

Specificity = TN / (TN + FP ) . Precision (PRE) is defined as

Precision = TP / (TP + FP ) .

Accuracy =

TP + TN TP + TN + FP + FN

Accuracy (percentage of correct predictions of both positives and negatives): MCC is a measure of the quality of binary classifications (Matthews, 1975), which is defined as MCC =

TP × TN − FP × FN (TP + FP)(TP + FN )(TN + FP)(TN + FN )

where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively. MCC = 1 signifies a perfect classification, whereas MCC = 0 indicates a completely random classification.

SUPPLEMENTAL RESULTS Characterization of protein functional features associated with each type of glycosylation We investigated the significantly enriched functional features that were closely associated with protein glycosylation. Each protein (glycosylated proteins or proteins in the background protein set) contains a huge number of features. Thus, to reduce the dimensionality of the feature space, we identified over-represented functional feature terms by performing two-sided hypergeometric tests based on comparisons with the background protein set (Supplemental Methods). The significantly enriched KEGG, GO, and STRING terms for each type of glycosylated protein are shown in Tables S2–S7 (Supplemental Material). For example, there are 499 significantly enriched protein interaction partners of O-linked glycosylation in terms of PPI features. For BP terms, O-linked glycosylated proteins are enriched in “transcription regulation,” “immune response,” “metabolic process,” and “apoptotic process” (Table S2). For MF terms, O-linked glycosylated proteins are enriched in functions related to binding or proteolysis, such as “heparin binding,” “eukaryotic cell surface binding,” and “endopeptidase activity”. For KEGG pathway features, O-linked glycosylated proteins are related to signaling pathways. There are 32 KEGG pathway features where ten are signaling pathway terms, including “Complement and coagulation cascade,” “PI3K-Akt signaling pathway,” “MAPK signaling pathway,” “NF-kappa B signaling pathway,” “Toll-like receptor signaling pathway,” “Jak-STAT signaling pathway,” “HIF-1 signaling pathway,” “NOD-like receptor signaling pathway,” “Adipocytokine signaling pathway,” “TGF-beta signaling pathway,” and “T cell receptor signaling pathway”. Moreover, there are three cancer-related pathways, i.e., “Pathways in cancer,” “Small cell lung cancer,” and “Transcriptional misregulation in cancer,” as well as two leukemia-related pathways, i.e., “Acute myeloid leukemia” and “Chronic myeloid leukemia.” Overall, these results show that O-linked glycosylation is a type of PTM in cellular processes related to signaling or cancer pathways. The three types of glycosylation share common enriched functional features, but they also have different enriched functional features (Tables S2-S7), thereby indicating the functional preference for each glycosylation type and highlighting the need to develop type-specific models to facilitate better glycosylation prediction performance. We were particularly interested in examining the possibility of discriminating glycosylated

proteins from the background protein set by considering a cohort of significantly enriched functional features. Thus, we used a log-odds method (Li et al., 2010) to calculate the functional scores for each protein. A higher functional score indicated the greater likelihood that a protein would be glycosylated. The distributions of the functional scores of the three types of glycosylated proteins are shown in Fig. S2. Apparently, the majority of the proteins in the background set had negative scores, whereas a large proportion of glycosylated proteins had higher positive scores. To examine the significance of the differences between the positive sets and the background sets with respect to the three types of glycosylation, we performed t-tests. Indeed, the difference was statistically significant, where all of P-values were ≤1.45E-04. In summary, these results suggest that functional features could be used to distinguish glycosylated proteins from the background protein set and they might be useful features for improving the prediction performance of glycosylation sites.

Fig. S1. Amino acids occurrences for the three types of glycosylation. Sequence Logos were drawn using the pLogo software (O'Shea et al., 2013).

Fig. S2. Functional score distributions of the background proteins (red) and glycosylated proteins in the positive set (yellow).

Fig. S3. FFS curves based on the IG feature selection results for prediction of the three types of glycosylation. The values labeled in the graph correspond to the highest AUC scores obtained by FFS.

Fig. S4. FFS curves based on the mRMR feature selection results for prediction of the three types of glycosylation. The values labeled in the graph correspond to the highest AUC scores obtained by FFS.

Fig. S5. ROC curves of GlycoMine, EnsembleGly, GPP, NetNGlyc and NetOGlyc for glycosylation sites prediction based on the benchmark datasets. Note that not all methods can be used to predict all the three types of glycosylation.

Fig. S6. Screenshot of the interface of GlycoMine. (A) The online web interface of GlycoMine; (B) Window 1, the input interface of GlycoMine; (C) Window 2, the output interface of GlycoMine, which shows the predicted glycosylation sites within the amino acid sequences; and (D) Window 3, the output interface, which lists the predicted glycosylation site position, site pattern and probability score. In particular, the predicted glycosylation site positions are highlighted by different colors, which indicate different confidence values.

Table S1. A list of features used in this study. The features used can be categorized into four major types: sequence-derived features, predicted structural features, protein functional annotation and functional features. Feature type

Annotation

Relevant tools or databases

Sequence-derived features

AAindex

AAindex

Physicochemical properties

BioJava

PSSM Residue conversation score

PSI-BLAST PSI-BLAST

CKSAAP features

Perl script

Predicted disordered region

DISOPRED2

Predicted secondary structures

SABLE

Predicted dihedral angles

SpineX

Domain

UniProt

Nucleotide binding

UniProt

Disulfide bond

UniProt

Posttranslational modified residue

UniProt

Glycosylation

UniProt

Active site

UniProt

Natural variant

UniProt

Metal ion binding site

UniProt

Binding site

UniProt

Biological Process (BP)

Gene Ontology

Molecular Function (MF)

Gene Ontology

Cellular Component (CC)

Gene Ontology

KEGG Pathway

KEGG

Protein-protein interaction (PPI)

STRING

Domain information

Pfam

Domain information

InterPro

Predicted structural features

Functional annotation

Functional features

Table S2. Significantly enriched GO and KEGG terms for O-linked glycosylated proteins.

Type BP

ID GO:0045944

Term in the positive 19

Term in human proteome 1771

P-value 4.93E-08

GO:0045893

12

897

1.66E-06

GO:0043066 GO:0006954 GO:0042744 GO:0006955 GO:0007568 GO:0044281 GO:0016032 GO:0000187 GO:0000042 GO:0043154

24 11 4 12 8 18 9 6 4 6

1230 615 50 984 322 1059 506 184 36 145

3.69E-15 2.86E-07 6.37E-06 4.16E-06 1.12E-06 1.04E-10 3.60E-06 5.46E-06 1.68E-06 1.40E-06

GO:0019221 GO:0030335 GO:0010827 GO:0006959 GO:0002925

11 8 6 5 3

252 272 60 100 11

3.12E-11 3.21E-07 7.34E-09 4.33E-06 2.13E-06

GO:0051897

9

170

3.66E-10

GO:0015701 GO:0071456 GO:0050731

6 6 7

50 114 155

2.38E-09 3.45E-07 1.01E-07

GO:0007596 GO:0030168 GO:0015758 GO:0019048

12 15 7 18

627 436 125 803

4.02E-08 2.35E-13 2.33E-08 1.20E-12

GO:0051918 GO:0071803 GO:0020027 GO:0002576 GO:0017187

4 3 3 11 4

14 9 10 199 23

3.01E-08 1.09E-06 1.55E-06 2.46E-12 2.61E-07

Term positive regulation of transcription from RNA polymerase II promoter positive regulation of transcription, DNA-dependent negative regulation of apoptotic process inflammatory response hydrogen peroxide catabolic process immune response aging small molecule metabolic process viral process activation of MAPK activity protein targeting to Golgi negative regulation of cysteine-type endopeptidase activity involved in apoptotic process cytokine-mediated signaling pathway positive regulation of cell migration regulation of glucose transport humoral immune response positive regulation of humoral immune response mediated by circulating immunoglobulin positive regulation of protein kinase B signaling cascade bicarbonate transport cellular response to hypoxia positive regulation of peptidyl-tyrosine phosphorylation blood coagulation platelet activation glucose transport modulation by virus of host morphology or physiology negative regulation of fibrinolysis positive regulation of podosome assembly hemoglobin metabolic process platelet degranulation peptidyl-glutamic acid carboxylation

GO:0042730 GO:0051693 GO:0007597 GO:0007598 GO:0010888 GO:0046832

6 4 8 4 3 2

47 37 23 8 11 2

1.62E-09 1.88E-06 4.16E-16 2.14E-09 2.13E-06 5.60E-06

CC

GO:0016021 GO:0005737 GO:0005576 GO:0005829 GO:0005615 GO:0009986 GO:0005887 GO:0005654 GO:0015629 GO:0009897 GO:0030141 GO:0030863 GO:0034399 GO:0000790 GO:0042405 GO:0034364 GO:0005796 GO:0042627 GO:0031093 GO:0008091 GO:0001652 GO:0031838 GO:0014731 GO:0034363 GO:0044615

6 42 30 30 37 12 17 21 8 12 5 5 4 6 4 3 8 3 7 4 2 2 6 2 4

9756 8368 2736 5430 1608 797 2246 1647 529 463 143 69 42 256 23 27 152 18 116 28 4 3 45 4 9

8.71E-06 1.54E-06 3.16E-12 9.32E-06 2.38E-25 4.98E-07 2.22E-05 4.92E-10 3.84E-05 1.47E-09 2.42E-05 6.94E-07 3.16E-06 3.43E-05 2.61E-07 3.64E-05 3.70E-09 1.04E-05 1.39E-08 5.96E-07 3.35E-05 1.68E-05 1.24E-09 3.35E-05 3.84E-09

fibrinolysis actin filament capping blood coagulation, intrinsic pathway blood coagulation, extrinsic pathway negative regulation of lipid storage negative regulation of RNA export from nucleus integral to membrane cytoplasm extracellular region cytosol extracellular space cell surface integral to plasma membrane nucleoplasm actin cytoskeleton external side of plasma membrane secretory granule cortical cytoskeleton nuclear periphery nuclear chromatin nuclear inclusion body high-density lipoprotein particle Golgi lumen chylomicron platelet alpha granule lumen spectrin granular component haptoglobin-hemoglobin complex spectrin-associated cytoskeleton intermediate-density lipoprotein particle nuclear pore nuclear basket

MF

GO:0004252 GO:0008201 GO:0005200 GO:0043499 path:hsa04610 path:hsa04151 path:hsa05200 path:hsa05222 path:hsa04380 path:hsa05164 path:hsa04010 path:hsa05410

10 10 10 4 13 17 13 6 7 8 9 6

445 362 208 38 127 815 852 207 302 357 737 266

1.28E-07 1.92E-08 9.74E-11 2.10E-06 7.62E-18 1.57E-11 1.44E-07 1.06E-05 8.23E-06 2.39E-06 6.40E-05 4.23E-05

serine-type endopeptidase activity heparin binding structural constituent of cytoskeleton eukaryotic cell surface binding Complement and coagulation cascades PI3K-Akt signaling pathway Pathways in cancer Small cell lung cancer Osteoclast differentiation Influenza A MAPK signaling pathway Hypertrophic cardiomyopathy (HCM)

KEGG

path:hsa05202 path:hsa04064 path:hsa04620 path:hsa05160 path:hsa05168 path:hsa05169 path:hsa04060 path:hsa04630 path:hsa03013 path:hsa04066 path:hsa05310 path:hsa05166 path:hsa04621 path:hsa04920 path:hsa05152 path:hsa05161 path:hsa04350 path:hsa05143 path:hsa04660 path:hsa05162 path:hsa04640 path:hsa05221 path:hsa05144 path:hsa05220

13 6 7 6 13 11 12 10 8 8 4 15 5 7 8 9 5 4 6 7 12 6 9 6

430 208 217 263 388 408 532 306 276 240 40 573 148 149 377 353 141 68 265 307 211 157 86 169

4.67E-11 1.09E-05 9.63E-07 3.98E-05 1.32E-11 4.78E-09 6.80E-09 3.97E-09 3.58E-07 1.25E-07 2.59E-06 1.11E-11 2.84E-05 7.74E-08 3.54E-06 1.94E-07 2.26E-05 2.16E-05 4.15E-05 9.14E-06 1.75E-13 2.21E-06 7.73E-13 3.37E-06

Transcriptional misregulation in cancer NF-kappa B signaling pathway Toll-like receptor signaling pathway Hepatitis C Herpes simplex infection Epstein-Barr virus infection Cytokine-cytokine receptor interaction Jak-STAT signaling pathway RNA transport HIF-1 signaling pathway Asthma HTLV-I infection NOD-like receptor signaling pathway Adipocytokine signaling pathway Tuberculosis Hepatitis B TGF-beta signaling pathway African trypanosomiasis T cell receptor signaling pathway Measles Hematopoietic cell lineage Acute myeloid leukemia Malaria Chronic myeloid leukemia

Table S3. Significantly enriched GO and KEGG terms for N-linked glycosylated proteins.

Type BP

ID GO:0006508 GO:0006958 GO:0045087 GO:0007411 GO:0043066 GO:0006954 GO:0006955 GO:0006956 GO:0010951

Term in the positive 16 13 17 11 19 10 16 5 4

Term in human proteome 1643 195 817 859 1230 615 984 31 40

P-value 2.70E-07 3.06E-16 1.48E-12 1.62E-06 1.14E-11 6.00E-07 2.45E-10 5.42E-09 1.43E-06

GO:0030449 GO:0008284 GO:0007155 GO:0032496 GO:0042102 GO:0007229 GO:0051897

8 11 19 8 5 9 7

40 854 1099 274 86 252 170

1.86E-14 1.53E-06 1.68E-12 1.09E-07 9.95E-07 3.07E-09 6.86E-08

GO:0048662

4

60

7.33E-06

GO:0050731

8

155

1.33E-09

GO:0007596 GO:0030168 GO:0010642

21 32 3

627 436 15

2.47E-19 7.68E-40 3.72E-06

GO:0050900 GO:0030194 GO:0006953 GO:0050777 GO:0051918 GO:0002682 GO:0050776 GO:0002576 GO:0030195 GO:0017187 GO:0042730 GO:0007597 GO:0007598

17 3 10 3 5 3 6 24 4 4 7 12 4

224 20 89 17 14 7 187 199 16 23 47 23 8

7.90E-22 9.23E-06 4.26E-15 5.54E-06 6.60E-11 2.91E-07 2.55E-06 2.29E-35 2.99E-08 1.43E-07 7.65E-12 4.78E-27 1.17E-09

Term proteolysis complement activation, classical pathway innate immune response axon guidance negative regulation of apoptotic process inflammatory response immune response complement activation negative regulation of endopeptidase activity regulation of complement activation positive regulation of cell proliferation cell adhesion response to lipopolysaccharide positive regulation of T cell proliferation integrin-mediated signaling pathway positive regulation of protein kinase B signaling cascade negative regulation of smooth muscle cell proliferation positive regulation of peptidyl-tyrosine phosphorylation blood coagulation platelet activation negative regulation of platelet-derived growth factor receptor signaling pathway leukocyte migration positive regulation of blood coagulation acute-phase response negative regulation of immune response negative regulation of fibrinolysis regulation of immune system process regulation of immune response platelet degranulation negative regulation of blood coagulation peptidyl-glutamic acid carboxylation fibrinolysis blood coagulation, intrinsic pathway blood coagulation, extrinsic pathway

CC

MF

KEGG

GO:0006957 GO:0051919 GO:0002542 GO:0001869

4 3 2 2

17 5 2 2

3.90E-08 8.34E-08 4.15E-06 4.15E-06

GO:0010888 GO:0005886 GO:0005576 GO:0005615 GO:0009986 GO:0005887 GO:0010008 GO:0031362

3 42 42 54 19 44 7 3

11 6272 2736 1608 797 2246 282 35

1.36E-06 4.43E-12 9.32E-25 1.92E-49 6.18E-15 3.16E-30 2.00E-06 5.14E-05

GO:0045121 GO:0042470 GO:0009897 GO:0070062 GO:0008305 GO:0031225 GO:0005765 GO:0005577 GO:0005788 GO:0043202 GO:0005796 GO:0031093 GO:0031088 GO:0071062 GO:0033093 GO:0031092

11 10 26 6 6 6 9 4 6 7 5 16 3 3 2 5

352 195 463 124 105 219 311 29 321 137 152 116 17 5 5 15

2.15E-10 1.18E-11 1.95E-29 2.36E-07 8.82E-08 6.25E-06 1.87E-08 3.80E-07 5.18E-05 1.57E-08 1.59E-05 6.62E-25 5.54E-06 8.34E-08 4.12E-05 9.88E-11

complement activation, alternative pathway positive regulation of fibrinolysis Factor XII activation negative regulation of complement activation, lectin pathway negative regulation of lipid storage plasma membrane extracellular region extracellular space cell surface integral to plasma membrane endosome membrane anchored to external side of plasma membrane membrane raft melanosome external side of plasma membrane extracellular vesicular exosome integrin complex anchored to membrane lysosomal membrane fibrinogen complex endoplasmic reticulum lumen lysosomal lumen Golgi lumen platelet alpha granule lumen platelet dense granule membrane alphav-beta3 integrin-vitronectin complex Weibel-Palade body platelet alpha granule membrane

GO:0003823 GO:0004252 GO:0004866 GO:0030246 GO:0004872 GO:0004867 GO:0008201 GO:0070053 GO:0043499 GO:0019865 GO:0050431 path:hsa04610 path:hsa05133 path:hsa05150

7 12 5 8 7 12 13 3 6 2 4 28 9 8

183 445 66 442 441 292 362 8 38 2 19 127 144 96

1.13E-07 1.71E-10 2.66E-07 3.74E-06 3.43E-05 1.35E-12 8.39E-13 4.64E-07 1.73E-10 4.15E-06 6.32E-08 4.64E-49 2.20E-11 2.88E-11

antigen binding serine-type endopeptidase activity endopeptidase inhibitor activity carbohydrate binding receptor activity serine-type endopeptidase inhibitor activity heparin binding thrombospondin receptor activity eukaryotic cell surface binding immunoglobulin binding transforming growth factor beta binding Complement and coagulation cascades Pertussis Staphylococcus aureus infection

path:hsa05322 path:hsa04151 path:hsa04510 path:hsa04512 path:hsa05146 path:hsa05200 path:hsa05222 path:hsa05410 path:hsa05412

6 16 12 15 7 10 6 10 8

142 815 505 226 220 852 207 266 248

5.21E-07 1.61E-11 7.08E-10 1.61E-18 3.89E-07 1.01E-05 4.55E-06 2.45E-10 5.11E-08

path:hsa05414 path:hsa04145 path:hsa04060 path:hsa04630 path:hsa04514 path:hsa04810 path:hsa05310 path:hsa04672

9 11 14 7 13 10 3 5

282 299 532 306 331 510 40 96

8.09E-09 3.85E-11 6.87E-12 3.40E-06 2.72E-13 1.11E-07 7.68E-05 1.71E-06

path:hsa05152 path:hsa04142 path:hsa05020 path:hsa04640 path:hsa05144

8 12 4 17 9

377 244 67 211 86

1.18E-06 1.64E-13 1.14E-05 2.81E-22 1.99E-13

Systemic lupus erythematosus PI3K-Akt signaling pathway Focal adhesion ECM-receptor interaction Amoebiasis Pathways in cancer Small cell lung cancer Hypertrophic cardiomyopathy (HCM) Arrhythmogenic right ventricular cardiomyopathy (ARVC) Dilated cardiomyopathy Phagosome Cytokine-cytokine receptor interaction Jak-STAT signaling pathway Cell adhesion molecules (CAMs) Regulation of actin cytoskeleton Asthma Intestinal immune network for IgA production Tuberculosis Lysosome Prion diseases Hematopoietic cell lineage Malaria

Table S4. Significantly enriched GO and KEGG terms for C-linked glycosylated proteins.

Type BP

ID GO:0006958 GO:0045087 GO:0006917 GO:0007155 GO:0006956 GO:0002088 GO:0007043 GO:0019835 GO:0030101 GO:0006957

Term in the positive 5 6 3 5 4 2 2 5 2 5

CC

GO:0005576 GO:0005615 GO:0031982 GO:0005579 GO:0005796 GO:0043514 GO:0070743 GO:0005201 GO:0005212 GO:0001532 path:hsa05146 path:hsa04610 path:hsa05322 path:hsa05020

6 5 2 5 2 1 1 2 2 1 4 5 5 5

MF

KEGG

Term in human proteome 188 742 359 1078 32 60 35 24 21 17

P-value 1.84E-10 2.65E-09 3.62E-05 1.06E-06 2.65E-11 5.48E-05 1.85E-05 4.30E-15 6.54E-06 6.26E-16

2585 1557 48 9 103 2 2 157 32 1 213 124 135 66

3.89E-06 6.31E-06 3.50E-05 1.28E-17 1.61E-04 3.65E-04 3.65E-04 3.73E-04 1.54E-05 1.83E-04 5.99E-08 2.25E-11 3.46E-11 8.98E-13

Term complement activation, classical pathway innate immune response induction of apoptosis cell adhesion complement activation lens development in camera-type eye cell-cell junction assembly cytolysis natural killer cell activation complement activation, alternative pathway extracellular region extracellular space vesicle membrane attack complex Golgi lumen interleukin-12 complex interleukin-23 complex extracellular matrix structural constituent structural constituent of eye lens interleukin-21 receptor activity Amoebiasis Complement and coagulation cascades Systemic lupus erythematosus Prion diseases

Table S5. The over-represented protein interacting partners that interact with the corresponding C-glycosylated proteins in STRING.

ID 9606.ENSP00000263341 9606.ENSP00000372815 9606.ENSP00000301364

Term in the positive 4 4 3

Term in human proteome 894 200 224

P-value 1.73E-05 4.66E-08 8.92E-06

9606.ENSP00000339613

2

23

7.88E-06

9606.ENSP00000364293

2

27

1.09E-05

9606.ENSP00000223642 9606.ENSP00000340210 9606.ENSP00000232003 9606.ENSP00000322061 9606.ENSP00000360281

5 3 3 3 3

231 211 144 38 53

5.17E-10 7.47E-06 2.38E-06 4.19E-08 1.16E-07

9606.ENSP00000343002 9606.ENSP00000224181

3 4

58 47

1.53E-07 1.31E-10

9606.ENSP00000354458

3

27

1.45E-08

protein name Interleukin-1 beta Complement C4-A Pre-rRNA-processing protein TS- R1 homolog GDP-fucose protein O-fucosyltran sferase 2 Pre-rRNA-processing protein TSR2 homolog Complement C5 CD59 glycoprotein Histidine-rich glycoprotein Complement component C7 Complement component C8 beta chain Beta-1,3-glucosyltransferase Complement component C8 gam- ma chain Complement component C8 alpha chain

Table S6. The top 100 over-represented protein interacting partners that interact with the corresponding N-glycosylated proteins in STRING.

ID 9606.ENSP00000365858 9606.ENSP00000374992 9606.ENSP00000384675 9606.ENSP00000263321 9606.ENSP00000372224 9606.ENSP00000225844 9606.ENSP00000339428 9606.ENSP00000339621 9606.ENSP00000297494 9606.ENSP00000377055 9606.ENSP00000317780

Term in the positive 8 5 9 6 4 6 7 4 9 5 6

Term in human proteome 402 97 538 175 42 171 270 40 523 92 168

P-value 1.88E-06 1.80E-06 1.75E-06 1.75E-06 1.74E-06 1.53E-06 1.50E-06 1.43E-06 1.39E-06 1.39E-06 1.38E-06

9606.ENSP00000357255 9606.ENSP00000360248

8 9

383 519

1.32E-06 1.31E-06

9606.ENSP00000297450 9606.ENSP00000256010 9606.ENSP00000329418 9606.ENSP00000271651 9606.ENSP00000341141 9606.ENSP00000322061 9606.ENSP00000356825 9606.ENSP00000347140 9606.ENSP00000260302 9606.ENSP00000296280

6 7 8 6 4 4 7 5 7 5

166 262 379 164 38 38 259 88 254 86

1.29E-06 1.24E-06 1.22E-06 1.20E-06 1.16E-06 1.16E-06 1.15E-06 1.12E-06 1.01E-06 9.95E-07

9606.ENSP00000369820

5

86

9.95E-07

9606.ENSP00000362181

5

85

9.39E-07

9606.ENSP00000293275 9606.ENSP00000304414 9606.ENSP00000307713 9606.ENSP00000362608

7 7 8 7

250 250 363 248

9.07E-07 9.07E-07 8.91E-07 8.61E-07

9606.ENSP00000310998

5

83

8.35E-07

protein name Erythroid transcription factor Ig gamma-3 chain C region Son of sevenless homolog 1 Tyrosinase Hepatocyte growth factor activator C-C motif chemokine 13 Suppressor of cytokine signaling 2 Epididymal-specific lipocalin-6 Nitric oxide synthase, endothelial Lysosomal Pro-X carboxy- peptidase Cytochrome c oxidase subunit 5A, mitochondrial Osteocalcin Ectonucleoside triphosphate diphosphohydrolase 1 Angiopoietin-1 Neurotensin/neuromedin N Suppressor of cytokine signaling 1 Cathepsin K Sialoadhesin Complement component C7 Adenylate cyclase type 10 Asialoglycoprotein receptor 2 Collagenase 3 Mannan-binding lectin serine protease 1 Phosphatidylinositol N-acetylglucosaminyltransferase subunit A Natural cytotoxicity triggering receptor 2 C-C motif chemokine 16 C-X-C chemokine receptor type 6 B2 bradykinin receptor (B2R) Serine/threonine-protein kinase pim-1 N-acetylglucosamine-1-phosphodiest er alpha-N-acetylglucosaminidase

9606.ENSP00000368538

6

152

7.74E-07

9606.ENSP00000382166 9606.ENSP00000356037 9606.ENSP00000282030 9606.ENSP00000243611 9606.ENSP00000352561 9606.ENSP00000338082 9606.ENSP00000356969 9606.ENSP00000373539 9606.ENSP00000170630 9606.ENSP00000245564 9606.ENSP00000366525 9606.ENSP00000338358 9606.ENSP00000388516 9606.ENSP00000291890

6 5 5 4 7 7 6 6 7 3 7 5 6 5

152 81 81 34 242 242 150 150 241 9 240 79 147 77

7.74E-07 7.39E-07 7.39E-07 7.34E-07 7.32E-07 7.32E-07 7.17E-07 7.17E-07 7.12E-07 6.95E-07 6.93E-07 6.53E-07 6.37E-07 5.75E-07

9606.ENSP00000264896 9606.ENSP00000370034 9606.ENSP00000357470 9606.ENSP00000223642 9606.ENSP00000333994 9606.ENSP00000351430 9606.ENSP00000281141

5 7 7 7 6 5 6

77 233 233 231 142 75 141

5.75E-07 5.70E-07 5.70E-07 5.38E-07 5.21E-07 5.04E-07 5.00E-07

9606.ENSP00000239938 9606.ENSP00000327431 9606.ENSP00000378394 9606.ENSP00000346508

10 6 6 6

601 140 140 139

4.89E-07 4.80E-07 4.80E-07 4.60E-07

9606.ENSP00000269571

11

752

4.57E-07

9606.ENSP00000391592

10

596

4.54E-07

9606.ENSP00000312673 9606.ENSP00000284273

6 8

138 330

4.41E-07 4.40E-07

9606.ENSP00000264818

9

451

4.18E-07

9606.ENSP00000319464 9606.ENSP00000188790 9606.ENSP00000260010 9606.ENSP00000297350

5 6 9 7

72 136 447 219

4.11E-07 4.05E-07 3.89E-07 3.77E-07

Tumor necrosis factor receptor superfamily member 4 CX3C chemokine receptor 1 C4b-binding protein alpha chain SET-binding protein C4b-binding protein beta chain Calcitonin receptor Hemoglobin subunit gamma-2 Apolipoprotein A-II Stabilin-2 Interleukin-4 receptor subunit alpha Protein misato homolog 1 Ferritin light chain Serpin B6 Complement factor B Natural cytotoxicity triggering receptor 1 Lysosome membrane protein 2 Toll-like receptor 7 Interleukin-6 receptor subunit alpha Complement C5 Hemoglobin subunit beta Glycophorin-E Cell division cycle protein 123 homolog Early growth response protein 1 Hemoglobin subunit gamma-1 Proactivator polypeptide Platelet-derived growth factor subunit A Receptor tyrosine-protein kinase erbB-2 Tyrosine-protein phosphatase non-receptor type 6 Somatotropin Ubiquitin-associated and SH3 domain-containing protein B Non-receptor tyrosine-protein kinase TYK2 Carboxypeptidase N subunit 2 Seprase Toll-like receptor 2 Tumor necrosis factor receptor superfamily member 11B

9606.ENSP00000379110 9606.ENSP00000264005

8 6

321 133

3.58E-07 3.55E-07

9606.ENSP00000233946 9606.ENSP00000363571

8 8

318 318

3.34E-07 3.34E-07

9606.ENSP00000392423 9606.ENSP00000284440

8 8

317 315

3.26E-07 3.11E-07

9606.ENSP00000002165 9606.ENSP00000254801 9606.ENSP00000259206

5 5 7

68 68 212

3.09E-07 3.09E-07 3.04E-07

9606.ENSP00000300305 9606.ENSP00000233813

10 6

570 129

3.04E-07 2.97E-07

9606.ENSP00000354458

4

27

2.82E-07

9606.ENSP00000200307 9606.ENSP00000286758 9606.ENSP00000006053 9606.ENSP00000329623 9606.ENSP00000361818 9606.ENSP00000274520 9606.ENSP00000330658 9606.ENSP00000314299

6 8 6 14 7 7 5 5

127 307 125 1228 204 204 64 63

2.71E-07 2.57E-07 2.47E-07 2.46E-07 2.35E-07 2.35E-07 2.28E-07 2.10E-07

9606.ENSP00000310305 9606.ENSP00000346467

7 5

198 61

1.92E-07 1.79E-07

9606.ENSP00000231751 9606.ENSP00000351163 9606.ENSP00000359724

7 7 6

195 195 117

1.73E-07 1.73E-07 1.67E-07

9606.ENSP00000326119

7

193

1.62E-07

9606.ENSP00000387122 9606.ENSP00000361592

6 5

116 59

1.59E-07 1.51E-07

9606.ENSP00000403377 9606.ENSP00000215832 9606.ENSP00000253408 9606.ENSP00000302234 9606.ENSP00000263726

6 13 10 7 5

115 995 526 189 58

1.51E-07 1.47E-07 1.47E-07 1.40E-07 1.38E-07

Growth-regulated alpha protein Phosphatidylcholine-sterol acyltransferase Interleukin-1 receptor type 1 Muscle, skeletal receptor tyrosine- protein kinase Reelin Ubiquitin carboxyl-terminal hydrolase isozyme L1 Plasma alpha-L-fucosidase Immunoglobulin J chain Interleukin-1 receptor antagonist protein Runt-related transcription factor 1 Insulin-like growth factor-binding protein 5 Complement component C8 alpha chain C-C motif chemokine 7 C-X-C motif chemokine 13 Fractalkine Apoptosis regulator Bcl-2 Syndecan-4 Interleukin-9 Pappalysin-1 Complement factor H-related protein 1 P2Y purinoceptor 2 (P2Y2) Latent-transforming growth factor beta-binding protein 1 Lactotransferrin Collagen alpha-1(XI) chain Four and a half LIM domains protein 1 Phorbol-12-myristate-13-acetate-indu ced protein 1 Protein CLEC16A Erythroid membrane-associated protein Mitogen-activated protein kinase 1 Glial fibrillary acidic protein Eotaxin LIM/homeobox protein Lhx4

9606.ENSP00000357150 9606.ENSP00000303231 9606.ENSP00000349270 9606.ENSP00000339393 9606.ENSP00000261937

5 8 5 9 7

58 280 57 389 185

1.38E-07 1.28E-07 1.27E-07 1.23E-07 1.22E-07

9606.ENSP00000367992 9606.ENSP00000244869 9606.ENSP00000252951 9606.ENSP00000329357 9606.ENSP00000162749

7 6 6 15 11

184 110 110 1331 646

1.17E-07 1.16E-07 1.16E-07 1.05E-07 1.04E-07

9606.ENSP00000265016

6

107

9.86E-08

T-cell surface glycoprotein CD1b Interleukin-12 subunit alpha Hemoglobin subunit mu C-C chemokine receptor type 6 Vascular endothelial growth factor receptor 3 S-formylglutathione hydrolase Proepiregulin Hemoglobin subunit zeta Transcription factor Sp1 Tumor necrosis factor receptor superfamily member 1A ADP-ribosyl cyclase 2

Table S7. The top 100 over-represented protein interaction partners that interact with the corresponding substrates of O-linked glycosylation in STRING.

ID 9606.ENSP00000263923

Term in the positive 9

Term in human proteome 449

P-value 1.39E-06

9606.ENSP00000362637 9606.ENSP00000359446 9606.ENSP00000338358 9606.ENSP00000260010 9606.ENSP00000291232

7 5 5 9 7

229 79 79 447 227

1.37E-06 1.36E-06 1.36E-06 1.34E-06 1.30E-06

9606.ENSP00000272163 9606.ENSP00000262965 9606.ENSP00000333685 9606.ENSP00000348827 9606.ENSP00000276927 9606.ENSP00000264335 9606.ENSP00000000412

5 9 9 8 9 11 8

78 443 441 323 438 708 321

1.28E-06 1.24E-06 1.20E-06 1.15E-06 1.13E-06 1.10E-06 1.10E-06

9606.ENSP00000379110 9606.ENSP00000345206

8 7

321 221

1.10E-06 1.09E-06

9606.ENSP00000294339

8

320

1.07E-06

9606.ENSP00000276110

8

320

1.07E-06

9606.ENSP00000241651 9606.ENSP00000266427 9606.ENSP00000357753 9606.ENSP00000263373

9 8 6 6

433 318 137 137

1.03E-06 1.02E-06 1.01E-06 1.01E-06

9606.ENSP00000365380 9606.ENSP00000365663 9606.ENSP00000216492 9606.ENSP00000417281 9606.ENSP00000353863 9606.ENSP00000245932

8 7 7 11 6 8

317 218 218 699 136 315

1.00E-06 9.92E-07 9.92E-07 9.78E-07 9.66E-07 9.55E-07

9606.ENSP00000358022

9

428

9.40E-07

9606.ENSP00000353654

7

216

9.34E-07

protein name Vascular endothelial growth factor receptor 2 Bladder cancer-associated protein Carboxypeptidase N catalytic chain Serpin B6 Toll-like receptor 2 Tumor necrosis factor receptor superfamily member 13C Lamin-B receptor Transcription factor E2-alpha Mitogen-activated protein kinase 11 Thyroid hormone receptor beta Interferon alpha-1/13 14-3-3 protein epsilon Cation-dependent mannose-6-phosphate receptor Growth-regulated alpha protein Recombining binding protein suppressor of hairless T-cell acute lymphocytic leukemia protein 1 Cytokine receptor common subunit gamma Myogenin Transcription factor ETV6 Involucrin Spectrin beta chain, non-erythrocytic 4 Forkhead box protein P3 Natriuretic peptides A Chromogranin-A E3 ubiquitin-protein ligase Mdm2 Zinc finger protein 148 Vasodilator-stimulated phosphoprotein Induced myeloid leukemia cell differentiation protein Mcl-1 Collagen alpha-2(IV) chain

9606.ENSP00000263525 9606.ENSP00000264246

7 9

215 425

9.06E-07 8.88E-07

9606.ENSP00000243050

9

424

8.71E-07

9606.ENSP00000230354 9606.ENSP00000282957 9606.ENSP00000264010 9606.ENSP00000293272 9606.ENSP00000367830 9606.ENSP00000293379 9606.ENSP00000311489

12 5 7 10 10 9 7

841 72 213 549 547 420 210

8.64E-07 8.59E-07 8.51E-07 8.45E-07 8.18E-07 8.07E-07 7.75E-07

9606.ENSP00000264515 9606.ENSP00000259456 9606.ENSP00000317272 9606.ENSP00000252971

7 7 10 8

210 210 542 304

7.75E-07 7.75E-07 7.54E-07 7.35E-07

9606.ENSP00000344352

8

304

7.35E-07

9606.ENSP00000344818 9606.ENSP00000384675 9606.ENSP00000261023 9606.ENSP00000353646 9606.ENSP00000350283

16 10 10 5 13

1518 538 537 69 979

7.08E-07 7.06E-07 6.95E-07 6.94E-07 6.67E-07

9606.ENSP00000284240 9606.ENSP00000303276 9606.ENSP00000265970

8 5 8

300 68 298

6.66E-07 6.46E-07 6.34E-07

9606.ENSP00000357283 9606.ENSP00000216223 9606.ENSP00000362795 9606.ENSP00000293636

9 9 7 6

407 404 201 124

6.25E-07 5.88E-07 5.80E-07 5.65E-07

9606.ENSP00000372815 9606.ENSP00000381331 9606.ENSP00000307046 9606.ENSP00000296930 9606.ENSP00000337432 9606.ENSP00000317790

7 11 9 11 6 6

200 659 400 657 123 123

5.61E-07 5.57E-07 5.42E-07 5.41E-07 5.38E-07 5.38E-07

9606.ENSP00000263816

8

288

4.92E-07

Tenascin-R T-lymphocyte activation antigen CD80 Nuclear receptor subfamily 4 group A member 1 TATA-box-binding protein Carboxypeptidase B Transcriptional repressor CTCF C-C motif chemokine 5 Protein kinase C zeta type Integrin alpha-5 Spectrin beta chain, non-erythrocytic 2 Retinoblastoma-binding protein 5 Hemogen Hepatocyte growth factor receptor Motor neuron and pancreas homeobox protein 1 Cyclic AMP-dependent transcription factor ATF-3 Polyubiquitin-C Son of sevenless homolog 1 Integrin alpha-V Puratrophin-1 Breast cancer type 1 susceptibility protein Thy-1 membrane glycoprotein Non-secretory ribonuclease Phosphatidylinositol 4-phosphate 3-ki- nase C2 domain-containing subunit alpha Prelamin-A/C Interleukin-2 receptor subunit beta C-X-C chemokine receptor type 3 Chymotrypsin-like elastase family member 1 Complement C4-A Histone deacetylase 2 Syndecan-2 Nucleophosmin Interleukin-17F Spectrin beta chain, non-erythrocytic 5 Low-density lipoprotein receptor-

9606.ENSP00000393887 9606.ENSP00000324890

6 10

121 516

4.89E-07 4.87E-07

9606.ENSP00000301464

5

64

4.77E-07

9606.ENSP00000330658 9606.ENSP00000344192 9606.ENSP00000321334 9606.ENSP00000273951 9606.ENSP00000320709 9606.ENSP00000222381 9606.ENSP00000372326 9606.ENSP00000350877 9606.ENSP00000351671 9606.ENSP00000225792

5 9 6 7 8 6 6 10 9 7

64 393 120 194 285 119 119 509 389 192

4.77E-07 4.69E-07 4.66E-07 4.58E-07 4.55E-07 4.44E-07 4.44E-07 4.31E-07 4.31E-07 4.27E-07

9606.ENSP00000263946 9606.ENSP00000300811 9606.ENSP00000379866 9606.ENSP00000359424

6 5 7 9

118 62 188 380

4.22E-07 4.06E-07 3.71E-07 3.56E-07

9606.ENSP00000252244 9606.ENSP00000273063 9606.ENSP00000250092 9606.ENSP00000263121

6 5 9 8

114 60 377 271

3.45E-07 3.45E-07 3.33E-07 3.12E-07

9606.ENSP00000353243 9606.ENSP00000380227 9606.ENSP00000263726 9606.ENSP00000204637 9606.ENSP00000332369

7 10 5 5 7

183 487 58 58 181

3.10E-07 2.90E-07 2.90E-07 2.90E-07 2.88E-07

9606.ENSP00000284384 9606.ENSP00000387662 9606.ENSP00000238081 9606.ENSP00000354586 9606.ENSP00000309953

14 11 11 11 6

1067 615 614 614 110

2.87E-07 2.86E-07 2.82E-07 2.82E-07 2.80E-07

9606.ENSP00000242285 9606.ENSP00000346294 9606.ENSP00000351407

10 8 9

485 266 367

2.80E-07 2.72E-07 2.67E-07

related protein 2 Alpha-2-HS-glycoprotein T-cell-specific surface glycoprotein CD28 Insulin-like growth factor-binding protein 6 Pappalysin-1 Interleukin-17A Apolipoprotein(a) Vitamin D-binding protein Adiponectin Serum paraoxonase/arylesterase 1 Ferrochelatase, mitochondrial Serine/arginine-rich-splicing factor 2 C-C motif chemokine 20 Probable ATP-dependent RNA helicase DDX5 Plakophilin-1 Zinc finger protein 428 Collagen alpha-4(IV) chain Inhibitor of nuclear factor kappa-B kinase subunit alpha Keratin, type II cytoskeletal 1 Anion exchange protein 3 Macrosialin SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily B member 1 GDP-L-fucose synthase Integrin alpha-4 LIM/homeobox protein Lhx4 Fms-related tyrosine kinase 3 ligand 5-aminolevulinate synthase, erythroid- specific, mitochondrial Protein kinase C alpha type Glucagon 14-3-3 protein theta Zinc finger protein GLI2 EGF-containing fibulin-like extracellular matrix protein 2 Clathrin light chain A (Lca) Protein S100-A4 Aryl hydrocarbon receptor nuclear translocator

9606.ENSP00000349270 9606.ENSP00000355361

5 8

57 265

2.66E-07 2.64E-07

Hemoglobin subunit mu Leukocyte surface antigen CD47

Table S8. The corresponding AUC scores of the prediction model using different feature groups by removal of a particular feature group from the classifiers. The shaded line in the Table corresponds to the performance of the All-feature model. For the rest lines in the Table, the AUC score denotes the performance of the resulting model after removal of the corresponding feature group.

Features

C-linked

N-linked

O-linked

IG+IFS

mRMR+IFS

IG+IFS

mRMR+IFS

IG+IFS

mRMR+IFS

All features

1

1

0.983

0.986

0.982

0.961

AAindex

0.9943

0.9987

0.9444

0.9739

0.9341

-

PhysicalChem

0.9999

-

-

-

0.9421

-

PSSM

-

0.9993

-

0.9878

0.9406

-

ConservationScore

-

-

-

-

-

-

CKSAAP

-

0.9993

-

0.9878

-

-

Disopred_Region

-

1

-

-

0.9363

0.8878

Secondary_Structure

-

0.9993

-

-

-

-

Predicted_Dihedral_Angles

-

-

-

0.9870

0.9491

-

Functional Annotation

-

0.9993

-

0.9877

-

-

Functional Features

0.9637

0.9722

0.8783

0.9598

0.8544

0.6960

Table S9. Statistics of the final selected optimal features for each type of glycosylation by IG + IFS. Feature Type

C-linked

N-linked

O-linked

AAindex

5

2

83

PhysicalChem

1

0

9

PSSM

0

0

3

ConservationScore

0

0

0

CKSAAP

0

0

0

Disopred_Region

0

0

13

Secondary_Structure

0

0

0

Predicted_Dihedral_Angles

0

0

28

Functional Annotation

0

0

0

Functional Features

2

2

4

Total

8

4

140

Table S10. Statistics of the final selected optimal features for each type of glycosylation by mRMR + IFS. Feature Type

C-linked

N-linked

O-linked

AAindex

349

189

0

PhysicalChem

0

0

0

PSSM

12

1

0

ConservationScore

0

0

0

CKSAAP

2

1

0

Disopred_Region

0

0

1

Secondary_Structure

11

0

0

Predicted_Dihedral_Angles

33

10

0

Functional Annotation

16

10

0

Functional Features

6

7

3

Total

429

218

4

Table S11. Performance results of 10 randomization tests (the performance of each randomization was evaluated using fivefold cross-validation tests) of GlycoMine by randomly selecting the negative samples. The performance was evaluated using ten performance measures including AUC, MCC, ACC, Sensitivity, Specificity, Precision, TP, FP, TN and FN. C-linked, OFC is from IG+IFS No. AUC MCC ACC

Sensitivity

Specificity

Precision

TP

FP

TN

FN

1

0.999

0.982

0.991

0.982

1

1

54

0

55

1

2

1

1

1

1

1

1

55

0

55

0

3

0.999

0.982

0.991

0.982

1

1

54

0

55

1

4

0.999

0.982

0.991

0.982

1

1

54

0

55

1

5

1

1

1

1

1

1

55

0

55

0

6

0.998

0.982

0.991

0.982

1

1

54

0

55

1

7

0.985

0.982

0.991

0.982

1

1

54

0

55

1

8

1

1

1

1

1

1

55

0

55

0

9

1

1

1

1

1

1

55

0

55

0

10

1

1

1

1

1

1

55

0

55

0

Avg.

0.998

0.991

0.996

0.991

1

1

-

-

-

-

Specificity

Precision

TP

FP

TN

FN

1

1

54

0

55

1

1

1

54

0

55

1

1

1

54

0

55

1

1

1

54

0

55

1

0.982

0.981

51

1

54

2

0.982

0.981

53

1

54

2

0.982

0.981

53

1

54

2

1

1

53

0

55

2

0.982

0.982

54

1

54

1

0.982

0.982

54

1

54

1

0.991

0.991

-

-

-

-

Sensitivity

Specificity

Precision

TP

FP

TN

FN

0.919

0.958

0.956

306

14

319

27

0.925

0.958

0.956

308

14

319

25

0.925

0.941

0.939

308

20

313

25

0.922

0.946

0.946

307

18

315

26

0.931

0.941

0.939

310

20

313

23

0.928

0.964

0.963

309

12

321

24

C-linked, OFC is from mRMR+IFS No. AUC MCC ACC Sensitivity 1 0.999 0.982 0.991 0.982 2 0.999 0.982 0.991 0.982 3 0.999 0.982 0.991 0.982 4 0.999 0.982 0.991 0.982 5 0.994 0.946 0.973 0.964 6 0.997 0.946 0.973 0.964 7 0.995 0.946 0.973 0.964 8 0.997 0.964 0.982 0.964 9 0.997 0.964 0.982 0.982 10 0.997 0.964 0.982 0.982 Avg. 0.997 0.966 0.983 0.975

N-linked, OFC is from IG+IFS No. AUC MCC ACC 1 0.971 0.878 0.938 2 0.969 0.883 0.941 3 0.971 0.865 0.932 4 0.973 0.868 0.934 5 0.971 0.871 0.935 6 0.959 0.892 0.946

7

0.972

0.892

0.946

0.934

0.958

0.957

311

14

319

22

8

0.976

0.874

0.937

0.931

0.943

0.942

310

19

314

23

9

0.976

0.889

0.944

0.934

0.955

0.954

311

15

318

22

10

0.973

0.901

0.949

0.919

0.979

0.978

306

11

322

27

Avg.

0.971

0.881

0.94

0.927

0.954

0.953

-

-

-

-

Specificity

Precision

TP

FP

TN

FN

0.967

0.965

307

11

322

26

0.964

0.963

308

12

321

25

0.946

0.945

310

18

315

23

0.946

0.945

312

18

315

21

0.941

0.94

312

20

313

21

0.967

0.966

308

11

322

25

0.946

0.945

311

18

315

22

N-linked, OFC is from mRMR+IFS No. AUC MCC ACC Sensitivity 1 0.989 0.891 0.944 0.922 2 0.988 0.891 0.944 0.925 3 0.988 0.877 0.938 0.931 4 0.988 0.883 0.941 0.937 5 0.988 0.877 0.938 0.937 6 0.988 0.893 0.946 0.925 7 0.988 0.881 0.94 0.934 8

0.988

0.88

0.94

0.922

0.958

0.956

307

14

319

26

9

0.988

0.877

0.938

0.931

0.946

0.945

310

18

315

23

10

0.989

0.994

0.941

0.919

0.964

0.962

306

12

321

27

Avg.

0.988

0.894

0.941

0.928

0.955

0.953

-

-

-

-

Sensitivity

Specificity

Precision

TP

FP

TN

FN

0.864

0.844

0.847

499

81

439

71

0.865

0.837

0.841

450

85

435

70

0.877

0.827

0.835

456

90

430

64

0.865

0.831

0.836

450

88

432

70

0.867

0.835

0.841

451

86

434

69

0.873

0.838

0.844

454

84

436

66

O-linked, OFC is from IG+IFS No. AUC MCC ACC 1 0.929 0.708 0.854 2 0.929 0.702 0.851 3 0.931 0.705 0.852 4 0.929 0.697 0.848 5 0.928 0.703 0.851 6 0.932 0.712 0.856 7

0.933

0.716

0.858

0.873

0.842

0.847

454

82

438

66

8

0.933

0.723

0.862

0.865

0.858

0.859

450

74

446

70

9

0.933

0.719

0.86

0.873

0.846

0.85

454

80

440

66

10

0.933

0.717

0.859

0.852

0.865

0.864

443

70

450

77

Avg.

0.931

0.71

0.855

0.867

0.842

0.846

-

-

-

-

Specificity

Precision

TP

FP

TN

FN

0.835

0.841

451

86

434

69

0.758

0.793

482

126

394

38

0.835

0.841

451

86

434

69

0.758

0.793

482

126

394

38

O-linked, OFC is from mRMR+IFS No. AUC MCC ACC Sensitivity 1 0.928 0.703 0.851 0.867 2 0.904 0.695 0.842 0.927 3 0.928 0.703 0.851 0.867 4 0.904 0.695 0.842 0.927

5

0.929

0.697

0.848

0.865

0.831

0.836

450

88

432

70

6

0.9

0.679

0.837

0.9

0.773

0.799

468

118

402

52

7

0.891

0.681

0.838

0.902

0.773

0.799

469

118

402

51

8

0.904

0.689

0.844

0.867

0.821

0.829

451

93

427

69

9

0.895

0.643

0.821

0.838

0.804

0.81

436

102

418

84

10

0.884

0.627

0.812

0.756

0.867

0.851

393

69

451

127

Avg.

0.9067

0.6812

0.8386

0.8716

0.8055

0.8192

-

-

-

-

Table S12. The prediction performance of three types of glycosylation sites based on 5-fold cross-validation tests using the benchmark datasets and independent tests using the independent test datasets. The performance was evaluated using ten measures including AUC, MCC, ACC, Sensitivity, Specificity, Precision, TP, FP, TN and FN. N-linked:

AUC

MCC

ACC

Sensitivity

Specificity

Precision

TP

FP

TN

FN

0.983

0.8904

0.9478

0.9880

0.9277

0.8723

82

12

154

1

0.986

0.9018

0.9558

0.9518

0.9578

0.9186

79

7

159

4

0.986

0.8647

0.9390

0.9309

0.9430

0.8908

310

38

629

23

0.987

0.8880

0.9500

0.9339

0.9580

0.9174

311

28

639

22

Independent

0.939

0.8140

0.9113

0.9512

0.8916

0.8125

78

18

148

4

Benchmark

0.954

0.8053

0.9065

0.9641

0.8801

0.7867

295

80

587

11

Independent

0.786

0.5439

0.7254

0.9625

0.6098

0.5461

77

64

100

3

Benchmark

0.797

0.5631

0.7310

0.9838

0.6108

0.5458

304

253

397

5

Independent

0.551

0.1307

0.5206

0.7073

0.4277

0.3791

58

95

71

24

Benchmark

0.508

0.0418

0.5365

0.4789

0.5652

0.3541

159

290

377

173

AUC

MCC

ACC

Sensitivity

Specificity

Precision

TP

FP

TN

FN

0.982

0.8719

0.9406

0.9612

0.9302

0.8732

124

18

240

5

0.961

0.7907

0.9070

0.8605

0.9302

0.8605

111

18

240

18

IG+IFS Independent

mRMR +IFS

Our work

IG+IFS Benchmark

mRMR +IFS

EnsembleGly

GPP

NetNGlyc

O-linked:

IG+IFS Our work

Independent

mRMR +IFS

IG+IFS

0.941

0.7228

0.8667

0.9000

0.8497

0.7536

468

153

865

52

0.932

0.7078

0.8661

0.8423

0.8782

0.7794

438

124

894

82

Independent

0.819

0.5564

0.8063

0.6935

0.8605

0.7049

86

36

222

38

Benchmark

0.845

0.5637

0.7970

0.7776

0.8065

0.6632

388

197

821

111

Independent

0.697

0.3688

0.6596

0.8017

0.5930

0.4802

97

105

153

24

Benchmark

0.669

0.3207

0.6266

0.7936

0.5444

0.4615

396

462

552

103

Independent

0.850

0.533

0.786

0.756

0.799

0.637

93

53

211

30

Benchmark

0.861

0.544

0.779

0.807

0.765

0.629

409

241

785

98

AUC

MCC

ACC

Sensitivity

Specificity

Precision

TP

FP

TN

FN

1

1

1

1

1

1

13

0

28

0

1

1

1

1

1

1

13

0

28

0

0.999

0.9864

0.9939

1

0.9907

0.9821

55

1

107

0

1

1

1

1

1

1

55

0

108

0

Independent

0.919

0.8289

0.9268

0.8462

0.9643

0.9167

11

1

27

2

Benchmark

0.917

0.7555

0.8896

0.8545

0.9074

0.8246

47

10

98

8

Benchmark

mRMR +IFS

EnsembleGly

GPP

NetOGlyc

C-linked:

IG+IFS Independent

mRMR +IFS

Our work

IG+IFS Benchmark

mRMR +IFS

EnsembleGly

Table S13. Statistics of proteome-wide C-, N- and O-linked glycosylation sites predicted at the 99% specificity level. The prediction was performed for the whole human proteome with a total of 84,843 proteins. The prediction results are available for download at http://www.structbioinfor.org/Lab/GlycoMine/. Glycosylation type C-linked N-linked O-linked

Number of predicted glycosylated proteins 6876 5846 6358

Number of predicted glycosylation sites 18926 24174 97042

REFERENCES Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, 3389-3402. Ashburner,M. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat. Genet., 25, 25-29. Chan,W.M. and Consortium, U. (2010) The UniProt Knowledgebase (UniProtKB): a freely accessible, comprehensive and expertly curated protein sequence database, Genet. Res., 92, 78-79. Chen,K. et al. (2009) Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs, J. Comput. Chem., 30, 163-172. Chen,X. et al. (2013) Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites, Bioinformatics, 29, 1614-1622. Dunker,A.K. and Obradovic, Z. (2001) The protein trinity - linking function and disorder, Nat. Biotechnol., 19, 805-806. Dunker,A.K. et al. (2008) The unfoldomics decade: an update on intrinsically disordered proteins, BMC genomics, 9. Gupta,R. et al. (1999) O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins, Nucleic Acids Res., 27, 370-372. Hornbeck,P.V. et al. (2012) PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse, Nucleic Acids Res., 40, D261-270. Huang,Y. et al. (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences, Bioinformatics, 26, 680-682. Hunter,S. et al. (2012) InterPro in 2011: new developments in the family and domain prediction database, Nucleic Acids Res., 40, D306-312. Iakoucheva,L.M. et al. (2004) The importance of intrinsic disorder for protein phosphorylation, Nucleic Acids Res., 32, 1037-1049. Inoue,M.

et

al.

(2001)

High

density

O-glycosylation

of

the

MUC2

tandem

repeat

unit

by

N-acetylgalactosaminyltransferase-3 in colonic adenocarcinoma extracts, Cancer Res., 61, 950-956. Jensen,L.J. et al. (2009) STRING 8--a global view on proteins and their functional interactions in 630 organisms, Nucleic Acids Res., 37, D412-416. Kawashima,S. et al. (2008) AAindex: amino acid index database, progress report 2008, Nucleic Acids Res., 36, D202-205.

Li,H. et al. (2009) SysPTM: a systematic resource for proteomic research on post-translational modifications, Mol. Cell Proteomics, 8, 1839-1849. Li,T. et al. (2010) Identifying Human Kinase-Specific Protein Phosphorylation Sites by Integrating Heterogeneous Information from Various Sources, PLoS One, 5, e15411. Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochim. Biophys. Acta., 405, 442-451. O'Shea,J.P. et al. (2013) pLogo: a probabilistic approach to visualizing sequence motifs, Nat. Methods, 10, 1211-1212. Peng,H.C. et al. (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern. Anal. Mach. Intell., 27, 1226-1238. Punta,M. et al. (2012) The Pfam protein families database, Nucleic Acids Res., 40, D290-301. Saeys,Y. et al. (2007) A review of feature selection techniques in bioinformatics, Bioinformatics, 23, 2507-2517. Song,J. et al. (2006) Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information, BMC Bioinformatics, 7, 124. Song,J. et al. (2012) PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites, PLoS One, 7, e50300. Trost,B. and Kusalik,A. (2011) Computational prediction of eukaryotic phosphorylation sites, Bioinformatics, 27, 2927-2935. Wagner,M. et al. (2005) Linear regression models for solvent accessibility prediction in proteins, J. Comput. Biol., 12, 355-369. Ward,J.J. et al. (2004) The DISOPRED server for the prediction of protein disorder, Bioinformatics, 20, 2138-2139. Wixon,J. and Kell, D. (2000) The Kyoto encyclopedia of genes and genomes--KEGG, Yeast, 17, 48-55.