Detecting Temporal Protein Complexes Based on ...

1 downloads 0 Views 205KB Size Report
Closeness) and apply it to the time course protein interaction networks (TCPINs) to ... circle, while those outside it are in lilac dotted circle, which are represented ...
2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Detecting Temporal Protein Complexes Based on Neighbor Closeness and Time Course Protein Interaction Networks Xianjun Shen, Li Yi, Xingpeng Jiang, Yanli Zhao, Tingting He, Jincai Yang* School of Computer, Central China Normal University, Wuhan, China [email protected] Lichtenberg et al. analyzed the dynamics of protein complexes by integrating gene expression data into PPI network [8]. Tang et al. constructed time course protein interaction networks (TCPINs), which make the efficiency of protein complexes identification get significantly improved [9]. To better understand the molecular mechanism in cellular system, it is very important to devise algorithms of mining temporal protein complexes from dynamic PPI networks with more effectiveness. It has become a hot research topic to identify protein complexes based on local strategy [10]. Many researches on the properties of a single node in complex network focus on investigating its neighbor nodes instead of the node itself. An overwhelming majority of nodes within complex network have been found to interact intimately with their surrounding nodes rather than exist in isolation. From this perspective, we present a novel clustering method CNC (Clustering based on Neighbor Closeness) based on neighbor closeness to detect temporal protein complexes from time course protein interaction networks. CNC algorithm is proved to be reasonable in our experiments and it has better performance than the following five state-of-the-art algorithms in terms of matching degree and accuracy metric: Hunter [11], MCODE [12], CFinder [13], SPICI [14], and ClusterONE [15] algorithms. CNC algorithm obtains many protein complexes with strong biological significance from the time course protein interaction networks.

Abstract—The detection of temporal protein complexes would be a great aid in furthering our knowledge of the dynamic features and molecular mechanism in cell life activities. Inspired by the idea of that the tighter a protein’s neighbors inside a module connect, the greater the possibility that the protein belongs to the module, we propose a novel clustering algorithm CNC (Clustering based on Neighbor Closeness) and apply it to the time course protein interaction networks (TCPINs) to detect temporal protein complexes. Our novel algorithm has better performance on identifying protein complexes than five state-of-the-art algorithms—Hunter, MCODE, CFinder, SPICI, and ClusterONE—in terms of matching degree and accuracy metric, meanwhile it obtains many protein complexes with strong biological significance. Keywords—protein complexes; time course protein interaction networks; clustering coefficient; neighbor closeness

I.

INTRODUCTION

The emergence of large-scale protein-protein interaction (PPI) data has raised a hot wave of research on PPI networks in the post-genomic era. PPI network has the feature of modular structure—they have dense connections between the nodes within modules but sparse ones between the nodes in different modules [1]. A protein complex is a fundamental unit formed with highly connected proteins and often possesses specific biological functions [2]. Accumulated evidences suggest that protein complexes are involved in many disease mechanisms [3]. Tracking the protein complexes could reveal important insights into modular mechanisms and improve our understanding on the disease pathways [4]. So far, a great number of algorithms to identify protein complexes have been devised, such as CPM (Clique Percolation Method) and MCODE (Molecular Complex Detection) algorithms. In addition, other biology information such as gene expression data and GO semantic information have been integrated into many of the other protein complex detection methods [5, 6]. Although many of the state-of-the-art clustering methods perform well on identifying protein complexes, the inherent dynamics in cell life activities are often overlooked, which is mainly due to the fact that the PPI data derived from high throughput processing techniques could not enable us to discern any temporal signals. However, cellular systems are highly dynamic and responsive to the stimulus from external environment. Thus it has important implications in making a transition from the analyzing of static PPI networks to dynamic networks. Han JD et al. has proved the dynamically organized modularity in yeast PPI network [7]. De

978-1-4673-6799-8/15/$31.00 ©2015 IEEE

II. MATERIALS AND METHODS A. Definition of Neighbor Closeness As is shown in Fig.1, suppose that the nodes inside the black circle constitute an initial cluster C, and the red node is one of cluster C’s neighbor node, which is denoted by v. The neighbor nodes of v inside the cluster are in aqua dotted circle, while those outside it are in lilac dotted circle, which are represented by {NI v − c } and {NOv − c } respectively. For a node v, its neighbor closeness inside a cluster C is defined as

Figure 1. A schematic drawing of neighbor closeness.

109

expression profiles of 9335 probes under 36 different time points. The gene products involved in the gene expression data cover 98% of the proteins in the static PPI network. The reference dataset of known yeast protein complexes is derived from CYC2008, which is considered as the gold standard dataset. This dataset contains 236 protein complexes excluding the ones containing only 2 proteins.

(1), and that outside the cluster C is defined as (2). NCI v − c

∑ =

NCOv − c =

w∈{ NI v −c



deg( w) }

| NI v − c | w∈{ NOv −c }

deg( w)

| NOv − c |

(1)

(2)

where | NI v − c | and | NOv − c | are the number of nodes in {NI v − c } and {NOv − c } respectively; deg( w) denotes the number of node w’s neighbors inside and outside the cluster C in (1) and (2) respectively. If NCI v − c > NCOv − c then node v is believed to be more intimate with the node group of {NCI v − c } than {NCOv − c } and vice versa.

B. Comparison with the Known Protein Complexes Overlapping Score (OS) [16] (3) is typically used to assess the match degree between a predicted protein complex pc and a known protein complex kc:

B. CNC Algorithm Below explains CNC algorithm in detail. 1) Formation of initial cluster. From a topological view, proteins in a complex core often have many interacting partners and the cores often correspond to small and dense subgraphs in a PPI network, explaining they have higher clustering coefficient which quantifies how close the neighbors of a vertex are to being a clique. We first calculate the clustering coefficient for each protein node in PPI network, and then every node with high clustering coefficient is consolidated with its neighbors to form an initial cluster, namely the core of a protein complex. 2) Extension of initial cluster. For each neighbor node v of an initial cluster C, we calculate its neighbor closeness inside and outside the cluster. If NCI v − c > NCOv − c then node v is regarded as a candidate node to be merged into the cluster C. Among all the candidate nodes, we give priority to the one that with the maximum difference value between NCI v − c and NCOv − c to merge into the cluster C, which is mainly because greater difference value means the node is more inclined to participate into the cluster with higher neighbor closeness. After the extension, each cluster is deemed to be a final protein complex. 3) Remove redundant protein complexes. For the protein complexes that completely overlap with each other, only one is retained while the others are removed as redundancy. Time course protein interaction networks we used include 36 temporal protein interaction networks, each of which consists of the active proteins at a time point and their interactions in the original static PPI network [9]. CNC algorithm is performed individually on these temporal protein interaction networks, thereby generating 36 predicted protein complex sets as our final result.

where | x | represents the number of the proteins involved in complex x. Two protein complexes are considered to be matching if their overlapping score is greater than or equal to a given threshold, which is set to 0.2, the same as many other researches [16]. Particularly, OS ( pc, kc) = 1 indicates that the two complexes pc and kc match perfectly. The recognition capability of various algorithms on 36 protein interaction networks is illustrated in Fig.2, from which two conclusions could be drawn: First, though each protein interaction network is different, the numbers of known protein complexes recognized by various algorithms separately fluctuate in a narrow range, explaining each algorithm has a feature of stability on different protein interaction networks. Second, CNC algorithm has stronger recognition capability than the other classic algorithms—it recognizes averagely 97.3 protein complexes while ClusterONE, SPICI, CFinder, MCODE and Hunter algorithms recognize 94.2, 88.0, 48.7, 24.5 and 21.8 ones, respectively. Considering the stability of each algorithm on different temporal protein interaction networks, further performance comparison of various algorithms on a single protein interaction network can be implemented reasonably. We find

Number of matched known complexes

OS( pc,kc ) =

III. RESULTS AND ANALYSIS A. Experimental Datasets We use yeast PPI network derived from DIP, which includes 21788 interactions among 4950 distinct proteins after removing the self-interactions and repeated ones. Gene expression data over three successive metabolic cycles are available from GEO (Gene Expression Omnibus) with accession number GSE3431. This dataset includes the

110 100 90 80 70 60 50 40 30 20 10 0

CNC SPICI MCODE

| pc ∩ kc |2 | pc | × | kc |

(3)

ClusterONE CFinder Hunter

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 TCPINs Figure 2. Recognition ability of different algorithms on time course protein interaction networks.

110

that our CNC algorithm obtains the greatest number (namely 497) of predicted protein complexes from the temporal protein interaction network at time point 9. Thus we take an example of this time point for our further analysis. As is shown in Fig.3, the number of matched known protein complexes of CNC algorithm is dramatically higher than that of the other algorithms when OS threshold ranges from 0.2 to 0.6. Particularly, set OS threshold to 0.2, CNC algorithm obtains 97 as its number of matched known protein complexes, which is 10%, 23%, 126%, 259% and 439% greater than that achieved by SPICI, ClusterONE, CFinder, MCODE, and Hunter algorithms, respectively. To conclude, CNC algorithm is capable to detect protein complexes more effectively than the other five classic algorithms.

Number of matched known complexes

100

n

m

Acc = Sn × PPV

m

n

40 20

1

Figure 3. Amounts of known protein complexes recognized by different algorithms on the temporal protein interaction network at time point 9 under varying OS threshold.

0.55

CNC SPICI Hunter

ClusterONE CFinder MCODE

0.5

(4)

PPV = ∑ j =1 max in=1 t (i, j )/ ∑ i =1 ∑ i =1 t (i, j )

60

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 overlapping score threshold

(5) (6)

As is shown in Fig.4, the average accuracies of CNC, ClusterONE and SPICI algorithms are 0.492, 0.477 and 0.473, respectively, while the others are 0.377, 0.334 and 0.282. In a word, the accuracy of our novel algorithm is dramatically higher than that of the other classic algorithms. Afterwards, as was stated in the previous section, further analysis of accuracy metric can be implemented reasonably on the result of time point 9. As is shown in table 1, #complexes represents the total number of identified protein complexes, #matched denotes the amount of the identified protein complexes which match with at least one known protein complex, and #PM is the number of perfectly matched protein complexes. CNC algorithm obtains the most protein complexes 497, nearly half of which match with known protein complexes, while the fractions of matched identified protein complexes of ClusterONE, SPICI and CFinder algorithms are far less than that of CNC algorithm. Although MCODE and Hunter algorithms obtain relatively higher fractions of matched identified protein complexes, they achieve only a few protein complexes in total. It can be seen that MCODE algorithm has a relatively bad performance compared with other algorithms in terms

Accuracy

n

80

0

C. Accuracy metric The harmonic mean of Sn (Sensitivity) and PPV (Positive Predictive Value), also known as Acc (Accuracy metric), is typically used to assess the overall performances of various algorithms [17]. Sn and PPV are calculated based on a matching matrix T of which the number of rows and columns (separately denoted as n and m) represent the number of known protein complexes and predicted protein complexes respectively while the element t( i , j ) denotes the number of proteins involved in both the i-th known protein complex and the j-th predicted protein complex. Let n( i ) denote the number of proteins involved in the i-th known protein complex, then Sn, PPV and Acc can be defined as (4), (5) and (6), respectively. Sn = ∑ i =1 max mj=1 t (i, j )/ ∑ i =1 n(i)

CNC ClusterONE SPICI CFinder MCODE Hunter

0.45 0.4 0.35 0.3 0.25 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 TCPINs

Figure 4. Predictive accuracy comparison of different algorithms on 36 transient protein interaction networks.

of Sn, PPV and Acc. In addition, we find a common feature shared by Hunter, CFinder, SPICI and ClusterONE algorithms—they have either higher Sn values but lower PPV values or higher PPV values but lower Sn values, which finally causes lower Acc values. However, in our novel algorithm there exists a balance between Sn and PPV values. Although it does not obtain the highest Sn or PPV, the greatest Acc 0.489 is achieved, which further suggests our CNC algorithm outperforms the other five classic algorithms.

D. Analysis of Function Enrichment To evaluate the statistical significance of the identified protein complexes, many researchers annotate their main biological functions by using p-value formulated as (7). Given a predicted protein complex containing C proteins, p-value calculates the probability of observing k or more

111

ACKNOWLEDGMENTS

TABLE 1. ACCURACY COMPARISON OF VARIOUS ALGORITHMS Algorithms #complexes #matched #PM

Sn

PPV

Acc

CNC

497

234

3

0.424

0.563

0.489

ClusterONE

240

84

2

0.327

0.651

0.461

SPICI

291

70

2

0.356

0.600

0.464

CFinder

137

41

5

0.556

0.246

0.370

MCODE

52

21

0

0.224

0.365

0.286

Hunter

22

17

2

0.189

0.552

0.323

This research is supported by the National Natural Science Foundation of China (No. 61532008) and the International Cooperation Project of Hubei Province (No. 2014BHE0017), the Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (No. CCNU14A02008, CCNU15ZD003, CCNU14A05016). REFERENCES [1]

proteins from the complex by chance in a biological function shared by F proteins from a total genome size of N proteins [18]: k −1 ⎧ F ⎪⎛ ⎞⎛ N − F ⎞ ⎛ N ⎞ ⎫⎪ p − value = 1 − ∑ ⎨⎜ ⎟⎜ ⎟ / ⎜ ⎟⎬ ⎪⎝ i ⎠⎝ C − i ⎠ ⎝ C ⎠ ⎭⎪ i =0 ⎩

[2] [3]

(7) [4]

The lower the p-value is, the stronger biological significance the complex possesses. Experimental results show that only 1.9% of protein complexes without biological significance (namely with p-value greater than 0.01), while 88.5% and 54.8% of protein complexes with p-value less than E-05 and E-10 respectively, indicating they have strong biological significance. Table 2 provides 5 protein complexes with strong biological significance identified by CNC algorithm; moreover, we find many predicted protein complexes with cluster frequency of 100 percent, such predicted protein complexes are probably real protein complexes, which provide meaningful references to the relate researchers.

[5]

[6]

[7] [8] [9]

IV. CONCLUSION In this paper, we develop a novel clustering algorithm CNC and apply it to the time course protein interaction networks to detect temporal protein complexes. In CNC algorithm, whether or not a protein is assigned into a cluster is determined by the relationship between the neighbor closeness inside the cluster and that outside the cluster. Experimental results show that many protein complexes with strong biological significance are detected by our novel algorithm. Moreover, in terms of matching degree and accuracy metric, CNC algorithm has a better performance than other five state-of-the-art algorithms.

[10]

[11]

[12] [13] [14]

TABLE 2. FIVE PREDICTED PROTEIN COMPLEXES WITH SMALL P-VALUES Index

P-value

162

9.25E-36

282

3E-33

79

5.03E-32

159

7.06E-32

296

1.62E-31

Cluster frequency 15 out of 22 genes, 68.2% 20 out of 22 genes, 90.9% 14 out of 21 genes, 66.7% 21 out of 27 genes, 77.8% 28 out of 39 genes, 71.8%

[15] Gene Ontology term tRNA transcription from RNA polymerase III promoter | AmiGO mRNA splicing, via spliceosome | AmiGO

[16] [17]

mRNA polyadenylation| AmiGO mRNA splicing, via spliceosome | AmiGO proteolysis involved in cellular protein catabolic process |AmiGO

[18]

112

Rives AW and Galitski T. “Modular organization of cellular networks,” Proc Natl Acad Sci USA, 2003, 100:1128-1133. Lin CY, Lee TL, Chiu, YY, et al. “Module organization and variance in protein-protein interaction networks,” Sci Rep, 2015, 5: 9386, DOI: 10.1038. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. “Associating genes and protein complexes with disease via network propagation,” PLoS Computational Biology. 2010, 6: e1000641. Yu H, Lin CC, Li YY, Zhao Z. “Dynamic protein interaction modules in human hepatocellular carcinoma progression,” BMC Systems Biology 2013, 7(5):1–13. Feng J, Jiang R, Jiang T. “A max-flow-based approach to the identification of protein complexes using protein interaction and microarray data,” Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 2011, 8(3): 621-634. Lubovac Z, Gamalielsson J, Olsson B. “Combining functional and topological properties to identify core modules in protein interaction networks,” Proteins: Structure, Function, and Bioinformatics, 2006, 64(4): 948-959. Han J D J, Bertin N, Hao T, et al. “Evidence for dynamically organized modularity in the yeast protein–protein interaction network,” Nature, 2004, 430(6995): 88-93. De Lichtenberg U, Jensen L J, Brunak S, et al. “Dynamic complex formation during the yeast cell cycle,” science, 2005, 307(5710): 724-727. Tang X, Wang J, Liu B, Li M, Chen G Pan Y. “A comparison of the functional modules identified from time course and static PPI network data,” BMC bioinformatics 2011, 12(1):339. He T, Yang J, Li Y, et al. “An Integrated Approach to Identify Protein Complex Based on Best Neighbor and Modularity Increment,” 2013 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 2013: 57-60. Chin C H, Chen S H, Ho C W, et al. “A hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles,” BMC bioinformatics, 2010, 11(Suppl 1): S25. Bader G D, Hogue C W V. “An automated method for finding molecular complexes in large protein interaction networks,” BMC bioinformatics, 2003, 4(1): 2. Adamcsek B, Palla G, Farkas I J, et al. “CFinder: locating cliques and overlapping modules in biological networks,” Bioinformatics, 2006, 22(8): 1021-1023. Jiang P, Singh M. “SPICI: a fast clustering algorithm for large biological networks,” Bioinformatics, 2010, 26(8): 1105-1111. Nepusz T, Yu H, Paccanaro A. “Detecting overlapping protein complexes in protein-protein interaction networks,” Nature methods, 2012, 9(5): 471-472. Li M, Wu X, Wang J, Pan Y. “Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data,” BMC Bioinformatics 2012, 13:109. Brohee S, Van Helden J. “Evaluation of clustering algorithms for protein-protein interaction networks,” BMC bioinformatics, 2006, 7(1): 488. Li X, Foo C, Ng S. “Discovering protein complexes in dense reliable neighborhoods of protein interaction networks,” IEEE Computer Society Bioinformatics Conference - CSB, 2007.

Suggest Documents