Detecting Essential Proteins Based on Network ... - IEEE Xplore

7 downloads 0 Views 9MB Size Report
Gene Ontology Information. Wei Zhang, Jia Xu, Yuanyuan Li and Xiufen Zou. Abstract—The identification of essential proteins in protein-protein interaction (PPI) ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

1

Detecting Essential Proteins Based on Network Topology, Gene Expression Data and Gene Ontology Information Wei Zhang, Jia Xu, Yuanyuan Li and Xiufen Zou Abstract—The identification of essential proteins in protein-protein interaction (PPI) networks is of great significance for understanding cellular processes. With the increasing availability of large-scale PPI data, numerous centrality measures based on network topology have been proposed to detect essential proteins from PPI networks. However, most of the current approaches focus mainly on the topological structure of PPI networks, and largely ignore the gene ontology annotation information. In this paper, we propose a novel centrality measure, called TEO, for identifying essential proteins by combining network topology, gene expression profiles and GO information. To evaluate the performance of the TEO method, we compare it with five other methods (degree, betweenness, NC, Pec, CowEWC) in detecting essential proteins from two different yeast PPI datasets. The simulation results show that adding GO information can effectively improve the predicted precision and that our method outperforms the others in predicting essential proteins. Index Terms—Protein-Protein Interaction Network, Essential proteins, Gene Ontology, Gene Expression Profile.

F

1

I NTRODUCTION

E

SSENTIAL

proteins play a vital role in maintaining cellular life. Identification of essential proteins, on the one hand, can help us understand the basic requirements to sustain a life form because proteins are indispensable to life [1]; on the other hand, doing so can provide insights for the identification of drug targets because the essential proteins (or genes) in pathogenic organisms may be potential targets of new antibiotics [2]. Identifying essential genes is crucial to gaining insights into biological processes. Traditionally, the identification of essential genes is performed via biological experiments, such as single gene knockouts [3], RNA interference [4], and transposon mutagenesis [5]. However, identifying essential proteins by using these experimental methods is time consuming and expensive. The rapid development of modern high-throughput technologies has accumulated large quantities of omics data, which provide new opportunities for predicting essential proteins (genes), protein complexes, functional module and gene function [6] from large molecular networks. A series of computational approaches have been developed to predict essential • W. Zhang is with the School of Science at East China Jiaotong University in Nanchang, China. E-mail: wzhang [email protected]. • J. Xu is with the School of Mechatronic Engineering, at East China Jiaotong University in Nanchang, China. • Y. Li is with the School of Mathematics and Statistics at Wuhan University in Wuhan, China and with the School of Science at Wuhan Institute of Technology in Wuhan,China. • X. Zou is with the School of Mathematics and Statistics at Wuhan University in Wuhan, China.

proteins based on the properties of PPI networks [7], [8], [9], [10], [11], [12]. For example, the widely used classic methods, such as the degree centrality method (DC) [7] and the betweenness centrality method (BC) [8] are based on network topology. By considering the modular nature of proteins essentiality through calculating the edge clustering coefficient, a new centrality method called the NC method has been proposed to detect essential proteins [9]. The simulation results show that the NC method outperforms other classic centrality measures. Most approaches focus on the topological features of the PPI network; however, experimental data are inherently noisy and often contain a large number of spurious interactions, even when the data originated from well-known organisms, such as Saccharomyces cerevisiae [13], [14], [15]. For example, the falsepositive rate of data obtained through Yeast Two Hybrid(Y2H) analysis could be as high as 64%, and the false-negative rate can vary from 43% to 71% [16]. Additionally, Sprinzak et al. [17] showed that the reliability of high-throughput Y2H assays was around 50%. Therefore, most network-based methods are very sensitive to the network being used. In recent years, several approaches have been proposed that integrated Protein Protein Interaction (PPI) topology with other biological knowledge to minimize the undesirable effects of noisy data, and these multi-information fusion measures were effective at predicting essential proteins [18], [19], [20], [21], [22]. Kim et al. [23] proposed a novel algorithm (MCGO) that used network motifs for centrality measurements and Gene Ontology(GO) to prune uninformative edges. Network motifs are considered the ba-

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

sic building blocks of biological networks and have proven useful in predicting protein complexes in PPI networks [24]. The experimental results of MCGO analysis of yeast PPI data from the DIP showed that this method performed significantly better than DC and sum of edge clustering coefficient (SoECC) [25]. Li et al. [26] developed a new method, called Pec, by integrating edge clustering coefficient and the correlation coefficient of gene expression data. The simulation results obtained using a PPI network from Saccharomyces cerevisiae showed that Pec was robust and exceeded fifteen previously proposed methods at predicting essential proteins. Recently, Zhang et al. [27] designed a new method called Co-Expression Weighted by Clustering coefficient(CoEWC), which captures the co-clustering relationship between the proteins as well as the coexpression of interacting proteins. These two methods (Pec and CoEWC) are based on a combination of network topology and gene-expression data, and the simulation results obtained using gold-standard network data revealed that both methods outperformed network topology-based centrality methods, including DC, BC, SoECC and local average centrality (LAC) [29]. Because GO annotations provide valuable information and a convenient way to study gene-functional similarity, it has been shown that adding the semantic similarity of GO terms could improve the precision of predicted protein complexes [28], [30] and disease gene prioritization [32]. Although these methods improved the accuracy of predicting essential proteins, they focused mainly on only one (topology structure) or two types of data (network topology and gene-expression data or GO-annotation data), few methods have combined these three types of data [31]. In this study, we combine these three types of data by considering of ”essential properties” and propose a new method for predicting essential proteins, called TEO. To evaluate the performance of this method, we apply it to two benchmark yeast PPI datasets and compare the results with those of other methods. Our results show that adding GO functional-annotation data improves performance. Further, comparative simulation results show that the TEO achieves comparable or higher precision in predicting of essential proteins compared to state-of-the-art methods. This paper is structured as follows: in section 2, we present the new method for predicting essential proteins, two Saccharomyces cerevisiae test protein interaction networks, gene-expression profiles and reference datasets of essential proteins retrieved from previously published work; in section 3, extensive numerical simulations and comparisons are described to evaluate the performance of the proposed method; and conclusions are presented in section 4.

2

2

M ETHODS

2.1 The new method based on combining topological information, gene expression profiles and GO data. TEO is proposed based on the following considerations: (1). Essential proteins always form densely connected modules. (2). Essential proteins in the same cluster are more likely to be co-expressed. (3). Essential proteins in the same cluster have a stronger functional similarity. To describe the new proposed method clearly, the following definitions and descriptions are presented: Given a PPI network with N proteins, we represent the PPI network with an undirected graph G = (V, E), where the set of vertices V represents protein, and the set of edge E denotes the sets of interactions between proteins pairs. To assess the density of two connected nodes, u and v, in a subgraph, we applied the edge clustering coefficient [33], which is defined by the following formula Ecc(u, v) =

3 Nu,v min(du − 1, dv − 1)

(1)

where du and dv are the degrees of nodes u and 3 v, respectively, and Nu,v represents the number of triangles constituted by edge (u, v) in the network. Because co-clustered proteins tend to also be coexpressed, we used the Pearson correlation coefficient (PCC) to measure co-expressed protein pairs, PCC is a widely used measure of the strength of correlation between two variables of linear dependence. The PCC of a pair of genes (X and Y) is defined as: P CC(X, Y ) =

n 1 ∑ Xi − mean(X) Yi − mean(Y ) ( )( ) n − 1 i=1 std(X) std(Y )

(2)

where n is the number of samples of geneexpression data and Xi is the expression level of gene X in sample i. The PCC of a pair of proteins (u and v) is defined the same as the PCC of their corresponding gene pairs. The PCC values ranges from -1 to 1, with the larger of the PCC values between two considered proteins suggesting that they are more likely to be both in the same cluster and functionally similar. To evaluate the functional similarity between two considered proteins, we adapt GO similarity to qualify the similarity between two proteins. GO aims to standardize the annotation of genes and gene products across species and provides rich information for describing biological properties of gene products. GO provides valuable information and a convenient way to study the functional similarities of gene which could improve the precision of constructed networks. GO consists of three sub-ontologies: Biological Process (BP), Cellular Component (CC), and Molecular

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

function (MF) [34], [35]. The three GO categories are widely used in predicting gene functional associations, and semantic similarity is used as an indicator for the plausibility of an interaction. To compute the semantic similarity between GO terms annotated to proteins in an interaction network, we adapt the measure introduced by Wang et al. [36]. We computed three types of integrations, i.e., CC+PPI, MF+PPI, BP+PPI, and define the GO similarity between two connected proteins as ∑ (Su (t) + Sv (t)) t∈Tu ∩Tv ∑ GO sim(u, v) = ∑ (3) Su (t) + Sv (t) t∈Tu

t∈Tv

where Su (t) is the S-value of GO term t related to term u and Sv (t) is the S-value of GO term t related to term v. For more details, please refer to [36]. Based on the definitions of Ecc, the Gene Ontology similarity (GO sim) and PCC, a new centrality method named TEO is proposed. The essential nature of a protein is determined by the probability that the protein co-cluster, co-expresses, and shares similar functions with its neighbors. For a protein u, the essentiality T EO(u) is defined as the probability between the Ecc and combined GO information with PCC gene expression profiles: T EO(u) =



Ecc(u, v) × (GO sim(u, v) + P CC(u, v))

v∈N (u)

(4)

where N (u) denotes the set of all neighbors of node u. 2.2

Data sources

To evaluate the performance of the new method, we focus our analysis on well-known Saccharomyces cerevisiae PPI data and protein-essentiality data. We use two sets of PPI data. The first was obtained from the DIP database [37], containing 5,093 proteins and 24,743 interactions after filtering the duplicate interactions and self-interactions. The second set of PPI data was described in previously published work [38]. After removing redundant interactions and isolated proteins, we obtained a combined PPI network with 4,928 proteins and 17,201 interactions. The gold standard essential-proteins dataset contains 1285 essential proteins and was obtained from the following databases: MIPS [39], SGD [40], DEG [41], and SGDP [42]. The gene expression data of Saccharomyces cerevisiae were obtained from previously published [43], and the gene expression profile contains 6,777 gene products and data from 36 time points. The Gene ontology annotation data for Saccharomyces cerevisiae gene products were obtained from the GO Consortium [35] and were released on March 4, 2015. First, we matched the

3

GO identification of the proteins in the above PPI datasets and converted the protein interactions into their corresponding GO interaction dataset with the three sub-ontologies (BP, CC, and MF). We then, adapted the GO similarity measure [36] to assess the similarities between interactions of proteins in the PPI dataset against the three sub-ontologies. For proteins that had no corresponding GO identification information, we set the similarity of the interactions containing these proteins to zero. 2.3 Evaluation measures To evaluate the efficiency of the proposed method, we compared it with five other centrality measures for prediction accuracy. The first measure collected was the number of true essential proteins predicted in the top N. Accuracy was defined as the fraction of the number of true essential proteins.

3

RESULTS

In this section, we test new method on two yeast PPI network datasets and systematically evaluate its performance against five other centrality measures: DC, BC, NC, Pec, and CoEWC. 3.1 Comparison of detection of essential protein using five methods To investigate the benefit of adding GO functional annotations to our method, we compute the simulation results according to the three different sub-ontologies (BP, CC, and MF) and compare these results to those of five other methods. The GO similarity values for BP, MF, and CC against the two datasets are provided in Supplementary Table S1 and S2. We rank the proteins in descending order according to the values associated with these centralities and collect the number of true essential proteins among the top 200, 400 and 600 predicted candidate proteins. Figure 1 shows the comparative results of the number of true essential proteins from the DIP network dataset under the top 200, 400 and 600 predicted ranks using these methods. Compared the five state-of-theart methods, the proposed TEO method performs better than each of the other methods, except Pec, when using the three kinds of sub-ontologies. The TEO method using the CC sub-ontology identified slightly fewer essential proteins compared to the Pec method, however, TEO provided increased improvement over the Pec method when combining BP and MF sunontologies. We apply all six methods to the combined PPI network dataset to compare their performance in detecting the number of essential proteins. As shown in Figure 2, TEO dominated the other methods and achieved a higher accuracy in predicting essential proteins.

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

4

Fig. 1. Comparison of the number of essential proteins predicted by TEO and five other methods under the three sub-ontologies and using the DIP PPI data.

Fig. 2. Comparison of the number of essential proteins predicted by TEO and five other methods under three sub-ontologies and using the combined PPI data.

3.2 Comparison of the six methods under jackknife methodology

3.3 Analysis of the differences between TEO and the compared methods

To compare the proposed method with the other centrality measures (DC, BC, NC, Pec, and CoEWC), we adapt the jackknife methodology [45] and plot the number of top-ranked proteins against the cumulative number of essential proteins. Additionally, several random assortments are plotted for comparison. The panels in Figure 3 show the comparison results based on the DIP network and combined networks. As shown in Figure 3, the TEO sorting curve associated with fewer than three sub-ontologies is higher than that observed for the five other measures against both network when the number of top ranked protein is greater than 400. Moreover, TEO combined with the BP annotation consistently achieves the best results, suggesting that the BP ontology function contributes more important information to the prediction of essential proteins and provides a strong argument in favor of using TEO with the BP sub-ontology.

To further demonstrate that TEO (under the BP subontologies) performs well in the prediction of essential proteins, we analyze the difference between TEO and the five other methods by attempting to predict essential proteins from a smaller dataset (the top 200 proteins). The number of overlaps in the top 200 proteins identified by TEO and other centrality methods, Mi, is denoted as |T EO ∩ Mi |. |Mi − T EO| denotes the number of proteins that are identified by Mi but not TEO. Similarily |T EO − M i| denotes the number of proteins that are identified by TEO but not Mi . The number of essential and non-essential proteins predicted in the intersection and set difference is identified by TEO and the five centrality methods from the top 200 proteins are shown in Table 1. The number of common proteins identified by TEO, DC, and BC is relatively small when using the two PPI datasets,

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

5

TABLE 1 The number of essential and non-essential proteins predicted to be in the intersection and set difference as identified by TEO and five other methods using the top 200 protein dataset. Dataset

DIP

Combined

Methods (Mi )

|T EO ∩ Mi |

|Mi − T EO|

DC BC NC Pec CoEWC DC BC NC Pec CoEWC

60 54 108 130 74 73 54 7 127 106

140 146 92 70 126 127 146 193 73 94

Number of Essential proteins |Mi − T EO| |T EO − Mi | 44 115 44 120 47 69 40 55 42 90 38 103 45 118 33 151 41 55 49 67

Number of Non-essential proteins |Mi − T EO| |T EO − Mi | 96 25 102 26 45 23 30 15 84 36 89 24 101 28 160 42 32 18 45 27

Fig. 3. The performances of TEO and the six other centrality measures on the DIP and combined PPI using a jackknife methodology. (A)Comparing results on the DIP PPI network; (B)Comparing results on the combined PPI network whereas the number of common proteins identified by TEO, Pec, and CoEWC is larger. Additionally, the number of common proteins identified by TEO and NC is larger when using the DIP dataset than when using the combined dataset. These results indicate that TEO is different from the classic topology based methods. Further, we can observe that the number of essential proteins identified by TEO but not the other methods (more than 71% in total), which is greater than those identified by other methods but not TEO (less than 57% in total). Similarly, we compare the number of non-essential proteins in the sets of T EO − Mi and Mi − T EO, and find that the number of non-essential proteins identified by TEO but not the other methods (less than 27% in total), which is less than identified by the other methods but not TEO (more than 47% in total). These results suggested that TEO is different to state-of-the-art methods and is more effective at predicting essential proteins. Additionally, although the TEO and Pec predictions may contain large overlaps because these considered methods have a common background, the number of

essential proteins identified by TEO and not Pec is larger than the number of essential proteins identified by Pec and not TEO. Furthermore, the number of non-essential proteins identified by TEO and not Pec is smaller than the number of non-essential proteins identified by Pec and not TEO when using the two test networks. These results suggest that the GO terms contribute to the improved predictive results observed for the TEO method. In Table 2, we present the proteins that are predicted by TEO but are not included in the other five centrality measurements of the top 200 proteins in the DIP PPI dataset. The rank in TEO, protein name, degree, and value of NC, CoEWC, Pec and essentiality are all shown in Table 2. There are 42 proteins in Table 2, 85.7% of which are essential. Similarly, in Table 3, we show the proteins that are predicted by TEO but are not included in the other five centrality measurements of the top 200 proteins in the combined PPI dataset. The rank in TEO, protein name, degree, and value of NC, CoEWC, Pec and essentiality are all shown in Table 3. There are 38 proteins in the Table 3, 78.9% of which are essential.

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

TABLE 2 List of 42 proteins predicted as essential by TEO but not by five other centrality measures using the top 200 proteins from the DIP PPI dataset. Rank protein ess Degree 54 YDL087C yes 24 59 YDR195W no 18 62 YOL051W no 24 63 YGR013W yes 19 66 YDR301W yes 18 68 YKR002W yes 20 71 YLR277C yes 19 78 YMR213W yes 24 80 YKL059C yes 17 83 YNL317W yes 27 86 YLL036C yes 15 87 YBL093C no 26 89 YLR298C yes 15 95 YPR107C yes 15 98 YLR115W yes 20 99 YGL226C-A no 8 103 YGR156W yes 15 112 YDL005C no 27 113 YDL232W yes 19 122 YKL173W yes 28 123 YIL021W yes 27 125 YOR085W no 11 127 YGL022W yes 9 130 YJR093C yes 19 134 YMR061W yes 23 137 YDR473C yes 13 152 YLR275W yes 18 153 YMR314W yes 17 156 YPR178W yes 38 158 YDR378C yes 22 160 YDR240C yes 11 164 YOR103C yes 9 177 YJL002C yes 21 179 YPR182W yes 36 180 YBR060C yes 9 182 YDR416W yes 17 188 YOL149W yes 21 191 YBR152W yes 20 192 YIL004C yes 20 195 YNL172W yes 12 197 YER148W yes 38 199 YER029C yes 24 Abbreviations: ess=essentiality.

4

NC 12.672 12.597 15.132 13.058 12.911 14.172 15.296 15.567 12.768 11.807 10.893 12.623 11.662 11.595 15.078 8.857 13.143 13.089 10.008 10.409 13.319 8.968 7.500 12.750 10.512 8.750 8.926 8.583 12.148 14.868 9.100 10.018 9.679 14.273 8.018 9.018 10.393 8.994 9.430 10.566 14.205 9.089

CoEWC 2.502 1.968 0.963 1.382 1.044 1.832 1.711 1.909 0.699 1.375 2.126 1.105 2.642 0.664 -0.365 2.451 1.164 0.926 1.918 1.275 1.114 2.717 2.451 0.824 0.724 1.631 1.288 2.674 0.384 1.069 2.174 0.980 1.214 0.230 2.780 1.106 1.760 1.122 1.615 1.714 0.982 1.888

Pec 2.275 0.928 0.541 0.792 -0.019 1.253 1.091 2.145 -0.038 0.612 1.740 1.570 1.402 -0.004 -1.125 1.108 0.617 1.067 2.013 0.915 0.979 1.737 1.964 0.102 0.117 0.977 0.724 2.279 -0.437 0.794 1.009 0.054 0.852 -0.603 1.902 1.281 1.412 0.675 1.541 0.976 1.171 1.631

CONCLUSION

The topology of molecular networks can reveal essential principles that are associated with many cellular processes and biological functions [46], [47], [48]. Given the large amounts of PPI data obtained from high-throughput techniques, identifying essential proteins at the PPI network level is becoming increasing important. A lot of centrality methods have been proposed for predicting essential proteins in biological networks. Due to the unreliability of the available PPI networks and given that most of the existing methods depend on a network’s topological properties, the accuracy of these methods in predicting essential proteins is relatively low and is sensitive to the accuracy of the network. The three different types of available data (network topological, gene expression profiles and functional properties) may reflect the different types of biological properties and are meaningful for predicting essential proteins. In this paper, we propose a novel method (TEO)

6

TABLE 3 List of 38 proteins predicted as essential by TEO, but not by five other centrality measures using the top 200 proteins from the combined PPI dataset. Rank protein ess Degree 42 YDR195W no 18 46 YDR301W yes 18 61 YNL317W yes 19 62 YKL059C yes 17 67 YLL036C yes 15 68 YLR298C yes 15 69 YPR107C yes 15 75 YGL226C-A no 8 79 YGR156W yes 13 87 YDL232W yes 19 88 YBL093C no 26 95 YMR061W yes 23 98 YOR085W no 11 100 YGL022W yes 9 102 YJR093C yes 19 121 YLR275W yes 18 124 YPR178W yes 23 129 YDR240C yes 11 133 YOR103C yes 9 137 YKL173W yes 26 141 YBR060C yes 9 146 YDR404C yes 15 147 YOR151C yes 17 155 YJL002C yes 21 156 YDR473C yes 12 158 YIL004C yes 20 159 YMR314W yes 15 161 YDR416W yes 17 164 YGR104C no 13 167 YPR162C yes 7 170 YNL172W yes 11 174 Q0085 no 8 175 YPL010W yes 7 178 YNL288W no 14 182 YLL004W yes 8 185 YOL149W yes 19 186 YNL147W yes 23 194 YPL178W no 22 Abbreviations: ess=essentiality.

NC 12.680 13.013 11.482 12.833 10.893 11.723 11.821 8.857 12.550 10.008 12.795 10.830 8.968 7.500 12.908 8.926 11.592 9.100 10.018 9.017 8.310 8.598 8.105 9.892 7.818 9.799 7.686 8.990 9.250 8.167 10.311 8.310 7.833 8.457 8.310 9.715 12.324 9.817

CoEWC 1.928 1.034 1.474 0.692 2.126 2.642 0.656 2.451 1.275 1.918 0.977 0.619 2.717 2.451 0.814 1.288 0.370 2.174 0.980 1.197 2.780 2.217 2.807 1.214 1.394 1.615 2.692 1.085 1.288 3.172 1.743 1.924 2.249 2.299 2.743 1.556 0.098 1.102

Pec 0.946 0.027 0.453 -0.031 1.740 1.419 0.075 1.108 0.562 2.013 1.496 0.158 1.737 1.964 0.070 0.724 -0.559 1.009 0.054 0.859 2.058 1.980 2.111 0.843 0.759 1.628 2.071 1.233 0.496 1.718 0.918 0.865 0.943 1.497 1.588 1.016 -0.250 0.779

that improves the accuracy of essential protein prediction, by combining PPI data, gene-expression profiles and GO annotation data. TEO is based on the consideration of three types of ”essential properties”: (1) characterizing the co-module by using the edge clustering coefficient of the network’s topological structure; (2) evaluating co-expression patterns by calculating the Pearson correlation coefficient between two considered gene-expression profiles; and (3) measuring the functional similarity of two interacting proteins by calculating the GO similarity based on GO annotation terms. To evaluate the performance of the new method, we conduct simulation experiments using two Saccharomyces cerevisiae PPI networks and compare TEO with five other state-of-the-art methods. The simulation results show that the performance of TEO is comparable to or better than that of the five other methods at predicting essential proteins from both PPI networks. Therefore, the precision in identifying essential proteins based on the integration of additional biological knowledge is greater than based on the origin of the PPI network. Compared with the integration of CC and MF annotation data, the

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

integration of BP annotation data can more effectively aid in the prediction of essential proteins. Although TEO performs well at essential protein detection, there remains room for improving its prediction accuracy. The reasonable integration of highthroughput data and continued design of refined methods will lead to promising results in detecting essential proteins, and core protein complexes for the study of biological networks.

5

S UPPORTING I NFORMATION

Table S1 The GO similarly value for BP, MF and CC under the DIP PPI dataset. Table S2 The GO similarly value for BP, MF and CC under the combined PPI dataset.

ACKNOWLEDGMENTS This work was supported by the Natural Science Foundation of Jiangxi Province (No.20161BAB211022), the Major Research Plan of the National Natural Science Foundation of China (No. 91530320), the Chinese National Natural Science Foundation (No. 61672388) and the Research Foundation of Hubei Province Department of Education (No. Q20151505).

R EFERENCES [1] M. Itaya, ”An estimation of minimal genome size required for life,” FEBS Letters, vol. 362, no. 3, pp. 257-260, Apr 1995, doi:10.1016/0014-5793(95)00233-Y. [2] A.E. Clatworthy, E. Pierson, D.T. Hung, ”Targeting virulence: a new paradigm for antimicrobial therapy”, Nat. Chem. Biol., vol.3, no. 9, pp.541-548, Aug 2007. [3] G. Giaever, A.M. Chu, L. Ni, C. Connelly, L. Riles, V´eronneau S, Dow S, A. Lucau-Danila, K. Anderson, B. Andr´e, A.P. Arkin, A. Astromoff, M. El-Bakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss, K. Davis, A. Deutschbauer, K.D. Entian, P. Flaherty, F. Foury, D.J. Garfinkel, M. Gerstein, D. ¨ Gotte, U. Guldener, J.H. Hegemann, S. Hempel, Z. Herman, ¨ D.F. Jaramillo, D.E. Kelly, S.L. Kelly, P. Kotter, D. LaBonte, D.C. Lamb, N. Lan, H. Liang , H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S.L. Ooi, J.L. Revuelta, C.J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens, G. Schimmack, B. Shafer, D.D. Shoemaker, S. Sookhai-Mahadeo,R.K. Storms, J.N. Strathern, G. Valle, M. Voet, G. Volckaert, W. Ching-yun, T.R. Ward, J. Wilhelmy, E.A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey, J.D. Boeke, M. Snyder, P. Philippsen, R.W. Davis, M. Johnston,” Functional profiling of the Saccharomyces cerevisiae genome,” Nature, vol. 418, no. 6896, pp. 387-91, July 2002, doi:10.1038/nature00935. [4] R. S. Kamath, A. G.Fraser, Y. Dong, G. Poulin, R. Durbin, M. Gotta, A. Kanapin, N. LeBot, S. Moreno, M. Sohrmann, D.P. Welchman, P. Zipperlen and J. Ahringer, ”Systematic functional analysis of the Caenorhabditis elegansgenome using RNAi,” Nature, vol.421, pp.231-237, Jan 2003, doi:10.1038/nature01278. [5] L.A. Gallagher, E. Ramage, M.A. Jacobs, R. Kaul, M. Brittnacher and C. Manoil, ”A comprehensive transposon mutant library of Francisella novicida, a bioweapon surrogate,” Proc. Natl. Acad. Sci., vol. 104, no.3 , pp. 1009-1014, Jan 2007. [6] D. Warde-Farley, S.L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz, C. Grouios, F. Kazi, C.Tannus Lopes, A. Maitland, S. Mostafavi, J. Montojo, Q. Shao, G. Wright, G.D. Bader and Q. Morris, ”The Gene MANIA prediction server: biological network integration for gene prioritization and predicting gene function,” Nucleic acids research, vol.38, no.suppl 2, pp. W214-W220, May 2010.

7

[7] H. Jeong, S.P. Mason, A.L. Barab´asi, Z.N. Oltvai, ”Lethality and centrality in protein networks,” Nature, vol. 411, pp. 41-42, May 2001, doi:10.1038/35075138. [8] M.P. Joy, A. Brock, D.E. Ingber, S. Huang, ”High-betweenness proteins in the yeast protein interaction network,” J Biomed Biotechnol, vol. 2005, no. 2, pp. 96-103, Jun 2005, doi:10.1155/JBB.2005.96. [9] J. Wang, M. Li, H. Wang, Y. Pan, ”Identification of essential proteins based on edge clustering coefficient,” IEEE/ACM Trans Comput Biol Bioinform., vol. 9, pp. 1070-1080, Aug 2012, doi:10.1109/TCBB.2011.147. [10] P. Wang, X. Yu, and J. Lu, ”Identification and evolution of structurally dominant nodes in protein-protein interaction networks,” IEEE Trans.Biomed. Circuits Syst., vol. 8, no. 1, pp. 87-97, Feb 2014. [11] Y. Wang, H. Sun, W. Du, E. Blanzieri, G. Viero, Y. Xu, Y. Liang, ”Identification of Essential Proteins Based on Ranking EdgeWeights in Protein-Protein Interaction Networks”, PLoS ONE, vol.9, no.9, pp. e108716, Sep 2014. [12] X. Tang, J. Wang, J. Zhong, Y. Pan, ”Predicting Essential Proteins Based on Weighted Degree Centrality,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 2, pp. 407-18, Mar 2014. doi: 10.1109/tcbb.2013.2295318. [13] C.von Mering, R. Krause, B. Snel, M. Cornell, S.G. Oliver, S. Fields, P. Bork, ”Comparative assessment of large-scale data sets of protein-protein interactions,” Nature, vol. 417, no. 6887, pp. 399-403, May 2002. [14] G.D. Bader, C.W.V. Hogue, ”Analyzing yeast protein-protein interaction data obtained from different sources,” Nature Biotechnology, vol. 20, no. 10, pp. 991-997, Oct 2002. [15] A. M. Edwards, B. Kus, R. Jansen, D. Greenbaum, J. Greenblatt, and M. Gerstein, ”Bridging structural biology and genomics: Assessing protein interaction data with known complexes,” Trends Genetics, vol. 18, pp. 529-536, Oct 2002. [16] O. Kuchaiev, M. Raˇsajski, DJ. Higham, N. Prˇzulj, ”Geometric De-noising of Protein-Protein Interaction Networks,” Plos Computational Biology, vol. 5, no. 8, p. e1000454, Aug 2009, doi:10.1371/journal.pcbi.1000454. [17] E. Sprinzak, S. Sattath, H. Margalit, ”How reliable are experimental protein-protein interaction data?,” Journal of Molecular Biology, vol. 327, no. 5, pp. 919-923, Apr 2003. [18] W. Peng, J. Wang, W. Wang, Q. Liu, F.X. Wu, Y. Pan, ”Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks,” BMC systems biology, vol. 6, pp. 87, July 2012. [19] Q. Xiao, J. Wang, X. Peng, F.X. Wu, Y. Pan, ”Identifying essential proteins from active PPI networks constructed with dynamic gene expression,” BMC genomics, vol. 16, no. Suppl 3, pp. S1, Jan 2015. [20] J. Luo, Y. Qi, ”Identification of Essential Proteins Based on a New Combination of Local Interaction Density and Protein Complexes,” PLoS ONE, vol. 10, no. 6, pp. e0131418, Jun 2015. [21] Y. Jiang, Y. Wang, W. Pang, L. Chen, H. Sun, Y. Liang, E. Blanzieri, ”Essential protein identification based on essential protein-protein interaction prediction by Integrated Edge Weights,” Methods, vol. 83, pp. 51-62, July 2015. [22] B. Zhao, J. Wang, M. Li, F.X. Wu, Y. Pan, ”Prediction of essential proteins based on overlapping essential modules,” IEEE Trans Nanobioscience, vol. 13, no. 4, pp. 415-24, Aug 2014. [23] W. Kim, M. Li, J. Wang, Y. Pan, ”Essential protein discovery based on network motif and gene ontology,” IEEE Int Conf Bioinformatics Biomedicine, Atlanta, pp. 470-475, Nov 2011, doi:10.1109/BIBM.2011.46. [24] W. Zhang, X. Zou, ”A new method for detecting protein complexes based on the three node cliques,” IEEE/ACM Trans Comput Biol Bioinform., vol. 12 no. 4, pp. 879-886, Aug 2015. doi:10.1109/TCBB.2014.2386314. [25] H. Wang, M. Li, J. Wang, Y. Pan, ”A New Method for Identifying Essential Proteins Based on Edge Clustering Coefficient,” International Symposium on Bioinformatics Research and Applications. Springer Berlin Heidelberg, pp. 87-98, 2011. [26] M. Li, H. Zhang, J. Wang, Y. Pan, ”A new essential protein discovery method based on the integration of protein protein interaction and gene expression data,” BMC Syst. Biol., vol. 6, pp. 15, Mar 2012, doi:10.1186/1752-0509-6-15.

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2615931, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

[27] X. Zhang , J. Xu, W. Xiao, ”A new method for the discovery of essential proteins,” PLoS ONE, vol. 8, pp. e58763, Mar 2013, doi:10.1371/journal.pone.0058763. [28] X.F. Zhang, D.Q. Dai, L. Ou-Yang, and H. Yan, ”Detecting overlapping protein complexes based on a generative model with functional and topological properties,” BMC Bioinformatics, vol. 15, p. 186, Jun 2014, doi:10.1186/1471-2105-15-186. [29] M. Li, J. Wang, X. Chen, H. Wang, Y. Pan, ”A local average connectivity-based method for identifying essential proteins from the network level, ” Comput Biol Chem. vol.35, no.3 , pp. 143-150, Jun 2011. [30] Y. Zhang, H. Lin, Z. Yang, J. Wang, ”Construction of Ontology Augmented Networks for Protein Complex Prediction,” PLoS ONE, vol. 8, no. 5, p. e62077, May 2013, doi:10.1371/journal.pone.0062077. [31] Y. Jiang, Y. Wang, W. Pang, L. Chen, H. Sun, Y. Liang, & E. Blanzieri, ”Essential protein identification based on essential protein-protein interaction prediction by Integrated Edge Weights,” Methods, vol. 83, pp.51-62, July 2015. [32] A. Schlicker, T. Lengauer, M. Albrecht,”Improving disease gene prioritization using the semantic similarity of Gene Ontology terms,” Bioinformatics, vol. 26, no.18, pp. i561-i567, 2010. [33] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, ”Defining and identifying communities in networks,” Proc Natl Acad Sci USA, vol. 101, no. 9, pp. 2658-2663, Jan 2004, doi:10.1073/pnas.0400054101. [34] M. Ashburner, C.A. Ball, J.A. Blake, D. Bostein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin and G. Sherlock, ”Gene ontology: tool for the unification of biology,” Nature Genetics, Vol. 25, no. 1, pp. 25-9, 2000. [35] The Gene Ontology Consortium, ”Gene Ontology Consortium: going forward,” Nucleic Acids Research, Vol.43, No. D1, pp. D1049-D1056, Jan 2015, doi: 10.1093/nar/gku1179. [36] J.Z. Wang, Z. Du, R. Payattakool, P.S. Yu, C.F. Chen, ”A new method to measure the semantic similarity of GO terms,” Bioinformatics, vol. 23, no. 10, pp. 1274-1281, May 2007, doi:10.1093/bioinformatics/btm087. [37] I. Xenarios, D.W. Rice, L. Salwinski, M.K. Baron, E.M. Marcotte, D. Eisenberg, ”DIP: the database of interacting proteins:a research tool for studying cellular networks of protein interactions”, Nucleic Acids Research, vol. 28, no. 1, pp. 289-291, Jan 2000, doi:10.1093/nar/30.1.303. [38] G. Liu, L. Wong, H.N. Chua, ”Complex discovery from weighted PPI networks,” Bioinformatics, vol. 25, no. 15, pp. 1891-1897, Aug 2009, doi:10.1093/bioinformatics/btp311. ¨ ¨ [39] H.W. Mewes, D. Frishman, K.F.X. Mayer, M. Munsterk otter1, O. Noubibou, P. Pagel, T. Rattei, M. Oesterheld, A. Ruepp, and ¨ V.Stumpflen, ”MIPS: Analysis and Annotation of Proteins from Whole Genomes in 2005,” Nucleic Acids Research, vol. 34, no. S1, pp. D169-D172, Jan 2006, doi:10.1093/nar/gkj148. [40] J.M. Cherry, C. Adler, C. Ball, S.A. Chervitz, S.S. Dwight, E.T. Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, and D.Botstein, ”SGD: Saccharomyces Genome Database,” Nucleic Acids Research, vol. 26, no. 1, pp. 73-79, 1998. [41] R. Zhang and Y. Lin, ”DEG 5.0, A Database of Essential genes in both Prokaryotes and Eukaryotes,” Nucleic Acids Research, vol. 37, pp. D455-D458, Jan 2009, doi:10.1093/nar/gkn858. [42] Saccharomyces Genome Deletion Project. http://wwwsequence.stanford.edu/group/yeast deletion project, 2011. [43] B.P. Tu, A. Kudlicki, M. Rowicka, S.L. McKnight, ”Logic of the yeast metabolic cycle: temporal Compartmentalization of cellular processes,” Science, vol. 310, no. 5751, pp. 1152-1158, Nov 2005, doi:10.1126/science.1120499. [44] http://geneontology.org/page/download-annotations. [45] A.G. Holman, P.J. Davis, J.M. Foster, C.K.S. Carlow, S. Kumar, ”Computational prediction of essential genes in an unculturable endosymbiotic bacterium, Wolbachia of Brugia malayi,” BMC Microbiology, vol. 9, no. 243, Nov 2009, doi: 10.1186/1471-21809-243. [46] W. Zhang, T. Tian, X. Zou, ”Negative feedback contributes to the stochastic expression of the interferon-β gene in virustriggered type I interferon signaling pathways,” Math Biosci., vol. 265, pp. 12-27, July 2015, doi: 10.1016/j.mbs.2015.04.003. [47] Y. Li, S. Jin, L. Lei, Z. Pan, & X. Zou, ”Deciphering deterioration mechanisms of complex diseases based on the construction

8

of dynamic networks and systems analysis,” Scientific Reports, vol. 5, no. 9283, Mar 2015, doi:10.1038/srep09283. [48] S. Jin, Y. Li , R. Pan, X. Zou, ”Characterizing and controlling the inflammatory network during influenza A virus infection,” Scientific Reports, vol.4, pp. 3799, Jan 2014.

1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Suggest Documents