Molecular BioSystems PAPER - Semantic Scholar

4 downloads 105 Views 4MB Size Report
action network and another is based on Gene Ontological annotations of proteins. ..... thyroid cancer (5.5 В 10А4), pancreatic cancer. (7.6 В 10А3). Cluster 5.
Molecular BioSystems

Dynamic Article Links

Cite this: Mol. BioSyst., 2012, 8, 3036–3048

PAPER

www.rsc.org/molecularbiosystems

Detecting protein complexes in a PPI network: a gene ontology based multi-objective evolutionary approachw Anirban Mukhopadhyay, Sumanta Ray* and Moumita De Received 27th March 2012, Accepted 14th August 2012 DOI: 10.1039/c2mb25302j Protein complexes play an important role in cellular mechanism. Identification of protein complexes in protein–protein interaction (PPI) networks is the first step in understanding the organization and dynamics of cell function. Several high-throughput experimental techniques produce a large amount of protein interactions, which can be used to predict protein complexes in a PPI network. We have developed an algorithm PROCOMOSS (Protein Complex Detection using Multi-objective Evolutionary Approach based on Semantic Similarity) for partitioning the whole PPI network into clusters, which serve as predicted protein complexes. We consider both graphical properties of a PPI network as well as biological properties based on GO semantic similarity measure as objective functions. Here three different semantic similarity measures are used for grouping functionally similar proteins in the same clusters. We have applied the PROCOMOSS algorithm on two different datasets of Saccharomyces cerevisiae to find and predict protein complexes. A real-life application of the PROCOMOSS is also shown here by applying it in the human PPI network consisting of differentially expressed genes affected by gastric cancer. Gene ontology and pathway based analyses are also performed to investigate the biological importance of the extracted gene modules.

1

Introduction

A PPI network can be described as a complex system of proteins linked by interactions. The simplest representation takes the form of an undirected graph consisting of nodes and edges,1 where proteins are represented as nodes and the interaction of two proteins is represented as adjacent nodes connected by an edge. The protein complexes in a PPI network are assemblages of proteins that interact with each other at a given time and place, forming a dense region in the PPI networks. Several techniques based on graph clustering, finding dense regions, or clique finding have been proposed to discover protein complexes in PPI networks.2–5 Molecular Complex Detection (MCODE), proposed by Bader et al.,6 detects densely connected regions in the PPI network by giving weight to each vertex, corresponding to its local neighborhood density. Then, starting with the top weighted vertex (seed vertex), it includes the vertices whose weight is above a given threshold in the cluster, recursively. The Markov Cluster algorithm (MCL) proposed in ref. 7 converges toward a partitioning of the graph, with a set of high-flow regions (the clusters) Department of Computer Science and Engineering, University of Kalyani, Kalyani, India. E-mail: [email protected], sumanta_ray86@rediffmail.com, [email protected] w Electronic supplementary information (ESI) available: The code and other related materials are available at http://kucse.in/procomoss/. See DOI: 10.1039/c2mb25302j

3036

Mol. BioSyst., 2012, 8, 3036–3048

separated by boundaries with no flow. In ref. 8 Restricted Neighborhood Search Clustering (RNSC), a cost-based local search algorithm is proposed that explores the solution space to minimize a cost function calculated according to the number of intra-cluster and inter-cluster edges. Starting from an initial random solution, RNSC iteratively moves a vertex from one cluster to another if this move reduces the general cost. Recently in ref. 9 a clustering with overlapping neighborhood expansion (ClusterONE) has been introduced for detecting potentially overlapping protein complexes from protein–protein interaction data. This algorithm consists of three major steps: first, starting from a single seed vertex, a greedy procedure adds or removes vertices to find groups with high cohesiveness. In the second step, merging between each pair of groups is done based on the extent of overlap between each pair of groups for which the overlap score is above a specified threshold. In the third step, a postprocessing is done by discarding complex candidates that contain less than three proteins or whose density is below a given threshold. In general it has been observed that the proteins constituting a complex are functionally similar and they carry out some common biological activity. Motivated by this observation, in this article a multi-objective algorithm PROCOMOSS (Protein Complex Detection using Multi-objective Evolutionary Approach based on Semantic Similarity) is developed. PROCOMOSS optimizes both graph based density metric and GO-semantic similarity based metric simultaneously to find dense protein This journal is

c

The Royal Society of Chemistry 2012

complexes containing functionally similar proteins. NSGA-II,10,11 a popular multi-objective GA,12 has been utilized as an underlying optimization tool. The results are collected by applying PROCOMOSS in the protein–protein interaction (PPI) data downloaded from two different high throughput datasets DIP and MIPS. The performance of PROCOMOSS is compared with that of other methods such as MCODE,6 RNSC,8 MCL,7 and clusterONE9 and Affinity propagation.13 The different semantic similarity measures we have used in our PROCOMOSS algorithm are also have been utilized in the Affinity propagation for grouping proteins into modules. We also perform a Gene Ontology and pathway based analysis of the predicted modules identified by PROCOMOSS in the differentially expressed genes extracted from the gastric cancer dataset downloaded from www.biolab. si/supp/bi-cancer/projections/info/GSE2685.htm.

For a detailed description see the ESIw website. Among the various approaches for computing semantic similarity measure we use three of them here: Lin,21 Jiang and Conrath22 and Kappa’s measure23 (equation and a brief description can be found in the ESIw website) to form the objective functions for computing our clustering results. Besides the semantic similarity measure between GO terms annotating a protein pair, we also find some graphical properties of a protein interaction network to be used as objectives.

2

A protein complex is a subgraph of the whole PPI graph. Here a protein complex is encoded as a chromosome. So in the resulting population a chromosome of the type: ni, i = 1,. . .,p is an integer denoting the index of a protein in the unique protein set, represents a protein complex consisting of p number of nodes or proteins. All nodes in the chromosome are not necessarily connected.

Multi-objective optimization using GA

The multi-objective optimization problem can be stated as follows:12,14–16 find the vector x ¼ ½x1 ; x2 ; . . . ; xn T of the decision variables satisfying the m inequality constraints: gi(x) % Z 0, i = 1,2,. . .,m, p equality constraints hi(x) % = 0, i = 1,2,. . .,p that optimizes the vector function f(% x) % % = [f1(x), T f2(x),. % . ., fk(x)] % . The constraints define the feasible region F containing all the admissible solutions. The vector x* % denotes an optimal solution in F. The concept of Pareto optimality12,16 is useful in the domain of multi-objective optimization. A formal definition for Pareto optimality from the viewpoint of minimization problem may be given as follows: a decision vector x* % is called Pareto optimal if and only if there is no x% that dominates x*, % i.e., there is no x% such that 8i A {1,2,. . .,k}, fi(x) % r fi(x) % and (i A {1,2,. . .,k}, fi(x) In words, x* % o fi(x*). % % is Pareto optimal if there exists no feasible vector x% which causes a reduction in some criterion without a simultaneous increase in at least another. The Pareto optimum usually admits a set of solutions called non-dominated solutions. Here we use NSGA-II10,11 as the underlying multi-objective algorithm.

3

Gene ontology based semantic similarity

The Gene Ontology (GO) project17 is a collaborative effort to provide consistent description of genes and gene products. GO provides a collection of well-defined and well-structured biological terms, called GO terms that are shared across different organisms. They comprise three categories as the most general concepts: biological processes, molecular functions and cellular components. The measurement of semantic similarity between two concepts can be easily extended to measure the degree of similarity between terms in the GO structures.18 The GO terms are structured by the relationships to each other, such as is-a that represents a specific-to-general relationship between terms, and part-of that represents a part-to-whole relationship. Two approaches to gene similarity computation are graph structure-based (GS), which use the hierarchical structure of GO in computing gene similarity and information content-based (IC) measures which additionally consider the a priori probabilities, or information contents, of GO terms in a reference gene set.19,20 This journal is

c

The Royal Society of Chemistry 2012

4

Proposed method

Here we describe the PROCOMOSS algorithm for clustering in a PPI network in detail. 4.1 Chromosome representation

4.2 Population initialization Initially the whole network is broken into several biclusters.24–26 Biclustering is done by applying K-means clustering from both the dimensions of a PPI matrix and taking intersections of the clusters formed in these two dimensions. Each bicluster represents a densely connected region in the network. We sort these biclusters on the basis of density and pick up first 50 biclusters and encode these in the initial population. The subsequent populations are created using the genetic operators of NSGA-II. 4.3 Representation of objective functions Here we use two types of objective functions: one is totally dependent on the graphical properties of the protein interaction network and another is based on Gene Ontological annotations of proteins. 4.3.1 Graph based objective. All graph theoretic approaches for finding protein complexes seek to identify dense subgraphs by maximizing the density of each subgraph on the basis of local network topology. The density of a graph is defined as a ratio of the number of edges present in a graph to the possible number of edges in a complete graph of the same size. As there are a large number of interactions (or edges) between proteins (or nodes) in a protein complex (or subgraph), the density of each complex is generally very high. So using density as an objective function and maximizing it for individual subgraphs will yield much denser complexes. For choosing the next objective we count the number of interconnecting nodes for a chromosome that are not present in the current chromosome/cluster. For example in Fig. 1, the chromosome is represented as black nodes and the interconnecting nodes of this chromosome (which are not present in the current chromosome) are shown in yellow colored nodes. This may be Mol. BioSyst., 2012, 8, 3036–3048

3037

Fig. 1 Example of outward interconnecting nodes of a chromosome: black nodes represent chromosome and yellow nodes are outward interconnecting nodes to this chromosome.

written as: NðCÞ ¼ j [ ni j;

ð1Þ

i2C

where C is any cluster in G and ni is the set of nodes which are connected with node i in C, and are not present in C. Minimizing this will result in clusters, which have a lesser number of outward interaction partners and we get compact clusters. 4.3.2 Semantic similarity based objective. The semantic similarity measure between two GO terms can be directly converted to a measurement of the similarity between two proteins. Since a protein is annotated to multiple GO terms, the similarity between two proteins can be represented as the average similarity of the GO term cross pairs, which are associated with both interacting proteins.18 The package csbl.go (http://csbi.ltdk.helsinki.fi/csbl.go/) is used for calculating the similarity between protein pairs. We calculate the similarity between all pairs of proteins and tabulate this as a Table 2

Fig. 2 Illustration of the mutation process: black nodes represent chromosome. In each iteration randomly select one node in the chromosome, delete that node or add those which are direct neighbors of that node with equal probability. (a) Represents the parent chromosome. After one iteration a child chromosome (shown in (b) or (c)) is produced.

matrix form. For calculating the fitness of a chromosome, the average similarity of each pair of proteins comprising the chromosome is computed. For example to compute the fitness of the chromosome: {n1 n2 . . . np} we compute a submatrix s with rows and columns comprising these nodes from the similarity matrix S. The average value of the matrix s serves as the fitness of this chromosome. This may be written as: P

i2p

simðsÞ ¼

P

j2p

p

sði; jÞ

ð2Þ

:

Table 1 Summary of the PPI network data sets used here Data set # Proteins # Interactions Avg. degree Max. degree Density DIP MIPS

4669 3950

21 621 11 119

9.2305 5.5792

241 233

0.0020 0.0014

Comparisons of results with respect to sensitivity, Positive Predictive Value (PPV) and accuracy General sensitivity of clustering result

General PPV of clustering result

Accuracy

Method

DIP

MIPS

DIP

MIPS

DIP

MIPS

MCODE cluster ONE MCL RNSC PROCOMOSS_Lin_Mf PROCOMOSS_Lin_Bp PROCOMOSS_Lin_Cc Affinity_Lin_Mf Affinity_Lin_Bp Affinity_Lin_Cc PROCOMOSS_Jiang_Mf PROCOMOSS_Jiang_Bp PROCOMOSS_Jiang_Cc Affinity_Jiang_Mf Affinity_Jiang_Bp Affinity_Jiang_Cc PROCOMOSS_Kappa_Mf PROCOMOSS_Kappa_Bp PROCOMOSS_Kappa_Cc Affinity_Kappa_Mf Affinity_Kappa_Bp Affinity_Kappa_Cc

0.1168 0.2135 0.2605 0.2909 0.2490 0.2726 0.2321 0.1624 0.1443 0.1565 0.2215 0.1972 0.2119 0.1535 0.1436 0.1512 0.1716 0.2398 0.1833 0.1390 0.1420 0.1487

0.0742 0.0999 0.1588 0.1922 0.1150 0.1051 0.1187 0.1391 0.1311 0.1331 0.1185 0.1178 0.1095 0.1507 0.1610 0.1571 0.1242 0.1285 0.1141 0.1391 0.1611 0.1610

0.4922 0.4078 0.4464 0.6608 0.8186 0.8141 0.6891 0.3473 0.3524 0.3761 0.7379 0.8596 0.8526 0.3668 0.3754 0.3650 0.6390 0.8261 0.5904 0.3494 0.3546 0.3552

0.4709 0.3890 0.4135 0.6048 0.7782 0.9425 0.7144 0.3513 0.3492 0.3523 0.7643 0.7976 0.8952 0.3818 0.3861 0.3830 0.7763 0.7366 0.8113 0.3513 0.3887 0.3868

0.2397 0.2951 0.3486 0.4384 0.4515 0.4711 0.4000 0.2375 0.2255 0.2426 0.4043 0.4117 0.4250 0.2373 0.2322 0.2349 0.3312 0.4450 0.3290 0.2204 0.2244 0.2298

0.1869 0.1971 0.2563 0.3409 0.2992 0.3147 0.2912 0.2211 0.2140 0.2166 0.3010 0.3066 0.3131 0.2399 0.2493 0.2453 0.3105 0.3076 0.3043 0.2242 0.2503 0.2495

3038

Mol. BioSyst., 2012, 8, 3036–3048

This journal is

c

The Royal Society of Chemistry 2012

Here we use the semantic similarity measures proposed by Lin, Jiang and Conrath, and Kappa to compute the similarity matrices. By maximizing it we can group the functionally similar proteins. 4.4

Selection and mutation

The popularly used genetic operations are selection, crossover, and mutation. General crossover operation between two chromosomes results in many disconnected subgraphs and produces a large number of isolated nodes. So crossover is not performed here and instead mutation is performed with high probability (mutation probability = 0.9). The selection operation used here is the crowded binary tournament selection used in NSGA-II. If a chromosome is selected to be mutated then addition or deletion of nodes in the chromosome is performed in the following way: for a chromosome a random node ni is selected and either of the two tasks is performed with equal probability: delete that node or add the nodes which are direct neighbors of node ni, and are not included in the parent chromosome. Fig. 2 illustrates the process. Either of the child chromosomes shown in Fig. 2(b) and (c) are produced from the parent chromosome shown in Fig. 2(a). The whole operation is performed five times to create a new diversified chromosome from the parent chromosome.

5

Experimental results

We ran the proposed algorithm PROCOMOSS on the PPI network of Saccharomyces cerevisiae (yeast) dataset downloaded from the DIP27 and MIPS.28 PROCOMOSS takes 2290.93 seconds and 17 776.90 seconds for population initialization and takes 492.32 seconds and 463.12 seconds to run on a Core 2 duo 2.26 GHz PC having 2 GB internal memory with Windows 7 installed on it, for the DIP and the MIPS dataset respectively. For DIP out of 5000 S. cerevisiae proteins we used 4669 proteins and for MIPS we used 3990 proteins out of 6190 due to the availability of their annotation data. Subsequently we reduced the interaction dataset which contains the annotated proteins only. Our used dataset contains 11 119 interactions for MIPS and 21 621 interactions for DIP. Table 1 summarizes the PPI network for DIP and MIPS. We match our clustering result with the known protein complexes consisting of 491 complexes, downloaded from the site http://yeast-complexes.russelllab.org/. The interaction datasets and the benchmark complexes can be found in the ESIw website. 5.1 Performance comparisons with the existing method For comparisons of PROCOMOSS clustering results with that of some other existing algorithms we have employed some

Fig. 3 Proportion of clusters attaining a specified p-value in DIP dataset: (a) MCODE, (b) MCL, (c) cluster ONE, (d) RNSC, (e) Affinity_Lin_mf, (f) Affinity_Lin_bp, (g) Affinity_Lin_cc, (h) PROCOMOSS_Lin_mf, (i) PROCOMOSS_Lin_bp, (j) PROCOMOSS_Lin_cc. Here log of p-value is given in x-axes and proportion of clusters is represented as y-axes.

This journal is

c

The Royal Society of Chemistry 2012

Mol. BioSyst., 2012, 8, 3036–3048

3039

matching statistics including sensitivity, positive predictive value (PPV) and accuracy. We built a contingency table with rows as protein complexes and columns as resulting clusters. So, the contingency table T is an n  m matrix having n complexes and m resulting clusters, where row i corresponds to the i-th annotated complex, and column j to the j-th cluster. The value of a cell Ti,j indicates the number of proteins found in common between complex i and cluster j. Some proteins belong to several complexes, and some proteins may be assigned to multiple clusters or not assigned to any cluster. 5.1.1

T

of this cluster assigned to all complexes: PPVi;j ¼ Pn i;j i¼1

Ti;j Tj

Ti;j

¼

; where Tj is the marginal sum of a column j. The cluster-

wise positive predictive value PPVclj represents the maximal fraction of proteins of cluster j found in the same complex: PPVclj ¼ maxni¼1 PPVi;j : The general PPV (PPV) of a clustering result is the weighted average of clustering-wise-PPV (PPVclj ) over all predicted clusters: Pm j¼1 Tj PPVclj Pm PPV ¼ : ð4Þ j¼1 Tj

Sensitivity. Sensitivity is the fraction of proteins of T

complex i found in predicted cluster j: Sni;j ¼ Ni;ji ; where Ni is the number of proteins belonging to complex i. A complexwise sensitivity Sncoi may be defined as: Sncoi ¼ maxm j¼1 Sni;j . The general sensitivity (Sn) is the weighted average of Sncoi over all complexes and defined as: Pn i¼1 Ni Sncoi Sn ¼ P : ð3Þ n i¼1 Ni 5.1.2 Positive predictive value. The positive predictive value is the proportion of members of predicted cluster j which belong to complex i, relative to the total number of members

5.1.3 Accuracy. The geometric accuracy (Acc) represents a tradeoff between sensitivity and the positive predictive value and is defined as: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Acc ¼ Sn  PPV: ð5Þ It is the geometrical mean of the Sn and the PPV. The advantage of taking the geometric mean is that it yields a low score when either the Sn or the PPV metric is low. High accuracy values thus require a high performance for both criteria. In Table 2 we list PROCOMOSS clustering results obtained by using Lin, Jiang and Conrath, and Kappa’s similarity measures based on each of the three orthogonal taxonomies

Fig. 4 Proportion of clusters attaining a specified p-value in MIPS dataset: (a) MCODE, (b) MCL, (c) cluster ONE, (d) RNSC, (e) Affinity_Kappa_mf, (f) Affinity_Kappa_bp, (g) Affinity_Kappa_cc, (h) PROCOMOSS_Kappa_mf, (i) PROCOMOSS_Kappa_bp, (j) PROCOMOSS_Kappa_cc. Here log of p-value is given in x-axes and proportion of clusters is represented as y-axes.

3040

Mol. BioSyst., 2012, 8, 3036–3048

This journal is

c

The Royal Society of Chemistry 2012

Table 3 Protein complexes predicted by PROCOMOSS in DIP dataset using Lin measure (biological process), their p-values, most significant GO-terms, and GO-id Sl no. Real protein complex Predicted protein complex 1 2

3

mRNA guanylyl transferase complex 19/22S regulator

4 5 6 7

Ppz1 protein phosphate complex Complex 87 Act1–Sac6 complex Ku complex Complex 187

8

Complex 221

9

Small subunit processome

10

Complex 250

11

Complex 266

12 13 14 15 16

Complex Complex Complex Complex Complex

17

Complex 384

270 279 307 348 354

18

Complex 409

19

Complex 435

20

Complex 440

21

Complex 473

22

Complex 477

19

61.53 Ceg1 Cet1 Ckb2 Fun12 3.46  10 Nip1 Spt16 Ssa3 Top2 Ecm29 Pre1 Pre5 Rad50 5.75  1016 61.29 Rpn1 Rpn10 Rpn11 Rpn12 Rpn2 Rpn5 Rpn6 Rpn7 Rpn8 Rpt1 Rpt2 Rpt3 Rpt5 Rpt6 Ubp6 Cka2 Ckb2 Ssa3 Ykl088w 4.09  1016 66.67

c

Proteolysis involved in GO:0051603 cellular protein catabolic process Proteolysis involved in GO:0051603 cellular protein catabolic process

GO:0044238

Primary metabolic process Primary metabolic process Macromolecule glycosylation RNA splicing, via transesterification reactions rRNA processing

GO:0044238 GO:0044238 GO:0043413 GO:0000375

65.51

rRNA processing

GO:0006364

3.74  1059 66.67

rRNA processing

GO:0006364

3.74  1059 62.5

rRNA processing

GO:0006364

Primary metabolic process rRNA processing Primary metabolic process Primary metabolic process rRNA processing

GO:0044238 GO:0006364 GO:0044238 GO:0044238 GO:0006364

   

1022 1022 1008 1051

Cdc55 Iml1 Yfr006w Act1 Sac6 Yku70 Yku80 Brr1 Hta1

1.77 1.77 4.89 3.87

Imd2 Imd4 Kar2 Tdh1 Tdh2 Dip2 Mpp10 Nan1 Nop1 Nop58 Pwp2 Rok1 Rrp9 Sof1 Utp13 Utp15 Utp18 Utp21 Utp22 Utp4 Utp6 Utp7 Utp8 Utp9 Bms1 Ecm16 Utp7 Ygr054w Dip2 Mpp10 Nop1 Nop58 Pwp2 Rok1 Utp22 Utp30 Utp6 Utp7 Clu1 Ura7 Gdb1 Nup192 Nop13 Rlp7 Iml1 Psd2 Dbp8 Dip2 Enp1 Ero1 Imp3 Kre33 Krr1 Mpp10 Nan1 Nop1 Pre9 Rok1 Rrp9 Utp22 Utp30 Utp6 Utp7 Utp8 Idh2 Lys12 Rvb2

3.64  1019 62.5 59

3.74  10

4.09 3.26 4.09 1.77 3.74

    

1016 1014 1016 1022 1059

100 100 100 100

100 100 100 66.67 62.06

2.71  1030 75 37

66.67 Arx1 Fpr4 Mrt4 Nip7 Nog2 9.78  10 Nsa2 Sda1 Tif6 Chd1 Ckb2 Spt16 Ykl088w 3.76  1013 66.67 26

Brx1 Drs1 Mak21 Nip7 3.40  10 64.28 Noc2 Noc3 Nop4 Spb4 Ytm1 Ecm29 Rpn1 Rpn11 Rpn12 8.20  1025 71.43 Rpn3 Rpn6 Rpn8 Rpt2 Sro7 Ura7 Cct8 Tdh2 Tef1 Ufd4 3.26  1014 66.67

The Royal Society of Chemistry 2012

GO-id

Primary metabolic process

or aspects that hold terms describing the molecular function (mf), biological process (bp) and cellular component (cc) for a gene product. We also compute the results integrating the same similarity measures in the Affinity propagation algorithm in the same network on both DIP and MIPS datasets. As Affinity propagation groups data points based on the similarity between each pair of data points, using a semantic similarity matrix in the Affinity propagation is likely to produce protein modules which consist of functionally homogeneous proteins. We see in Table 2 that PROCOMOSS shows comparatively better results than Affinity propagation for each of the similarity measures. PROCOMOSS also performs comparatively well than other algorithms with respect to sensitivity, PPV, and accuracy. We also perform a GO-based study for comparing the predicted clusters obtained by PROCOMOSS and other algorithms. We use org.Sc.sgd.db and GOstats packages from R for computing This journal is

% of protein covered Most significant GO-term

p-Value

GO:0006364

Proteolysis involved in GO:0051603 cellular protein catabolic process rRNA processing GO:0006364 Proteasome regulatory particle assembly rRNA processing

GO:0070682 GO:0006364

Ubiquitin-dependent protein catabolic process

GO:0006511

rRNA processing

GO:0006364

the p-values of predicted clusters. In Tables 3 and 4 we list the most significant GO-terms and corresponding GO-id and p-values under the biological process category for our clustering results. It appears that PROCOMOSS shows a comparatively better result when we use Lin measure as objective function in the DIP dataset whereas in the MIPS dataset using Kappa’s measure produces better accuracy values. So we built six tables, three for Lin measure in the DIP dataset and other three for Kappa measure in the MIPS dataset describing the most significant GO-terms, GO-id and p-values of the resulting clusters. We have given these in the ESIw website. We plotted a bar diagram as depicted in Fig. 3 and 4 showing the proportion of clusters attaining a specified p-value for the DIP and the MIPS dataset respectively. The figures show that a large proportion of clusters produced by other algorithms have higher p-values in comparison with PROCOMOSS in which a significantly larger Mol. BioSyst., 2012, 8, 3036–3048

3041

Table 4 Protein complexes predicted by PROCOMOSS in MIPS dataset using Kappa measure (molecular function), their p-values, most significant GO-terms, and GO-id Sl no. Real protein complex 1 2 3

mRNA decapping complex Ctf19 protein complex

Predicted protein complex

p-Value 18

Dcp1 Dcp2 Edc3 Kem1 2.17  10

% of protein covered Most significant GO-term

GO-id

66.67

mRNA metabolic process

GO:0016071

Mitotic cell cycle

GO:0000278

RNA polymerase II transcriptional preinitiation complex assembly Covalent chromatin modification Histone acetylation Histone acetylation Vesicle-mediated transport

GO:0051123

37

66.67 Chl4 Ctf19 Ctf3 Mcm21 2.17  10 Mcm22 Nkp1 Taf1 Taf10 Taf2 Taf5 7.69  1024 75 Taf6 Taf8

4

RNA polymerase II general transcription factor TFIID Tid3 complex

5 6 7

Complex 155 Complex 166 TRAPPII complex

8

11 12 13

SBF, SWI4–SWI6dependent cell cycle box binding factor complex Clathrin-associated Apl5 Apl6 Apm3 protein AP-3 complex Complex 346 Gcn5 Ngg1 Pdr1 Sgf29 Spt15 Spt20 Srb2 Taf1 Taf12 Retromer subcomplex Vps29 Vps35 Vps5 Ric1–Rgp1 complex Rgp1 Ric1 Gim complexes Gim3 Pac10 Yke2

2.55  1015 100 1.14  1022 66.67 3.68  1019 100

14

Complex 479

2.17  1037 66.67

9 10

Nuf2 Spc24 Spc25 Tid3 1.14  1022 100 Atg11 Atg17 Kap104 Mtr10 Nmd5 Bet3 Bet5 Gsg1 Kre11 Trs120 Trs130 Trs20 Trs23 Trs31 Trs33 Swi4 Swi6

Msn5 Num1

1.18  1020 66.67 2.55  1014 60 3.58  1011 83.33

GO:0016573 GO:0016573 GO:0016192

1.14  1022 66.67

Covalent chromatin modification

GO:0016569

2.05  1016 75

Protein localization

GO:0008104

Histone acetylation

GO:0016573

Protein acylation Covalent chromatin modification RNA polymerase II transcriptional preinitiation complex assembly Mitotic cell cycle

GO:0043543 GO:0016569 GO:0051123

21

1.30  10

number of clusters tend to have smaller p-values (i.e. larger log(p)). This establishes that PROCOMOSS predicts more functionally homogeneous complexes than the other algorithms. 5.2

GO:0016569

Predicted complexes

We found that some real complexes are recognized by our PROCOMOSS algorithm. We filtered out those complexes which have less than sixty percent common proteins with some

60

GO:0000278

of our predicted clusters. Here we give two tables and others can be found in the ESIw website. In Tables 3 and 4 we list the protein complexes predicted by our PROCOMOSS algorithm using Lin measure (biological process annotation) and Kappa’s measure (molecular function annotation) as objective functions. Table 3 is built by applying PROCOMOSS on DIP data whereas Table 4 represents the details of complexes found in MIPS data. The third column indicates the proteins that are members of the real protein complex shown in the second

Fig. 5 Figure of a cluster predicted by PROCOMOSS using Kappa’s semantic similarity measure with molecular function annotation. Yellow colored nodes are components of the TFIID complex and violet colored edges signify semantic similarity between two proteins connected with that edge, which is higher than 0.5.

3042

Mol. BioSyst., 2012, 8, 3036–3048

This journal is

c

The Royal Society of Chemistry 2012

Fig. 6 Figure of some portion of a cluster predicted by PROCOMOSS using Lin semantic similarity measure with molecular function annotation. Yellow colored nodes are components of anaphasepromoting complex/cyclosome.

column and are found in some predicted clusters. The fourth column represents p-values of the corresponding predicted clusters which have above 60 percent of common proteins in real protein complexes. The p-value of a cluster is defined as the lowest p-values of all the functional groups constituting the cluster. In row 21 of Table 3 we see that PROCOMOSS_ Lin_bp predicts 10 proteins, of which Rpn1, Rpn11, Rpn12, Rpn3, Rpn6, Rpn8, and Rpt2 are found in complex 473. These proteins responsibly act as a regulatory subunit of the 26S proteasome which is involved in the ATP-dependent degradation

of ubiquitinated proteins. The 26S proteasome is a multisubunit enzyme composed of a cylindrical catalytic core (20S) and a regulatory particle (19S) that together perform the essential degradation of cellular proteins tagged by ubiquitin.29 In row 9, 19 proteins out of 29 proteins that make the complex small subunit processome are predicted. The small subunit (SSU) processome is a ribosome biogenesis intermediate that assembles from its subcomplexes onto the pre-18S rRNA with yet unknown order and structure.30 The UtpB subcomplex of the SSU processome consisting of Utp13, Utp15, Utp18, Utp21, Utp22, Utp4, Utp6, Utp7, Utp8 and Utp9, which are involved in nucleolar processing of pre-18S ribosomal RNA and ribosome assembly, is also predicted. In row 3 of Table 4, we see that using Kappa’s semantic similarity measure with molecular function annotation PROCOMOSS predicts proteins Taf1, Taf10, Taf2, Taf5, Taf6 and Taf8 which function as components of the DNA-binding general transcription factor complex TFIID. TFIID plays a key role in the regulation of gene expression by RNA polymerase II through different activities. In row 4 the predicted four proteins Nuf2, Spc24, Spc25 and Tid3 of the Tid3 complex act as components of the essential kinetochore-associated NDC80 complex, which is involved in chromosome segregation and spindle checkpoint activity. Out of 12 proteins in the TRAPP II complex, 10 proteins: Bet3, Bet5, Gsg1, Kre11, Trs120, Trs130, Trs20, Trs23, Trs31 and Trs33 are identified. TRAPP II seems to play a role in intra-Golgi transport. Fig. 5 shows a cluster predicted by PROCOMOSS using Kappa’s semantic similarity measure with MF annotation, applying on the MIPS dataset. Here the edges are colored corresponding to the similarity between proteins connected by this edge.

Fig. 7 Venn diagrams of predicted complexes that have greater than 60 percent of common proteins in some of the resulting clusters. Complexes are identified by PROCOMOSS using (a) Lin, (b) Jiang and Conrath, and (c) Kappa measures in DIP dataset. In (d)–(f) complexes are identified in MIPS dataset using Lin, Jiang and Conrath and Kappa’s measure respectively.

This journal is

c

The Royal Society of Chemistry 2012

Mol. BioSyst., 2012, 8, 3036–3048

3043

Fig. 8 Venn diagrams of predicted complexes which have greater than 60 percent of common proteins in some of the resulting clusters. Complexes are identified by PROCOMOSS using Lin, Jiang and Conrath, and Kappa measures in DIP and MIPS datasets. (a) Represents Venn diagram of the complexes predicted by Lin, Jiang and Conrath and Kappa measures using biological process annotation for DIP dataset. (b) and (c) describe the same using cellular component and molecular function respectively. Similarly (d)–(f) represent the same for MIPS dataset.

The proteins that are connected by light blue colored edge have similarity less than 0.5 whereas the proteins that have similarity greater than 0.5 are connected by violet colored edges. Yellow colored nodes represent proteins that are the components of the TFIID complex. Some portion of a cluster predicted by PROCOMOSS using Lin semantic similarity measure with molecular function annotation is shown in Fig. 6. Proteins Apc1, Apc2, Apc4, Apc5, Apc9, Cdc16, Cdc23 and Cdc27 that are components of the anaphase promoting complex/cyclosome (APC/C) are shown as yellow colored nodes. This cluster also captures some portion of the TFIID complex and the GIM protein complex consisting of prefoldin subunit 2, 3, 4, 5 (GIM2, GIM3, GIM4 and PAC10). The figure of the whole cluster is given in the ESIw website. We drew Venn diagrams to show the overlaps of complexes predicted by our algorithm PROCOMOSS using three different semantic similarity measures. Fig. 7(a)–(c) show the overlap of predicted complexes identified by PROCOMOSS in the DIP dataset whereas (d)–(f) show the same in the MIPS dataset using Lin, Jiang and Conrath and Kappa’s semantic similarity measures respectively. We retain those complexes which have 60 percent of common proteins with some of the resulting clusters. We see that in the DIP dataset using Lin measure, PROCOMOSS predicts in total 41 complexes spanned in three different taxonomies: biological process (bp), molecular function (mf) and cellular component (cc), each of which is used to predict 22, 18, and 17 complexes with 9 overlaps between bp and mf, 6 overlaps between cc and mf and 3 overlaps between cc and bp. We also drew Venn diagrams to detect overlaps between complexes 3044

Mol. BioSyst., 2012, 8, 3036–3048

predicted by PROCOMOSS using Lin, Jiang and Conrath and Kappa’s measure for each of the taxonomies that hold terms describing the molecular function (mf), biological process (bp) and cellular component (cc) for a gene product. These are shown in Fig. 8.

6 Application to the detection of modules in a human PPI network affected by a specific disease Here we proposed a real-life application of PROCOMOSS by using it for a dataset of differentially expressed genes of gastric cancer dataset. We extract the differentially expressed genes from this dataset and also track the interactions of these genes which conceded interactions with other genes in the whole human PPI network. We downloaded the database from www.biolab.si/supp/bi-cancer/projections/info/GSE2685.htm/. There are 8 examples (26.7%) of normal gastric tissue (Normal) and 22 examples (73.3%) of advanced gastric cancer tissue (Tumor). So the dataset contains in total 30 samples and 4522 genes. We performed a t-test on this dataset to extract 1076 differentially expressed genes at a significant threshold level. We compiled a network that consists of the aforesaid genes as well as other genes that conceded interaction with them. We only take the first neighbor of the differentially expressed genes in the whole interaction dataset, and thus our network is composed of 3079 proteins and 6049 interactions. Using PROCOMOSS we find 20 statistically and biologically significant clusters in this network using Lin and Kappa’s semantic similarity measure in each case. It takes 492.32 seconds This journal is

c

The Royal Society of Chemistry 2012

Table 5 GO-terms and pathway predicted by PROCOMOSS in gastric cancer dataset using Kappa measure (molecular function) and their p-values Clusters

GO-terms (BP)

GO-terms (CC)

GO-terms (MF)

KEGG pathways

Cluster 1 (63 proteins)

Regulation of transcription, DNA-dependent (5.9  1035)

Nucleoplasm (5.4  1018)

Cluster 2 (43 proteins)

Regulation of neurological system process (4.5  106)

Cell junction (1.4  107)

Steroid hormone receptor activity (4.2  1045) Glutamate receptor activity (2.1  106)

Cluster 3 (89 proteins)

Regulation of apoptosis (2.1  1013)

Intracellular organelle lumen (2.7  107)

Unfolded protein binding (2.4  108)

Cluster 4 (58 proteins)

Regulation of transcription, DNA-dependent (6.5  1034)

Nucleoplasm part (5.2  1016)

Steroid hormone receptor activity (3.1  1046)

Cluster 5 (56 proteins)

Positive regulation of transcription, DNA-dependent (1.2  1023)

Nuclear lumen (1.7  1018)

Transcription regulator activity (2.0  1028)

Cluster 6 (56 proteins)

Regulation of apoptosis (3.4  1036)

Organelle envelope (9.3  1015)

Cluster 7 (83 proteins)

Regulation of transcription from RNA polymerase II promoter (5.3  1031)

Nuclear lumen (2.1  1015)

Protein heterodimerization activity (9.4  109) Steroid hormone receptor activity (1.7  1047)

Cluster 8 (53 proteins)

Regulation of transcription from RNA polymerase II promoter (2.5  1030)

Nuclear lumen (7.8  1014)

Steroid hormone receptor activity (8.8  1048)

Cluster 9 (51 proteins)

Apoptosis (1040)

Organelle envelope (1.2  1015)

Protein heterodimerization activity (1.7  108)

Cluster 10 (73 proteins) Cluster 11 (3 proteins)

Response to protein stimulus (6  105) Positive regulation of gene expression (7.3  1036)

Ribonucleoprotein complex (7.9  102) Nuclear lumen (1.6  1025)

Unfolded protein binding (1.8  102) Transcription regulator activity (1.1  1028)

Cluster 12 (57 proteins) Cluster 13 (64 proteins) Cluster 14 (52 proteins)

Protein folding (4.1  109)

Pathways in cancer (4.4  109), non-small cell lung cancer (9.6  107), prostate cancer (1.8  105), thyroid cancer (3.1  105) Long-term potentiation (1.1  106), neurotrophin signaling pathway (3.6  105), tight junction (6.0  104) Neurotrophin signaling pathway (2.6  106), amyotrophic lateral sclerosis (ALS) (3.9  105), MAPK signaling pathway (6.4  104) Pathways in cancer (7.0  109), non-small cell lung cancer (4.3  107), PPAR signaling pathway (6.8  104), small cell lung cancer (9.5  105), prostate cancer (1.3  104), thyroid cancer (5.5  104), pancreatic cancer (7.6  103) Pathways in cancer (8.2  107), prostate cancer (1.3  104), thyroid cancer (5.5  104), non-small cell lung cancer (3.4  103) Amyotrophic lateral sclerosis (ALS) (4.7  1011), pathways in cancer (7.0  1011), colorectal cancer (5.3  109) Pathways in cancer (3.5  107), non-small cell lung cancer (1.7  104), thyroid cancer (4.5  104), small cell lung cancer (9.1  104) Pathways in cancer (2.3  106), non-small cell lung cancer (1.4  104), PPAR signaling pathway (2.1  105), thyroid cancer (4.0  104) Amyotrophic lateral sclerosis (ALS) (3.1  1013), pathways in cancer (0.6  1012), endometrial cancer (9.2  109), colorectal cancer (6.3  1011) Antigen processing and presentation (3.2  102) Pathways in cancer (1.9  1011), prostate cancer (3.8  1010), chronic myeloid leukemia (1.8  109), acute myeloid leukemia (5.2  109) Neurotrophin signaling pathway (8.9  103), prostate cancer (2.1  102) Long-term potentiation (2.1  106), neurotrophin signaling pathway (5.5  106) Long-term potentiation (3.2  106), tight junction (1.5  104)

Cluster 15 (52 proteins) Cluster 16 (53 proteins)

Regulation of neurological system process (1.0  106)

Cytosol (2.3  1010) Enzyme binding (1.8  108) Plasma membrane Enzyme binding (3.3  107) (7.9  108) Protein kinase C Cytoplasmic binding (6.1  105) membrane-bound vesicle (2.3  108) Cytosol (6.1  108) Enzyme binding (7.4  106) Cell junction Enzyme binding (1.5  108) (1.8  107)

Cluster 17 (60 proteins)

Regulation of neurological system process (1.5  107)

Plasma membrane (1.9  107)

Cluster 18 (61 proteins)

Regulation of apoptosis (5.2  1011)

Cytosol (2.1  105) Unfolded protein binding (3.0  107)

Cluster 19 (62 proteins)

Regulation of transcription from RNA polymerase II promoter (2.0  1030) Regulation of apoptosis (2.2  1010)

Nucleoplasm (4.4  1014)

Cluster 20 (60 proteins)

Regulation of phosphate metabolic process (2.6  106) Glutamate signaling pathway (1.5  106) Protein folding (9.9  109)

Steroid hormone receptor activity (8.2  1043) Cytosol (2.1  105) Unfolded protein binding (2.2  107)

to initialize the population and 49.06 seconds to run for 5 generations and 50 population size on a Core 2 duo 2.26 GHz This journal is

c

Enzyme binding (7.6  107)

The Royal Society of Chemistry 2012

Huntington’s disease (1.1  102), spliceosome (3.2  102) Long-term potentiation (3.2  106), neurotrophin signaling pathway (1.0  104), tight junction (1.5  105) Endocytosis (4.0  105), tight junction (4.0  105), chemokine signaling pathway (4.4  105) Neurotrophin signaling pathway (3.6  108), amyotrophic lateral sclerosis (ALS) (3.0  106) Pathways in cancer (3.0  108), prostate cancer (9.4  107), non-small cell lung cancer (1.8  105), thyroid cancer (7.4  104) Neurotrophin signaling pathway (3.6  108), amyotrophic lateral sclerosis (ALS) (3.0  106)

PC having 2 GB internal memory with Windows 7 installed on it. Tables 5 and 6 show the predicted clusters found by using Mol. BioSyst., 2012, 8, 3036–3048

3045

Table 6 GO-terms and pathway predicted by PROCOMOSS in gastric cancer dataset using Lin measure (biological function) and their p-values Cluster

GO-terms (BP)

GO-terms (CC)

GO-terms (MF)

KEGG pathway

Cluster 1 (47 proteins)

Regulation of transcription from RNA polymerase II promoter (2.4  1026) Cytoskeleton organization (4.2  1020)

Chromatin remodeling complex (9.8  1012) Intracellular non-membranebound organelle (1.8  1020) Nuclear lumen (7.1  1017)

Ligand-dependent nuclear receptor activity (1.2  1037) Actin binding (3.5  1029)

Pathways in cancer (5.7  109), notch signaling pathway (5.9  105)

Transcription regulator activity (8.1  1022) Actin binding (5.1  1027)

Prostate cancer (8.3  106), non-small cell lung cancer (0.2  104), glioma (4.0  104)

Transcription factor binding (2.1  1015)

Pathways in cancer (1.5  1010), renal cell carcinoma (1.5  109), prostate cancer (8.7  105)

Nucleoplasm (2.7  1020)

Transcription regulator activity (2.7  1038)

Chromatin remodeling complex (5.2  1017) Nuclear lumen (1.8  1015)

Transcription factor binding (7.2  1017)

Pathways in cancer (1.4  1010), non-small cell lung cancer (6.9  109), thyroid cancer (7.8  108), small cell lung cancer (3.7  106), prostate cancer (5.5  106) Cell cycle (2.4  105), prostate cancer (5.9  105)

Cluster 2 (153 proteins) Cluster 3 (60 proteins) Cluster 4 (160 proteins) Cluster 5 (46 proteins) Cluster 6 (98 proteins)

Regulation of transcription, DNA-dependent (1.8  1021) Cytoskeleton organization (1.9  1017) Positive regulation of transcription from RNA polymerase II promoter (4.1  1016) Regulation of RNA metabolic process (6.8  1034)

Cluster 7 (49 proteins)

Regulation of transcription (6.8  1018)

Cluster 8 (97 proteins)

Cluster 15 (59 proteins)

Regulation of transcription from RNA polymerase II promoter (4.3  1027) Regulation of RNA metabolic process (1.4  1041) Regulation of transcription, DNA-dependent (4.5  1021) Positive regulation of transcription from RNA polymerase II promoter (5.6  1018) Regulation of transcription from RNA polymerase II promoter (2.3  1022) Regulation of transcription from RNA polymerase II promoter (3.7  1024) Regulation of transcription from RNA polymerase II promoter (5.3  1015) Regulation of apoptosis (1.3  109)

Cluster 16 (92 proteins) Cluster 17 (65 proteins)

Cluster 9 (103 proteins) Cluster 10 (44 proteins) Cluster 11 (44 proteins) Cluster 12 (78 proteins) Cluster 13 (45 proteins) Cluster 14 (43 proteins)

Cluster 18 (71 proteins) Cluster 19 (55 proteins)

Cluster 20 (8 proteins)

3046

Intracellular non-membranebound organelle (7.6  1011) Intracellular organelle lumen (4.5  1014)

Nuclear lumen (1.7  1024) Chromatin remodeling complex (2.6  1020) Intracellular organelle lumen (7.3  1013) Nuclear lumen (5.5  1020) Chromatin remodeling complex (1.0  108) Nucleoplasm (4.9  1011)

Transcription regulator activity (7.7  1029) Transcription regulator activity (2.0  1043) Transcription regulator activity (1.2  1016) Transcription factor binding (5.0  1017) Ligand-dependent nuclear receptor activity (8.2  1030) Ligand-dependent nuclear receptor activity (9.0  1041) Transcription factor binding (1.0  1021)

Cytosol (9.3  105)

Unfolded protein binding (4.3  106)

Positive regulation of gene expression (1.9  1049)

Nuclear lumen (1.4  1027)

Transcription regulator activity (4.8  1049)

Positive regulation of transcription, DNAdependent (1.4  1015) Positive regulation of transcription, DNAdependent (4.1  1031) Positive regulation of nitrogen compound metabolic process (4.1  1035)

Nuclear lumen (6.6  1016)

Transcription regulator activity (3.98  1017) Transcription factor binding (4.2  1040)

Nuclear lumen (1.1  1026)

Transcription regulator activity (2.0  1028)

Regulation of transcription, DNA-dependent (3.5  1033)

Chromatin remodeling complex (8.9  1024)

Transcription regulator activity (2.0  1036)

Mol. BioSyst., 2012, 8, 3036–3048

Nucleoplasm (3.1  1024)

Fc gamma R-mediated phagocytosis (3.5  1012), regulation of actin cytoskeleton (7.9  109)

Fc gamma R-mediated phagocytosis (4.2  1010), regulation of actin cytoskeleton (3.0  108)

Pathways in cancer (1.1  1014), prostate cancer (8.1  108), renal cell carcinoma (1.8  107), pancreatic cancer (4.4  105) Pathways in cancer (2.5  1011), prostate cancer (5.8  1011), chronic myeloid leukemia (3.1  109) Cell cycle (9.0  103), chronic myeloid leukemia (2.8  103) Pathways in cancer (7.8  1011), renal cell carcinoma (1.0  106), prostate cancer (7.2  105) Pathways in cancer (5.4  109), non-homologous end-joining (5.5  106), chronic myeloid leukemia (6.9  106) Pathways in cancer (8.0  108), PPAR signaling pathway (1.4  105), notch signaling pathway (5.9  105) Thyroid cancer (6.1  108), pathways in cancer (1.1  107), non-small cell lung cancer (1.1  107) Neurotrophin signaling pathway (3.6  106), amyotrophic lateral sclerosis (ALS) (3.0  106), MAPK signaling pathway (1.3  105) Pathways in cancer (3.9  1019), chronic myeloid leukemia (7.8  1013), acute myeloid leukemia (4.8  1010), prostate cancer (2.4  109) Prostate cancer (1.8  106), pathways in cancer (2.7  104) Pathways in cancer (2.7  1011), cell cycle (8.8  107), thyroid cancer (8.9  107), non-small cell lung cancer (2.1  105) Pathways in cancer (1.4  1010), prostate cancer (0.8  1010), chronic myeloid leukemia (1.4  109), acute myeloid leukemia (4.1  109), pancreatic cancer (5.1  107) Pathways in cancer (1.7  107), chronic myeloid leukemia (3.4  104), cell cycle (5.0  104), small cell lung cancer (5.7  104)

This journal is

c

The Royal Society of Chemistry 2012

Lin and Kappa’s semantic similarity measures respectively. Here we have listed the most significant GO-terms, GO-id and the corresponding p-value of three broadly classified GO categories: biological process, molecular function and cellular component. We also find significant KEGG pathway for the human proteins participating in each cluster. In Table 5, the first cluster consists of 63 proteins which are involved in several cancer pathways including lung cancer, prostate cancer, and thyroid cancer. In cluster 2 the 43 proteins belong to the longterm potentiation (1.1  106), neurotrophin signaling pathway (3.6  105). Hippocampal long-term potentiation (LTP), a long-lasting increase in synaptic efficacy, is the molecular basis for learning and memory. Neurotrophins are a family of trophic factors involved in differentiation and survival of neural cells. The neurotrophin family consists of nerve growth factor (NGF), brain derived neurotrophic factor (BDNF), neurotrophin 3 (NT-3), and neurotrophin 4 (NT-4). Neurotrophin/Trk signaling is regulated by connecting a variety of intracellular signaling cascades, which include MAPK pathway, PI-3 kinase pathway, and PLC pathway, transmitting positive signals like enhanced survival and growth (http://www.genome.jp/kegg/pathway/hsa/ hsa04722.html). On the other hand, p75NTR transmits both positive and negative signals. These signals play an important role in neural development and additional higher-order activities such as learning and memory. We see that the proteins in a significant number of predicted clusters are involved in different types of pathways in cancer viz., small cell and non-small cell lung cancer, prostate cancer, thyroid cancer, pancreatic cancer, colorectal cancer, endometrial cancer, chronic myeloid leukemia and acute myeloid leukemia. In Table 6 we see that 47 proteins in cluster 1 belong to cancer pathways and the notch signaling pathway. The notch signaling pathway is an evolutionarily conserved, intercellular signaling mechanism essential for proper embryonic development in all metazoan organisms in the animal kingdom. Proteins in cluster 2 belong to the Fc gamma R-mediated phagocytosis and regulation of the actin cytoskeleton pathway. Phagocytosis plays an essential role in host-defense mechanisms through the uptake and destruction of infectious pathogens. Specialized cell types including macrophages, neutrophils, and monocytes take part in this process in higher organisms. After opsonization with antibodies (IgG), foreign extracellular materials are recognized by Fc gamma receptors. Cross-linking of Fc gamma receptors initiates a variety of signals mediated by tyrosine phosphorylation of multiple proteins, which lead through the actin cytoskeleton rearrangements and membrane remodeling to the formation of phagosomes (http://www.genome.jp/kegg/pathway/hsa/hsa04666. html). Besides the different pathways here also the proteins of our predicted clusters are involved in a significant number of cancerous pathways. This implies that the protein complexes identified by PROCOMOSS are highly involved in cancer progression and thus are possible candidates for further validation. In Tables 5 and 6 low p-values of the GO-terms signify that the clusters are statistically significant and the occurrence of proteins in those clusters is not merely by chance. Hence we conclude that PROCOMOSS provides statistically and biologically significant clusters from a human PPI network consisting of proteins that are affected by some specific disease. This journal is

c

The Royal Society of Chemistry 2012

7

Conclusion

In this article we present a Multiobjective Gene Ontology based Genetic Algorithm for finding protein complexes in the protein–protein interaction network. Here we were able to group functionally similar proteins in a cluster by using semantic similarity measure of GO terms between protein pairs as an objective function whereas the density of the cluster is controlled by graph based objective function. Here we use the similarity measure proposed by Lin, Jiang and Conrath and Kappa to compute the similarity matrix. PROCOMOSS shows better performance in the DIP dataset when we use Lin measure for building the similarity matrix, compared to the other measures. But for the MIPS dataset using Kappa’s and Jiang and Conrath measure PROCOMOSS performs comparatively well. Irrespective of the similarity measure used, it has been found that the PROCOMOSS provides a greater number of functionally homogeneous clusters (lower p-values). Moreover the PPV and accuracy values are consistently better for the clusters provided by PROCOMOSS compared to the other algorithms. We observed that the density of a protein interaction network built from the MIPS dataset is much lower than that of the DIP dataset. The number of predicted complexes in MIPS is also much lower than that in the DIP reference dataset. All the algorithms including PROCOMOSS have higher sensitivity and accuracy on the DIP dataset than that on the MIPS dataset. From this we can conclude that it is more difficult to discover complexes in a low density network. We have also applied PROCOMOSS in a human PPI network consisting of differentially expressed genes in gastric cancer and have been able to extract statistically and biologically significant gene modules. Gene ontology based study and pathway analysis of these modules reveal wider applicability of PROCOMOSS algorithm. As a future work we plan to use other semantic similarity measures as objective functions for predicting protein complexes. PROCOMOSS can also be applied to the protein interaction network of other species to predict protein complexes.

References 1 A. Wagner, Proc. R. Soc. London, Ser. B, 2004, 457–466. 2 L. Mirny and V. Spirin, Proc. Natl. Acad. Sci. U. S. A., 2003, 100(21), 12123–12128. 3 M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, K. Kurokawa and S. Kanaya, BMC Bioinf., 2006, 7, 207. 4 S. Brohe and J. van Helden, BMC Bioinf., 2006, 7, 471–488. 5 J. Pereira-Leal, A. Enright and C. Ouzounis, Proteins, 2004, 54, 49–57. 6 G. Bader and C. Hogue, BMC Bioinf., 2003, 4, 1471–2105. 7 S. Van Dongen, ‘A new cluster algorithm for graphs,’ Center for Mathematics and Computer Science (CWI), Amsterdam technical report, 2000. 8 N. Przulj and D. Wigle, Bioinformatics, 2003, 20, 340–348. 9 T. Nepusz, H. Yu and A. Paccanaro, Nat. Methods, 2012, 9, 471–472. 10 K. Deb, A. Pratap, S. Agrawal and T. Meyarivan, IEEE Trans. Evol. Comput., 2002, 6, 182–197. 11 K. Deb, Multi-objective Optimization Using Evolutionary Algorithms, John Wiley and Sons, Ltd., England, 2001. 12 S. Bandyopadhyay, A. Mukhopadhyay and U. Maullik, Multiobjective Genetic Algorithms for Clustering, Springer-Verlag, Berlin, Heidelberg, 2011. 13 B. J. Frey and D. Dueck, Science, 2007, 315, 972–976.

Mol. BioSyst., 2012, 8, 3036–3048

3047

14 C. A. Coello, Knowl. Inf. Syst., 1999, 1, 129–156. 15 C. Coello Coello, D. V. Veldhuizen and G. Lamont, Evolutionary Algorithms for Solving Multi-Objective Problems, Kluwer Academic Publishers, 2002. 16 C. Coello Coello, IEEE Comput. Intell. Mag., 2006, 1, 28–36. 17 M. Ashburner, C. Ball, J. Blake, D. Botstein, H. Butler, J. Cherry, A. Davis, K. Dolinski and S. Dwight et al., Nat. Genet, 2000, 25, 25–29. 18 H. Wang, F. Azuaje, O. Bodenreider and J. Dopazo, Proc. IEEE Symp. Comput. Intell. Bioinf. Comput. Biol., 2004, 25–31. 19 F. M. Couto, M. J. Silva and P. M. Coutinho, Data Knowl. Eng., 2007, 61, 137–152. 20 P. Lord, R. Stevens, A. Brass and C. Goble, Bioinformatics, 2003, 19, 1275–1283. 21 D. Lin, Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 296–304. 22 J. J. Jiang and D. W. Conrath, Proceedings of the International Conference Research on Computational Linguistics, 1997.

3048

Mol. BioSyst., 2012, 8, 3036–3048

23 D. Huang, B. Sherman, Q. Tan, J. Collins, W. Alvord, J. Roayaei, R. Stephens, M. Baseler, H. Lane and R. Lempicki, Genome Biol., 2007, 8, R183. 24 U. Maulik, A. Mukhopadhyay and S. Bandyopadhyay, IEEE Trans. Inf. Technol. Biomed., 2009, 13, 969–975. 25 A. Mukhopadhyay, U. Maulik and S. Bandyopadhyay, PLoS One, 2012, 7, e32289. 26 U. Maulik, M. Bhattacharyya, A. Mukhopadhyay and S. Bandyopadhyay, Mol. BioSyst., 2011, 7, 1842–1851. 27 I. Xenarios, L. Salwinski, X. Duan, P. Higney, S. Kim and D. Eisenberg, Nucleic Acids Res., 2002, 30, 303–305. 28 U. Guldener, M. Munsterktter, M. Oesterheld, P. Pagel, A. Ruepp, H. Mewes and V. Stumpflen, Nucleic Acids Res., 2006, 34, 436–441. 29 R. Rosenzweig, P. A. Osmulski, M. Gaczynska and M. H. Glickman, Nat. Struct. Mol. Biol., 2008, 15, 573–580. 30 E. Champion, B. Lane, M. Jackrel, L. Regan and S. Baserga, Mol. Cell. Biol., 2008, 21, 6547–6556.

This journal is

c

The Royal Society of Chemistry 2012