using networks to measure similarity between genes ... - Walhout Lab

perspective

Using networks to measure similarity between genes: association index selection

npg

© 2013 Nature America, Inc. All rights reserved.

Juan I Fuxman Bass1,2, Alos Diallo1,2, Justin Nelson3, Juan M Soto1,2, Chad L Myers3 & Albertha J M Walhout1,2 Biological networks can be used to functionally annotate genes on the basis of interactionprofile similarities. Metrics known as association indices can be used to quantify interaction-profile similarity. We provide an overview of commonly used association indices, including the Jaccard index and the Pearson correlation coefficient, and compare their performance in different types of analyses of biological networks. We introduce the Guide for Association Index for Networks (GAIN), a web tool for calculating and comparing interaction-profile similarities and defining modules of genes with similar profiles.

Biological processes are orchestrated through complex interaction networks. Networks are modeled as graphs that depict interactions (‘edges’) between biological entities such as genes, tissues, proteins and metabolites (‘nodes’; see Box 1). If only one type of node is involved, as in protein-protein1,2 or genetic interaction networks3, the graph is defined as monopartite. Bipartite graphs, by contrast, describe interactions between two different types of nodes (X-type and Y-type), with edges connecting only nodes of different types (Fig. 1a). Bipartite graphs include proteinDNA interaction networks4–6, metabolic networks7,8, phenotypic networks9 and expression networks10–14. Networks are powerful tools for gene function annotation. For instance, the ‘guilt-by-association’ principle postulates that if a node with an unknown function has an interaction profile similar to that of a node with a known function, its function may be similar as well2,15. Additionally, network analysis can identify modules—neighborhoods comprising nodes with similar interaction profiles that can point to functional relationships between larger sets of genes16,17.

Although seemingly intuitive, it is not trivial to know how to best capture interaction-profile similarity between nodes, as numerous metrics, or association indices, can be used, and because each index can provide different values and rank similarity between pairs of nodes in a different order. Here, we provide an overview of commonly used association indices. We discuss the differences and similarities between association indices and provide a set of guidelines and a web tool for their selection for different applications. Types of association indices We focus here on bipartite networks that connect X-type nodes to Y-type nodes (Fig. 1a). In these networks, association indices can be used to measure shared Y-type nodes between two X-type nodes, or vice versa. An association index can measure interactionprofile similarity between X-type nodes A and B by calculating the shared partners (|N(A) ∩ N(B)|), in relation to their total number of interactions (‘node degree’), defined as |N(A)| and |N(B)|, and the total number of Y-type nodes in the network (ny) (Fig. 1a). There are three main types of indices, each of which uses the variables mentioned above in a different way (see Box 2). Similarity indices reflect the proportion of overlap and consider only the number of shared interactions between two X-type nodes and the individual degrees of these nodes, but they do not take the total number of Y-type nodes in the network into account. There are many similarity indices, most of which scale interaction-profile similarity between 0 and 1 (ref. 18) (Supplementary Table 1). We will focus on four that are commonly used in genomics and systems biology (see Box 2). The Jaccard index calculates the proportion

1Program in Systems Biology, University of Massachusetts Medical School, Worcester, Massachusetts, USA. 2Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, USA. 3Department of Computer Science and Engineering, University of Minnesota–Twin Cities, Minneapolis, Minnesota, USA. Correspondence should be addressed to J.I.F.B. ([email protected]) or A.J.M.W. ([email protected]).

Received 14 January; accepted 22 July; published online 26 november 2013; CORRECTED AFTER PRINT 27 JANUARY 2014; doi:10.1038/nmeth.2728

nature methods | VOL.10 NO.12 | DECEMBER 2013 | 1169

perspective Box 1 GLOSSARY OF TERMS


A graph is a pair G = (N, E) comprising a set N of nodes connected by a set E of edges. The degree of a node A (|N(A)|) is defined as the number of nodes with which it interacts. Hubs are nodes with a disproportionately high degree. A module is a set of highly interconnected nodes. A monopartite graph contains only one type of node. A bipartite graph contains two types of nodes (X-type and Y-type nodes), and connections occur only between nodes of a different type. An association index is a measure that quantifies interaction-profile similarity. An association network is a network in which two nodes of the same type (for example, only X-type nodes) are connected by an edge if their similarity exceeds a selected threshold.

of Y-type nodes shared between two X-type nodes relative to the total number of Y-type nodes connected to either X-type node. The Simpson index (equal to the meet/min index19 and similar to the topological overlap coefficient16) considers the number of shared Y-type nodes relative to the smallest degree of either X-type node. The geometric index calculates the square of the number of shared interactions between two X-type nodes, divided by the product of their individual degrees. Finally, the cosine index corresponds to the square root of the geometric index.

a

b

Bipartite network

Jaccard = 0.333 Simpson = 1 Geometric = 0.333 Cosine = 0.577 Hypergeometric = 0.146 PCC = 0.258

A

X-type nodes Example 1

B

Y-type nodes

Jaccard = 0.333 Simpson = 0.5 Geometric = 0.25 Cosine = 0.5 Hypergeometric = 0.053 PCC = –0.167

A

Example 2 B

Association index A

B

npg

Unlike similarity indices, matching indices, such as the simple matching coefficient and the Hamann index (Supplementary Table 1), consider the proportion of shared Y-type nodes as well as Y-type nodes that are not connected to either of the two X-type nodes. Because biological networks are sparse, shared nonpartners can contribute more to the similarity between two nodes than shared partners. Therefore, matching indices are not appropriate for the analysis of most biological networks and will not be discussed further.

Bipartite network B

C

D

E

Example 3

Interaction profile similarity

c

A

Jaccard = 0.667 Simpson = 1 Geometric = 0.667 Cosine = 0.816 Hypergeometric = 0.845 PCC = 0.730

A

F

B

PCC association network

CSI calculation

B –0.17 –0.35 –0.26 –0.35 C D 0.47 E F 0.09 –0.17 –0.75 0.47 A

B

C –0.17 –0.35 0.65 0.47 F E B D 0.47 0.09 –0.17 –0.75 0.47 A

C

D 0.47

E

F

CSIAB = 3/6

0.65 0.47 D 0.47 E

F

CSIAC = 1/6

0.47 A

C

B

0.47 A

Figure 1 | Measuring interaction-profile similarity between two nodes using association indices. (a) Bipartite graphs connect two types of nodes: X-type (purple) and Y-type (yellow). The interaction-profile similarity between a pair of X-type nodes (A and B) is determined on the basis of the number of shared Y-type nodes, the total number of Y-type nodes connected to A and B, and the total number of Y-type nodes in the network. (b) Association index comparison. For each pair of X-type nodes, the Jaccard, Simpson, geometric, cosine and hypergeometric indices and PCC were calculated on the basis of their interactions with Y-type nodes. (c) CSI calculation between nodes A and B for a bipartite network involving six X-type nodes (purple) and seven Y-type nodes (yellow). For each pair of X-type nodes the PCC was calculated (blue, positive values; red, negative values). In the PCC association network all the edges connected to A or B are highlighted. CSIAB represents the fraction of X-type nodes connected to both A and B, with PCC < PCCAB − 0.05. The CSI was also calculated between A and C. 1170 | VOL.10 NO.12 | DECEMBER 2013 | nature methods

perspective Box 2 DEFINITIONS OF ASSOCIATION INDICES The Jaccard index is the proportion of shared nodes between A and B relative to the total number of nodes connected to A or B. JAB =

| N(A) ∩ N(B) | | N(A) ∪ N(B) |

The Simpson index is the proportion of shared nodes relative to the degree of the least-connected node. SAB =

| N(A) ∩ N(B) | min(| N(A) |,| N(B) |)

The geometric index corresponds to the product of the proportion of shared nodes between A and B.

npg


GAB =

| N(A) ∩ N(B) |2 | N(A) | . | N(B) |

The cosine index is the geometric mean of the proportions of shared nodes between A and B. CAB =

| N(A) ∩ N(B) | | N(A) | . | N(B) |

The Pearson correlation coefficient is the correlation between the interaction profiles of A and B. PCCAB =

| N(A) ∩ N(B) | . ny − | N(A) | . | N(B) | | N(A) | . | N(B) | . (ny − | N(A) |).(ny − | N(B) |)

The hypergeometric index is the log-transformed probability of having an equal or greater interaction overlap than the one observed between A and B.  | N(A) |  ny − | N(A) |   .   i   | N(B) | − i  HAB = − log ∑  ny  i = |N(A) ∩ N(B)|    | N(B) | min(|N(A), N(B)|) 

The connection specificity index (CSI) is defined as the fraction of X-type nodes that have an interaction profile similarity with A and B that is lower than the interaction profile similarity between A and B itself. # nodes connected to A or B with PCC ≥ PCCAB − 0.05 # of X-type nodes in the network # nodes connected to A and B with PCC < PCCAB − 0.05 = # of X-type nodes in the network

CSIAB = 1 −

Statistic-based indices employ probability distributions (such as chi-square and Fisher’s exact test) to determine the likelihood of observing a certain overlap between the interaction profiles of two X-type nodes given their degree and the total number of Y-type nodes in the network18 (Supplementary Table 1). We will discuss two of the most commonly used statistic-based indices. The Pearson correlation coefficient (PCC) was originally developed to measure the linear relationship between two continuous variables, such as protein and mRNA levels. This metric can also be applied to bipartite networks where interactions are either present or absent. The PCC provides a value between −1 and 1 that describes how well the interactions overlap. A PCC of 1 indicates a perfect overlap, 0 corresponds to the number of shared interactions expected by chance and −1 depicts perfect anticorrelation. The hypergeometric index calculates the log-transformed

probability of observing an equal or greater number of shared nodes by chance and, therefore, measures the significance rather than the magnitude of the overlap. Comparing association indices Different association indices can provide different values of interaction-profile similarity. We illustrate this using three small example networks in which two X-type nodes, A and B, share different numbers of Y-type nodes, out of a total of seven (Fig. 1b). In each example, different indices can provide different values, ranging from perfect similarity (SimpsonAB = 1 in example 1) to low similarity (hypergeometricAB = 0.146 in example 1). Further, different indices can rank the interaction-profile similarity between a pair of nodes in different orders. For instance, according to most indices the profiles of A and B are most similar nature methods | VOL.10 NO.12 | DECEMBER 2013 | 1171

perspective b

a FIND MODULES

COMPARE SIMILARITY

GUIDE

HELP CONTACT LOGIN/OUT

GUIDE for ASSOCIATION INDEX for NETWORKS

PASSWORD List of interactions

Interaction matrix

Login Register New User G1 G2 G3 G4 G5 G6 G7 G8 G9

c1 c2 c3 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0

c4 c5 c6 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0

A C1 C1 C1 C2 C2 C2 C2 C3 C3 C3

or

B G1 G5 G6 G1 G2 G3 G5 G1 G2 G4

Find modules

Select index

Assemble networks

Histogram

Visuallze interactions

Compare similarity between selected pairs

Heatmap

1

Heatmap

USERNAME

K12H4.4 B0336.2 T10H9.3 E03H4.8 H19M22.3a K02010.5 F12F6.6 T22D1.4 T01B11.3 F57B10.1 F18C12.2a D2045.1 ZK180.4 Y57G11C.15 F57H12.1 K02B12.3 ZK637.8a Y51A2A.4 H15N14.2 C47G2.3 F59E10.3 T14G10.5 C24H11.7 Y71F9AL.17 T11F8.3 C09F5.2 R06C1.2 T17E9.2a Y25C1A.5 D1014.3 C55B7.5 Y51H7C.6a Y70C5C.6 F09E5.2 C36A4.4 C29F9.7 Y37D8A.10 ZK1058.2 F54C1.7 T01E8.3 C39F7.4 C39E9.3 T01C3.11 K03E6.7 F36H1.4a ZK792.6 ZK616.6 F31B12.1a H02I12.1 B0280.1 F57B9.2 F56A3.2

0

VISUALIZE DATA

D10 D21 D29 D32 D33 D40 D41 D42 D45 D53 D54 D57 D66 D71 D92 D09 D27 D43 D60 D23 D37 D47 D48 D82 D84 D51 D75 D59 D64 D88 D81 D35 D45 D04 D11 D13 D31 D94 D01 D25 D73 D91 D13 D03 D22 D50 D20 D36 D63 D05 D26 D61 D62 D83 D65 D15 D58 D87 D68 D85 D69 D28 D06 D80 D02 D49 D93 D95

GAIN

MANAGE PROJECT

c

Select pairs Select index

npg

d 0

e

f 4 All pairs of nodes Selected pairs of nodes

3 Density

Figure 2 | GAIN web tool for the calculation and clustering of association indices. (a) Screenshot of GAIN’s main window. (b,c) Visualization of a bipartite network as an interaction heat map (b) or as a graph (c). (d,e) Clustered association index displayed as heat map (d) and association network (e). (f) Density plot comparing the distribution of the association index values between a selected set of node pairs (blue) and all possible pairs of nodes (red).

1

Y57G11C.15 Y71F9AL.17 T14G10.5 F57H12.1 K02B12.3 T11F8.3 C24H11.7 Y51A2A.4 C47G2.3 F59E10.3 H15N14.2 ZK037.8a C09F5.2 F36H1.4a K03E6.7 F56A3.2 B0280.1 F57B9.2 H02I12.1 F31B12.1a ZK616.6 ZK792.6 T01E8.3 R06C1.2 Y25C1A.5 Y70C5C.6 D1014.3 ZK1058.2 Y37D8A.10 T01C3.11 C29F9.7 Y51H7C.6a F09E5.2 C36A4.4 C39F7.4 C55B7.5 F54C1.7 T17E9.2a C39E9.3 K02D10.5 K12H4.4 T22D1.4 T10H9.3 F57B10.1 F18C12.2a ZK180.4 E03H4.8 F12F6.6 B0336.2 D2045.1 T01B11.3 H19H22.3a

2

1 Y57G11C.15 Y71F9AL.17 T14G10.5 F57H17.1 K02B12.3 T11F3.3 C24H11.7 Y51A2A.4 C47G2.3 F59E10.3 H15N14.2 ZK637.8a C03F5.2 F36H1.4a K03E6.7 F56A3.2 B0280.1 F57B9.2 H02112.1 F31B12.1a ZK616.6 ZK792.6 T01E8.3 K06C1.2 Y25C1A.5 Y70C5C.6 D1014.3 ZK1058.2 Y37D8A.10 T01C3.11 C29F9.7 Y51H7C.6a Y03E5.2 C36A4.4 C39F7.4 C55B7.5 F54C1.7 T17E9.2a C39E9.3 K02D10.5 K12H4.4 T22D1.4 T10H9.3 F57B10.1 F16C12.2a ZK180.4 E03H4.8 F12F6.6 B0336.2 D2045.1 T01B11.3 H19H22.3a


Network

in example 3, but the Simpson index ranks examples 1 and 3 as equally similar. Finally, even for pairs of nodes that have different overlap and/or node degree, an index may output identical values as it condenses four variables (overlap, degree of A, degree of B and ny) into a single number. For instance, the Jaccard index cannot discriminate between examples 1 and 2 (Jaccard AB = 0.333), in which the total number of edges is differently distributed between A and B, whereas the other indices can. Nonspecific interactions can drive similarity The indices mentioned above consider the similarity in interacting partners between two X-type nodes but not the interaction specificity. Two issues need to be considered. First, Y-type hubs may confer artificially high levels of interaction-profile similarity: if half of all X-type nodes bind a Y-type hub, this overlap is not very informative. Second, not all Y-type nodes are independent, which may also confer exaggerated levels of interaction-profile similarity. For instance, neurons can be classified into different categories on the basis of the tissues in which they are located. Different types of neurons express common genes. Thus, in a gene-to-tissue network where genes are connected to the tissues in which they are expressed, neuronal genes may be connected to many classes of neurons, artificially increasing their similarity.

1172 | VOL.10 NO.12 | DECEMBER 2013 | nature methods

0 0

0.2

0.4 0.6 Index value

0.8

1.0

The connection specificity index (CSI) provides a contextdependent measure that mitigates the effect of nonspecific interactions by ranking the significance of similarity between two X-type nodes according to the specificity of their shared inter action partners9. The CSI between two nodes A and B is defined as the fraction of X-type nodes that have an interaction-profile similarity with A and B that is lower than the interaction-profile similarity between A and B themselves (see Box 2). As originally defined, the CSI employs the PCC as a first-level association index to rank the similarity between nodes, and then uses a constant of 0.05 to define the lower boundary of interaction-profile similarity9. When the constant is increased, the CSI provides a more stringent measure. Other association indices may also be used for a first-level ranking of interaction-profile similarity. Figure 1c illustrates an example in which the CSI reduces the influence of hubs. In this network, A and B interact with three and one Y-type nodes, respectively, and share one Y-type node, resulting in a PCCAB = 0.47 (Fig. 1c). A and C also share one interaction partner and therefore PCCAC = 0.47 as well. However, many other X-type nodes interact with the Y-type node connected to both A and C, hence this shared interaction is less specific. Applying CSI to these networks alleviates this problem: when a constant of 0.05 is used, CSIAB = 0.5, whereas CSIAC = 0.17 (Fig. 1c).

perspective a

Jaccard

Geometric

Simpson

Cosine

Hypergeometric

PCC

b

CSI

Jaccard

Simpson

Geometric or cosine

+ –

0.5 0.4 0.3 0.2 0.1 0

AUC = 0.895

0.2 0.1 0

0.5 0.4 0.3 0.2 0.1 0

Hypergeometric

AUC = 0.941

PCC

CSI

c

Figure 3 | Using association indices to identify modules in a gene-to-phenotype network. (a) Clustered association Jaccard 62 4 4 11 index heat maps for a C. elegans gene-to-phenotype network (top). The association index was calculated for 59 59 61 Simpson 62 each pair of genes according to shared phenotypic features and then clustered using hierarchical clustering. 0 12 Geometric 4 59 The distribution of index values for gene pairs that belong to the same module (intramodule, red) is 12 Cosine 4 59 0 plotted (bottom) against the values of gene pairs that belong to different modules (intermodule, black). Hyperg. 11 61 12 12 The area under the receiver operating characteristic curve (AUC) measures the separation between the two PCC 9 56 5 5 16 distributions. (b) The association networks shown were assembled by linking genes that have a top 10% CSI 20 61 17 17 26 phenotypic profile similarity value using the indicated indices. A force-directed layout was generated by Cytoscape 2.8.1. The colors of the nodes represent a manual partitioning of the genes in modules 9. (c) Pairwise comparison of the percentage of differences in the edges included in each of the association networks. Hyperg., hypergeometric.

15 15

d

0 0. 0.2 0.4 6 0. 8 1. 0

1.0 0.8 0.6 0.4 0.2 0 Geometric

Simpson

1.0 0.8 0.6 0.4 0.2

Hypergeometric

plc-1

perm-3

0

5

1.

0

C47G2.3

0.

.5 –0

15

5 10

1.0 0.8 0.6 0.4 0.2 0

0

Rank (%)

0.33 1 0.33 0.58 3.81 0.55 0.62

27 2 18 18 23 13 18

Index value

Rank (%)

Jaccard Simpson Geometric Cosine Hyperg. PCC CSI

0.42 1 0.42 0.65 4.86 0.62 0.62

15 2 11 11 14 9 18


0.44 0.89 0.42 0.65 5.79 0.60 0.96

11 7 11 11 10 10 1


0 0. 2 0. 0.4 6 0. 8 1. 0

0

5

1.

0

0.

.5 –0

15

5 10

Jaccard

Cosine

5 17 16 26

Index value

CSI

0 0. 2 0. 0.4 6 0. 8 1. 0

CSI

1.0 0.8 0.6 0.4 0.2 0

let-60

PCC

1.0 0.8 0.6 0.4 0.2 0

0.0 2 0. 4 0. 6 0. 8 1. 0

0.8 0.6 0.4 0.2 0

0 0. 0.2 4 0. 6 0. 8 1. 0

0 0. 0.2 0.4 6 0. 8 1. 0

Simpson

0.0 0.2 4 0. 6 0. 8 1. 0 Hypergeometric

plc-1

Cosine 1.0 0.8 0.6 0.4 0.2 0

1.0 0.8 0.6 0.4 0.2

b 1.0

L3 module S1 module M1 module Phenotype Phenotype hub

Geometric

1.0 0.8 0.6 0.4 0.2 0

0

Simpson

Jaccard

CSI

Figure 4 | Comparing association indices in the C. elegans gene-to-phenotype network. (a,b) Pairwise association index values for all genes according to shared phenotypic features for the Simpson index (a) and CSI (b), plotted against the values determined with the other indices. (c,d) The interaction-profile similarity between plc-1 and let-60 (belonging to different modules) (c), plc-1 and perm-3 (belonging to different modules) and between C47G2.3 and F57B10.1 (belonging to the same module) (d) were determined for all the association indices. The ranking of interaction-profile similarity across the entire network (in the top x% values from most to least similar) is indicated (right). Yellow nodes indicate phenotypes. Phenotype hubs (connected to more than 40% of the genes) are indicated with a blue outline. Hyperg., hypergeometric.

5 17

c

1.0 0.8 0.6 0.4 0.2 0

0.4 0.2 0

0.4 0.2 0

9 20 56 61

Finding network modules Network modules are groups of nodes with relatively high interaction-profile similarity and can point to shared biological function between nodes. To compare how different association indices perform in the identification of network modules, we used two bipartite networks. The first is a subset of a Caenorhabditis elegans gene-to-phenotype network that connects 52 essential genes to 94 phenotypic features9. We used genes that belong to four modules manually determined by the authors of the original paper to benchmark the performance of the different indices. Association indices were calculated for each pair of genes according to their shared phenotypic features and then clustered into heat maps (Fig. 3a). Visual inspection shows that the Simpson index is least suitable for the identification of the four modules, and CSI performs the best. Consistent

GAIN: a web tool for association indices and clustering We developed GAIN (http://csbio.cs.umn.edu/similarity_index/ login.php) (Fig. 2a and Supplementary Methods), which allows a user to upload an interaction data set and perform several tasks. First, interactions can be visualized as a heat map (Fig. 2b) or graph (Fig. 2c). Second, GAIN allows the user to find modules by calculating all pairwise values with a user-selected association index followed by hierarchical clustering and by displaying a heat map (Fig. 2d) or association network (Fig. 2e). Association networks contain one node type connected by an edge only when their interaction-profile similarity exceeds a user-selected threshold. Finally, GAIN can display a density plot to determine whether an association index can discriminate the interaction-profile similarity of a particular set of node pairs selected by the user from all node pairs (Fig. 2f). For instance, in a gene regulatory network this can be used to determine whether pairs of highly homologous transcription factors have more similar interaction profiles than all a 1.0 1.0 0.8 0.8 possible pairs of transcription factors. 0.6 0.6

0 0. 2 0. 0.4 6 0. 8 1. 0


Index value

npg

AUC = 0.903

0.3

Ja c Si car m d G pso e n C om os et H ine ric yp PC erg C . C SI

AUC = 0.909

L3 module L4 module S1 module M1 module

0 0. 2 0. 4 0. 6 0. 8 1. 0

0

0.4 0.3 0.2 0.1 0

.2 0 0. 0 .2 4 0. 6 0. 8

0.1

AUC = 0.909

–0

0.2

0.4 0.3 0.2 0.1 0

0 3 6 9 12 15

AUC = 0.799

0 0. 2 0. 4 0. 6 0. 8 1. 0

0.3

0 0. 2 0. 4 0. 6 0. 8

AUC = 0.917

0 0. 2 0. 4 0. 6 0. 8 1. 0

0.4 0.3 0.2 0.1 0

0 0. 2 0. 4 0. 6 0. 8

Fraction of gene pairs

Intramodule Intermodule

PCC

F57B10.1

nature methods | VOL.10 NO.12 | DECEMBER 2013 | 1173

perspective

Top k a.i. values (knn score)

Ranked a.i. values

... Function n?

Fraction of genes

1.00

F1 vs. F1

Gene-to-phenotype network

c 0.83

0.97 Median AUC

X vs. F2 genes

X vs. F1

knn score

Non-F2 vs. F2

Fraction of genes

Function 2? X

Top k a.i. values (knn score)

Ranked a.i. values

Function 1?

X vs. F2

0.94 0.91 0.88

F2 vs. F2

Jaccard Simpson

0.81

Geometric Cosine

0.79

Hypergeometric 0.77

PCC CSI

0.75

0.85 1 2 3 4 5 6 Number of nearest neighbors

knn score

Protein-DNA interaction network

1 2 3 4 5 6 Number of nearest neighbors

Figure 5 | Predicting gene function. (a) A k-nearest-neighbor (knn) algorithm was used to evaluate how well each index is able to assign genes to functional classes (F). To determine whether an uncharacterized gene X can be assigned a particular function, a knn score was determined as the average of the top k association index (a.i.) values between X and genes with that function. The knn values were then calculated for genes that have that function (blue and green) and for genes that do not (black curves). To assign a function to gene X, these values should be well separated (determined by calculating the AUC). a illustrates a case in which gene X can be assigned function 1 (F1, blue) but not function 2 (F2, green). The median AUC determined for all the functional classes was used as a measure of performance of the different association indices to predict gene function. (b) The median AUC calculated for the four functional classes in the C. elegans gene-to-phenotype network was determined for k values of 1 to 6 (number of nearest neighbors). (c) The median AUC calculated for the biological process GO slim terms in the yeast protein-DNA interaction network was determined for k values from 1 to 6.

b

A

E

D

F G

H

2

1

k

N

et

w

or

k or

of ir

et 1 1 0 0 ... 1 0 ...

1 1 1 0 ... 0 0 ...

Y-type nodes Compare interaction lists using association indices

Network 2

Y-type nodes

0.8 0.6 0.4 0.2 0

Network 2 X-type nodes

0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Similarity according to Z

N Inte on r in ac te tin ra g ct pa in ir g s pa irs

C

w

no

H

A-B A-C A-D A-E ... B-C B-D ...

Network 1 X-type nodes

X-type nodes

Index value

C Network 2

B

F G

B

N

D

E

Pa

A

de

Network 1

c

Network 1

s

a

assigned by manual classification (Fig. 3b). Generally, association networks obtained with different indices exhibit a large degree of overlap in the edges included, except for those obtained with the Simpson index and CSI (Fig. 3c). Indeed, by determining all pairwise association index values for each pair of indices, i.e., by not limiting to the top 10%, comparisons involving Simpson or CSI were least correlated (Fig. 4a,b and Supplementary Fig. 1). This analysis of a real network further substantiates the notion that different indices can result in different values and ranking of interaction-profile similarity. Neither the Simpson index nor CSI is well correlated with any of the other indices, but the consequences in each case are quite different: for Simpson the poor correlation results in reduced module demarcation, whereas for CSI it results in precisely the opposite. The denominator in

Similarity according to Y

with this observation, we found that CSI is best able to discriminate between the interaction-profile similarity between nodes that belong to the same module and that of nodes belonging to different modules (Fig. 3a). Next, we asked which index performs best to delineate association networks for this gene-to-phenotype network. These networks connect nodes that have an interaction-profile similarity above a certain threshold. Therefore, they serve not only to delineate modules but also to identify nodes related to more than one module and nodes that are not related to any module. We used the top 10% of the values obtained with each index (Fig. 3b). CSI outperforms the other indices, as it (i) better demarcates the modules, (ii) leaves only two genes not assigned to any module and (iii) places only one gene into a module different from that

npg


Ja c Si car G mp d eo so m n H yp e er C tric ge os om ine et ric PC C C SI

Ja c Si car G mp d eo so m n H yp e er C tric ge os om ine et ric PC C C SI

Index value in gene-to expression network

Z-type nodes Figure 6 | Application of association indices to network integration. (a) Using association indices, edges in one monopartite network can be Interacting bHLH pairs Highly coexpressed d e gene pairs compared to those in another, focusing either on a particular node Noninteracting bHLH Other gene pairs pairs (blue box) or on the entire network. (b) To determine whether interacting 6 7 nodes in network 2 have a similar interaction profile in network 1, 4 4 2 1 interaction-profile similarity values in network 1 can be compared between interacting 1.0 1.5 0.8 and noninteracting nodes in network 2. (c) The interaction-profile similarity between 0.6 1.0 X-type nodes in two networks can be compared to determine whether node pairs 0.4 that have a similar interaction profile in network 1 also have a similar interaction 0.2 0.5 0 profile in network 2. Edge width in the association network is proportional to the –0.2 0 association index value. (d) The association index values of C. elegans bHLH transcription factors was determined according to the tissues in which they are expressed and was partitioned between proteins that physically interact (red) and those that do not (gray). Each box spans from the first to the third quartile, the horizontal lines inside the boxes indicate median value and the whiskers indicate minimum and maximum values (b,d). (e) Association index values were determined for pairs of promoters in the yeast protein-DNA interaction network. The values for pairs of highly coexpressed genes (top 1%; red) and other gene pairs (bottom 99%; gray) are plotted.

Index value in protein-DNA interaction network


b

Non-F1 vs. F1

X vs. F1 genes

Median AUC

a

perspective Table 1 | Association index performance for different applications Identifying network modules Predicting gene function Comparing two sets of node pairsa Determining significance of overlap

Jaccard

Simpson

Geometric

Cosine

Hypergeometric

PCC

CSI

** ** ** No

* * *** No

** ** ** No

** ** ** No

** ** * Yes

** ** *** No

*** *** * No

Asterisks indicate qualitative strengths, with a greater number indicating greater utility.

npg


aAssessment

depends on biological question or objective.

the Simpson index uses only the lower of the two overall node degrees, which can lead to artificially high levels of interactionprofile similarity, even for genes belonging to different modules (Fig. 4c). In the case of CSI, ranking similarity according to interaction specificity results in a higher value for gene pairs with shared specific phenotypes (C47G2.3 and F57B10.1), and a lower value for gene pairs with shared common phenotypes (plc-1 and perm-3) (Fig. 4d). The second example network contains protein-DNA inter actions between 102 yeast transcription factors and 542 promoters4. Association indices were calculated for each pair of promoters according to their shared transcription factors, and values were clustered into heat maps (Supplementary Fig. 2a). The heat maps are visually quite similar, and numerous modules can be detected. There were no previously benchmarked modules available. However, because genes with similar functions are frequently bound by the same transcription factor(s), we assessed the performance of the different indices by analyzing the biological process Gene Ontology (GO) enrichment in three different modules (Supplementary Fig. 2b). Two modules were detected equally well by all association indices (Supplementary Fig. 2c); however, for the third module significant enrichment (P < 0.001) for genes involved in the oxidation-reduction process was detected only by CSI and the Jaccard, geometric and hypergeometric indices (Supplementary Fig. 2c). Thus, association indices can perform differently in different types of networks and even within a network. Predicting function of individual genes Biologists frequently identify single genes of unknown functions, for instance in a genetic screen. So far, we have discussed network modules as a starting point for functional annotation. However, for analysis of only a single gene, there is no need to first comprehensively identify network modules. Moreover, modules are not always suitable for annotation of the function of every gene, as a gene may not belong to a clearly defined module and may have more than one function. An intuitive way to annotate gene function is to use the guilt-by-association principle, which postulates that two genes with similar functions have similar interaction profiles. One can assign functions to genes using a variety of different algorithms. Here, we use a k-nearest-neighbor algorithm that tests associations between genes and functions (Fig. 5a). An unknown gene can be assigned to each function F depending on (i) the top k association index values between that gene and the gene(s) that are known to have that function, and (ii) the specificity of the distribution of those values. In the example shown in Figure 5a, the highest score for the unknown gene (X) with genes with either known function 1 (F1, blue) or function 2 (F2, green) is similar (red lines). However, for function 1, the two distributions are largely separate, whereas for function 2 the two

distributions overlap greatly. Thus, function 1 can be assigned to gene X with greater confidence than function 2. We assessed which association index best predicts function using the two networks described above. Again, CSI was best able to assign genes to functional classes, and the Simpson index performed the worst (Fig. 5b,c). This result is consistent with the ability of CSI to consider interaction specificity. Integrated networks The integration of different types of networks enables the comparison of pairs of nodes across networks13,20. Questions that can be answered include (i) whether directly interacting pairs of nodes in one network also tend to interact in another (Fig. 6a; note that this involves two monopartite networks), (ii) whether interacting nodes in one monopartite network have similar interaction profiles in a bipartite network (Fig. 6b) and (iii) whether pairs of nodes with similar interaction profiles in one bipartite network are also similar in another bipartite network (Fig. 6c). An example of the first type of question is whether the genes that encode physically interacting proteins also interact genetically. An example of the second type of question is whether proteins that physically interact tend to share phenotypes. Finally, an example of the third question is whether transcription factors that regulate a shared set of target genes are expressed in the same tissues and/or under the same conditions. To determine whether interacting nodes in one monopartite network also interact in another network, the overlap between both sets of interactions can be determined, using association indices, on the basis of the number of shared edges between both networks, the number of edges in each network and the total number of node pairs (Fig. 6a). To determine the magnitude of similarity, Jaccard and Simpson are most suitable, and the hypergeometric index can be used to determine significance. The same approach can be used to compare interacting node pairs between modules in one network to those in another. Such ‘cross-network module preservation’ has been evaluated elsewhere21. To integrate a monopartite and a bipartite (Fig. 6b), or two bipartite, networks (Fig. 6c), the biological question should inform index selection. We illustrate this with two data sets. The first is a multiparameter, integrated C. elegans basic helix-loophelix (bHLH) network comprising protein-protein interactions and gene-to-tissue expression patterns13. Each index revealed that interacting bHLH proteins are more often coexpressed than noninteracting ones (Fig. 6d). However, Simpson outperformed the other indices (Fig. 6d and Supplementary Fig. 3a). This is because a few bHLH proteins that bind many partners are broadly expressed, but each protein’s partners are expressed in only a subset of tissues. This is best captured by the Simpson index, as it uses the minimum node degree in the denominator. The second data set is the yeast protein-DNA interaction network, nature methods | VOL.10 NO.12 | DECEMBER 2013 | 1175

perspective

npg


integrated with a microarray coexpression network22. All indices revealed that highly coexpressed genes have higher protein-DNA interaction-profile similarity than other gene pairs (Fig. 6e). The PCC best separated the two categories, and the Simpson index was least efficient (Supplementary Fig. 3b). The relatively poor performance of the Simpson index is because it considers the degree of only the least connected promoter. As a consequence, a promoter bound by many transcription factors may be regarded as similar to a promoter bound by few, some of which are shared. However, differences between transcription factors bound to promoters are also highly meaningful, as these may contribute to distinct gene-expression profiles. Conclusions Different association indices can be used to compare interactionprofile similarity within and across networks, and different indices have strengths and weaknesses for different applications (Table 1). CSI is most suitable for predicting gene function and identifying modules. However, CSI levels the similarities between modules, which is a disadvantage in comparing modules. When the main goal is to compare the similarity between node pairs, the biological question should drive index selection. For instance, the Simpson index may be used to avoid penalizing large differences in node degree. If one wants, conversely, to capture this difference, other indices are more appropriate. The hypergeometric index should be used with caution to determine the magnitude of similarity between interaction profiles, as it does not scale linearly with the proportion of overlap. However, only this index is able to calculate the statistical significance of interactionprofile overlap. Note: Any Supplementary Information and Source Data files are available in the online version of the paper. Acknowledgments We thank members of A.J.M.W.’s lab, R. McCord and B. Lajoie for discussions and critical reading of the manuscript. We thank J.C. Bare (Institute of Systems Biology) for helpful advice in the development of GAIN. This work was supported by the US National Institutes of Health grants DK068429 and GM082971 to A.J.M.W. J.I.F.B. is partially supported by a postdoctoral fellowship from the Pew Latin American Fellows Program. J.N. and C.L.M. are partially supported by grant DBI-0953881 from the US National Science Foundation. AUTHOR CONTRIBUTIONS J.I.F.B. and A.J.M.W. conceived the project; J.I.F.B. performed the data analysis with the assistance of A.D., J.N. and C.L.M.; A.D. and J.I.F.B. developed the GAIN web tool in collaboration with J.N., C.L.M. and J.M.S.; J.I.F.B. and A.J.M.W. wrote the paper. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.


Reprints and permissions information is available online at http://www.nature. com/reprints/index.html. 1. Walhout, A.J.M. et al. Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 287, 116–122 (2000). 2. Schwikowski, B., Uetz, P. & Fields, S. A network of protein-protein interactions in yeast. Nat. Biotechnol. 18, 1257–1261 (2000). 3. Costanzo, M. et al. The genetic landscape of a cell. Science 327, 425–431 (2010). 4. Harbison, C.T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004). 5. Walhout, A.J.M. Unraveling transcription regulatory networks by protein-DNA and protein-protein interaction mapping. Genome Res. 16, 1445–1454 (2006). 6. Reece-Hoyes, J.S. et al. Extensive rewiring and complex evolutionary dynamics in a C. elegans multiparameter transcription factor network. Mol. Cell 51, 116–127 (2013). 7. Bordbar, A. & Palsson, B.O. Using the reconstructed genome-scale human metabolic network to study physiology and pathology. J. Intern. Med. 271, 131–141 (2012). 8. Watson, E., MacNeil, L.T., Arda, H.E., Zhu, L.J. & Walhout, A.J.M. Integration of metabolic and gene regulatory networks modulates the C. elegans dietary response. Cell 153, 253–266 (2013). 9. Green, R.A. et al. A high-resolution C. elegans essential gene network based on phenotypic profiling of a complex tissue. Cell 145, 470–482 (2011). 10. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004). 11. Fowlkes, C.C. et al. A quantitative spatiotemporal atlas of gene expression in the Drosophila blastoderm. Cell 133, 364–374 (2008). 12. Martinez, N.J., Ow, M.C., Reece-Hoyes, J., Ambros, V. & Walhout, A.J. Genome-scale spatiotemporal analysis of Caenorhabditis elegans microRNA promoter activity. Genome Res. 18, 2005–2015 (2008). 13. Grove, C.A. et al. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327 (2009). 14. Ritter, A.D. et al. Complex expression dynamics and robustness in C. elegans insulin networks. Genome Res. 23, 954–965 (2013). 15. Lee, I., Blom, U.M., Wang, P.I., Shim, J.E. & Marcotte, E.M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121 (2011). 16. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N. & Barabasi, A.L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002). 17. Spirin, V. & Mirny, L.A. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA 100, 12123–12128 (2003). 18. Hayek, L.-A.C. in Measuring and Monitoring Biological Diversity: Standard Methods for Amphibians. (ed. Heyer, W.R.) Ch. 9, 207–269 (Smithsonian Institution, Washington, DC, 1994). 19. Goldberg, D.S. & Roth, F.P. Assessing experimentally derived interactions in a small world. Proc. Natl. Acad. Sci. USA 100, 4372–4376 (2003). 20. Gunsalus, K.C. et al. Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis. Nature 436, 861–865 (2005). 21. Langfelder, P., Luo, R., Oldham, M.C. & Horvath, S. Is my network module preserved and reproducible? PLoS Comput. Biol. 7, e1001057 (2011). 22. Huttenhower, C., Hibbs, M., Myers, C. & Troyanskaya, O.G. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22, 2890–2897 (2006).

corrigenda

Corrigendum: Using networks to measure similarity between genes: association index selection Juan I Fuxman Bass, Alos Diallo, Justin Nelson, Juan M Soto, Chad L Myers & Albertha J M Walhout Nat. Methods 10, 1169–1176 (2013); published online 26 November 2013; corrected after print 27 January 2014

npg


In the version of this article initially published, the formula describing the connection specificity index (CSI) in Box 2 was incorrect. The denominator in the fraction of the CSI equation originally read “ny”; the correct denominator is “# of X-type nodes in the network.” The error has been corrected in the HTML and PDF versions of the article.

nature methods

using networks to measure similarity between genes ... - Walhout Lab

using networks to measure similarity between genes ... - Walhout Lab

Suggest Documents

using networks to measure similarity between genes ... - Walhout Lab

NpgRJ_Nmeth_1063 659..664 - Walhout Lab

computing semantic similarity measure between words using ... - aircc

Computing Semantic Similarity Measure Between Words ... - AIRCC

A SEMANTIC SIMILARITY MEASURE BETWEEN SENTENCES

Networks of plants: how to measure similarity in vegetable ... - Core

A measure of similarity between graph vertices

Texture Similarity Measure Using Kullback-Leibler Divergence ...

Using Web-Search Results to Measure Word-Group Similarity

Using an Entropy Similarity Measure to Enhance the ... - CiteSeerX

similarity measure, medical image

Using Time Perception to Measure Fitness for Duty - Eagleman Lab

A Distance Measure for Determining Similarity between ... - liacs

A Similarity Measure Between Vector Sequences with ... - CiteSeerX

A likelihood ratio distance measure for the similarity between ... - UEA

A MEASURE OF SIMILARITY BETWEEN GRAPH ... - Semantic Scholar

A Label-Free Similarity Measure between ... - Semantic Scholar

An Efficient Measure of Similarity between Gene Expression Profiles ...

Bridging the Gap: a Semantic Similarity Measure between Queries ...

A Distance Measure for Determining Similarity between Criminal ...

Unsupervised Semantic Similarity Computation Between Terms Using ...

Choosing the Right Similarity Measure

Sequence similarity between allelic Glu-B3 genes ... - Springer Link

Partial similarity of shapes using a statistical significance measure