Topological centrality and its eScience applications - Semantic Scholar

2 downloads 1874 Views 2MB Size Report
The degree centrality describes the degree information of each ... JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY ...
asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 1

Topological Centrality and its e-Science Applications

Hai Zhuge and Junsheng Zhang Knowledge Grid Research Group, Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China. E-mail: [email protected]

Network structure analysis plays an important role in characterizing complex systems. Different from previous network centrality measures, this article proposes the topological centrality measure reflecting the topological positions of nodes and edges as well as influence between nodes and edges in general network. Experiments on different networks show distinguished features of the topological centrality by comparing with the degree centrality, closeness centrality, betweenness centrality, information centrality, and PageRank. The topological centrality measure is then applied to discover communities and to construct the backbone network. Its characteristics and significance is further shown in e-Science applications.

Introduction The “rich get richer” phenomenon exists in many complex networks such as the World Wide Web. There are two usual ways for a node to become richer: connecting more nodes and connecting more important nodes. We have observed that a node may gain more if it connects an important node rather than connects many, but less important nodes, and that both nodes and edges are important in forming network centrality. Existing centrality measures concern either nodes or edges (Anthonisse, 1971; Bonacich, 1972; Freeman, 1977, 1979; Kleinberg, 1999; Latora & Marchiori, 2004; Nieminen, 1974; Sabidussi, 1966; Wasserman & Faust, 1972). They cannot reflect the topological characteristic of centrality because influences exist between nodes, between edges, and, between node and edge. This article aims to explore a new network centrality called Topological Centrality (TC). The definitions of various centrality measures are based on graph G = (V, E), where V and E are the node set and the edge set, respectively; |V | = n and |E| = m represent the number of nodes and edges, respectively. The authority and hub reflect the indegree and outdegree characteristics, respectively, of nodes in network (Kleinberg, Received December 2, 2009; revised March 6, 2010; accepted March 8, 2010 © 2010 ASIS&T • Published online in Wiley (www.interscience.wiley.com). DOI: 10.1002/asi.21353

InterScience

1999). The idea of Hyperlink-Induced Topic Search (HITS) is that a good hub links many authorities while a good authority is linked by many good hubs (Kleinberg, 1999). Nodes with the highest authority/hub are centers. The authority and hub of a node are calculated by the following formula:  ⎧ A(vi ) = H(vj ) ⎪ ⎨ (vj ,vi )∈E  , ⎪ A(vi ) ⎩ H(vj ) = (vi ,vj )∈E

where A(vi ) and H(vj ) are the authority and hub of node vi and vj , respectively. The degree centrality describes the degree information of each node (Freeman, 1979; Nieminen, 1974) according to the idea that more important nodes are more active and therefore should have more connections. Degree centrality can be used to find the core nodes, but it only considers the hub characteristic and ignores the authority characteristic. The degree centrality for a node v is calculated as follows: C D (v) = degree(v)/(n − 1). Calculating the degree centrality for all nodes in a dense graph has a time complexity O(n2 ), which becomes O(m) in a sparse graph. Similar to the degree centrality, an approach was proposed to improve the efficiency of information propagation in peer-to-peer networks based on the in- and outdegrees of nodes (Zhuge & Li, 2007). The betweenness centrality describes the frequencies of nodes in the shortest paths between indirectly connected nodes (Anthonisse, 1971; Freeman, 1977, 1979). It is based on the idea that if more nodes are connected via a node, then the node is more important. It can be used to find the edges between two communities in a complex network. The betweenness centrality for node v can be calculated by the following formula, 

CB (v) =

s = v = t ∈ v s = t

σst (v)/σst , (n − 1)(n − 2)

where σst is the number of the shortest geodesic paths from s to t, and σst (v) is the number of the shortest geodesic paths from s to t that pass through node v.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 2

The shortest paths between each pair of nodes in a graph can be found by Floyd–Warshall algorithm with time complexity O(n3 ) (Warshall, 1962), so the time complexity of the betweenness centrality also is O(n3 ). The betweenness centrality has been used to study the community structure of social and biological networks (Girvan & Newman, 2002). The closeness centrality describes the efficiency of information propagation from one node to the others (Freeman, 1979; Sabidussi, 1966; Wasserman & Faust, 1972). Its idea is that a node is central if it can quickly reach others. The closeness centrality can be regarded as a measure of the time to spread information from a node to other reachable nodes in the network. The closeness centrality is defined as the mean geodesic distance (i.e., the shortest path) between a node v and all of the nodes reachable from v as follows, where n ≥ 2 is the size of the connected component reachable from v. Calculating the closeness centrality for each node in the graph has time complexity O(n3 ).  CC (v) = (n − 1)/ dG (v, t). t∈V \v

The eigenvector centrality measures the importance of nodes according to the adjacent matrix of a connected graph (Bonacich, 1972; Perra & Fortunato, 2008). It assigns relative scores to all nodes in the network based on the principle that connecting high-scored nodes contributes more to the score of a node than do connecting low-scored nodes. PageRank is a variant of the eigenvector centrality measure (Page, Brin, Motwani, & Winograd, 1998). The information centrality describes nodes’ influence on the network efficiency of information propagation (Latora & Marchiori, 2004). The network efficiency is defined by the following formula:   1 1 vi  =vj ∈G εij EG = = , n(n − 1) n(n − 1) d(vi , vj ) vi  =vj ∈G

where the efficiency εij in the communication between nodes vi and vj is equal to the inverse of the length of the shortest path d(vi , vj ).

The information centrality of a node v is the relative drop in the network efficiency caused by the removal of the edges incident with v from G defined by the following formula: CI (v) = E/E = (E[G] − E[Gv ])/E, where Gv indicates the network resulting from removing the edges incident with node v from G. The information centrality has been used to study the structures of communities in complex networks (Fortunato, Latora, & Marchiori, 2004).

Topological Centrality In the connected network, weights of nodes and edges influence each other. When the order of nodes’ weights stays stable after certain times of influence, the network reaches a stable state, and the nodes with the highest weights are topological centers of the network. Topological centrality (TC) is a kind of network centrality measure that reflects relative centrality of nodes and edges as well as the influence between nodes, between edges, and between node and edge. The following is the way to measure TC: When the network is in a stable state, the TC of a node is the ratio of its weight to the largest weight of nodes, the topological centers have the largest weight of node: 1, and the TC of an edge is the ratio of its weight to the largest weight of an edge. The TC of a node or an edge reflects the geodesic distance (i.e., the length of the shortest path between nodes) from a node to its nearest topological center. The higher the TC of a node/edge, the closer it is to the nearest topological center. An undirected graph has one or more topological centers. The number is determined by network structure. An undirected network has one of the following structures. • A network with circular structure has n (n ≥ 3) topological centers, as shown in Figure 1a. • A network with symmetric structure has two topological centers, as shown in Figure 1b. • Otherwise, the network has a unique topological center, as shown in Figure 1c.

FIG. 1. Three types of topological structures. The darker the node, the higher its TC. The black nodes are the topological centers.

2

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 3

The following hypotheses are the basis of calculating TC: H1. The TC of a node is positively correlated to the TC degrees of its neighbor nodes.

The following characteristics explain H1: Characteristic 1. A node connecting nodes with a higher TC degrees gets a higher TC degree. Characteristic 2.A node connecting more nodes gets a higher TC degree. H2. If two nodes of an edge have higher TC degrees, then the edge has higher TC; and if an edge has higher TC, then its two ends also have higher TC degrees.

The following characteristics explain H2: Characteristic 3. A node closer to the topological center has a higher TC degree. Characteristic 4. An edge closer to the topological center has a higher TC degree.

Calculating TC H1 and H2 can be represented by the following formula, where vi are neighbors of node v, w[e(v, vi )] is the weight of an edge (or link) between v and vi ; vs(e) and vt(e) are the source and target nodes, respectively, of edge e; f and g are two functions; and, ↑ indicates the positive correlative relations. 

 w(v) ↑= w(v) + g(w(e(v, vi )) ↑, w(vi ) ↑) . w(e) ↑= f(w(vs(e) ) ↑, w(vt(e) ) ↑)

In the process of calculating TC degree, the weights of nodes and edges will increase after each time of iteration that simulates influence, but the descending order of weights of nodes will converge to a stable state. Weights of the nodes can be normalized by dividing the largest weight of the nodes. If the normalized weights of the nodes converge, the descending order of the nodes’ weights will stay stable, and the edges’ weights also will converge. The converged nodes’ weights and edges’ weights are the TC degrees of nodes and edges, respectively. Normalization of the weights of nodes satisfies the following characteristics: • If the normalized weights of nodes converge, the order of the descending weights of nodes also will converge. The normalization process does not change the order of weights of nodes, but the weights of nodes are mapped into interval [0,1]. • If the normalized weights of nodes converge, the weights of edges also converge. According to the definition of TC of an edge, the weight of an edge is the sum of the weights of its two ends. Since the normalized weights of nodes converge, the weights of incident edges also will converge. • If the normalized weights of nodes converge, the TC degrees of edges converge because the normalization of weights of edges is just to map the weights of edges onto [0,1], and it keeps the order of weights of edges.

To calculate the TC in a connected network, suppose a connected graph G = (V, E) has node set V = {v1 , v2 , . . . , vn } and edge set E = {e1 , e2 , . . . , em }, m ≥ n − 1. The corresponding adjacency matrix has the following elements:  1, {vi , vj } ∈ E γij = . 0, {vi , vj } ∈ /E The following formula iteratively calculates TC of nodes and edges, where temp_wi and wi are the weights of vi before and after normalization, temp_we(i,j) and we(i,j) are the weights of edge e(i, j) before and after normalization, and t ≥ 0 is the iteration time.   t wjt temp_wit+1 = wit + nj=1 γij we(i,j) . t+1 temp_we(i,j) = temp_wvt+1 + temp_wvt+1 i j The following formula normalizes the TC degrees of nodes and edges:  t+1 n temp_wt+1 /Maxi=1 wvi = temp_wvt+1 vi i . t+1 t+1 m temp_wt+1 we(v = temp_w /Max e(vi ,vj ) e(vi ,vj ) j=1 i ,vj ) The iterative calculation terminates if the following conditions are satisfied, where ej is an edge: ⎧  t+1 t 2 ⎪ ⎨ i∈[1,n] (wvi − wvi ) < εN .  ⎪ (wet+1 − wet j )2 < εM ⎩ j j∈[1,m]

Algorithm 1 iteratively calculates the weights of nodes and edges, where MAX (the maximum iteration times), εN (the square deviation threshold of the weight difference of nodes), and εM (the square deviation threshold of the weight difference of edges) control the times of iteration. The time complexity ofAlgorithm 1 is O[MAX(n + m)].At the initialization stage, all the weights of nodes are assigned a “1.” If the weights of edges are not given, all the weights of edges are assigned a “1.” After the first iteration, the weight of a node in the next iteration is the sum of the weights of its neighbor nodes and its own weight, and then the weight of an edge is the sum of its ends. The weights of nodes become larger. The weights of nodes and edges are normalized by dividing the maximum weight of nodes and edges during each time of iteration. Algorithm 1: Calculating TC degrees of nodes and edges Input: the number of nodes n, the number of edges m, edges represented as (edgeNum, startNode, endNode, weight), the maximum iteration time MAX, the deviation square limit of weight difference of nodes εN , and the deviation square limit of weight difference of edges εM . nodeWeight[1…n] ← 1, count ← 0, nodeSum ← n, edgeSum ← m while (count < MAX) and (nodeSum > εN or edgeSum > εM ) do oldNodeWeight[1 . . . n] ← nodeWeight[1 . . . n]

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

3

asi_21353_Rev3_EV.tex

TABLE 1.

5/5/2010

18: 21

Page 4

Comparison of different centrality measures.

Centrality measure

Time complexity

Concern

Degree centrality Betweenness centrality Closeness centrality Eigenvector centrality Information centrality Topological centrality

O(n2 ) O(n3 ) O(n3 ) Many approaches O(n3 ) O[K(n + m)]

node node or edge node node node node and edge

oldEdgeWeight[1 . . . m] ← edgeWeight[1 . . . m] nodeWeight[1 . . . n] ← (nodeWeight[1 . . . n] +  edgeWeight ∗ nodeWeight)/ max(nodeWeight) incidentEdge

edgeWeight[1 . . . m] ←  nodeWeight/ max(edgeWeight) incidentNode nodeSum ←  ni=1 (nodeWeight[i] − oldNodeWeight[i])2 2 edgeSum ← m i=1 (edgeWeight[i] − oldEdgeWeight[i]) count ← count +1 end while return nodeWeight[1 . . . n] and edgeWeight[1 . . . m]. After Algorithm 1 stops, the nodes with weights of 1 are topological centers. The weight of each node is its TC, and the larger the weight of node, the closer the node is to the nearest topological center. Table 1 compares TC with other centrality measures.

Experiments Experiments are carried out on several types of networks to verify the convergence of the proposed algorithm. Figure 2 shows the experiment results of the iterative TC calculation for nodes and edges in different structured networks with different scales: (a) the Watts–Strogatz small-world network with node number n = 1,000 and edge number m = 5,000; (b) the ring network with n = 1,000 and m = 1,000; (c) the lattice network with n = 100 and m = 180; (d) the full network with n = 30 and m = 435; and (d) the Erdˇors-Rényi random graph with n = 1,000, m = 10,045, and p = 0.02. The results show that the TC degrees of node and edge can converge. The times of iteration are related to n, m, εN and εM . Some centrality measures such as degree centrality, betweenness centrality, closeness centrality, and information centrality were compared in Latora and Marchiori (2004) and in Perra and Fortunato (2008). Here, we add PageRank and TC to the comparison based on the graph shown in Figure 3. The table in Figure 3 shows different centrality degrees of nodes. Their characteristics are: • The degree centrality is a local centrality, and it only records the degrees of nodes without any global information. Nodes 1, 2, and 3 have degree 5; Nodes 7 and 12 have degree 2; and the other nodes have degree 1. The degree centrality here is normalized by edge number 15. 4

• The only difference between the closeness centrality and the information centrality is that the orders of nodes {1, 3} and {7, 12} are different. The information centrality degrees of Nodes 1 and 3 are larger than those of Nodes 7 and 12 because the information centrality concentrates on the network efficiency. The influence on network efficiency by removing Nodes 1 and 3 is larger than that by removing Nodes 7 and 12. The closeness centrality measures the mean absolute distance from a node to other nodes in a connected graph. • The result of PageRank is far from other measures. Nodes 1 and 3 are two centers in PageRank, and Node 2 has a lower PageRank than do Nodes 1 and 3 because the authority of Nodes 7 and 12 are divided into two parts while Nodes 1 and 3 have four neighbors which contribute all of their authority values to Nodes 1 and 3, respectively. Nodes 7 and 12 have higher rank values than do Nodes 9, 10, and 11 because they have more neighbors. • The betweenness centrality reflects the frequencies of nodes occurring in the shortest paths between indirectly connected node pairs. However, the betweenness centrality has the worst resolution of nodes. Node 2 has the highest betweenness centrality; Nodes 1, 3, 7, and 12 have higher betweenness centrality; and the others have the same betweenness centrality 0. • TC combines the degree information and neighbor information. Node 2 is the topological center of the graph. Nodes 7 and 12 have higher TC degrees than do Nodes 9, 10, and 11 because they have extra neighbors. Nodes 1 and 3 follow Nodes 9, 10, and 11, and then the rest of the nodes. For nodes sharing the same topological center, the node with higher TC is closer to the topological center.

If the topological center also has the highest closeness centrality (e.g., Node 2 in Figure 3), the order of the closeness centrality of nodes may be similar to the TCs of nodes. But the topological centers are not always the same as the nodes having the highest closeness centrality. Figure 4 shows the differences between TC and the closeness centrality as follows: • Adopting the TC, only Node 4 is the topological center; and Nodes 1, 2, and 3 have larger TC degrees than does Node 5. Node 5 has two neighbors, and Node 6 has three neighbors, but Node 5 has higher TC than does Node 6 because Nodes 4 and 6 contribute more to Node 5 than do Nodes 5, 7, and 8 to Node 6. • Adopting the closeness centrality, Nodes 4 and 5 have the highest closeness centrality, and the closeness centrality degrees of Node 1, 2, and 3 are less than the closeness degree of Node 5.

Figure 5 shows the differences between TC and the closeness centrality in an Erdˇors-Rényi random graph. Table 2 shows the TC and the closeness centrality of Figure 6, a preferential attachment random graph p(i) ∼ (ck[i]α + a)/ (dl[i]β + a), where α = 1 and β = 0. Different from the closeness centrality, the degree centrality and the information centrality that are calculated without the iteration process on static networks, TC and PageRank

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 5

FIG. 2. Topological centrality convergence experiments (MAX = 100, εN and εM = 0.001): the left column lists networks of several structures, the middle column lists the node convergence records (the x-axis is the iteration times, and the y-axis is the normalized weights of nodes), and the right column lists the edge convergence records (the x-axis is the iteration times, and the y axis is the normalized weights of edges). (a) The Watts–Strogatz small-world network with n = 1,000 and m = 5,000, and the iteration time is 14; (b) the ring network with n = 1,000 and m = 1,000, and the iteration time is 2; (c) the lattice network with n = 100 and m = 180, and the iteration time is 17; (d) the full network with n = 30 and m = 435, and the iteration time is 2; and (e) the Erdˇors-Rényi random graph with n = 1,000, m = 10,045, p = 0.02, and the iteration time is 17.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

5

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 6

FIG. 3. An examination case for comparing the information centrality (C I ), the degree centrality (C D ), the closeness centrality (C C ), the PageRank (PR), and the log() of topological centrality (C T ).

FIG. 4. An example that differentiates topological centrality (C T ) from the closeness centrality (C C ).

concern dynamic iteration process on static networks. Each time of iteration reflects TC or authority of nodes in the network. To study the structure of the heterogeneous academic research networks, metadata of some papers in Digital Bibliography & Library Project (DBLP) (http://dblp.uni-trier.de/ db/index.html) are used as experimental data. The number of papers is 664,188, and the number of citation relations is 6

79,128. Node types are papers, researchers, and conferences. The semantic links are authorOf between researcher and paper, coauthor between researchers, publishedIn between paper and conference/journal, and cite between papers. The research network contains 1,084,198 nodes and 2,153,385 semantic links. The limits of iteration times are MAX = 40 and εM = εN = 200. The distribution of TC degrees of nodes is shown in Figure 7. It shows that nodes

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 7

FIG. 5. Difference between topological centrality and the closeness centrality (C C ) in the Erdˇors-Rényi random graph with nodes 20 and edges 60. C T is the log() of topological centrality.

with a lower TC degree contain more resources than do those with a higher TC degree.

Application: Discovering Research Communities Research communities are formed through interactions between researchers, equipments, papers, and projects. They are different from graph-based communities in the following aspects: • Research communities are dynamically formed through research activities such as applying funding and position, cooperating with colleagues, publishing, and citing papers; however, graph-based communities are viewed from graph connections. • Research communities contain multiple types of nodes (Researchers and papers can play different roles in research activities, as discussed in Zhuge, 2006.) and relations (e.g., coauthor relation and citation relation); however, there are no differences between nodes and between edges in graph-based communities.

Among existing centrality measures, only the PageRank considers the influences between neighbor nodes, and the

authority of a node is divided by its neighbors; however, PageRank does not reflect different influences of different types of edges in real applications. TC can distinguish the roles of different nodes in research network: • Nodes elect the core nodes by a voting-like mechanism: A node connected to more nodes is more probable to be the local core nodes. By certain times of iterations, the local core nodes and the global topological centers will be elected. The topological centers are the nodes connected to the most core nodes with higher TC degrees. • Edges may play different roles in the influence between the TC degrees of nodes. This confirms the phenomena of research communities: A researcher cooperating with the authority researchers will be closer to the centers of a research community; and a paper citing or cited by authority papers is possibly closer to the core papers on a research topic.

Nodes can play different roles according to topological positions in communities: A core node is usually the hub or authority in a community; a margin node belongs to one community, and it has few connections to other nodes in a community; a bridge node has an equal number of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

7

asi_21353_Rev3_EV.tex

TABLE 2.

5/5/2010

18: 21

Page 8

Comparison of topological centrality and the closeness centrality in a preferential attachment random graph.

v

Cc(v)

C  t (v)

v

Cc(v)

C  t (v)

v

Cc(v)

C  t (v)

v

Cc(v)

C  t (v)

2 1 4 9 34 64 61 35 33 31 25 98 94 63 60 57 52 42 29 28 27 26 23 16 5

0.406 0.427 0.307 0.295 0.293 0.291 0.291 0.291 0.291 0.291 0.291 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.289 0.375

0.0 −1.523 −1.561 −1.563 −1.565 −1.566 −1.566 −1.566 −1.566 −1.566 −1.566 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −1.568 −4.668

72 62 54 53 51 41 36 3 19 17 15 14 68 58 40 11 92 49 47 81 50 95 90 80 73

0.300 0.300 0.300 0.300 0.309 0.300 0.300 0.305 0.300 0.302 0.300 0.300 0.236 0.236 0.236 0.243 0.228 0.228 0.228 0.227 0.227 0.226 0.226 0.226 0.226

−4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.678 −4.853 −4.853 −4.853 −4.853 −4.856 −4.856 −4.856 −4.860 −4.860 −4.863 −4.863 −4.863 −4.863

71 67 93 88 83 8 75 74 65 6 59 55 45 44 43 32 30 24 22 21 13 10 100 97 77

0.226 0.226 0.273 0.273 0.273 0.277 0.273 0.273 0.273 0.277 0.273 0.273 0.273 0.273 0.273 0.275 0.273 0.273 0.275 0.273 0.283 0.275 0.237 0.237 0.238

−4.863 −4.863 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.15 −10.60 −10.62 −10.62

7 69 56 38 79 70 20 12 18 78 39 87 84 46 91 85 66 86 76 89 37 82 96 99 48

0.236 0.235 0.236 0.232 0.196 0.196 0.196 0.196 0.223 0.222 0.221 0.217 0.217 0.218 0.216 0.216 0.216 0.193 0.191 0.183 0.183 0.182 0.179 0.164 0.164

−10.62 −10.62 −10.62 −10.62 −11.47 −11.47 −11.47 −11.47 −15.18 −15.19 −15.20 −15.23 −15.23 −15.23 −15.24 −15.24 −15.24 −19.43 −19.47 −19.58 −19.58 −19.72 −19.78 −19.84 −19.84

Cc = the closeness centrality; C T = the log() of topological centrality.

FIG. 6. An examination case that differentiates topological centrality from the closeness centrality in a preferential attachment graph p(i) ∼ (ck[i]α + a)/ (dl[i]β + a), where α = 1 and β = 0.

8

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 9

FIG. 7. Topological centrality distribution.

FIG. 8.

Distinguishing roles of nodes with topological centrality degrees.

connections with two communities; and the rest are mediate nodes. TC can be used to distinguish roles of nodes. For example, Figure 8 contains three communities: C 1 = {1,4,5,6,7,8}, C 2 = {2,7,9,11,12}, and C 3 = {3,12,13,14,15,16}. Nodes 1, 2, and 3 are the core nodes of C 1 , C 2 , and C 3 , respectively. Nodes 7 and 12 are bridge nodes. Nodes 4, 5, 6, and 8 are margin nodes of C 1 . Nodes 9, 10, and 11 are margin nodes of C 2 . Nodes 13, 14, 15, and 16 are margin nodes of C 3 .

Nodes can be classified by TC degrees as follows: • If the TC degree of a node is larger than that of most of its neighbors, the node is a core node. • If the TC degree of a node is not larger than the TC degrees of all of its neighbors, the node is a margin node. • If the number of neighbors with lower TC degrees equals the number of neighbors with higher TC degrees, the node is a bridge node. • Otherwise, the node is a mediate node.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

9

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 10

Let λ = L(v)/N(v) and µ = H(v)/N(v), where v is a node, L(v) is the number of neighbor nodes of v with TC degrees lower than v, H(v) is the number of neighbor nodes of v with TC degrees higher than v, and N(v) is the number of neighbors of v, then the role of v is distinguished by the following formula where threshold(core) ∈ (0.5,1] controls the number of core nodes. ⎧ core node, λ > threshold of core ⎪ ⎪ ⎨ margin node, λ = 0 role(v) = . bridge node, λ = µ ⎪ ⎪ ⎩ mediate node, otherwise A core node is decided by whether it has larger TC degrees than do its neighbors; however, topological centers of the connected network may have exceptions. In Figure 8, Node 2 is both the topological center and the core node, but the ellipse node in Figure 9 is the topological center, and it is not a core node but a bridge node, although it has a higher TC

FIG. 9. The ellipse node is a topological center; it is not a core node but a bridge node.

degree than do all of its neighbors; thus, it is significant to distinguish the roles of topological centers. If the neighbors of a topological center are all core nodes, then the topological center is a bridge node or a core node. Researchers and papers may play such roles as source, authority, bee, hub, and novice (Zhuge, 2006). The source, authority, and hub nodes may be core nodes; the bee nodes are often bridge nodes; and the novice nodes may be margin or bridge nodes. In the research network, a group leader usually has more publications and cooperators. Correspondingly, group leaders have more collaboration with other researchers in the coauthor network. If each research group is regarded as a community, the research group’s leaders are core nodes. Fresh students (beginners in research) have few publications and cooperators, so they are margin nodes in the coauthor network. Visiting researchers and newly employed researchers are bridge nodes because they have cooperators in different research communities. After the core nodes, margin nodes and bridge nodes are distinguished, the remaining nodes are mediate nodes. Usually, a mediate node only belongs to one community. In the citation network, core nodes are the authority or hub papers having more citations than others; the margin nodes are the novice papers or newly published papers, and the bridge nodes connect two or more paper clusters. Each paper cluster may belong to a specific research topic or discipline. Funding decision and research promotion are based on the evaluation of the impact of researchers and their publications. TC can help distinguish the roles of researchers and

FIG. 10. An example of finding community. The circle nodes are core nodes, and the square nodes are noncore nodes.

10

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 11

papers, and the roles can be used to evaluate researchers and papers. TC degrees in the coauthor network help evaluate researchers while TC degrees in the citation network help evaluate papers. In the research network, roles of nodes will change year by year. In the coauthor network, a novice researcher may become an authority, a hub, or even a bridge. With more significant papers published, the TC degree of a node in the coauthor network will become higher than its neighbors, and then the researcher becomes an authority or a hub. By cooperating with researchers in different research groups or even different communities, a researcher becomes a bridge. If Figure 3 represents the coauthor network or the citation network, the general community discovery algorithms such as a GN algorithm (Clauset, Newman, & Moore, 2004; Newman, 2004) cannot discover their communities because the betweenness of each edge is regarded as the same, and it is hard to choose the proper edge for deletion. However, nodes in the coauthor networks and citation networks play different roles, and communities can be discovered according to the roles of nodes. The roles of nodes can be used to discover communities. One way is to find the core nodes, and then assign noncore nodes to the proper core nodes to form communities. Algorithm 2 discovers communities by finding the core nodes for each noncore node. Algorithm 2: Finding k communities by core nodes Input: a network C; Calculate the TC degrees of nodes and edges; Distinguish roles of nodes and add the core nodes to CoreSet; for node v ∈ CoreSet do nodes(v) ← {v} end for for each noncore node v do Choose the nearest core nodes as the candidate nodes denoted as CandidateSet; for node v ∈ CandidateSet do nodes(v ) ← nodes(v ) ∪ {v}; end for end for while |CoreSet| > k do Merge two most tightly connected communities; end while return k communities. The time complexity of Algorithm 2 is O[n(n + m)]. The number of core nodes can be controlled by setting the threshold L(v)/N(v). If there is more than one candidate core node, then the node should be classified into different communities, and the bridge nodes are often classified into several communities at the same time. This way can globally discover communities in the network. If the number of communities is large, the closely connected communities can be merged into larger communities. The closely connected communities may share many nodes and edges or there are many external connections

TABLE 3.

Finding a local community of the core node B.

Step

Node

nodeQueue

nodeSet

Expanded

0 1 2 3 4 5 6 7 8

B C D E F G H I J

B D, E E F, G, H G, H, I, J H, I, J I, J J

B B, C B, C, D B, C, D, E B, C, D, E, F B, C, D, E, F, G B, C, D, E, F, G, H B, C, D, E, F, G, H, I B, C, D, E, F, G, H, I, J

C, D, E F, G, H I, J

TABLE 4.

Finding the local community of the noncore node F.

Step

Node

nodeQueue

nodeSet

Expanded

0 1 2

D G H

D H

D, F D, F, G D, F, G, H

G, H

FIG. 11. The subgraph containing Nodes D, I, and J as well as the possible core nodes D, E, B, and A.

between them.Algorithm 3 merges existing communities into k communities. Algorithm 3: Merging communities Input: the number of communities k. Step 1. If the number of communities is less than k, then go to Step 4. Step 2. Calculate the Jaccard similarity of node sets of each community pair. Suppose A and B are two communities, Jaccard similarity of A and B are calculated by Jaccard(A, B) = |A ∩ B|/|A ∪ B|. If all the Jaccard similarities of community pairs equal 0, then go to Step 3; if not, find the community pairs that have the largest Jaccard similarity and merge them into a larger community, respectively. Go to Step 1.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

11

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 12

Step 3. Count the external links between community pairs. An external link has two ends in different communities. If all the numbers of external links equal 0, then go to Step 4; if not, find the community pairs with the maximum external links between them, and merge them into a larger community, respectively. Go to Step 1. Step 4. Stop merging communities.

the node. To do this, all core nodes in the network first should be found, then expand the local communities from the nearest core nodes connected to the noncore node, respectively. Case 3: Finding the local community of a set of nodes. Given a set of nodes, the local community can be found by the following three steps: 1. For each node, find the core nodes connected to it until the topological center is found; and all of the core nodes are added to the CoreSet. 2. Build the subgraph containing these nodes and the nodes in the CoreSet. 3. Expand the local community from the nodes in the CoreSet.

Another way is to find the core nodes first, and then expand from a node to form local communities. According to the roles of nodes, community expansion needs to consider the following cases. Case 1: Forming the local community according to a core node. Algorithm 4 is for discovering local communities from a core node. A community may have more than one core node. If two communities share many common nodes and edges, then the two communities can be merged into a larger community. This way can find research groups in the coauthor network, and can find the specific topic-related paper clusters in the citation network. Case 2: Forming the local community according to a noncore node. It is necessary to find the core nodes connected to

Figure 10 shows a segment of the network with TC degrees of nodes. We can find a local community from a core node, a noncore node, and a set of nodes as follows: • Find a local community of the core node B. The process is shown in Table 3. • Find a local community of the noncore node F is finding the nearest core node D, then finding the local community from D. The expansion process is shown in Table 4. • Find a local community of a node set {D, I, J}.

(a) FIG. 12.

12

(Continued)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 13

(b)

(c) FIG. 12. Coauthor networks of the International Semantic Web Conference dataset. Global view contains 147 connected components, 935 researchers, and 2,286 coauthor relations (a); the largest connected component with 370 researchers and 1,227 coauthor relations (b); and the log() of topological centrality of nodes in 12b (c).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

13

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 14

Algorithm 4: Expanding community from a core node Input: A core node c and a connected network G; nodeQueue ← {c}, nodeSet ← {c}, edgeSet ← {}; while nodeQueue  = {} do Fetch a node v from nodeQueue; for v is the neighbor node of v do Distinguish the role of v ; if (v not in nodeSet) and (v is not a core node) and [nodeWeight(v ) < nodeWeight(v)] then nodeQueue ← nodeQueue ∪ {v }; nodeSet ← nodeSet ∪ {v }; edgeSet ← edgeSet ∪ {e(v, v )}; end if end for end while return edgeSet.

D is a core node while I and J are two noncore nodes. If D is the core node of the community containing I and J, then {D, I, J} forms a local community. However, D is not the core node of the community containing I and J. The possible core nodes of the community containing D are {D, B, A}, the possible core nodes of the community containing I and J are the same; that is, {E, B, A}. Then, we can construct the subgraph containing Nodes D, I, and J as well as their possible core nodes D, E, B, and A (Figure 11). From the subgraph, we know that B is the nearest core node of the community containing D, I, and J. Then, we can expand from B to find the local community containing Nodes D, I, and J as mentioned in Case 1. This way can find the team of a researcher in the coauthor network and find papers related to the topic of a given paper in the citation network. Given a set of papers, the coauthor relations form the coauthor network, and the citation relations form the citation network. After TC degrees are calculated, the research groups are discovered, and the papers are clustered by citation relations. Researchers in the same communities may share similar research interests while papers in the same clusters are topic-related. Topic-related papers can be recommended to researchers with similar research interests. The global communities show research groups and research topics in the paper set while the local community helps recommend papers to appropriate readers. The TC-based approach distinguishes the roles of nodes, and then discovers the communities by the roles. Global communities and local communities are discovered based on the roles of nodes. This concerns roles rather than only connections. Although TC degrees of nodes and edges are calculated considering connections between nodes, the TC degrees of neighbor nodes influence each other. The role-based community discovery approach is suitable for research networks, and it can discover communities in treelike networks that are hard for general-community discovery approaches. 14

Application: Discovering Backbone in a Research Network Given a set of papers, research networks such as coauthor networks and citation networks can be constructed according to the metadata of papers in online digital libraries. Coauthors of a paper formulate the motif of research network (Milo et al., 2002). Relevant research concerns the structure of a science collaboration network (Barabási et al., 2002; Batagelj & Mrvar, 2000; Newman, 2001a, b). Our first dataset collects papers of the International Semantic Web Conference (ISWC) from 2002 to 2007, as shown in Figure 12a. The number of citation relation is 236, as shown in Figure 13. Figure 12c shows the node TC degrees of the largest connected component of the coauthor networks shown in Figure 12b. The central nodes have higher TC degrees, and the nodes have the highest TC degree: 1. From a topological center to the margins, the TC degrees reduce to 0 step-by-step. If the number of nodes are very huge, the TC degrees are very small, and function log() maps the TC from interval [0, 1] into [−21, 0], and the order of node TC stays stable. In a network, after roles of nodes are distinguished by the node TC degrees, the core nodes and edges among them form a subgraph, called the backbone network. It can play the following roles in scientific research: • It helps display a research network of different levels. When a core node is focused, the other nodes of its local community can be displayed. • It shows important researchers in the coauthor network. When a research community or research group is mentioned, its leaders will take priority to emerge. Figure 14 shows the backbone network of the largest connected component of the coauthor networks of the experimental dataset. The threshold of the core nodes is 0.5, and the threshold of the margin nodes is 0. It contains all of the core nodes and the coauthor relations among them. Most of the core nodes are connected, and this verifies the “rich club” phenomenon (Colizza, Flammini, Serrano, & Vespignani, 2006): The richer nodes possibly connect other richer nodes. Some core nodes formulate the connected components alone because the bridge nodes between them are noncore nodes. • The backbone network of the coauthor network can be used to propagate information. The coauthor network is a kind of social network. The core nodes are important during information propagation because they have more impact on their communities. Suppose that invitations of PC members need to be sent; researchers in the backbone network should take priority. • The backbone network of the coauthor network can be used to propagate information. The coauthor network is a kind of social network. The core nodes are important during information propagation because they have more impact on their communities. • The paper publication venue network contains conferences and journals. Other research resources such as researchers, papers, and publishers connect conferences and journals to a connected network. To find citations among conferences and journals, the subnetwork containing conferences, journals,

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 15

FIG. 13. The citation network of the International Semantic Web Conference dataset contains 36 connected components, and the largest connected component contains 142 papers and 165 citation relations.

and papers can be built. If a super node represents the conference or journal containing papers, then the citation relations in the super nodes and between different super nodes can be counted. The number of external citations reflects the relevance of conferences and journals.

The backbone networks can be used to study the development of scientific research. Sorted by years, they reflect the evolution of research networks. Figure 15 shows the evolution of the coauthor network of the ISWC from 2002 to 2008. The coauthor network of year n (2002 ≤ n ≤ 2008) contains the coauthor relations from 2002 to year n. This shows that more and more researchers have taken part in the conference; therefore, the nodes and edges in backbone networks also change. The following characteristics can be observed: • New researchers in the coauthor network often cooperate with researchers who have published papers in the ISWC

• • •



Conference Proceedings because the scales of the connected components in the coauthor networks become larger year-by-year. Researchers tend to cooperate with other researchers. The evolution graph shows that the isolated nodes enter the connected components step-by-step. The core researchers tend to cooperate with each other. The number of researchers in the largest connected component of the backbone networks becomes larger and larger. The core researchers are active locally, and they have more cooperators than do their neighbors. The roles of researchers in the coauthor network also keep changing: A new researcher may become a core researcher while a core researcher may become a mediate node or a margin node. The topological centers of the largest connected component keep changing. The topological centers emerge through a voting-like mechanism. Table 5 shows the topological centers.

The backbone network of heterogeneous research networks connects the important resources such as researchers, papers, conferences, journals, institutions, and publishers on a research topic. This helps find and recommend information.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

15

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 16

FIG. 14. The backbone network of the largest connected component of the coauthor network of the International Semantic Web Conference dataset from 2002 to 2007.

The PageRank algorithm also can find the local core nodes, but it does not provide the way to connect most of the core nodes to a backbone network because it is hard to choose the connecting nodes between the core nodes by the PageRank values. TC can choose the appropriate core nodes and form a backbone network that is likely connected because the core nodes include the community central nodes and the nodes connecting different communities. Discussion In semantic link networks, nodes influence their neighbors through different relations. Therefore, edges should be weighted when participating in iterative calculation, as follows: ⎧ n t ⎪ = wvt i + γij wr we(v wt ⎨temp_wvt+1 i i ,vj ) vj j=1 , ⎪ ⎩temp_wt+1 = temp_wt+1 + temp_wt+1 vi vj e(vi ,vj ) where r is the relation of edge e(vi , vj ), and wr is the weight of r that affects the calculation of TC in each iteration. 16

An important characteristic is that the original topological centers may change when two networks are merged into one by certain edges and the topological centers are recalculated in the new network. For example, if we merge the coauthor network with the citation network by the authorOf links, the topological centers of the new network may not simply be the sum of the topological centers in the coauthor network and those in the citation network. Recalculation of topological centers can synthesize more relations, so this can more accurately evaluate nodes. For example, authors can be evaluated by more factors (e.g., number of publications, number of coauthors, number of citations, etc.) in the new network than they can in the old networks. If applications require keeping the old topological centers in the new network to avoid recalculation, the following strategy can be adopted: Find the relations (e.g., authorOf ) between the old topological centers, and then compose the corresponding old topological centers to form new topological centers. Such integrated topological centers can provide semantic relevant information services (e.g., the authority author and his or her high-impact papers can be obtained at the same time) for applications in a large network.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 17

Conclusion This article first proposes the notion of TC and the algorithm to calculate the topological positions of nodes and edges in a network, and then studies its applications in discovering communities and constructing the backbone network in scientific research networks. Experiments on simulation networks and real research networks show the feasibility and effectiveness of our approaches. Two applications demonstrate the proposed approach: (a) discovering communities according to the roles of nodes distinguished by TC degrees and (b) constructing the backbone network. TC is a new measure of network characteristics.

Acknowledgments This research was supported by the National Basic Research Program of China (Project No. 2003CB317000), the International Cooperation Project of Ministry of Science and Technology of China (2006DFA11970), the National High Technology Research and Development Program of China (2007AA12Z220), and the National Science Foundation of China (60773057 and 60703018). References

FIG. 15. The evolution of the coauthor network of International Semantic Web Conference from 2002 to 2008: Each row shows the coauthor network and its backbone network; the left column shows the coauthor network, and the right column shows the backbone network.

A semantic link network concerns relational reasoning. New semantic links could be derived from existing semantic links, and therefore, TC in the network may change. On the other hand, semantic communities emerge with operations on the network (Zhuge, 2009), so measuring the centrality in dynamic networks is a challenge (Lee, Yook, & Kim, 2009). TC also can play a role in realizing the semantic zooming lens (Zhuge, 2010).

Anthonisse, J. (1971). The rush in a graph. Amsterdam: University of Amsterdam Mathematical Centre. Barabási, A., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and Its Applications, 311(3–4), 590–614. Batagelj, V., & Mrvar, A. (2000). Some analyses of Erdös collaboration graph. Social Networks, 22(2), 173–186. Bonacich, P. (1972). Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology, 2(1), 113–120. Clauset, A., Newman, M., & Moore, C. (2004). Finding community structure in very large networks. Physical Review E, 70(6), 66111. Colizza, V., Flammini, A., Serrano, M., & Vespignani, A. (2006). Detecting rich-club ordering in complex networks. Retrieved March 30, 2010, from http://arxiv.org/PS_cache/physics/pdf/0602/0602134v1.pdf Fortunato, S., Latora, V., & Marchiori, M. (2004). Method to find community structures based on information centrality. Physical Review E, 70(5), 56104. Freeman, L. (1977). A set of measures of centrality based on betweenness. Sociometry, 40(1), 35–41. Freeman, L. (1979). Centrality in social networks: Conceptual clarification. Social Networks, 1(3), 215–239. Girvan, M., & Newman, M. (2002). Community structure in social and biological networks. Proceedings of the NationalAcademy of Sciences, USA, 99(12), 7821. Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. Latora, V., & Marchiori, M. (2004). A measure of centrality based on the network efficiency. Retrieved March 30, 2010, from http://arxiv.org/ PS_cache/cond-mat/pdf/0402/0402050v1.pdf Lee, S., Yook, S., & Kim, Y. (2009). Centrality measure of complex networks using biased random walks. European Physical Journal B, 68(2), 277–281. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., & Alon, U. (2002). Network motifs: Simple building blocks of complex networks science. Science, 298(5594), 824–827. Newman, M. (2001a). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences (p. 21544898).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi

17

asi_21353_Rev3_EV.tex

5/5/2010

18: 21

Page 18

Newman, M. (2001b). Scientific collaboration networks: II. Shortest paths, weighted networks, and centrality. Physical Review E, 64(1), 16132. Newman, M. (2004). Analysis of weighted networks. Physical Review E, 70(056131). Retrieved April 20, 2010, from http://arxiv.org/PS_cache/ cond-mat/pdf/0407/0407503v1.pdf Nieminen, J. (1974). On the centrality in a graph. Scandinavian Journal of Psychology, 15(1), 332–336. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). PageRank citation ranking: Bringing order to the Web (Tech. Rep). Stanford, CA. Retrieved April 20, 2010, from http://ilpubs.stanford.edu:8090/422/ Perra, N., & Fortunato, S. (2008). Spectral centrality measures in complex networks. Physical Review E, 78(3), 036107. Sabidussi, G. (1966). The centrality index of a graph. Psychometrika, 31(4), 581–603.

18

Warshall, S. (1962). A theorem on boolean matrices. Journal of the ACM, 9(1), 11–12. Wasserman, S., & Faust, K. (1972). Social network analysis: Methods and applications. Cambridge, United Kingdom: Cambridge University Press. Zhuge, H. (2006). Discovery of knowledge flow in science. Communications of the ACM, 49(5), 101–107. Zhuge, H. (2009). Communities and emerging semantics in semantic link network: Discovery and learning. IEEE Transactions on Knowledge and Data Engineering, 21(6), 785–799. Zhuge, H. (2010). Interactive semantics. Artificial Intelligence, 174, 190–204. Zhuge, H., & Li, X. (2007). Peer-to-peer in metric space and semantic space. IEEE Transactions on Knowledge and Data Engineering, 6(19), 759–771.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY DOI: 10.1002/asi