Nan Du, Bin Wu, and Bai Wang. Beijing Key ... cation Call network) show that the algorithm is ..... Each kernel is regarded as a clustering center, and âki â K ...
Month 200X, Vol.21, No.X, pp.XX–XX
J. Comput. Sci. & Technol.
Community Detection in Complex Networks Nan Du, Bin Wu, and Bai Wang Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, China E-mail: {dunan, wubin, wangbai }@bupt.edu.cn Received May 9, 2007. Abstract
With the rapidly grown evidence that various systems in nature and society can be mod-
eled as complex networks, community detection in networks becomes a hot research topic in physics, sociology, computer society and etc. Although this investigation of community structures has motivated many diverse algorithms, most of them are unsuitable when dealing with large networks due to their computational cost. In this paper, we present a faster algorithm ComTector which is more efficient for the community detection in large complex networks based on the nature of overlapping cliques. This algorithm does not require any priori knowledge about the number or the original division of the communities. With respect to practical applications, ComTector is challenged with five different types of networks including the classic Zachary Karate Club, Scientific Collaboration Network, South Florida Free Word Association Network, Urban Traffic Network, North America Power Grid and the Telecommunication Call Network. Experimental results show that our algorithm can discover meaningful communities that meet both of the objective basis and our intuitions. Keywords
∗
complex networks, community detection, social network analysis
This work is supported by the National Science Foundation of China under grant number 60402011, and the
National Science and Technology Support Program of China under Grant No.2006BAH03B05.
2
1
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X
Introduction Recent researches indicate that a large body of
diverse systems in many different domains can be represented as complex networks[1,2,3,4]. Examples include the internet, the World Wide Web, social networks, citation networks and etc. In each case, the system is modeled as a large intricate web of connections among the massive entities they are made of, such as the physical connections between routers, hyper-links between web pages, friendships between people, and references among papers. Most of these networks are generally sparse in global yet dense in local. They have vertices in a group structure that the vertices within the groups have higher density of edges while vertices among groups have lower density of edges[5,6]. This kind of structure is called the community which is an important network property and can reveal many hidden features of the given networks. For instance, the communities in World Wide Web correspond to topics of interest. In social networks, individuals belong to the same community tend to have properties in common. Nowadays, community detection is also considered to be used for improving the search engine and detecting the terrorist organizations in the World Wide Web. Hence, community identification is a fundamental step not only for discovering what makes entities come together, but also for understanding the overall structural and functional properties of a large network[7]. A popular quantitative definition called Network Modularity Q, proposed by Girvan and Newman[8,9], is widely used as a quality metric for assessing the partitioning of a network into commu-
nities. The search for the largest modularity value is a N P -hard problem due to the fact that the space of all possible partitions grows faster than any power of the system size[10]. For this reason, many recent algorithms adopt various heuristic strategies to optimize this metric. However, as mentioned in [11], most actual networks are made of highly overlapping cohesive groups of nodes(cliques or k-cliques). As a result, when the network has highly overlapping cliques, most of the existing algorithms in general are inefficient due to their heuristic optimization strategies. Therefore, in this paper, we design an algorithm which is efficient for the community detection in large complex networks by using such overlapping nature of the cliques in real world scenarios. Given a large sparse graph, the running time of our algorithm is O(C ×T ri2 ), where C is the number of the detected communities and T ri is the number of the triangles in the given network for the worst case. The experiments on six real datasets(Zachary Karate Club, Scientific Collaboration Network, South Florida Free Word Association Network, Urban Traffic Network, North America Power Grid and the Telecommunication Call network ) show that the algorithm is able to generate communities of practical significance in the end. The rest of this paper is structured as follows: in section 2, we mainly review some related work. Section 3 describes the community detection algorithm in details. The experimental results and analysis are presented in section 4; and we conclude the paper in section 5.
Nan Du et al.:Community Detection in Complex Networks
2
Related Work There exist many algorithms for identifying com-
munities in literature. The spectral bisection methods[12,13] and the Kernighan-Lin[14] algorithm are early solutions to this problem in computer society. However, the major disadvantage of the spectral approach is that the bisection methods only bisect graph iteratively, which is unsuitable to general networks. For the Kernighan-Lin algorithm, it requires a priori knowledge about the sizes of the initial divisions[15]. In social network analysis (SNA), a group of algorithms focus on the discovery of the so-called cohesive sub-structures[5,6], including the cliques[16,17], n-cliques, n-clans, n-plexes[18], as well as the quasicliques[19,20,21]. These dense sub-structures often impose extra restrictions on the community definitions. For instance, the definition of n-clique requires that the distance between any pair of vertices should be no more than n, while in a quasi-clique the ratio of the number of each vertex’s neighbors to the number of all the vertices in the sub-structure is no less than a threshold value. Meanwhile, the average size of these sub-structures is always small, so people may get a great number of them, which actually hides the global organization of the network. Another widely used technique in SNA is the hierarchical clustering[22] which groups similar vertices into larger communities. Donetti and Munoz [23] have adopted such hierarchical clustering method by using the eigenvectors of the Laplacian matrix of the graph to measure the similarities among vertices. The complexity is determined by the com-
3
putation of all the eigenvectors, in O(n3 ) time for sparse matrices. While it does not require us to specify the size or number of the communities beforehand, this method does not know when to stop the agglomerative process for the best division of the network. In recent years, Girvan and Newman have introduced a divisive method[9,24] by iteratively cutting the edge with the greatest betweenness value, it can generate an optimized division of the network with O(m3 ) time complexity according to the optimized network. Radicchi has proposed a similar methodology with GN [25] by using the edgeclustering coefficient as a new metric with a smaller time complexity O(m2 ). To further improve the efficiency, Clauset, Newman and Moore have also proposed a fast clustering algorithm[26] with O(n log 2 n) time complexity on sparse graph which merges pairs of nodes to generate the maximal ∆Q iteratively until it becomes negative. Pascal Pons and Matthieu Latapy[27] have proposed another clustering algorithm by using the random walk method to evaluate the similarity among vertices. It uses Network Modularity as well to determine when to stop the agglomerative process and has O(n2 log n) time complexity. Other interesting algorithms include Jordi Duch and Alex Arenas’s extremal optimization method proposed in[10] with O(n2 log n) time complexity, Aaron Clauset’s method for finding local community structures in[28], the force-based incremental algorithm of Bo Yang and Da-You Liu in [29] which focuses on mining the community structure in a dynamic network, and the agent-based algorithm proposed by Ismail Gunes and Haluk Bingol in[30].
4
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X All these current algorithms are successful ap-
S 0 = S, then S is a maximal clique of G.
proaches for community discovery from different perspectives.
However, the actual complex net-
works are usually large sparse graphs with regions
Definition 2. For a given vertex v, N (v) = {u|(v, u) ∈ E(G)}, we call N (v) is the set of all neighbors of v.
consisting of overlapping cliques[11]. As a consequence, the betweenness based divisive algorithms will have very low computational efficiency while
S
Definition 3. Given set S ⊆ V (G), N |S = N (v i ) − S, v i ∈ S, N |S is the set of all neighbors
of S.
the fast agglomerative method[26] in general can
Definition 4. Let Com(G) be the set of all
not give a satisfactory division due to its local op-
components in G. The giant component is denoted
timization strategy. Therefore, we follow a different
by C G and M (C G ) is the set of all the maximal
track by presenting an algorithm which can gener-
cliques in C G . We use V M ⊆ V (G) to represent
ate a higher network modularity than the fast algo-
the set of all vertices covered by M (C G ).
rithm while performs more efficiently than the GN algorithm.
Definition 5. Given vertex v i ∈ V M , C i is the set of all maximal cliques that contain v i , and C = {C i |C i ⊆ M (C G )}. ∀C i , C j ∈ C, if
3
Community Detection Algorithm The basic idea of ComTector is to build up
communities around overlapping cliques. We regard overlapping maximal cliques as the clustering kernels and carry out an agglomerative process to
obtained fractional communities will be properly adjusted so as to prevent the network from being divided into too small pieces.
≥ f
which is a threshold to describe the extent to which C i and C j overlap, we call C j is contained by C i , denoted by C j < C i . If C i is not contained by any other element in C, C i is called the kernel of G and v i is the center of C i . Definition 6. Let K be the set of all kernels
associate the rest vertices to their closest kernels based on a proposed distance measure. Finally the
|C i ∩C j | |C j |
in G. V K = {v i |v i ∈ k j , k j ∈ K} is the set of all S vertices covered by K and I K = (k i ∩ k j ), k i , k j ∈ K, i 6= j is the union of all the vertices that any pair of elements in K has in common. Definition 7. For any given vertex v i , the
3.1
Problem Formulation
In this paper, we consider simple graphs only, i.e., the graphs without self-loops or multi-edges. Given graph G, V (G) and E(G) denote the sets of its vertices and edges respectively. Definition 1. Given set S ⊆ V (G), ∀u, v ∈
Freeman Relative Centrality[5] is defined as C RD = |N (v i )|/(n − 1). Definition 8. Given graph G, the centralP ization[5] of G is defined as GC = (C RDmax − C RD (v i ))/(n − 2).
3.2
Algorithm
S, u 6= v, such that (u, v) ∈ E, then S is a clique in G. If any other S 0 is a clique and S 0 ⊇ S iff
Since that most complex networks always have a giant component, we first use an efficient algorithm
Nan Du et al.:Community Detection in Complex Networks
5
Peamc[17] to enumerate all maximal cliques in this giant component. Because a maximal clique is a complete sub-graph, it can represent the closest relationship among different entities and thus is the densest community in the given network. For any v i ∈ V (G), C i is the set of all maximal cliques containing v i . ∀v i , v j ∈ V M , if
|C i ∩C j | |C j |
≥f
(f is an empirical value), which means all or most of v j ’s relationships are covered by those of v i , we say v j depends on v i and C j is contained by C i .
Figure 1: Overlapping Cliques
Otherwise, if C j is not contained by any other element of set C, then C j is called a kernel. From the above discussion, we can conclude that the larger the size of C i can be, the more likely
from C i n to get rid of unnecessary duplications. If C n is not empty, it is put in set K. The process continues iteratively until C becomes empty.
that a kernel it would become. Therefore, we re-
To make things more concrete, an illustrated
arrange all the elements of set C according to the
example is given as follows on the network shown
descending order of their sizes and delete those ele-
in Figure 1. Here, v 0 is contained in four maximal
ments whose sizes are smaller than 2, which means
cliques with C 0 = {{v 0 , v 1 , v 4 , v 5 }, {v 0 , v 1 , v 3 , v 4 },
if C i is going to be a kernel, v i must participate
{v 0 , v 2 , v 3 , v 4 }, {v 0 , v 4 , v 5 , v 6 }}. With respect to
in at least two closest relationships. Here, let C i 0
v 1 , it is involved in two maximal cliques with C 1 =
be the element of C with the largest size, C i 1 be
{{v 0 , v 1 , v 4 , v 5 }, {v 0 , v 1 , v 3 , v 4 }}. Since that C 1
the element of C whose size ranks second. . . C i n be
is contained by C 0 , C 1 is unable to be a kernel.
the element of C whose size ranks n and so on. K
Similarly,C 2 , C 3 , C 4 , C 5 are also contained by
is the set of all kernels. We first pick up C i 0 and
C 0 , and C 8 , C 9 , C 10 , C 11 are contained by C 7 .
remove those elements it contains from C. In the
Therefore, C 0 and C 7 are two different kernels with
next step, we delete each maximal clique that con-
v 0 , v 7 being as the centers respectively. The overall
tains the centers of the left elements in C from C i 0 .
process is depicted in algorithm 1.
If C i 0 is not empty, it is put in set K. Again, we
Starting from set K, we see that each element
will pick up the element with the largest size from
of K corresponds to the kernel of a possible com-
the rest elements of C, such as C i n , remove it from
munity in G. In fact, the purpose to generate set K
C, remove all the elements contained by C i n , and
is similar to that of the classic k-means algorithm
delete each maximal clique that includes the cen-
for finding the clustering center. Thus, people may
ters of the left elements in C from C i n . If there
argue that another very intuitive method to search
is any maximal clique that contains the centers of
for the kernels might depend on the degree of each
the elements in set K, it also needs to be deleted
vertex. It is possible for us to sort all the vertices
6
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X
according to the descending order of the vertex’s
a large community. In our experiments, we have
degree and treat the set of each vertex as well as
found that approximate 40 percent of the top 10
their neighbors as the element of set C to generate
elements in set C have their centers’ degrees also
the kernels. Even though this method seems to be
ranked top 10. Most vertices in the communities
simple and straightforward, doing so will not bring
of average size do not have large degrees. Let v k
us a good network modularity value.
be the center of the element in C with the smallest
Algorithm 1 FilterOutKernels(C,f) 1: K ⇐ ∅
size and v d be the vertex with the maximum degree. We have found that the proportion of the number
2: sort C by the descending order of |C i |
of vertex v whose degree is such that |N (v k )| ≤
3: {core stores the centers of the generated
|N (v)| ≤ |N (v d )| to the number of all the vertices
kernels} 4: core ⇐ ∅ 5: for C i ∈ C do
is 75% on average, which is far more than |C| and thus leads to a low efficiency for kernel generation. Therefore, whether a vertex would be included in a community actually depends on how closely its
6:
contained ⇐ C j , j 6= i, C j < C i
7:
independent ⇐ k, k 6= i, C k ≮ C i
8:
delete C i from C
the overlapping maximal cliques to find the possible
9:
C ⇐ C − contained
kernels.
10: 11: 12:
for s ∈ C i do if s ∩ (independent ∪ core) 6= ∅ then delete s from C i
neighboring vertices are connected with each other, which is another important motivation for us to use
The discovered communities form a complete partition of C G , and thus require every pair of elements in K should not have any vertex in common.
end if
As a result, pair-wise intersection among elements
14:
end for
of K will be performed and all the common vertices
15:
if C i 6= ∅ then
will be put in set I K . For each vertex v i ∈ I K ,
13:
16:
K ⇐ Ci
17:
end if
18:
core ← v i
19: end for 20: return K
The reason is that the vertices contained in communities do not necessarily have large degree.
we use a distance measure to identify which kernel in K is closest to v i . Based on C RD and GC , given vertex v i and sub-graph S G , we add v i to the closest S G and our distance measure is defined as Dv i = a0 · C RD (v i ) + a1 · (GC + (C RD (v i ) − C RDmax )), where a0 + a1 = 1, a0 , a1 ∈ (0, 1). This metric represents the distance between the given vertex and its closest kernel by taking full
A large vertex’s degree only indicates that the ver-
account of the following factors: C RD (v i ) directly
tex itself as a single entity has many connections
reflects the relative significance of vertex v i . The
with others, yet it does not mean it is involved in
larger C RD (v i ) can be, the more important vertex
Nan Du et al.:Community Detection in Complex Networks
7
v i would become. Gc and (C RD (v i ) − C RDmax )
kernels and will be marketed as old. As a result, ev-
are fine tuning factors. If a sub-graph has a high
ery kernel is now expanded. Again each new vertex
central tendency while the gap between the rela-
in set N |V E is added to a tentative set V E ’. Next
tive degree of vertex v i and the maximum one is
the vertices in V E ’ are also assigned to their clos-
small, v i can hold a more significant position. Ev-
est kernels and are marked as old. This process is
ery vertex in I K is assigned to its closest kernel in
repeated iteratively until the kernels can not be ex-
K, which is shown in Algorithm 2. In Figure 1, C 0
panded any more. Algorithm 3 describes the whole
and C 7 share v 5 . We regard CRD as the dominant
procedure.
factor, such that a0 = 0.8 and a1 = 0.2. Since that the distances of v 5 to C 0 and C 7 are 0.543 and
Algorithm 3 AssignVertex(K) 1: for v i ∈ V K do
0.271 accordingly, v 5 is thus assigned to C 0 and
2:
removed from C 7 .
3: end for
Algorithm 2 DeDuplication(K) 1: I K ⇐ ∅
4: V E
2: for k i ∈ K do
4:
for k j ∈ K, i < j do S I K ← I K (k i ∩ k j )
5:
end for
3:
6: end for 7: for v ∈ I K do 8:
v i is marked as old
S
9: end for
N (k i ) − V K
5: while V E 6= ∅ do
for v i ∈ V E do
6: 7:
assign v i to its closest kernel k i
8:
v i is marked as old end for
9:
V E ’← ∅, V E ’← vertices not marked as
10:
remove v from all the kernels except for the one having the maximum distance
← vertices not marked as old in
old in N |V E V E ← ∅, V E ← V E ’
11:
12: end while
Each kernel is regarded as a clustering center, and ∀k i ∈ K every vertex in V (C G ) − V K will be
3.3
Modularity Optimization
assigned to their closest kernel based on the corresponding value of the distance measure. This procedure is done by gradually expanding these kernels in K. We adopt a marking strategy to differentiate new vertices from old ones. In the first step, all vertices in V K are marked as old. In the second step, S every new vertex in the set N (k i ) − V K , k i ∈ K will be added to a tentative set V E . In the third step, all vertices in V E are assigned to their closest
When the clustering process is finished, all the obtained sub-structures constitute the original division of C G . We then adopt the Network Modularity Q to evaluate this original division. Based on our observations, there exist some extremely small communities in this division which are derived from the tiny kernels compared with others. The actual causes of such fractional kernels is
8
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X
that for a specific C i containing vertex v i , v i may
tained, we can use the amalgamation process of the
not be the true center of C i . In other words, al-
fast algorithm to increment Q0 by the largest ∆Q
though each maximal clique in set C i contains v i ,
until it becomes negative. Because this optimiza-
it may also include many centers of other kernels,
tion method is just like that of the fast algorithm,
so v i is not the expected core figure and is just a
it suffers from the resolution limit[31] problem as
normal entity which participates in the social cy-
well. With respect to the second method, the frac-
cles of other core figures. Consequently, while C i is
tional communities of the original division whose
able to become a kernel, it can not contain enough
sizes are below the average level will be merged
cliques of other elements in set C. However, every
with the rests. In our experiments, we have found
maximal clique which contains the centers of other
that the final modularity value obtained by this
kernels is deleted from C i . As a result, C i is re-
straightforward method is often close to that of the
duced to a rather small kernel. The communities
former with even less computational costs.
derived from these small kernels may partition the
3.4
network into too small pieces.
Performance Analysis
To address this problem, we propose two meth-
From the priori discussion, the enumeration of
ods to adjust the original division. In terms of the
all maximal cliques in G by using P eamc[17] will
first one, we adopt the basic idea from Newman’s
cost O(∆ × M C × T ri2 ) in the worst case on a sin-
fast algorithm to perform a local greedy optimiza-
gle processor, where ∆ is the maximal degree of G,
tion. Given the p × p symmetric matrix e whose
M C is the size of the maximum clique and T ri is
element eij is the fraction of all edges in the net-
the number of all triangles in G. For most complex
work that link vertices in community i to vertices X in community j, the row sums ai = eij repre-
networks, they are often large sparse graphs where
j=0
|V (G)| ≈ |E(G)|. In these networks, the size of
sent the fraction of edges that connect to vertices X in community i. Q is thus defined as (eii − ai 2 ).
the clique has a power-law distribution where the
We iteratively search for the changes ∆Q from
clustering coefficient property (which directly cor-
number of triangles is the most. By taking the high
the amalgamation of each pair of communities, choose responds to the existence of triangles) of complex the largest one, and perform the corresponding amal- networks into consideration, P eamc[17] can pergamation until ∆Q becomes negative. The modu-
form very efficiently (especially when |V (G)|/|E(G)| ≤
larity value of the original division is Q0 . Suppose
3). To find the kernel set K, we need to traverse
we first merge community i with j and the new
all the elements of C whose size is larger than 2,
community is denoted as (ij). We can have aij − a 2 + ai 2 + aj 2 i, j connected (ij ) ∆Q = 0 otherwise
which will cost O(M C × |C|2 ). The parameter f to
.
on our experiments, we suggest that it should be Once the initial values of ∆Q and ai are ob-
identify whether one element of C is contained by another influences the number of kernels. Based
larger than 0.3, although the changing of modu-
Nan Du et al.:Community Detection in Complex Networks
9
Table 1: Datasets used in our experiments Network
|V (G)|
|E(G)|
Zachary Karate Club
34
78
Scientific Collaboration
1667
4487
Word Association
10225
81330
U.T. of Beijing
4235
13846
U.T. of Shanghai
1967
4593
U.T. of Shenyang
954
2772
Power Grid
4941
6594
gorithm is first tested on the well-known Zachary
T.C. 1
512024
1021861
Karate Club[4][5], and then is challenged with Sci-
T.C. 2
845750
1544834
T.C. 3
2423807 5317183
Figure 2: Zachary Karate Club
entific Collaboration Network, South Florida Free Word Association Network [32], Urban Traffic Network (U.T.), and the North America Power Grid [1].
larity is not very sensitive to it. As for a0 , and a1 , the relative degree is regarded as the dominant factor, so these two coefficients are set to 0.8, and 0.2 accordingly. Since that V M − V K ≈ V (G), assigning the rest vertices in V M − V K will cost O(|K| × |V (G)| × I), where I is is the average times for the process to repeat until K is empty. In sparse graphs, we have |V (G)| ≈ |E(G)|, |C| < |V (G)|, |K| ≈ |C|, |V (G)| < T ri2 ¿ |V (G)|2 , and ∆ × M C < |K|. Let C denote the number of the communities in the original division. The adjustment phase using modularity optimization will cost O(C × log C). Because C ≈ |K| and I has the av-
Based on the experimental results, we will have a detailed discussion about the optimization of parameter f . In the end, our algorithm is further tested on the large Telecommunication Call networks (T.C.) to illustrate the global structural properties of complex networks. Table 1 shows the general description of our datasets. The building and organization of each network will be presented in the following sections. All experiments are done on a single PC (3.0GHz processor with 2Gbytes of main memory on Linux AS3 OS).
4.1
Zachary Karate Club
erage value of 6 according to the small-world prop-
Zachary Karate Club is one of the classic studies
erty, the overall cost will be O(C × T ri2 ) in the
in social network analysis. Over the course of two
worst case.
years in the early 1970s, Wayne Zachary observed social interactions between the members of a karate
4
Experimental Results In this section, we present a number of appli-
cations to which ComTector is applied. The al-
club in an American university. He built network of connections with 34 vertices and 78 edges among members of the club based on their social interactions.
10
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X
Table 2: Results on the Scientific Collaboration Network Scientific Collaboration Network Algorithm
Communities
Q
Time
GN
79
0.85
403s
Newman Fast
85
0.43
2.4s
ComTector
81
0.83
1s
By chance, a dispute arose during the course of
Figure 4: comfort community
his study between the club’s administrator and the karate teacher. As a result, the club split into two smaller communities with the administrator and the teacher being as the central persons accordingly. Figure 2 shows the detected two communities by our algorithm.
4.2
Scientific Collaboration Network
the vertices of the Core area, we come to the detailed description of each specific community. In this magnified picture, the color of each community is the same as that of the vertex in the Core area. The vertices in every community are the central persons being as the representatives of the research group. The solid lines among these vertices
The data of the collaboration network is obtained according to the 1990 published papers from the year 1998 to 2005 indexed by SCI, EI and ISTP in Beijing University of Posts and Telecommunications. Each author corresponds to a vertex of the network and there is an edge between two vertices if the two authors have collaborated in a paper. A great deal of work has gone into disambiguation of
show that the central persons of the given communities have collaborated together, while the dashed lines mean that the rest persons other than the central ones of the communities have once collaborated with each other. More specifically, the community with ”XiaoMin Ren” and ”YongQing Huang” are further enlarged to show the detailed internal structure.
similar names, so co-authorship relationships are relatively free of name resolution problems. The
4.3
top portion in Figure 3 gives the map of all the
tion Network
discovered 81 communities in Table 2. Each community in the Periphery area is an independent small component of the network, and the Core area corresponds to the giant component with each vertex being the representation of the corresponding community. By zooming onto
South Florida Free Word Associa-
The purpose of South Florida Free Word Association Network by Douglas L. Nelson and Cathy L. McEvoy[32] is to make the largest database of free word association ever collected in the United States available to interested researchers and schol-
Nan Du et al.:Community Detection in Complex Networks
Figure 3: All Discovered Communities in the Collaboration Network
11
12
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X
Table 3: Results on South Florida Free Word Association Network South Florida Free Word Association Network Algorithm
Communities
Q
Time
GN
n/a
n/a
> 2h
Newman Fast
68
0.16
356s
ComTector
75
0.25
99s
ars. More than 6,000 participants produced nearly three-quarters of a million responses to 5,019 stimulus words. Participants were asked to write the first word that came to mind that was meaningfully related or strongly associated to the presented word
Figure 5: Backbone of the Urban Traffic Net-
on the blank shown next to each item.
work in Beijing
For example, if given BOOK[
], they might
write READ on the blank next to it. This procedure is called a discrete association task because each participant is asked to produce only a single associate to each word. The network consists of 10225 vertices and 81330 edges. Each commu-
Table 4: Results on the Urban Traffic Network in Beijing, Shanghai, and Shenyang Urban Traffic Network in Beijing
nity corresponds to a group of semantically related
Algorithm
Communities
Q
Time
words. Figure 4 presents the community around
GN
54
0.83
>2h
”comfort”. Table 3 gives different results by the
Newman Fast
71
0.40
19s
ComTector
66
0.82
7s
three algorithms.
4.4
Urban Traffic Network
We build the urban traffic networks in three famous cities of China including Beijing, Shanghai and Shenyang. In this network, each vertex is a bus stop and there exists an edge between two vertices if they are neighboring in the same bus line. Experimental results on the three networks are given in Table 4. If we zoom out of each community and regard them as nodes, we will obtain a commu-
Urban Traffic Network in Shanghai GN
25
0.78 1560s
Newman Fast
28
0.36
6s
ComTector
14
0.74
1s
Urban Traffic Network in Shenyang GN
26
0.82
28s
Newman Fast
31
0.4
1s
ComTector
14
0.80
1s
Nan Du et al.:Community Detection in Complex Networks nity graph. In this graph, each vertex represents a particular community and there is an edge be-
13
Table 5: Results on North America Power Grid Network
tween two vertices if they have some vertices in common within the distance of one hop. By treat-
North America Power Grid Network
ing the reciprocal of the common vertices’ number
Algorithm
Communities
Q
Time
between the two ends of an edge as the weight,
GN
36
0.92 3122s
we calculate a minimal spanning tree in the com-
Newman Fast
63
0.45
143s
munity graph. This obtained spanning tree thus
ComTector
35
0.90
4s
represents the backbone of the given network. In our experiment, the 66 detected communities are numbered from 0 to 65 and the backbone of the urban traffic network in Beijing is shown in Figure 5, we see that the 66 communities in Beijing covers most of the traffic hubs in the city, which meets our common sense that the growth of a big city always develops a robust transport system.
4.5
North America Power Grid
We have also run ComTector on the raw data representing the topology of the Western States Power Grid of the United States, which was originally used by D. Watts and S. Strogatz to describe the collective dynamics of ’small-world’ networks. The whole network is an unweighted, undirected graph containing 4941 vertices and 6594 edges. The number of discovered communities are shown in Table 5.
4.6
Figure 6: Network Modularity Q Figure 6 shows this kind of relation in the Scientific Collaboration, Word Association, Urban Traffic Network, and Power Grid Network respectively. If f is too large, it will cut the network into
Parameter Optimization
In the algorithm, the possible values of parameter f affect the ultimate outcome of the partition. We adopt Newman’s Q modularity to evaluate the strength of the detected community structures. f determines the kernels’ number in the given network, which in turn has an influence on the Q value.
smaller pieces. For each community i, eii tends to be small and ai is relatively large, which further causes Q to decrease. As a result, in Figure 6, we see when f ∈ (0.3, 0.5), Q often reaches its maximum value on average, although the changing of Q is not very sensitive to f .
14
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X
4.7
Telecommunication Call Networks
The Telecommunication Call Networks are built from the datasets in a city and in a province within the period of one month from a Telecom Operator in China. We regard each subscriber as a vertex and two vertices will share an edge if the subscribers have once contacted with each other by their mobile phones. We have detected 28033 and
Figure 7: Backbone of the telecommunication call network 2 The Distribution of Community Size
5
2171 communities in the telecommunication call
10
network 1 and 2 with 0.60 and 0.64 Q modular4
10
neither GN nor N ewmanF ast can generate satisfactory results within the acceptable time. Looking at the large communities in the networks, we have
Community Number
ity accordingly within the period of 4200s, while
3
10
2
10
found that they often consists of people who have close spending power, similar ages or live in the
1
10
same areas. To some extent, these obtained com0
munity structures and these corresponding com-
10 0 10
1
10
2
10 Community Size
3
10
mon factors are useful clues for the Telecom Operator to design their client market policies. In ad-
Figure 8: Power-law distribution of the commu-
dition, by following what we do in the urban traffic
nity size
network to build the community graph, we can obtain the backbone of the large telecommunication call network, which is shown in Figure 7. The left part of Figure 7 is the core of the original network with 845750 vertices and 1544834 edges. It is obvious that all the massive vertices and edges are intertwined together, which is hard for us to read and analyze. By contrast, the backbone of this large network is presented in the right part, which gives us a direct sense about the global organization of the whole picture. One important property of the complex networks is that the distribution of the community size k appears to have
a power-law form P (k) ∼ k −α with some constant α. In our experiment, we run our algorithm on a very large telecommunication call network 3 which consists of 2423807 vertices and 5317183 edges with 139244 discovered communities. The experimental result shows that the telecommunication call network also exhibit such kind of property with an exponent α = 3.28, which is shown in Figure 8. We guess that this power-law distribution of the communities is possibly resulted from the formation and evolution of the complex network itself. This kind of connection will lead the direction of our future research.
4
10
Nan Du et al.:Community Detection in Complex Networks
5
Conclusion In this paper, we have followed a different track
by proposing a new method ComTector for the community detection in complex networks. Based on the overlapping nature of cliques in our real world, this algorithm can be applied to many large sparse graphs. It is very simple and intuitive to extract satisfactory results on networks whose community structures are known before. The method consists of two critical steps. In terms of the first step, we adopt a significantly efficient algorithm to enumerate all maximal cliques in the giant component of the given network. These clusters of the maximal cliques form the kernels of the potential
15
are obtained from diverse systems of different fields, they indeed have some similar structural properties in common. For the future work, we will continue our research by focusing on the evolution and prediction of the community structures as well as the backbone of the complex network by using time series analysis to have a deeper understanding of the network dynamics from both of the micro and macro perspectives. Moreover, we will extend our algorithms to find communities in bipartite networks, which could further improve existing collaborative recommendations based on the community wisdom.
References
communities. With respect to the second step, we use an agglomerative technique which iteratively add the left vertices in the giant component to their closest kernels. The clustering results will then be properly adjusted by merging the fractional communities to achieve a better Network Modularity, and the finally obtained community structures together with other components constitute the ultimate partition of the network. We have demonstrated the efficiency and utility of the algorithm with a number of practical examples. Experimental results on real-world networks show that the algorithm can extract meaningful communities that meet both of the objective facts and our intuitions. In addition, we also use ComTector to analyze networks whose structure is otherwise difficult to understand. These networks include the Scientific Collaboration, Word Association, Urban Traffic, Power Grid and Telecommunication Call network. Despite that these networks
[1] Watts, D.J. and Strogatz, S.H. Collective Dynamics of ’Small-World’ Networks. Nature, Vol393: 440–442. [2] Watts, D.J. Small Worlds:The Dynamics of Networks between Order and Randomness. Levin, S.A., Strogatz, S.H. (eds.), Princeton: Princeton University Press, 1999. [3] Boccaletti, S., Latora, V. and Moreno, Y. Complex Networks: Structure and Dynamics Physics Reports, Vol-424(Issue 4-5): 175–308. [4] Newman, M.E.J. The Structure and Function of Complex Networks. SIAM Review, Vol-45: 167–256. [5] Wasserman, S. and Faust, K. Social Network Analysis. Cambridge: Cambridge University Press, 1994. [6] Scott, J. Social Network Analysis: A Handbook. London: Sage Publications, 2002. [7] Milo, R., Itzkovitz, S., et al. Network Motifs:
16
J. Comput. Sci. & Technol., Month 200X, Vol.21, No.X Simple Building Blocks of Complex Networks.
Proc. PAKDD07 Workshops, Nan Jing, 2007,
Science, Vol-298:824–827.
pp.476–483.
[8] Newman, M.E.J. Modularity and community [19] Abello, J., Resende, M.G.C., and Sudarsky, S. structure in networks.PNAS, Vol-103: 8577.
et al. Massive Quasi-Clique Detection. In Proc.
[9] Girvan, M. and Newman, M.E.J. Community
the 5th Latin American Symposium on Theoret-
structure in social and biological networks PNAS, Vol-99:7821–7826.
ical Informatics, Mexico, 2002, pp.598–612 . [20] Pei, J., Jiang, D.X., and Zhang, A.D. et al.
[10] Duch, J. and Arenas, A. Community detection
On mining cross-graph quasi-cliques. In Proc.
in complex networks using extremal optimiza-
The 12th ACM SIGKDD, Philadelphia, 2006,
tion. Physical Review E, Vol-72: 027104.
pp.228–237.
[11] Palla, G., Dernyi, I., and Farkas, I. Uncovering [21] Zeng, Z., Wang, J., and Karypis, G. et al. Cothe Overlapping Community Structure of Com-
herent Closed Quasi-Clique Discovery from Large
plex Network in Nature and Society. Nature,
Dense Graph Databases. In Proc. The 12th
Vol-435:814–818.
ACM SIGKDD, Philadelphia, 2006, pp.797–802.
[12] Fiedler, M. Algebraic connectivity of graphs. [22] Han, J.W. and Kamber, M. Data Mining: ConCzechMath J, Vol-23: 298–305. [13] Pothen, A., Simon, H., and Liou K-P. Partition-
cepts and Techniques, 2nd ed. Morgan Kaufmann Publishers, 2006.
ing sparse matrices with eigenvectors of graphs. [23] Luca Donetti and Miguel A. Munoz Detecting SIAM J Matrix Anal App., Vol-11: 430–452. [14] Kernighan, B.W., and Lin, S. A efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, Vol-49: 291–307. [15] Newman, M. E. J. Detecting community structure in networks. Eur. Phys. J. B, Vol-38: 321–330. [16] Bron, C., and Kerbosch, J. Finding all cliques of an undirected graph. Communications of the ACM , Vol-16: 575–577. [17] Du, N., Wu, B., and Wang, B. et al. A Parallel Algorithm for Enumerating All Maximal
Network Communities: a new systematic and efficient algorithm Journal of Statistical Mechanics, P100102. [24] Girvan, M. and Newman, M.E.J. Finding and evaluating community structure in networks. Physical Review E, Vol-69: 026113. [25] Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., and Parisi, D. Defining and identifying communities in networks. PNAS, Vol-101: 2658. [26] Clauset, A., Newman, M.E.J. and Moore, C. Finding community structure in very large networks. Physical Review E, Vol-70: 066111.
Cliques in Complex Networks. In Proc. The [27] Pons, P. and Latapy, M. Computing Commu6th ICDM2006 Workshop, Hong Kong, 2006,
nities in Large Networks Using Random Walks.
pp.320–324.
In Proc. ISCIS2005, Istanbul, 2005, pp.284–
[18] Wu, B. and Pei,X. et al. A Parallel Algorithm
293.
for Enumerating all the Maximal k-plexes. In [28] Clauset, A. Finding local community structure
Nan Du et al.:Community Detection in Complex Networks in networks.Physical Review E, Vol-72: 026132. [29] Yang, B., and Liu, D.Y. Force-Based Incremental Algorithm for Mining Community Structure in Dynamic Network Journal of Computer Science and Technology, Vol-21: 393-400. [30] Gunes, I. and Bingol, H. Community Detection in Complex Networks Using Agents. CoRR, Vol: abs/cs/0610129. [31] Fortunato, S. and Barthelemy, M. Resolution limit in community detection. PNAS, Vol-104: 36-41, 2007. [32] http://w3.usf.edu/FreeAssociation/
17