Ant Colony Optimization for Community Detection in Large-Scale ...

15 downloads 97 Views 250KB Size Report
FAW-Volkswagen Automotive Company Limited. Changchun, China. Abstract—In this paper we present a new ant colony optimization for community detection ...
2011 Seventh International Conference on Natural Computation

Ant Colony Optimization for Community Detection in Large-Scale Complex Networks Dongxiao He, Jie Liu, Dayou Liu, Di Jin

Zhengxue Jia

College of Computer Science and Technology Jilin University Changchun, China

Manager Service Department FAW-Volkswagen Automotive Company Limited Changchun, China

Abstract—In this paper we present a new ant colony optimization for community detection in large networks, which takes modularity Q as objective function. An important difference that distinguishes our algorithm from the former ant algorithms is the manner in which the ants are used in the algorithm. Unlike those existing methods in which each ant searches for a candidate solution, each ant in our algorithm only decides whether its current vertex joins the community of its previous vertex with the aid of a simulated annealing idea, whose purpose is to locally optimize function Q. In each iteration, the ants work collectively so as to uncover the community structure of the network. Moreover, we introduce a thought of “layer and rule” into this method for further improving its performance. Our algorithm doesn’t employ the pheromone, which reduces its running time and makes it well suitable for large-scale networks. Meanwhile, it still performs very well on both computer-generated benchmark and some widely used real-world networks compared with a set of competing algorithm in terms of clustering quality. Keywords-complex network; community detection; ant colony optimization; simulated annealing; modularity Q

I.

optimization based methods solve CDP by transforming it into an optimization problem and trying to find an optimal solution for a predefined objective function, such as the modularity Q [6] employed in some algorithms [7-10]. In particular, there are also some ant colony optimization (ACO) algorithms for CDP which have been proposed currently [11-13]. However, as far as we know, all these algorithms belong to heuristic method; meanwhile, they don’t have the ability to deal with large-scale networks. Aiming to overcome the drawbacks of the existing ACO methods for CDP, an effective as well as efficient ant colony optimization, so called MACO, has been proposed here. In this method, each ant wishes to propagate the label of its current position (vertex) to some others, but under a constraint that locally optimizes modularity Q according to a simulated annealing idea. The local propagation that each of our ants does in each cycle is affected by the actions of the former ants, which can be regarded as the underlying interactive mechanism of the ant colony. Moreover, the MACO also adopts a thought of “layer and rule”, so as to further improve its performance. To the best of our knowledge, the proposed MACO here is the first ACO method which takes modularity Q as fitness function. Additionally, here we do not use the (time-consuming) pheromone mechanism, which make it more efficient and well suitable to deal with large networks.

INTRODUCTION

Many complex systems in real world exist in the form of networks, such as social network, biological network, Web network, etc., which are collectively referred to as complex networks. A distinguishing property of complex networks from random networks is the ubiquitousness of community structures, which are intuitively highly intra connected subnetworks with relatively sparse connections to the leaving parts [1]. The goal of community detection problem (CDP) is to detect and interpret community structures in various complex network data. The research on CDP in complex networks is of fundamental importance in both theoretical significance and practical applications. Recently, there are lots of community detection algorithms which have been developed. In terms of the basic strategies adopted by them, they mainly fall into two main categories which are heuristic and optimization based methods. The heuristic methods solve CDP based on some intuitive assumptions, such as [1-5]. In contrast, the

978-1-4244-9953-3/11/$26.00 ©2011 IEEE

II.

ALGORITHM

A. Problem Definition In order to measure how well the community structure found is, Newman et al. [6] proposed a quantitative measurement called modularity Q, which has been widely accepted by the scientific community. The basic thought of this function Q is that, the more the within-community edges there is (compared to the random connecting situation), the better the community division will be. Let N = (V, E) represent a weighted network, where V is the set of vertices (nodes) and E is the set of edges (links) connecting pairs of vertices. Assume that A = (Aij)n×n is the adjacency matrix of this network N where Aij is the weight of link from vertices i to j, ki = ∑ j Aij is the degree of a node i,

m=

1 2

∑ ij Aij is

the number of total edges, function s(u, v)

equals to 1 if u = v and 0 otherwise, ci denotes the community

1151

which vertex i is assigned to. Then, the modularity Q can be defined as (1).

ki k j ⎞ ⎛ 1 (1) ⎜ Aij − ⎟ s ci , c j ∑ 2m ij ⎝ 2m ⎠ If we rewrite (1) into (2), function Q can be expressed as the sum of function f of all vertices. As we can see, from the angle of each vertex, here function f denotes the actual number of edges of a vertex within community, minus the expected value of the same quantity if edges fall at random without regard for the community structure. Thus, it can be also regarded as a quality metric for communities representing the same meaning as function Q, while in terms of the local view of each vertex.

(

Q=

)

ki k j ⎞ (2) ⎟ 2m ⎠ j ∈ ci ⎝ Furthermore, from the analysis of [10], for ∀ i ∈ V, the global function Q is monotone increasing with the local function f of each vertex i. This also means that, if the variation of one vertex’s label results in an increase of its own function f under the condition that the labels of all other vertices do not change, function Q of the entire network will also increase at the same time. Based on the above theory, here we proposed a new method, which optimizes the global function Q by making each vertex optimize its local function f. Q=

1 ∑ f (i), 2m i

f (i ) =



∑ ⎜ A ij −

B. The Main Idea The basic thought of our ant colony optimization is as follows. At the beginning of this method, it initializes each vertex as a community and randomly distributes some ants on the network. Thereafter, it will proceed in a number of cycles. In each cycle, each of the ants freely crawls from one vertex to another, and tries to propagate the label of its current position to some others. The propagation process is directed by a simulated annealing strategy whose purpose is to locally optimize modularity Q. Finally, this algorithm will stop when there are no vertices in the network that change their labels. Now, the method described above is so called SACO, which means a single-layer ACO. In order to further improve the performance of the SACO, a Multi-layer ACO, so called MACO, which is based on the tactic of “layer and rule”, is proposed here. In the MACO, we firstly execute the SACO on the original network as the first level, so as to attain a (high resolution) community structure of this network. Then, we will reconstruct a higher level networks based on this partition, which makes each detected community as a vertex and the sum of the weights of edges between any two communities as the weight between them. Thereafter, we will execute the SACO again on the new generated network at this higher level. This process will be iteratively executed until there is no increase of modularity Q which can be available. At last, we will select the partition which corresponds to the maximum Q-value from the hierarchical community structures as the best one.

C. Single-layer Ant Colony Optimization Based on the above discussions, here we describe the single-layer ant colony optimization which is called SACO in Fig. 1. In our algorithm SACO, we firstly initialize each vertex with a unique label by step 1, and then randomly distribute some ants on the network by step 2. At each iteration, each ant wishes to propagate the label of its current vertex to one of its neighbors. Thus, if it’s possible, each of the ants will randomly select one of its neighbors with a different label by step 9, and move there by step 10. Thereafter, it will also decide whether it’s suitable to propagate the label from its previous position to the new selected one by steps 14 and 16, which is based on a type of simulated annealing strategy and the purpose is to optimize its local function f. At last, the SACO will stop when the iteration number limit is reached. As we can see, our approach SACO has some distinct differences from the existing ant algorithms. Unlike most of those methods where ants correspond with each other through the pheromone, our ants communicate with one another by a particular underlying interactive mechanism which makes the actions of the current ants affected by that of all the previous ants. In other words, all our ants work collectively on a same community division, which helps to realize the indirect communication among the ant colony. This special interactive mechanism makes the time complexity of our algorithm significantly reduced. D. Multi-layer Ant Colony Optimization It’s obvious that, the SACO is inherently a local optimization based method, and it detects communities just by making use of single vertices’ movements among communities. Though it can attain the community structure with a high resolution easily, this result may not correspond to a partition with the maximum Q-value. Therefore, here we present a multi-layer MACO to further improve the single-layer SACO by merging communities, which is implemented through vertices’ moves on the networks at a higher level. The description of MACO is in Fig. 2. As we can see, the process of network reconstruction in step 7 will produce a new weighted network at a higher level, which also has self-loop edges. While, fortunately, our SACO has the ability to deal with this type of networks. Meanwhile, function Q is also suitable for weighted network with self-loop edges, and the Q-value of the community division result of a high level network is just equal to the same quantity for the community division obtained by mapping this result to its original network. In fact, running SACO on different level of networks can all be regarded as the optimization for modularity of the original network. III.

EXPERIMENT

In order to evaluate the performance of algorithm MACO, we tested it on computer-generated networks as well as in some widely used large-scale real networks.

1152

Procedure SACO

Procedure MACO

Input: N, L, p, T, c;

Input:

// N denotes the network, L is the

N; // the original network

iteration number limit, p is the proportion of ant colony, T is the initial

Output:

temperature, and c is the annealing coefficient

corresponding to maximum Q

Output:

C; // the community structure

// the community division

Begin

Begin

1 i ← 0;

1 For ∀ v ∈ V, Cv(0) ← v;

2 N(i) ← N; // N(i) denotes the i-level network

// initialize the labels of all

3 Do

vertices

2 Randomly distribute n ′ ants on network N; // n′ = p*n

4

n is the number of vertices

5

For j = 1: n ′

5

have the same label

7

N(i) ← Build a higher level network based on H(i); // take each community as a vertex, and the sum of the weight of edges between any two communities as the

Else

weight between them

8 Untill Q (i) ≤ Q (i-1)

previous_vertex ← The vertex where

9 best_Partition ← H(i-1); // the partition with

ant j is situated; 9

current_vertex ← Randomly selects one

maximum Q-value

of its neighbors with a different label; 10

End

Ant j moves to current_vertex;

11

Figure 2. The algorithm flow of MACO

fcur ← Compute the f-value of

current_vertex with its own label; f ′cur ← Compute the f-value of

12

current_vertex with the previous_vertex’s label; 13

f ′cur > fcur

If

14

Ccurrent_vertex(i) ← Cprevious_vertex(i)

with probability 1; 15

Else

16

Ccurrent_vertex(i) ← Cprevious_vertex(i)

with the annealing probability p; ′ ⎞ ⎛ f − f cur // p = exp ⎜ − cur ⎟ , a type of simulated T ⎝ ⎠ annealing strategy

17 18

// Q(i)

i ← i+1;

Ant j randomly selects one of its

8

Q(i) ← Compute the Q-value of H(i);

6

neighbors and moves there; 7

// H(i) denotes the

denotes the Q-value of H(i)

If the vertex of ant j and all its neighbors

6

H(i) ← Run SACO on N(i);

partition of N(i)

3 For i = 1: L 4

best_Partition;

End End

19 End 20 T ← T*c; // cooling 21End End Figure 1. The algorithm flow of SACO

In the experiment, our MACO is compared with four representative as well as efficient community detection algorithms, in which FN [7] and FUA [9] are optimization based methods, while FEC [3] and LPA [4] are heuristic methods. There are four parameters: T, c, L and p in our algorithm MACO. The T and c both are simulated annealing parameters. T denotes the initial temperature, and c denotes annealing coefficient. But, the L and p both are ACO parameters. L denotes the iteration number limitation, and p denotes the fraction of ant colony size to the number of nodes in the network. According to our experience and some experiment results, we set T = 500, c = 0.1, L = 50 and p = 0.6 in this paper. A. Computer-generated Networks We adopt randomly-generated synthetic networks by Newman model [1] with a known community structure to evaluate the performance of the algorithms. Moreover, here we employ a widely used accuracy measure so called Normalized Mutual Information (NMI) [14]. In this benchmark, each graph consists of n = 128 vertices divided into 4 groups of 32 nodes. Each vertex has on average zin edges connecting it to members of the same group and zout edges to members of other groups, with zin and zout chosen such that the total expected degree zin+zout = 16, in this case. As zout

1153

is increased from the small initial values, the resulting graphs pose greater and greater challenges to the community detection algorithms. In Fig. 3, we show the NMI accuracy attained by each algorithm as a function of zout from 1 to 12. As we can see, our algorithm MACO is competitive with FUA, and outperforms the other three methods in terms of NMI accuracy on this benchmark. B. Real-world Networks As real networks may have some different topological properties from the synthetic ones, here we adopt several widely used large-scale real networks to further evaluate the performance of these algorithms. These networks that we used here are all listed in Table 1. The sizes of these networks range from thousands of nodes to (near) millions of nodes. Because the inherent community structure for real networks is usually unknown, here we adopt the most commonly used modularity Q [6] to evaluate the performance of these algorithms. Table 2 shows the average result (over 50 runs) that compares our method MACO with FN, FEC, LPA and FUA in terms of function Q on the real-world networks described in Table 1. As we can see, the clustering quality of our method MACO is a little worse than that of the FUA, and better than that of the other three algorithms. This also shows that our MACO is very effective on large-scale real networks.

TABLE II.

Q-value word internet arxiv www amazon webgoogle

normalized mutual information

FN 0.4665 0.6378 0.7153 -

FEC 0.4609 0.6104 0.7276 0.7962 0.8088 0.9409 IV.

LPA 0.3340 0.4978 0.6399 0.8422 0.6733 0.8024

FUA 0.5246 0.6613 0.7801 0.9455 0.8478 0.9771

MACO 0.5150 0.6487 0.7724 0.9315 0.8473 0.9719

CONCLUSION

In this paper, a multi-layer ant colony optimization (MACO) for community detection has been proposed. As we can see, from the angle of each vertex, MACO can be considered as a local optimization algorithm. It makes each ant try to optimize its local function f, which is with the aid of a simulated annealing strategy, so as to optimize the global modularity function Q. From the angle of entire network, MACO can be regarded as a type of label diffusion algorithm, in which each ant tries to propagate the label of its current position (vertex) to some others. Moreover, MACO also employ the idea of “layer and rule”, so as to further improve its performance. At last, our experimental result has shown that MACO is highly effective and efficient for discovering communities. ACKNOWLEDGMENT

1 FN FEC LPA FUA MACO

0.6

This work was supported by National Natural Science Foundation of China under Grant Nos. 60873149, 60973088, National High-Tech Research and Development Plan of China under Grant No. 2006AA10Z245.

0.4

REFERENCES

0.8

[1]

0.2

0

0

2 4 6 8 10 number of inter-community edges per vertex zout

12

Figure 3. Compare MACO with FN, FEC, LPA and FUA on Newman benchmarks in terms of NMI accuracy. Each point is an average result over 50 realization of graphs.

TABLE I.

Networks word

COMPARE MACO WITH FN, FEC, LPA AND FUA IN TERMS OF FUNCTION Q ON LARGE-SCALE REAL NETWORKS

|V| 7,207

REAL-WORLD NETWORKS USED HERE

|E| 31,784

Descriptions Semantic network [2] A snapshot of the Internet internet 22,963 48,436 by Mark Newman [15] Scientific collaboration arxiv 56,276 315,921 networks [16] Edgeed WWW pages in the www 325,729 1,090,108 nd.edu domain [17] Amazon products from amazon 473,315 3,505,519 2003 all [18] Web graph Google released webgoogle 855,802 4,291,352 in 2002 [18]

M. Girvan, and M. E. J. Newman, “Community structure in social and biological networks,” Proc. Natl. Acad. Sci., vol. 99, pp. 7821-7826, June 2002. [2] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structures of complex networks in nature and society,” Nature, vol. 435, pp. 814-818, June 2005. [3] B. Yang, W. K. Cheung, and J. Liu, “Community mining from signed social networks,” IEEE Trans. Knowl. Data En., vol. 19, pp. 1333-1348, September 2007. [4] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear-time algorithm to detect community structures in large-scale networks,” Phys. Rev. E., vol. 76, pp. 036106, September 2007. [5] D. Jin, B. Yang, C. Baquero, D. Liu, D. He and J. Liu, “Markov random walk under constraint for discovering overlapping communities in complex networks”, J. Stat. Mech., vol. 2011, pp. P05031, May 2011. [6] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Phys. Rev. E., vol. 69, pp. 026113, February 2004. [7] M. E. J. Newman, “Fast algorithm for detecting community structure in networks”, Phys. Rev. E., vol. 69, pp. 066133, June 2004. [8] R. Guimera and L. A. N. Amaral, “Functional cartography of complex metabolic networks,” Nature, vol. 433, pp. 895-900, February 2005. [9] V. D. Blondel, J. L. Guillaume, R. Lambiotte and E.Lefebvre, “Fast unfolding of communities in large networks,” J. Stat. Mech., vol. 2008, pp. P10008, July 2008. [10] D. Jin, D. He, D. Liu and C. Baquero, “Genetic algorithm with local search for community mining in complex networks,” in Proceedings of

1154

the 22th International Conference on Tools with Artificial Intelligence (ICTAI’10) , 2010, pp.105-112. [11] Y. Liu, J. Luo, H. Yang, and L. Liu, "Finding closely communicating community based on ant colony clustering model," aici, in Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI’10), 2010, pp.127-131. [12] S. Sadi, S. G. Oguducu, and A. S. Uyar, "An efficient community detection method using parallel clique-finding ants", in Proceedings of IEEE Congress on Computational Intellignce (CEC’10), 2010, pp. 1-7. [13] D. Jin, D. Liu, B. Yang, J. Liu, C. Baquero, D. He, “Ant colony optimization with markov random walk for clustering in complex networks,” in Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’11), 2011, pp. 123134.

[14] L. Danon, J. Duch, A. D. Guilera, and A. Arenas, “Comparing community structure identification,” J. Stat. Mech., vol. 2005, pp. P09008, September, 2005. [15] Network data from Mark Newman’s home page, http://wwwpersonal.umich.edu/~mejn/netdata/, 2006 [16] M. E. J. Newman, “The structure of scientific collaboration networks,” Proc. Natl. Acad. Sci., vol. 98, pp. 404-409, January 2001. [17] Center for Complex Network Research, http://www.nd.edu/~networks/resources/, 2007 [18] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Statistical properties of community structure in large social and information networks,” in Proceedings of the 17th International Conference on World Wide Web (WWW’08), 2008, pp. 695-704.

1155