Wu ZH, Lin YF, Gregory S et al. Balanced multi-label propagation for overlapping community detection in social networks. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(3): 468–479 May 2012. DOI 10.1007/s11390-012-1236-x
Balanced Multi-Label Propagation for Overlapping Community Detection in Social Networks Zhi-Hao Wu1 (武志昊), You-Fang Lin1 (林友芳), Steve Gregory2 , Huai-Yu Wan1 (万怀宇), Student Member, CCF, and Sheng-Feng Tian1 (田盛丰) 1
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
2
Department of Computer Science, University of Bristol, Bristol BS8 1UB, U.K.
E-mail: {zhihaowu, yflin}@bjtu.edu.cn;
[email protected]; {huaiyuwan, sftian}@bjtu.edu.cn Received August 31, 2011; revised January 6, 2012. Abstract In this paper, we propose a balanced multi-label propagation algorithm (BMLPA) for overlapping community detection in social networks. As well as its fast speed, another important advantage of our method is good stability, which other multi-label propagation algorithms, such as COPRA, lack. In BMLPA, we propose a new update strategy, which requires that community identifiers of one vertex should have balanced belonging coefficients. The advantage of this strategy is that it allows vertices to belong to any number of communities without a global limit on the largest number of community memberships, which is needed for COPRA. Also, we propose a fast method to generate “rough cores”, which can be used to initialize labels for multi-label propagation algorithms, and are able to improve the quality and stability of results. Experimental results on synthetic and real social networks show that BMLPA is very efficient and effective for uncovering overlapping communities. Keywords
1
overlapping community detection, multi-label propagation, social network
Introduction
Community structure, which indicates groups of vertices such that vertices within a group are much more connected to each other than to the rest of the network, is one of the key properties of social networks[1] . In many cases, communities are always meaningful. For example, a community may represent an organization, a department of a corporation or a group of people with similar interests. Community detection has been used in social network analysis to answer a wide range of questions regarding the behavior and interaction patterns of people. Usually, communities in social networks can overlap with each other. For example, a researcher may belong to multiple research groups, and one person often joins in different hobby groups in the real world or on the Internet. Thus detecting overlapping communities is very important in understanding and analyzing the structures of social networks. At the same time, community structures also supply a proper perspective for studying the dynamics of social networks. Since the amount of social network data has proliferated in recent years, such as huge blog or SNS (Social
Network Service) datasets, detecting communities, especially overlapping communities, in very large social networks has been an active field of research for years. For large networks, the speed of algorithms is very important. There have been some algorithms that can detect non-overlapping communities in large networks efficiently[2-5] . One of the fastest algorithms proposed to date is the label propagation algorithm (LPA) of Raghavan et al.[2] . The algorithm starts by giving each vertex a unique label. In each iteration over all vertices, each vertex is given the label that most of its neighbors have. As well as its near-linear time complexity, it is very simple and has no parameters. The label propagation technology has been an active research direction. Many research articles regarding LPA have been published in recent years, such as [6-10]. However, there are very few algorithms that can uncover overlapping communities. COPRA[9] is the first algorithm proposed to detect overlapping communities by using label propagation technology, to our knowledge. It is very fast and allows each vertex to belong to v communities at most, where v is a parameter of COPRA. However, the parameter v is vertex-
Regular Paper This work was partially supported by the Fundamental Research Funds for the Central Universities of China, the National Natural Science Foundation of China under Grant No. 60905029, the Natural Science Foundation of Beijing of China under Grant No. 4112046. ©2012 Springer Science + Business Media, LLC & Science Press, China
Zhi-Hao Wu et al.: Balanced Multi-Label Propagation
independent. If there are some vertices with a small number of community memberships and some others with a large number of community memberships, it will be hard for COPRA to choose a suitable v to satisfy both kinds of vertices at the same time. A new method proposed in this paper will solve this problem. There are some other methods to detect overlapping communities designed in recent years[11-15] . OSLOM[14] is a recent proposed method based on the local optimization of a fitness function expressing the statistical significance of clustering with respect to random fluctuations. The authors declare that OSLOM is able to detect overlapping communities on very large networks if a hard partition generated by a fast non-overlapping community detection algorithm (e.g., BGLL[3] ) is given, and then OSLOM can refine the hard partition to overlapping communities. Another kind of method to find overlapping communities is to cluster links, such as [15]. The first step of this method is to construct a line graph of the original network. Then a non-overlapping community detection algorithm can be used to find link communities. Since vertices can belong to multiple links, it guarantees communities can overlap with each other. The disadvantage of these methods is their high memory complexity, because a vertex with degree k will become a k-clique in the corresponding line graph. Thus for a large network with millions of vertices or billions of edges, it may require hundreds of GB memory. In this paper, we propose a new multi-label propagation algorithm to detect overlapping communities in social networks. Our algorithm extends the LPA algorithm, allowing each vertex to have multiple labels that have balanced belonging coefficients, without COPRA’s global limit on the number of community memberships. The paper is structured as follows. In Section 2, we present our new method in detail. Experimental results on synthetic and real social networks are shown in Section 3. Conclusions appear in Section 4. 2 2.1
Multi-Label Propagation for Overlapping Community Detection COPRA Algorithm
As an extension of the LPA algorithm, the COPRA algorithm[9] allows a vertex label to contain more than one community identifier. The main procedure of COPRA is as follows. 1) To initialize, every vertex is given a unique label with belonging coefficient setting to 1. 2) Then, repeatedly, each vertex x updates its labels by summing and normalizing the belonging coefficients of vertices in the neighbor set of x. To avoid every
469
vertex owning all community identifiers at the end, COPRA algorithm uses the parameter v to limit the maximum number of communities to which any vertex can belong. After several iterations, if the stop criteria proposed by Gregory[9] is satisfied, the propagation procedure stops. 3) Remove communities that are totally contained by others. 4) Split discontinuous communities. COPRA adopts synchronous updating strategy in the propagation procedure, because it seems to give better results than asynchronous updating. In the second step, to retain more than one community identifier in each label without keeping all of them, COPRA calculates a belonging coefficient for each label of each vertex. Then the labels whose belonging coefficient is less than some threshold are deleted. Here 1/v is used as the global threshold, where v is the parameter of COPRA. Because the belonging coefficients in each label sum to 1, v represents the maximum number of communities to which any vertex can belong. Since v is a vertex-independent parameter, it is possible that all belonging coefficients of labels of one vertex are less than the threshold. If so, COPRA retains only the label that has the greatest belonging coefficient, and deletes all others. If more than one label has the same maximum belonging coefficient, below the threshold, COPRA retains a randomly selected one of them. This random selection reduces the stability of the algorithm. COPRA keeps the good computational performance of LPA, and is able to give good results in many cases. However, since it adopts a global vertex-independent parameter, if there are some vertices with various numbers of community memberships in a network, it will be hard to choose a suitable value of v. For example, if most vertices are non-overlapping vertices and a small part of vertices are overlapping with a large number of community memberships, a small value of v will make it hard to identify overlapping vertices, and a large value of v will make some non-overlapping vertices identified as overlapping vertices. In the next subsection, we will propose a new update strategy with a vertex-dependent parameter. 2.2
Balanced Belonging Coefficients Update Strategy
In the COPRA algorithm, each vertex has at most v labels, while the balanced belonging coefficients (BBC) update strategy proposed in this paper does not limit the number of communities that a vertex may belong to. The BBC update strategy requires that labels of a vertex have balanced belonging coefficients.
470
J. Comput. Sci. & Technol., May 2012, Vol.27, No.3
As with COPRA, we label each vertex x with a set of pairs (c, b), where c is a community identifier and b is a belonging coefficient, indicating the strength of x’s membership of community c, such that all belonging coefficients for x sum to 1. Each propagation step would set x’s label to the union of its neighbors’ labels and sum the belonging coefficients of the communities over all neighbors. Then we find the label cmax with maximum belonging coefficient bmax . Whether a community identifier should be retained is judged by (1). b bmax
> p,
(1)
where p is the threshold parameter, and p ∈ (0, 1]. If the belonging coefficient of a community identifier is balanced enough with the maximum belonging coefficient, i.e., (1) is satisfied, it is retained. At last, the belonging coefficients of the retained labels are normalized.
Fig.1. BBC Propagation of multiple labels for the central vertex, p = 0.75.
Fig.1 shows an example of the update procedure. The central vertex has five neighbors. To calculate the labels of this vertex, first we sum the belonging coefficients for each label of all neighbors. After finding the maximum value 9/6, each value is divided by the maximum value to get a ratio. Finally, the labels with ratio larger than the threshold p are kept as the final labels of the central vertex in this iteration. In each iteration, all vertices’ labels are recalculated using this method, in a random order. If we initialize labels of all vertices as other label propagation algorithms do, i.e., each vertex has a unique label, the new update strategy clearly will not work well. This is because no matter what value of p is chosen, each vertex will own all community identifiers of its neighbors. Therefore we propose a “rough core” extraction algorithm to give initial labels before label propagation.
2.3
Rough Core
Maximal cliques can be recognized as the most common cores of communities naturally, but the computational complexity of finding all maximal cliques is too high for large networks. Some non-overlapping community detection algorithms, such as BGLL algorithm[3] , have relatively low complexity, but they can only generate disjoint communities. For networks with overlapping communities, some labels may be totally discarded before the label propagation procedure. Although the first stage of GCE algorithm[13] is able to find overlapping seeds, it is still not efficient enough for large social networks. In this subsection, we propose an algorithm, called RC (Rough Cores), to extract overlapping rough cores efficiently. Table 1. RC Algorithm RC(): 1: give initial unique label for each vertex; 2: sort all vertices in the order of their degrees from large to small in vSet; 3: foreach vertex i in sorted vSet: 4: if ki > 3 and i.free = true: 5: find vertex j which has the largest degree in N (i) and j.free = true; 6: if j 6= null : 7: add vertices i and j to a new core; 8: commNeiber ← N (i) ∩ N (j); 9: sort commNeiber in the order of vertex degree from small to large; 10: while commNeiber 6= null : 11: foreach vertex h in sorted commNeiber: 12: add h to the core; 13: delete vertices not in N (h) from commNeiber; 14: delete vertex h from commNeiber; 15: if sizeof(core) > 3: 16: add core to cores; 17: return cores;
In this algorithm, we do not find all maximal cliques, because many maximal cliques that highly overlap only indicate the same core. A problem here is, for a core indicated by more than one clique, which clique should be extracted to initialize the labels? Through experimental observation, we conclude that a small clique will be better for the final results. Because the label propagation algorithm has strong infectiousness, larger cliques will cause “monster communities” more easily than smaller ones among the final overlapping communities. Thus, the “rough core” here indicates the smallest maximal clique starting from two first chosen vertices. To find a new rough core, first we find a “free” vertex u with the largest degree, and then a second “free” vertex v with the largest degree will be found in N (u). Here a “free” vertex is one not yet belonging to any cores and
Zhi-Hao Wu et al.: Balanced Multi-Label Propagation
471 Table 4. Other Functions in BMLPA
N (u) is the neighbor set of vertex u. Based on the two chosen vertices, the smallest maximal clique is found as a rough core, by adding the vertex with the smallest degree from the set of common neighbors of vertices of the current core iteratively. In this algorithm, ki represents the degree of vertex i. We give each vertex a “free” flag. If vertex v already belongs to some cores, v.free = false, otherwise v.free = true. 2.4
Normalize(l): sum ← 0; foreach (c, b) in l: sum ← sum + b; foreach (c, b) in l: l ← l − {(c, b)} ∪ {c, b/sum}; id(l): ids ← {}; foreach l.x in l: ids ← ids ∪ id(l.x); Return ids;
BMLPA Algorithm
The complete BMLPA algorithm is shown in Tables 2∼4. For consistency, apart from the initialization and update strategy function we keep the remaining parts of COPRA in BMLPA, including the synchronous updating, stop criteria and post processing procedure. This
id(x): ids ← {}; foreach (c, b) in x: ids ← ids ∪ {c}; Return ids; count(l): counts ← {}; foreach l.x in l: foreach (c, b) in l.x: if for some n, (c, n) is in counts: counts ← counts − {(c, n)} ∪ {(c, n + 1)}; else: counts ← counts ∪ {(c, 1)}; Return counts;
Table 2. BMLPA Algorithm 1: Initialize old using cores generated by RC(); 2: foreach vertex x: Propagate bbc(x, old, new); 3: if id(old) = id(new): min ← mc(count(old), count(new)); else: min ← count(new); 4: if min 6= oldmin: old ← new; oldmin ← min; Repeat from step 2. 5: foreach vertex x: ids ← id(old.x); foreach c in ids: if for some g, (c, g) is in coms, (c, i) in sub: coms ← coms − {(c, g)} ∪ {(c, g ∪ {x})}; sub ← sub − {(c, i)} ∪ {(c, i ∩ ids)}; else: coms ← coms ∪ {(c, {x})}; sub ← sub ∪ {(c, ids)}; 6: foreach (c, i) in sub: if i 6= {}, coms ← coms − (c, g); 7: Split discontinuous communities in coms;
mc(cs1, cs2): cs ← {}; foreach (c, n1) in cs1, (c, n2) in cs2; cs ← cs ∪ {(c, min(n1, n2))}; Return cs;
way, it is convenient to estimate each of the contributions of this paper. This algorithm keeps two vectors of vertex labels: old and new; old.x (resp. new.x) denotes the previous (resp. updated) label for vertex x. Each vertex label is a set of pairs (c, b), where c is a community identifier and b is the belonging coefficient. N (x) is the set of neighbors of vertex x. 2.5
Table 3. Propagation Function in BMLPA Propagate bbc(x, source, dest): dest.x ← {}; foreach y in N (x): foreach (c, by ) in source.y: b ← by ; if for some bx , (c, bx ) is in dest.x: dest.x ← dest.x − {(c, bx )} ∪ {(c, bx + b)}; else: dest.x ← dest.x ∪ {(c, b)}; bmax ← 0; foreach (c, b) in dest.x: if b > bmax : bmax ← b; foreach (c, b) in dest.x: if b/bmax < p: dest.x ← dest.x − {(c, b)}; Normalize(dest.x);
Complexity Analysis
The time complexity of each step of BMLPA is estimated below, where n is the number of vertices, m is the number of edges, vavg is the average number of labels to which each vertex belongs and davg is the average degree. 1) To calculate rough cores, first the sort procedure according to vertex degree takes O(n log n). And then since vertices also can belong to multiple cores and it only starts from “free” vertices, it takes O(vavg davg n) to find all cores. The total time for initialization is therefore O(n(log n + vavg davg )). 2) The only difference between BMLPA and COPRA in the complexity of the propagation phase is that COPRA limits the maximum number of memberships of each vertex to v, while BMLPA does not limit it. So
472
J. Comput. Sci. & Technol., May 2012, Vol.27, No.3
the total time for the whole phase of BMLPA is therefore O(vavg m log(vavg m/n)). 3) In the remaining steps, BMLPA has the same complexity as COPRA except for replacing v with vavg . Steps 2∼4 are repeated, so the time per iteration is O(vavg m log(vavg m/n)). The initial and final steps take 3 time O(vavg (m + n) + (vavg + log n)n). Assuming that vavg is always a small integer, for a sparse network, the time complexity is therefore O(n log n) for each iteration. 3
Experiments
3.1
Methodology
In this subsection, we evaluate BMLPA algorithm on both synthetic networks and real social networks. There are several kinds of synthetic networks. One of these is the well-known Girvan-Newman benchmark[16] , but this kind of synthetic network contains few important features of real-world networks. We adopt the LFR benchmark[17] , a class of graphs with planted community structure and heterogeneous distributions of vertex degree and community size. Since the LFR benchmark supports planted communities, we can use a variant of the Normalized Mutual Information (NMI) measure, which has been extended to handle overlapping communities, to evaluate the results. The detailed definition of the NMI measure used in this paper can be found in [12]. NMI = 1 means that the found communities perfectly match the real communities. Smaller values of NMI indicate worse detection results. For real social networks, since it is hard to know the real communities, usually a quality function is employed to measure the quality of results. Modularity Q[18] is a famous quality function for non-overlapping communities. A high value of modularity always indicates more intra community edges than would be expected by chance. Recently some researchers also propose some overlap modularity measures[19-21] . Nicosia et al. have extended modularity Q to overlap modularity, called Qov [19] . The value of overlap modularity depends on the number of communities to which each
vertex belongs and the strength of its membership to each community. We assume that each vertex belongs equally to all of the communities of which it is a member. For Qov , a function f is used, and we define it as suggested in [19]: f (x) = 60x − 30.
(2)
Although modularity has some problems, such as the resolution limit[22] , it is a widely used index to evaluate the results of community detection. We will use the widely used Qov measure for experiments on real social networks but shall not draw strong conclusions about the accuracy of the community detection, for the above reasons. In the following experiments, we mainly compare BMLPA with COPRA and RC-COPRA (COPRA with RC initialization) to verify the effectiveness of the initialization procedure with RC and the BBC update strategy. Also BMLPA is compared with a recent proposed algorithm, OSLOM[14] , which is fed with RC as the seeds to optimize in our experiments. 3.2
Comparison with COPRA
In this subsection we compare three multi-label propagation algorithms, i.e., COPRA, RC-COPRA and BMLPA. RC-COPRA stands for the version of COPRA with initialization using RC. By comparing COPRA and RC-COPRA, we can evaluate the contribution of RC to COPRA. Based on RC, we can compare RCCOPRA and BMLPA to estimate the improvement of the new update strategy of this paper. For COPRA and RC-COPRA, the best value of the parameter v is searched from 1 to 15 for each network. For BMLPA, the best value of p is searched from 0.1 to 1 with an increment of 0.05 each time. We run each algorithm 100 times on each network for each value of the parameter in this experiment. Table 5 shows the comparison results on a real social network PGP[27] and six LFR synthetic networks. The
Table 5. Comparison of Best Results of COPRA, RC-COPRA and BMLPA on a Social Network (PGP) and six LFR Synthetic Networks COPRA RC-COPRA Para. Avg. Std. Para. Avg. Std. 11 0.7802 0.0167 11 0.8199 0.0023 4 0.9842 0.0110 4 0.9937 0.0014 5 0.9899 0.0139 4 1.0000 0.0000 3 0.5248 0.1297 4 0.7957 0.0702 4 0.9899 0.0109 4 0.9930 0.0046 6 0.5240 0.2465 4 0.6731 0.1080 5 0.5508 0.1768 4 0.6176 0.0234 LS: network with large scale, HD: network with high density, Lmu: network with large nities, LOn: network with large On and LOm: network with large Om .
Networks PGP LS HD Lmu LC LOn LOm Note:
BMLPA Para. Avg. Std. 0.15 0.8255 3.35E−16 0.65 0.9929 1.00E−15 0.75 1.0000 0.00E+00 0.55 0.8844 2.23E−15 0.55 0.9986 5.58E−16 0.55 0.7668 0.00E+00 0.60 0.8439 1.12E−16 mu, LC: network with large commu-
Zhi-Hao Wu et al.: Balanced Multi-Label Propagation
column “Para.” shows the best parameter for each algorithm on each network. The column “Avg.” gives the average value of the evaluation measure of the best parameter. For PGP network this stands for average value of Qov and for LFR synthetic networks, it stands for average value of NMI. “Std.” stands for the standard deviation. First, we test the three algorithms on PGP network, which is a social network with 10 680 vertices and 24 316 edges. From Table 5 we can see that both COPRA and RC-COPRA find their best results with parameter v equal to 11, but RC-COPRA gives a larger value of Qov with a lower standard deviation. It means that the initial rough core makes improvements on both quality and stability. We can also see that, when RC is used to initialize labels for both COPRA and BMLPA, the BBC update strategy proposed in this paper can further improve the results on both quality and stability. Then we also test the three algorithms on six LFR synthetic networks with various properties. The standard configuration of LFR network used in this experiment is: N = 1 000, hki = 10, kmax = 30, mu = 0.1, cmin = 10, cmax = 50, On = N/10 and Om = 2. Based on the standard configuration, each kind of network only changes the given value of parameters. For the network with large scale (LS), N = 5 000. For the network with high density (HD), hki = 30, kmax = 90. For the network with large mu (Lmu), mu = 0.4. For
473
the network with large On (LOn), On = N/2. For the network with large Om (LOm), Om = 7. From Table 5 we can see on networks with large scale, high density and large communities, both COPRA and RC-COPRA can find good results, and RC can make the results more stable. But on the other three networks, RC is not only able to make the results more stable, but also improves the quality of results to some extent. As expected, BMLPA can further improve the results for Lmu, LOn and LOm networks. The above results show that both the RC initialization phase and the new update strategy can improve the performance of COPRA, especially for networks with large mu and high overlap (high Om , On ). In addition, the BMLPA algorithm can give very stable results, which is seldom achieved by other label propagation algorithms. To understand the improvement of the BBC update strategy clearly, as shown in Fig.2, we construct a small LFR network, which contains 10 overlapping vertices with Om equal to 6. For consistency, we only compare the results of BMLPA and RC-COPRA in this experiment. RC-COPRA finds the correct number of communities with v equal to 3 and 4. When v is equal to 3, RCCOPRA only finds disjoint communities. Fig.3 shows the overlapping communities detected by RC-COPRA with v equal to 4. As expected, since most vertices in
Fig.2. Small LFR synthetic network containing 10 overlapping vertices with Om = 6. Other parameters of the network are: N = 120, hki = 20, kmax = 35, mu = 0.15, cmin = 12, cmax = 30.
474
J. Comput. Sci. & Technol., May 2012, Vol.27, No.3
Fig.3. Best overlapping communities detected by RC-COPRA (v = 4) for the small LFR synthetic network of Fig.2.
Fig.4. Best overlapping communities detected by BMLPA (p = 0.8) for the small LFR synthetic network of Fig.2.
the small network are non-overlapping ones, to ensure that the quality of the overall communities are not too bad, RC-COPRA can only find overlapping vertices with small Om . However, these found overlapping
vertices do not exist between the planted overlapping communities in Fig.2. Fig.4 shows the result of BMLPA, which discoveries all planted overlapping vertices successfully. Although
Zhi-Hao Wu et al.: Balanced Multi-Label Propagation
the memberships of each overlapping vertex are not totally identified, the result is much better than that of RC-COPRA. To investigate the shifting trend of the improvements with increasing Om , we make another experiment (Fig.5).
475
parameters are fixed. Fig.6 shows the results from four algorithms. We use the best parameter for each algorithm. The parameter v is searched from 1 to 15 and p is searched from 0.1 to 1 with an increment of 0.05 each time. For the benchmark networks with disjoint communities, the searched best value of parameter v of COPRA and RC-COPRA are 4 and 2, respectively, while the best value of parameter p of BMLPA is 0.7 for this experiment.
Fig.5. Upward trend of the enhancement on quality (NMI) of the new update strategy on networks with increasing Om (from 1 to 8). Other used parameters: N = 2 000, hki = 21 ∼ 28, kmax = 120, cmin = 60, cmax = 100, t1 = 2, t2 = 2, mu = 0.2 and On = 200.
In the experiment of Fig.5, tuning Om from 1 to 8, we show the enhancement percent of NMI by comparing the results of BMLPA and RC-COPRA. As expected, this shows an upward trend when increasing Om from small to large. This is because a larger Om makes it harder for RC-COPRA to choose a suitable parameter v, while our balanced belonging coefficients update strategy is able to decide how many community identifiers to retain depending on different kinds of vertices. The above experiments prove that both RC initialization and BBC update strategy are able to improve the quality and stability of communities detected by COPRA. Compared with the update strategy of COPRA, the balanced belonging coefficients update strategy proposed in this paper can solve networks with various numbers of community memberships successfully. 3.3
Tests on Benchmark Networks
In this subsection, we test the three multi-label propagation algorithms and a recent proposed overlapping community detection algorithm, OSLOM, on LFR benchmark networks. For consistency, we use RC to generate seeds for OSLOM algorithm in the following experiments. We run each algorithm 10 times on each one of 10 independent realizations. The first set of benchmarks contains only disjoint communities. The network size is either 1 000 or 5 000, community sizes are in the range 10∼50 or 20∼100, the mixing parameter mu varies from 0.1 to 0.8, and other
Fig.6. NMI of four overlapping community detection algorithms on synthetic networks with disjoint communities. (a) LL: large network with large communities. (b) LS: large network with small communities. (c) SL: small network with large communities. (d) SS: small network with small communities.
The results shows that, when mu is smaller than 0.4, all three multi-label propagation algorithms can give very good results, which are better than that of OSLOM. The effectiveness of the initialization phase of RC begins to emerge when mu is larger than 0.4. It can be observed that the advantage of the new update strategy is notable on networks with large communities when mu is tuned to a relatively large value, such as 0.6 for LL network and 0.5 for SL network. The possible reason is as follows. Since we fix the average degree of the benchmark networks, communities in networks with small communities may have high clustering coefficient, which are easier to detect. In contrast, for networks with large communities and relatively large value of mu, the inner community structure may be more decentralized and vertices connecting different communities may be hard
476
to identify correctly. At this time, the update strategy of COPRA will randomly choose a label if there are more than one labels that have the same largest belonging coefficient which is less than the threshold parameter 1/v. This random operation will bring the risk of misclassification and it will also mislead the next iteration. In this case, the BBC update strategy will temporarily retain the labels with the same largest belonging coefficient, and let the correct community identifiers gradually emerge in the following iterations. Therefore, the results from BMLPA are better on both quality and stability. Then we test the algorithms on benchmark networks with overlapping communities. The difference from the last benchmark is that we set 100 overlapping vertices with two community memberships in each network of the overlapping benchmark. Fig.7 shows the results from algorithms with their best parameters on the overlapping benchmark networks. In this experiment, COPRA and RC-COPRA perform best with v equal to 5, respectively, and BMLPA gives its best result with p equal to 0.75.
J. Comput. Sci. & Technol., May 2012, Vol.27, No.3
communities when mu ranging from 0.6 to 0.7. Remarkably, OSLOM forms a strange curve on LL network. This may be partly because the quality of seeds generated by RC is not good enough for OSLOM in this case. In literature, OSLOM is always fed with good quality partitions, such as disjoint communities detected by Infomap[4] or BGLL[3] . But the problem of using a partition generated by some disjoint community detection algorithm is that some overlapping communities may be totally discarded by the used disjoint community detection algorithm. That is one reason why we design the RC algorithm to generate small overlapping rough cores. The above results on LFR benchmark networks show that when mu is larger than 0.4 on both disjoint and overlapping benchmark networks, the BMLPA algorithm is always able to give the best results in the three tested multi-label propagation algorithms. In most cases, BMLPA also performs better than OSLOM, except some cases when mu is too large. 3.4
Experiments on Computational Efficiency
To test the computational efficiency, we test BMLPA and other algorithms on networks with different scales and densities. Fig.8 shows the total time cost of four overlapping community detection algorithms. For RC-COPRA and BMLPA, the results in Fig.8 include the time cost of initialization phase of RC. We also show the time cost of RC in this figure. When the number of vertices becomes from small to large, the minimum and maximum sizes of communities are increased in the given ranges on average at the same time.
Fig.7. NMI of four overlapping community detection algorithms on synthetic networks with overlapping communities. (a) LL: large network with large communities. (b) LS: large network with small communities. (c) SL: small network with large com-
Fig.8.
munities. (d) SS: small network with small communities.
1 000 ∼ 500 000, hki = 6, cmin = 10 ∼ 100, cmax = 100 ∼ 1 000,
From Fig.7, we can find that RC-COPRA performs better than COPRA when mu is larger than 0.4, and at the same time the results generated by BMLPA are also a bit better than those of RC-COPRA. OSLOM has the advantage of finding parts of
Experiment on networks with different scales.
N =
mu = 0.15.
In this experiment, parameter v is used as 5 for both COPRA and RC-COPRA and p is used as 0.7 for BMLPA. From Fig.8, we can see for the given parameter BMLPA runs the fastest in the four overlapping
Zhi-Hao Wu et al.: Balanced Multi-Label Propagation
community detection algorithms. For the adoption of RC initialization, RC-COPRA takes the second place and costs less total time than COPRA. OSLOM can only process networks with about 100 000 vertices in one thousand seconds. In the experiment of Fig.9, the degrees of the synthetic networks are adjusted from 20 to 200. The time expense of OSLOM rises rapidly with the increase of average vertex degree. The results show that the three multi-label propagation algorithms in Fig.9 and the RC algorithm are not affected too much by network density.
477
on networks BL2 and CM2. Besides the complexity of iteration, since the total execution time also depends on the number of iterations, it is hard to say which algorithm owns the fastest speed. Table 6. Social Networks Used Code KAR DOL FOT EML BL1 PGP BL2 CM2
4
Fig.9. Experiment on networks with different densities. hki = 20 ∼ 200, N = 5 000, cmin = hki, cmax = 500, mu = 0.2.
3.5
Experiments on Social Networks
Finally, we have also tested the three multi-label propagation algorithms on eight real social networks listed in Table 6. Table 7 lists the best average modularity result of each algorithm using the best parameter for each network. BMLPA algorithm gives the best average modularity for every network tested expect for network BL1. At the same time, BMLPA shows excellent stability on these tested networks when the best parameter is given. Table 7 also shows the total execution time for each algorithm in its best parameter for each network. BMLPA algorithm runs the fastest on networks EML, BL1 and PGP and runs the slowest
Name Karate Dolphins Football Email Blogs PGP Blogs2 Cond-mat-2003
Ref. [23] [24] [16] [25] [26] [27] [26] [28]
Vertices 34 62 115 1 133 3 982 10 680 30 557 27 519
Edges 78 159 613 5 451 6 803 24 316 82 301 116 181
Conclusions
In this paper, we present a new multi-label propagation algorithm, BMLPA, to uncover overlapping communities in social networks. In BMLPA, a balanced belonging coefficients update strategy is proposed to detect overlapping communities in networks with various numbers of community memberships, which cannot be solved well by COPRA algorithm. Also a rough core extraction method is designed to initialize labels for multi-label propagation algorithms. Three multi-label propagation algorithms: COPRA, RC-COPRA (COPRA with RC initialization) and BMLPA, are tested on both synthetic and real social networks. Experimental results show that both RC initialization and the BBC update strategy can bring improvements in both quality and stability, especially for networks that contain vertices with various numbers of community memberships. Experiments on efficiency show that BMLPA keeps the good speed advantage of LPA. Like other multilabel propagation algorithms, BMLPA can also solve networks with large scales and high densities, which are possessed by real social networks very often, and it is always a problem to other kinds of overlapping community detection algorithms, such as OSLOM.
Table 7. Test Results of the Three Multi-Label Propagation Algorithms on Eight Social Networks Code KAR DOL FOT EML BL1 PGP BL2 CM2 Note:
BMLPA p Qov Std. Time 0.70 0.74 0.000 0.00 0.75 0.77 0.000 0.00 0.60 0.69 0.000 0.01 0.85 0.73 0.000 0.18 0.50 0.76 0.000 0.69 0.15 0.83 0.000 11.6 0.90 0.70 0.000 54.8 0.50 0.69 0.000 30.5 the values in bold indicate the best results for
RC-COPRA v Qov Std. 2 0.71 0.016 2 0.76 0.219 4 0.67 0.003 2 0.60 0.180 10 0.78 0.012 11 0.82 0.002 2 0.65 0.014 1 0.64 0.031 the networks.
Time 0.00 0.00 0.01 0.25 2.45 15.5 23.4 19.0
v 3 4 2 2 9 11 2 1
Qov 0.44 0.70 0.69 0.51 0.75 0.79 0.60 0.68
COPRA Std. 0.180 0.040 0.030 0.237 0.007 0.017 0.032 0.057
Time 0.00 0.00 0.01 0.31 2.62 14.7 38.4 21.7
478
References [1] Fortunato S. Community detection in graphs. Physics Reports, 2010, 486: 75-174. [2] Raghavan U, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 2007, 76(3): 036106. [3] Blondel V, Guillaume J, Lambiotte R et al. Fast unfolding of communities in large networks. J. Statistical Mechanics: Theory and Experiment, 2008, 2008(10): P10008. [4] Rosvall M, Bergstrom C. Maps of random walks on complex networks reveal community structure. Proc. the National Academy of Sciences of U.S.A., 2008, 105(4): 1118-1123. [5] Du N, Wang B, Wu B. Community detection in complex networks. J. Comput. Sci. & Technol., 2008, 23(4): 672-683. [6] Leung I X Y, Hui P, Lio P, Crowcroft J. Towards real-time community detection in large networks. Physical Review E, 2009, 79(6): 066107. [7] Barber M J, Clark J W. Detecting network communities by propagating labels under constraints. Physical Review E, 2009, 80(2): 026129. ˇ [8] Subelj L, Bajec M. Unfolding communities in large complex networks: Combining defensive and offensive label propagation for core extraction. Physical Review E, 2011, 83(3): 036103. [9] Gregory S. Finding overlapping communities in networks by label propagation. New J. Physics, 2010, 12(10): 103018. [10] Xie J, Szymanski B K, Liu X. Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In Proc. IEEE ICDM Workshop on DMCCI 2011, Vancouver, Canada, Dec. 2011, pp.344-349. [11] Palla G, Der´ enyi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 2005, 435(7043): 814-818. [12] Lancichinetti A, Fortunato S, Kertesz J. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 2009, 11: 033015. [13] Lee C, Reid F, McDaid A, Hurley N. Detecting highly overlapping community structure by greedy clique expansion. In Proc. the 4th SNA-KDD Workshop, Washington, DC, USA, July 25-28, 2010. [14] Lancichinetti A, Radicchi F, Ramasco J J, Fortunato S. Finding statistically significant communities in networks. PLoS One, 2011, 6(4): e18961. [15] Ahn Y Y, Bagrow J P, Lehmann S. Link communities reveal multiscale complexity in networks. Nature, 2010, 466(7307): 761-764. [16] Girvan M, Newman M E J. Community structure in social and biological networks. Proc. the National Academy of Sciences of the U.S.A., 2002, 99(12): 7821-7826. [17] Lancichinetti A, Fortunato S. Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, 2009, 80(1): 016118. [18] Newman M E J, Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69(2): 026113. [19] Nicosia V, Mangioni G, Carchiolo V et al. Extending the definition of modularity to directed graphs with overlapping communities. J. Statistical Mechanics: Theory and Experiment, 2009, P03024. [20] Shen H, Cheng X, Cai K et al. Detect overlapping and hierarchical community structure in networks. Physica A: Statistical Mechanics and Its Applicat., 2009, 388(8): 1706-1712. [21] Shen H, Cheng X, Guo J. Quantifying and identifying the overlapping community structure in networks. Journal of Statistical Mechanics: Theory and Experiment, 2009, P07042.
J. Comput. Sci. & Technol., May 2012, Vol.27, No.3 [22] Fortunato S, Barthelemy M. Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America, 2007, 104(1): 36-41. [23] Zachary W. An information flow model for conflict and fission in small groups. J. Anthropological Research, 1977, 33(4): 452-473. [24] Lusseau D, Schneider K, Boisseau O J et al. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations — Can geographic isolation explain this unique trait?. Behavioral Ecology and Sociobiology, 2003, 54: 396-405. [25] Guimer` a R, Danon L, D´ıaz-Guilera A, Giralt F, Arenas A. Self-similar community structure in a network of human interactions. Phys. Rev. E, 2003, 68: 065103. [26] Gregory S. An algorithm to find overlapping community structure in networks. In Proc. the 11th PKDD, Sept. 2007, pp.91-102. [27] Bogu¨ na M, Pastor-Satorras R, D´ıaz-Guilera A, Arenas A. Models of social networks based on social distance attachment. Physical Review E , 2004, 70: 056122. [28] Newman M E J. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 2001, 98: 404-409.
Zhi-Hao Wu received the B.Sc. degree in computer science and technology from Beijing Jiaotong University (BJTU) in 2007. Now he is a Ph.D. candidate in computer science and technology at BJTU. His research interests include complex networks and social networks analysis, data mining and machine learning. His current research focuses on community detection and evolution in networks. You-Fang Lin received the Ph.D. degree in computer science and technology from BJTU in 2003. He is currently an associate professor, vice dean of the School of Computer and Information Technology, BJTU. His research interests cover data warehousing and data mining, intelligent system, complex networks, and the related practical applications in the field of civil aviation, telecommunications. Steve Gregory was a cofounder of the research field of concurrent logic programming, which was at the core of the 1980s the 5th Generation Computer Systems project. Since 1990 he has been at the University of Bristol, where he has worked on various topics. His current research interests are in network analysis. He specializes in exotic types of community structure, and has designed two algorithms, CONGA and COPRA, for detecting overlapping communities.
Zhi-Hao Wu et al.: Balanced Multi-Label Propagation Huai-Yu Wan is currently a Ph.D. candidate in computer science and technology at Beijing Jiaotong University. He received his B.S. and M.S. degrees in computer science from Beijing Jiaotong University in 2004 and 2007 respectively. His research interests focus on data mining and social network analysis.
479 Sheng-Feng Tian was born in 1944. He is a professor of the School of Computer and Information Technology, Beijing Jiaotong University in China. Currently, his research interests include support vector machines and network intrusion detection.