Consensus Clustering Approach for Discovering ...

3 downloads 0 Views 179KB Size Report
Mar 16, 2016 - D. Shiva Shankar. School of Computer and Information Sciences. University of Hyderabad, Hyderabad, India [email protected].
Consensus Clustering Approach for Discovering Overlapping Nodes in Social Networks D. Shiva Shankar

S. Durga Bhavani

School of Computer and Information Sciences University of Hyderabad, Hyderabad, India

School of Computer and Information Sciences University of Hyderabad, Hyderabad, India

[email protected]

[email protected]

ABSTRACT Community discovery is an important problem that has been addressed in social networks through multiple perspectives. Most of these algorithms discover disjoint communities and yield widely varying results with regard to number of communities as well as community membership. We utilize this information positively by interpreting the results as opinions of different algorithms regarding membership of a node in a community. A novel approach to discovering overlapping nodes is proposed based on Consensus Clustering and we design two algorithms, namely core-consensus and peripheryconsensus. The algorithms are implemented on LFR networks which are synthetic bench mark data sets created for community discovery and comparative performance is presented. It is shown that overlapping nodes are detected with a high Recall of above 96 % with an average F-measure of nearly 75% for dense networks and 65% for sparse networks which are on par with high-performing algorithms in the literature.

1.

INTRODUCTION

All the world is a stage, and all the men and women are not merely players but take upon many roles within a life-time. They don different identities and hence community discovery becomes an interesting problem as the same person can belong to different communities. For example, in collaboration networks, if research areas are treated as communities, nodes which are ‘multi-disciplinary’ become interesting nodes to discover in the real world scenario. There have been many algorithms proposed based on the notion of expanding communities, starting from a seed and using a benefit function to decide on the quality of the cluster [5], [8], EAGLE [7],[1] and several hierarchical and fuzzy approaches. Latest algorithms like BigClam[10] based on non-negative factorization method claim high speeds and can manage massive data to identify overlapping communities. Forunato’s group have developed a whole suite, called LFR networks, creating synthetic bench-mark data sets for (overlapping) community discovery[6]. We do not survey the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

CODS ’16, March 13-16, 2016, Pune, India Copyright 2016 ACM 978-1-4503-4217-9/16/03$15.00 http://dx.doi.org/10.1145/2888451.2888471

work extensively and refer the reader to the recent survey of Xie et al.[9] who carried out an in-depth evaluation of several overlapping community discovery algorithms and evaluated them on LFR networks for different topologial variations of networks.

2.

APPROACH BASED ON CONSENSUS CLUSTERING

There exist many clustering algorithms in the literature [3]. In order to choose a single clustering that agrees most with the other clusterings, the idea of consensus clustering was proposed by [2]. Lancichinetti et al.[4] uses this aspect effectively by computing consensus of the results obtained by different algorithms and proposes consensus clustering as a disjoint community discovery algorithm. We extend this idea and propose two algorithms Core-consensus and Peripheryconsensus algorithms to detect overlapping nodes. A social network is modeled as a graph G = (V, E) with |V | = n, |E| = m. We consider k known algorithms which discover disjoint communities underlying the network G.

2.1

Core of a community

Let c(i, A) be the community label assigned by the algorithm A to node i. Then we define a relation i ∼ j, i and j belong to the ‘core’ of a community if c(i, Al ) = c(j, Al ), for all algorithms Al , 1 ≤ l ≤ k. Clearly ∼ is an equivalence relation. Algorithm 1 Overlapping nodes discovery using Coreconsensus Input: Community labels obtained from k community discovery algorithms on a graph G = (V, E), |V | = n, |E| = m Output: Overlapping nodes. Step 1: Initialize clusters Gi = {i}, for i = 1, 2, . . . n. Step 2: For every pair (i, j) ∈ V × V , if i ∼ j then Gi = Gi ∪ {j} end if Return Nodes i for which Gi = {i} and degree(i) > 1.

3.

ALGORITHM BASED ON PERIPHERYCONSENSUS

We define a node i to be ‘overlapping’ if majority of algorithms decide that, majority of the neighbours j of i do not

belong to the same community as i. This algorithm that determines a node a to be an overlapping node can be written as a 2-step procedure. Algorithm 2 Overlapping nodes discovery using Peripheryconsensus Step1: For a fixed algorithm, find nodes a whose neighbours ( ≥ 50%) belong to different communities. Step2: If majority of algorithms agree that neighbours of node a belong to different communities, then return a.

4.

IMPLEMENTATION AND RESULTS ON BENCH-MARK DATA SETS

Lancichinetti et al.[6] generate a large class of benchmark graphs, called LFR networks, setting a standard for testing the community discovery algorithms. We consider the experimental setting chosen by Xie et al.[9] for comparison purposes. The authors compare 14 algorithms which discover overlapping communities for performance evaluation. In partiular, they measure performance of these algorithms for detecting overlapping nodes using threee measures Precision, Recall, F-Score (F-measure). We compare our results against the best scores obtained by Xie et al in each of the experimental scenarios.

4.1

algorithm on all these networks and present the results for |V|= N= 1000, 2000, 3000, 4000, 5000. Xie et al. present results for N= 5000, hence we compare our results on the networks of size 5000. The results obtained by the PeripheryConsensus are being compared against the best performance of the algorithms presented by Xie et al. Figures 1 show that the performance of F-measure is on par with SLFA in the case of sparse networks and in fact slightly better than LINK algorithm for dense networks. It can be seen that PeripheryConsensus algorithm yields F-Measure on par with the best results in the literature.

(a)

(b)

Figure 1: Plots comparing Periphery-Consensus algorithm with the best performance given by Xie et al.

Results and Discussion

Each graph is given as input to six community discovery algorithms of a software package RStudio, they are Edge Betweenness(EBC), FG, INF, ML, LEV and WT. The choice of these algorithms is arbitrary and made based on their speed rather than accuracy. Initially we give the results on networks constructed with 1000 nodes having 10% overlapping nodes, with overlapping membership as 3 and µ = 0.1 etc which are the default settings in the LFR data set. The Core-Consensus algorithm has not given satisfactory results. The Periphery-Consensus algorithm is run on larger graphs containing number of nodes N varying between 1000- 5000, with 10% of the nodes treated as overlapping nodes. Each experiment is run 5 times on randomly generated graphs on N nodes and the average results are presented in Table 1.

5.

REFERENCES

[1] S. Arora, R. Ge, S. Sachdeva, and G. Schoenebeck. Finding overlapping communities in social networks: Toward a rigorous approach. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC ’12, pages 37–54, New York, NY, USA, 2012. ACM. [2] A. Goder and V. Filkov. Consensus clustering algorithms : Comparison and refinement. SIAM, 2008. [3] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kauffman, 2012. [4] A. Lancichinetti and S. Fortunato. Consensus clustering in complex networks. Scientific Reports, 2:1–7, 2012. ˜ [5] A. Lancichinetti, S. Fortunato, and J. KertAl’sz. Detecting the overlapping and hierarchical community Total nodes TP FP Precision Recall F-measure structure in complex networks. New J. Phys, 11:2–17, 1000 95.2 10.8 0.898 0.952 0.923 2009. 2000 187.75 25 0.883 0.938 0.909 [6] A. Lancichinetti, S. Fortunato, and F. Radicchi. 3000 279.75 32.75 0.895 0.932 0.913 Benchmark graphs for testing community detection 4000 373.25 37.75 0.908 0.933 0.92 algorithms. Physical review E, 78(4):046110, 2008. 5000 467.5 54 0.896 0.935 0.915 [7] H. Shen, X. Cheng, K. Cai, and M.-B. Hu. Detect overlapping and hierarchical community structure. Table 1: The results obtained on average over 5 runs for Physica A, 388:1706, 2009. different networks for periphery-consensus clustering al[8] J. J. Whang, D. F. Gleich, and I. S. Dhillon. gorithm with 10% of nodes as overlapping in the groundOverlapping community detection using seed set truth. TP = True Positives and FP = False Positives. expansion. In Proceedings of the 22nd ACM, pages 2099–2108. ACM, 2013. [9] J. Xie, S. Kelley, and B. Szymanski. Overlapping 4.1.1 Comparative analysis community detection in networks: the state of the art Xie et al. [9] design two kinds of networks, those with a and comparative study. ACM Computing Surveys, 45, sparse overlap having 10% of the nodes and a dense over2013. lap of 50%. Each of these type of networks is constructed [10] J. Yang and J. Leskovec. Defining and evaluating with different degrees of overlap (Om ), the overlapping nodes network communities based on ground-truth. In belonging to exactly x communities, x = 2, 3, . . . , 8. We Proceedings of ICDM, 2012. implement our proposed overlapping community discovery

Suggest Documents