Greedy Local Algorithm for Overlapping Community ... - IEEE Xplore

16 downloads 615 Views 896KB Size Report
Abstract— Due to rapid growth of Social Networking Sites, a significant number of ... This paper presents Greedy Local Algorithm (GLA) for detecting overlaps ...
Greedy Local Algorithm for Overlapping Community Detection in Online Social Networks Ashish Kumar Singh

Dr. Sapna Gambhir

Research Scholar, Deptartment of Computer Engineering YMCA University of Science and Technology Faridabad, India [email protected]

Associate Professor, Department of Computer Engineering YMCA University of Science and Technology Faridabad, India [email protected]

Abstract— Due to rapid growth of Social Networking Sites, a significant number of people interact with each other without any barriers of geographical distances or language. This communication among people is selective in the sense that they communicate only with likeminded people, and people of similar interests. This selective communication creates a typical structure in the networks which is known as community. Community structure is a significant property of social networks and hence its detection of prime importance in social network analysis and research. This paper presents Greedy Local Algorithm (GLA) for detecting overlaps between communities in social networks, which can detect dense overlaps between communities and results in lesser number of outliers. Keywords— Community Detection, Overlapping community detection, social network analysis, Social networks, algorithms

I.

INTRODUCTION

Online social networks have become center of attraction for a wide variety of audience. Nearly every person has a profile on Facebook, Google Plus, Orkut, Twitter etc which are collectively termed as Social Networking Sites (SNS). People usually communicate to others via these SNS and share their feelings, emotions, achievements, sorrows and nearly everything about their life. A social network is a graphical representation of the communication among people, where people are represented as nodes and the edges between a pair of nodes represent some kind of communication between them [1]. It is a natural tendency of human beings to socialize and communicate with likeminded people. This selective nature of communication forms a typical structure in networks known as community. Inside communities people communicate more often to the members of the community where as they tend to talk less to the members outside the community. The problem of detecting communities has gained huge interest from the researchers now days. When a social network is represented as an undirected graph G (V, E), where V represents the individuals and E represents the links among them, then a community C is such that the number of links going outside from the vertices in C is far less than the number of links with both vertices inside C. Communities are an important feature of complex networks [2].

c 978-1-4799-4236-7/14/$31.00 2014 IEEE

Community detection algorithms answer a wide range of questions regarding the behaviour and interaction patterns of people. Community detection is the process of finding such dense group of nodes which have high internal edge density. The detection of such communities is not trivial and is quite challenging as it is completely different from two similar and well studied problems in computer science namely Clustering and Graph Partitioning. The first most challenge in the domain of community detection is that there is no definition of a community which is accepted generally; still there are a large number of community detection algorithms available which produce effective results. Recent advances in SNSs provide quality data to the researchers which allow them to experiment and test their algorithms on real world data. This makes the testing quite rigorous and hence the algorithms produced are of high standards. Finding communities have been in focus of researchers because it helps them understand the communication patterns as a function of structural properties of the networks. This paper presents a systematic and organized study of overlapping community detection techniques and proposes a new overlapping community detection algorithm named Greedy Local Algorithm. The strengths and weaknesses of each technique is also a matter of focus in this paper. The major contributions of this paper will be to serve as a base for those starting research in this direction and to provide them with the existing state of art algorithms for the research problem. The paper tries to make the researchers aware of the challenges in the direction and the solutions proposed so far. A hybrid of local expansion and optimization algorithm and clique percolation method is also proposed in the paper. The comparison of the results of the algorithm on both real world networks and synthetic networks with existing algorithms in the domain are shown as the proofs for the validity of the algorithm. The paper is organized in to sections namely Problem Formulation which clearly states the problem of overlapping community detection, Related Work which explains the techniques used so far to solve the problem and their corresponding strengths and weaknesses, Proposed Work which describes the Greedy Local Algorithm (GLA),

155

Experiments which shows the experiments conducted with the algorithm on both real world and synthetic networks. The Discussion section describes the comparison of the results of the algorithm with other techniques and evaluates the results obtained from the experiments. At last concluding thoughts are presented in the conclusion section. II.

PROBLEM FORMULATION

A node will be identified as overlapping if the links to the node are present in more than one cluster. Links are partitioned by hierarchical clustering in [1] on the basis of edge similarity. If the only available information is the network topology, the most fundamental characteristic of a node is its neighbors. Since a link consists of two nodes, it is natural to use the neighbor information of the two nodes when we define a similarity between two links. If we are given a

Given an undirected graph G(V,E) where V is the set of Vertices and E is the set of edges assign to each vertex 'v' a cover C where C represents the set of communities of which 'v' is a member. |V|=n ,|E|=m. III.

RELATED WORK

There are a numerous algorithms proposed for community detection, which are difficult to be reviewed one by one, so a classification of the algorithms is presented on the basis of the basic operating mechanism of the algorithms. Link Partitioning Algorithms work by breaking the links and creating dendrograms whereas Clique Based Algorithms exploit the properties of cliques to detect communities. Agent Based Algorithms are based on propagation of labels between vertices of the graph and the Fuzzy algorithms are based on the mixed membership models. Community Detection Algorithms

Link Partitioning Algorithms

Clique Based Algorithms

Fuzzy Algorithms

Local Expansion and Optimization

Agent Based and Dynamic Fig 1: Classification of Community Detection Algorithms

Local Expansion and Optimization algorithms are based on some fitness measure to be optimized locally. Fig. 1 shows the classification of community detection algorithms. Following are the details of each category of the algorithms. A. Link Partitioning Algorithms The basic idea of link partitioning algorithms is to partition links to discover the communities. Two steps of every link partitioning algorithms are: Step 1: Construct the Dendrogram (a representation to store hierarchical structure of network ). An illustration of dendrogram is shown in Fig 2. Step 2: Partition the Dendrogram at some threshold.

156

Fig 2 : Illustration of a Dendrogram

pair of links ejk and eik , the edge similarity between these two links is calculated by Jaccard index as:

S (e jk , eik )=

|( N iŀ N j )| |( N i N j )|

Eq. (1) Ni is the set of vertices which are in the neighborhood of vertex i including vertex i. After calculating edge similarities linkage clustering is done to find hierarchical communities. With this similarity, we use single-linkage hierarchical clustering to find hierarchical community structures. Generally single linkage hierarchical clustering is done because of its simplicity and efficiency which enables us to apply it on large networks. B. Clique Percolation Methods A clique is a maximal subgraph in which all nodes are adjacent to each other. The input to Clique based algorithms is a network graph G and an integer k. Clique based algorithms have following steps in general: a. Find all cliques of size k in the given network. b. Construct a clique graph. Any two k-cliques are adjacent if they share k-1 nodes. c. Each connected components in the clique graph form a community. CPM (Clique Percolation Method) is based on the assumption that a network is composed of cliques which overlap with each other. CPM finds overlapping communities by searching for adjacent cliques. As a vertex can be a member of more than one clique, so overlap is possible between communities. The parameter k is of utmost importance in finding communities via CPM, Empirically small values of k have shown effective results [3] [4]. An efficient implementation of the CPM method is CFinder[5]. CPM is suitable for dense graphs where cliques are present. In case there are few cliques only CPM fails to produce meaningful covers.

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

C. Agent Based and Dynamic Algorithms Three famous algorithms that come under this category are Speaker Listener Propagation Algorithms SLPA [6], Community Overlap Propagation Algorithm COPRA [8], and Label Propagation algorithm LPA [8]. SLPA is SpeakerListener label propagation algorithm, in which a node is called speaker if it is spreading information and is called a listener if it is consuming information. Labels are spread according to pair wise interaction rules. In SLPA a node can have many labels depending upon the underlying information it has learned from the network. The average complexity of SLPA is O(tm) where t is the number of iterations and m is number of edges. The best part of SLPA is that it doesn’t require any prior information about the number of communities in the network. In other two algorithms the node forgets the information it has learned in previous iterations but in SLPA each node has a stored memory in which it stores all the information it has learned about the network in form of labels. Whenever a node observes more labels in surrounding, it is more likely that it will spread those labels to other nodes. The Label Propagation algorithm described in [8] is extended to overlapping case by allowing multiple labels for a node. Initially all nodes have their own unique label, labels are updated upon iterations depending upon the labels occupied by the maximum neighbors. Nodes with same labels form a community. LPA was modified by Gregory [7], he introduced Community Overlap Propagation Algorithm (COPRA). In COPRA each label consists of a belonging coefficient and a community identifier. The sum of belonging coefficients of communities over all neighbors is normalized. A node updates its belonging coefficient in a synchronous fashion by averaging the belonging coefficients of all its neighbors at each step. The parameter v controls the number of communities of which a node can be a member. The time complexity of COPRA is O(vmlog(vm/n)). According to benchmarks in [9] COPRA provides the best results for overlapping communities. An important optimization in LPA can be to avoid unnecessary updates, which will reduce the execution time. D. Fuzzy Algorithms The overlap between communities can be of two types, one is the crisp overlap in which each node either belongs to a community or doesn’t, the belonging factor is 1 for all the communities a node is a member. The other type of overlap is the fuzzy overlap in which each node can be a member of communities with belonging factor in the range 0 to 1.The membership strength of a node to a community is denoted by bnc and if we sum the belonging coefficients of a vertex for all the communities of which it is a member the result will be 1 [10]. In non-fuzzy overlapping, each vertex belongs to one or more communities with equal strength: an individual either belongs to a community or it does not. With fuzzy overlapping, each individual may also belong to more than one community but the strength of its membership to each community can vary. It is expressed as a belonging coefficient

that describes how a given vertex is distributed between communities. The major drawback of fuzzy based overlapping community detection methods is the need to calculate the dimensionality k of the membership vector; this value is generally passed as a parameter to the algorithms, while some algorithms calculate it from the data [10]. Only a few fuzzy methods have shown good results. Zhang [11] proposed an algorithm with the combination of spectral clustering, fuzzy c means and optimization of a quality function. They propose a method to detect up to k communities by using the fuzzy c means clustering algorithm after converting the input network to a k- dimensional Euclidean space. The accuracy and computational efficiency of the algorithm is heavily dependent upon the parameter k. Nepsuz [12] model the problem of overlapping community detection as an optimization problem constrained nonlinearly which can be solved by simulated annealing methods.To determine the value of k, it is increased repeatedly until the value of community structure doesn’t improve as measured by modified fuzzy modularity function. Zhang S., Wang R.S., Zhang X.S [11] proposed a method combining spectral mapping, fuzzy clustering and optimization of a quality function. It converts a network to (k1) dimensional Euclidean space and use the fuzzy c- means algorithm to detect up to k communities. Both detection accuracy and computation efficiency rely on the user specified value k which is the upper bound on the number of communities. Nepusz Y., Petrocz A., Negyessy L., and Bazso F [12] modeled the overlapping community detection as a nonlinear constrained optimization problem which can be solved by simulated annealing methods. It allows each vertex of the graph to belong to multiple communities at the same time. The algorithm determines the optimal membership degrees with respect to a given goal function. E. Local Optimization Algorithms These algorithms make use of a community quality score which needs to be optimized in order to reveal the community structure of a given network. Baumes[13] iteratively improved the Community Overlap Newman-Girvan Algorithm (CONGA) by dividing the network into a number of seed communities and adding the vertices to those seed communities until the density of the community has increased to its maximum. LFM (Lanccichinetti and Fortunato)[15] method was proposed to find both the hierarchical as well as overlapping community structure of the network. Connected Iterative Scan (CIS) was proposed by Goldberg et al.[16] which finds communities by utilizing the optimality principles and connectedness properties of the network. Order Statistics Local Optimization Method(OSLOM)[17] is the first method which takes into account the edge directions and edge weights. iLCD (intrinsic Longitudinal Community Detection)[18] detects both temporal as well as static communities. IV.

PROPOSED WORK

It is observed that there are a lot of techniques are proposed for the detection of communities in social networks, but none of them used cliques as seeds. Most of the attention was on

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

157

detecting disjoint communities, where as community overlap is a general phenomenon in social networks. Only a few overlapping community detection algorithms exist which can detect overlapping communities, the problem with these algorithms is that they fail when the community overlaps are denser. Another issue with these algorithms is that they are

Fig 3 : Overall GLA (Greedy Local Algorithm)

not able to assign community memberships to a large number of nodes, these nodes are referred to as outliers. The running time of these algorithms is also a significant problem. As the input networks are of large sizes these algorithms fail to terminate. So there is a need of new technique for community detection which overcomes these major issues. This paper presents a new community detection technique based on the greedy local optimization technique and overcomes the issues of dense overlaps and outliers. The greedy local algorithm (GLA) works by first of all finding the maximal cliques, and then expands the seeds based on a local optimization function. It keeps on adding nodes to the seeds to grow it into community for which the maximum increase in local quality score is obtained. The method of locally expanding the seeds based on the community quality score works by first of all finding the seeds, then cleaning the seeds and finally expanding them continuously by adding nodes until the addition of any node will decrease the community quality score. Finding seeds: With the idea that every community is formed by a core of highly connected people, this approach thinks of seeds as cliques. To identify the cliques in a given network Bron-Kerbosch [19] Algorithm is used. This algorithm returns all the maximal cliques present in the input network. This algorithm also has a parallel implementation provided by [20]. It is also found that the cost of finding the maximal cliques is small as compared to the cost of expanding the seeds into communities. Seeds play an important role in the results as the number of communities will be equal to the seeds that will be undergoing expansion. To generate these maximal cliques a variant of the Bron–Kerbosch algorithm [21] is used, which is the most efficient and well known algorithm for generating maximal cliques in a given graph. This algorithm is a recursive backtracking algorithm that searches for all maximal cliques in the given graph. The algorithm works as follows: If it is given 3 sets R, P, X it finds the maximal cliques that include all the vertices in R , some vertices in P and none of the vertices in X. P and X are disjoint sets whose union consists of those vertices that form clique when added to R. The initialization is done by setting R and X to be empty and P to be the vertex set of the graph. Expand the seeds to get communities: When the seeds are enumerated, the algorithm tries to add nodes to those seeds by

158

checking the value of the positive difference the addition of the node makes to the community quality score. The community quality score is defined as the number of edges inside the community divided by alpha power of the number of edges inside and outside the community. During the experiments good results are seen for alpha in the range 0.91.5. For larger communities alpha is tuned high and for smaller communities alpha is tuned to low values. This variation of alpha is just for adapting the algorithm according to the given network. Generally default value of alpha is kept to 1 for synthetic networks. The algorithm keeps on adding nodes to the seeds for which there is a large increase in the community quality score and stops the algorithm whenever there is no such node available for which there is positive increase in the quality score. All the neighbors of the nodes in the seeds are eligible for the addition to seeds .The algorithm maintains a list of such nodes which are eligible and then calculates the difference they make to the community quality score and chooses the one with the maximum positive change. Removing duplicate communities: Due to the unsupervised expansion of the seeds, the algorithm can give duplicate communities as the result. There are high chances of seeds getting expanded to duplicate communities, so there must also be a technique to remove duplicate communities. The similarity can be computed by jaccard coefficient and inverse logarithmic measures. A simpler method is used in which the similarity is found by calculating the fraction of nodes of the smaller communities which are also a member of the larger community. As calculating the similarity requires performing intersection between two communities, it constitutes a large part of the execution time of the algorithm, such calculations are reduced by calculating similarities only when they are utmost required. At the end of the algorithm, there can be some duplicates present in the result. To get rid of such duplicate communities, communities are compared pairwise and the similarity score is calculated. If the similarity score is beyond some threshold (0.6) that community is marked as duplicate and removed from the result. The similarity score is defined as the proportion of common vertices in the communities under question to the size of the smaller community.

S=

|C1 ŀC2| min(|C1 , C2|)

Eq (2) At the end the algorithm returns the list of all communities without any duplicates. Calculating similarities between communities is a time consuming process in case of large networks where the size of communities is larger and in that case it is necessary that similarity score must be calculated only when there is an ultimate need of it. To overcome this during programming the suspicion score of the community is increased whenever it has more such members who are already a member of other communities. At the end we calculate the similarity score only for such suspicious communities. This saves a lot of time during execution. Although this heuristic is not so advanced and it needs to be modified in a better way to reduce the total time of execution

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

for removing duplicate communities. The time complexity of the procedure will be quadratic , but on applying the suspicious community heuristic this complexity is reduced and experiments have shown significant time improvements with this method. V. EXPERIMENTS The proposed GL Algorithm is tested on of both synthetic as well as real world networks. The implementation is done in python 3.0 using the python-igraph library. An intel i5 laptop with 6GB RAM is used for the execution of the algorithm for both types of networks (Synthetic and Real World). Gephi is used for visualization of the graphs generated by the algorithm. Gephi is open-source network visualization software. Gephi is used for setting the graph layouts and coloring of the graphs. Fedora 20 (Linux) is the operating system used for the development and testing of the algorithm because of good python support and it comes with a lot of support packages for the developers to write and test their programs at an early stage of development. The implementation of the proposed Greedy Local Algorithm (GLA) is put to test on following real world networks: Les Miserables and Zachary's karate club. Both these real world networks are chosen for testing the algorithm because the ground truth for both the network is known. It is known that there are this many communities and this many members in each community in reality. So the comparison with the ground truth and the results of the proposed technique are easier to be carried out. Moreover other researchers have also laid down their results for these networks , this makes these networks superior over other networks for testing purposes. The algorithm is run with the following values of the parameters shown in Table 1:

One can see from Fig. 4 shows that no appreciable information can be obtained from the network in this state. In this state the network is uncolored and there is no classification of nodes into communities. This network is fed to the algorithm as input in the form of edgelist.

Fig 4 : Les Miserables dataset network The algorithm results in 5 communities, each identified by a color. There are 3 overlapping nodes identified by the algorithm. Overlapping nodes are represented by red color. These overlapping nodes are members of multiple communities. An interesting thing to note in this output network is that there is a clique formation between these 3 overlapping nodes, which signifies that these 3 characters have co-appeared with majority of other characters.

Table 1: Parameter Values for the Algorithm Parameter Value k(size of clique)

3

sigma(similarity measure)

0.2

alpha(community size)

1

The values of these parameters can be fine tuned according to the input networks if the network properties vary with large amounts. This gives the algorithm the flexibility to be applied to various types of networks. By varying these parameters one can get different community sizes. During the tests the results showed one important characteristic that there exist specified constraints for the values of these parameters. If one provides values which are below or beyond those constrains, the results will not be accurate. The constraint for alpha is 0.9-1.5, constraint for k is 3-5 and constraint for sigma is 0.1-0.3. The algorithm is tested to provide good results will nearly all networks with the following values of the parameters within constraints. The following data-set shows the co-appearance of characters in the novel Les Miserables[22]. The input network is a simple Graph G(V, E) in which there is no distinction between any node. All nodes are alike.

Fig 5: Output of Les Miserables dataset network

Fig 6: Zachary Karate Club dataset network

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

159

The next dataset we used for the testing of the algorithm is a social network of members of a karate club at a US university. The conflict and fission of this network was studied by Zachary, he showed that there was split in the group because of disputes. When the network depicted in Fig 6 is provided as input to the algorithm, the algorithm returns the network shown in Fig 7 as its output.

Fig : 7 Output of Zachary Karate Club dataset network It is evident from Fig.7 that there are 4 communities detected by the algorithm, each represented by different colors. There is one ambiguous node 10, for which the algorithm fails to determine the community membership.

nearly uniform increase in the execution time as the number of nodes increase, which is good sign that the algorithm can scale to large networks also. The graph in Fig. 9 shows the number of communities’ v/s no of nodes plot for the algorithm. It is evident from the graph that the number of communities increases with the number of nodes for randomly generated graphs. VI. DISCUSSION The proposed GL Algorithm is tested on of both synthetic as well as real world networks. For synthetic networks Random Graph Generator is used which generates random graphs on the basis of distance between nodes. For real world networks Zachary's karate club network is used as input to the algorithm. For synthetic networks the algorithm is executed multiple times to calculate the data points, for all the algorithms including GLA. The proposed algorithm (Greedy local) is compared with the spinglass community detection technique [24] and the classical Edge betweeness technique [25] of Girvan and Newman. The following graph clearly shows that the proposed technique outperforms both of the techniques in execution time for randomly generated graphs. GLA provided better results than both the algorithms used for comparison. Not only timing is the advantage of GLA but it performs quite well in case of overlapping communities, which the other algorithms cannot detect completely.

Fig 8: Timing diagram for GLA

Fig 10: Comparison of other algorithms with GLA

Fig 9: Number of Nodes v/s No of Communities The implementation is tested on synthetic networks which were generated using a random graph generator API in igraph named GRG. The execution time v/s number of nodes plot is shown below in Fig 8. The rate of increase of execution time is not so high, and it can be seen from the graph that there is

160

One just needs to reset the parameters to the algorithm and the results will be changed accordingly and the algorithm will adapt to the network accordingly. It can be clearly seen from the table that the proposed technique has a lower execution time for all input networks. When the execution time of other techniques grows abnormally, the execution time of the proposed technique grows uniformly with respect to the number of nodes. The results of the proposed GL Algorithm are also compared with the seed set expansion technique of whang et al. [26] there result for karate club data set are described by Fig 11.

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

Fig 11: Results of seed set expansion technique of Whang[26]

The problem with the result of the seed set expansion technique is that it has merged the communities and hence the number of communities reported by the technique is only 2, where as there exist 4 communities in the network, as described by the results of proposed Greedy Local Algorithm (GLA). The results of the proposed technique are similar to the spectral technique proposed in [27], the results of the spectral technique are shown in the following figure. The main problem with the spectral technique is the execution time. Its execution time is higher as compared to the proposed greedy local technique. Although there is one important observation that the spectral technique has no outliers in results while the proposed technique has 1 outlier in result, but this is quite insignificant for large networks where the execution time matters the most. From the previous section it is evident that the GL Algorithm performs well on both real world networks as well as synthetic networks. For synthetic networks significant time savings are reported during experiments as compared to other existing techniques. GLA results in lesser number of outliers as compared to Clique percolation method (CPM) and others. The proposed algorithm performs well even in case of dense overlaps. The significant insights from the result of the implementation of GLA are as follows: a.) Number of outliers is reduced by approximately 10-15% as compared to other algorithms. b.) Overlapping nodes can be identified even in case of dense overlaps. c.) There is a significant improvement in the execution time as compared to existing algorithms. d.) The algorithm works well on both synthetic as well as real world networks. VII. CONCLUSION AND FUTURE WORK Community detection approaches have attracted a lot of attention of researchers in recent years and there is a considerable increase in the number of algorithms published for solving the issue as it has applications in various domains like microbiology, social science and physics. Analyzing community structure in social networks has emerged as a topic of growing interest as it shows the interplay between the structures of the network and its functioning. The paper tries its best to review all popular algorithms, but the study is by no means complete as there are newer

algorithms discovered at a fast rate because of the growing interest of researchers in this domain. This paper describes nearly all the algorithms which exist for community detection , and also reviews their strengths and weaknesses. The basic concepts required for understanding the problem of community detection are described in great detail. The main goal was to come up with a technique which is better than the current state of art solutions. The proposed technique of Greedy Local Optimization for overlapping community detection is described. The technique follows the greedy local optimization method. Extensive tests are also performed and the results are also shown for the sake of validity of the proposed technique. The proposed technique performs well as compared to the classical algorithms and the current state of art algorithms. Tests on real world networks have proved that the community structure identified by the proposed technique is better than classical solutions proposed by other researchers. The paper showed that the problem of outliers and overlap detection can be removed by the proposed greedy local technique. In future the efforts will be concentrated on further optimizing the technique , so that the algorithm can scale to larger number of nodes and hence can be applied to large social networks. Another area of future work will be to come up with a parallel version of the algorithm , so that large networks can be processed in parallel , to reduce the execution time. Another interesting area of future research is the mapreduce version of the algorithm, which can be applied in distributed manner for further scaling the algorithm to larger networks. Due to the local nature of expansion in this algorithm , it is easy to reason that the complexity can't becharacterized purely in terms of |V| or |E| as the complexity depends on the subtler local characteristics of the graph used as input. Some graphs may have dense locality and others may have sparse locality, So these characteristics of the graph can't be specified rigorously. The whole algorithm can be broken down into three main steps namely: seeding, expansion and duplicate removal. In future the optimization that can be performed in each of these steps are : a.) Optimization in the seeding step The greedy local technique chooses maximal cliques as seeds, finding maximal cliques in the graph takes quite significant amount of time. As an alternative one can choose random nodes as seeds, or we can have hubs as seeds , finding them requires much lesser amount of time as compared to finding maximal cliques. b.) Optimization in the expansion step During the expansion phase the algorithm simply adds nodes to the communities which yield maximum increase in the community quality score, adding nodes in this way leads to duplicate communities in the end hence there is lot of time wasted in checking for duplicates. What can be done to solve this time consumption is that one can create a suspicious score related to every node , and if that suspicious score is more than a specified threshold we can stop adding that node to the communities. This would surely mean more computation and more memory consumption as there will be an extra double

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

161

variable associated with every node and the value of this variable will be updated during each iteration. Although this takes significant time but this optimization can surely lead to better results because this time is smaller than the time taken in the removal and detection of duplicates. c.) Optimization in the duplicate removal step Duplicates are removed at the end when all the communities have been enumerated, instead what can be done is to expand the communities in an intelligent manner that we don't get into this duplicate trouble. This can be achieved if there is some way of telling whether the community which is getting expanded currently will lead to a duplicate community or a unique community. Future work will also be concentrated on finding better local heuristic functions to reduce the execution time of the algorithm. Finally, better benchmarking abilities are required by which we can compare the algorithms faster. On one hand, a utility is needed that uses synthetic graphs to systematically explore under what topological conditions the performance of a community detection algorithm breaks down. On the other hand, better empirical networks are needed with ground truth communities to ensure that the algorithm do not merely perform well on synthetic data but works well even with the real world data. REFERENCES [1] Girvan, M., and Newman, M. E. “Community structure in social and biological networks.” Proceedings of the National Academy of Science USA 99, 12 ,7821––7826. (June 2002) [2] M. A. Porter, J-P. Onnela, and P. J. Mucha. “Communities in network.”, Notices of the AMS, 56(9), (2009). [3] Palla, G., Derenyi,Farkas, I., and Vicsek, T., “Uncovering the overlapping community structure of complex networks in nature and society”. Nature 435, 814–818.(2005) [4] Lancichinetti, A., Fortunato, S., and Kertesz, J., “Detecting the overlapping and hierarchical community structure of complex networks”. New Journal of Phys. 11, 033015.(2009) [5] (http://www.cfinder.org) Implementation of the clique percolation method by Palla et. al.(2005) [6] Xie, J., Szymanski, B. K., and Liu, X., “SLPA: Uncovering overlapping communities in social networks via a speakerlistener interaction dynamic process”. In Proc. of ICDM Workshop.344–349. (2011) [7] Gregory, S. “Finding overlapping communities in networks by label propagation”. New Journal of Phys. 12, 10301. (2010) [8] Raghavan, U. N., Albert, R., and Kumara, S. “Near linear time algorithm to detect community structures in largescale networks”. Phys. Rev. E 76, 036106.(2007) [9] Freeman, L. C., "Centrality in Social Networks I: Conceptual Clarification", Social Networks, 1, 215-239, (1979). [10] Ding, F., Luo, Z., Shi, J., and Fang, X., “Overlapping community detection by kernel-based fuzzy affinity propagation”. In Proc. of ISA Workshop. 1–4. (2010)

162

[11] Zhang S., Wang R.S., Zhang X.S, “Identification of overlapping community structure in complex networks using fuzzy c-means clustering, Physics A,Vol. 374,2007, 483-490.(2010) [12] Nepusz Y., Petrocz A., Negyessy L., and Bazso F. ”Fuzzy communities and the concept of bridgeness in complex networks”, Physics Review E 77,016107.(2008) [13] Baumes, J., Goldberg, M., Krishnamoorthy, M., MagdonIsmail, M., and Preston, N. “Finding communities by clustering a graph into overlapping subgraphs”. In Proc. of IADIS. 97–104.(2005) [14] Newman, M. E. J., "The structure and function of complex networks", SIAM Review, 45, 167-256, (2003). [15] Fortunato, S.,“Community detection in graphs.” Phys. Report 486, 75–174., (2010). [16] Goldberg M., Kelley S., Magdon-Ismail M.,Mertsalov K., and Wallace A, “Finding overlapping communities in social network”, SocialCom (2010). [17] Lancichinetti. A and Fortunato S. "Finding statistically significant communities in networks". PLoS ONE ,4-6, (2011). [18] Cazabet R.,Ambald C." Detection of overlapping communities in dynamical social networks". In Proceedings of the 2nd IEEE International Conference on Social Computing (SOCIALCOM’10).309–314.(2010) [19] Coen Bron and Joep Kerbosch. Algorithm 457: “finding all cliques of an undirected graph”. Commun. ACM, 16(9):575–577. [20] Matthew C. Schmidt, Nagiza F. Samatova, Kevin Thomas,and Byung-Hoon Park. “A scalable, parallel algorithm for maximal clique enumeration”. Journal of Parallel and Distributed Computing, 69(4):417–428, (2009). [21] De Nooy, W., Batagelj, V.. Mrvar, A. ,”Exploratory Social Network Analysis with Pajek”, Cambridge University Press, (2005). [22] D. E. Knuth, “The Stanford GraphBase: A Platform for Combinatorial Computing”, Addison-Wesley, Reading, MA [23] W. W. Zachary, “An information flow model for conflict and fission in small groups”, Journal of Anthropological Research 33, 452-473 [24] Traag VA and Bruggeman J: “Community detection in networks with positive and negative links”. Phys Rev E 80:036115. http://arxiv.org/abs/0811.2329. (2009) [25] Girvan, Michelle, and Mark EJ Newman. "Community structure in social and biological networks." Proceedings of the National Academy of Sciences 99.12:7821-7826. (2002) [26] Whang, Joyce Jiyoung, David F. Gleich, and Inderjit S. Dhillon. "Overlapping community detection using seed set expansion." Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 2013. [27] Hua-Wei Shen and Xue-Qi Cheng J. ,”Spectral methods for the detection of network community structure: a comparative analysis” Stat. Mech. P10020 (2010)

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

Suggest Documents