2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Finding Communities in Weighted Signed Social Networks Tushar Sharma
Lucknow, India
[email protected] Abstract— In this paper I have proposed a novel algorithm AGMA (Automatic Graph Mining Algorithm). AGMA automatically classifies a weighted social network graph into appropriate number of clusters which does not require user involvement. AGMA uses the linked pattern and the link weight as the clustering criterion based on which the classification of nodes is done. The algorithm is able to find out communities in disconnected graphs. The final section of the paper demonstrates the applicability of AGMA with examples in identifying social communities in artificial as well as in the real world examples like Gahuku-Gama Subtribes Network and 9/11 terrorist network. The signed social networks also lie in the applicability domain of this algorithm.
Keywords- Social network; Weighted social network; Cluster; community mining; signed social network.
I. INTRODUCTION The term social networking was first coined in 1954 by J.A Barnes [2]. A social network is said to be formed when people, organizations, or any social entity interacts with another in a way. The purpose does not matter i.e. kind of relation among entities does not matter, but the intensity does. In mathematical terms this network can be best represented as G= (V,E),where graph G is a finite set of vertices V= (v1,v2,v3…vn) and edges E=(e1, e2, e3, … em) is a finite set of edges connecting these vertices. Typically these graphs are very large and both the edges and the vertices have attributes. The intensity we were talking about can be described with the concept of weight in social networking. E.g. in social network of family we can define a join as “blood relation”. In this network brothers and sisters can be represented having more weight among them as compared to the edge defining their join with cousins. Zachary's Karate club network [14] is an example of unweighted social network while the social network [15] of Victor Hugo’s novel Les Miserables is an example of weighted social network. In case of weighted social network the mathematical definition can be changed to G= (V,E,W), where wi defines the weight of the ith edge. Community mining is a technique by which we identify a group of entities who exhibit similar linked pattern in a social networking graph. Mining such structures in a graph is a important task as we need to break down the complete graph into simpler one so that we can further characterize, analyze or discriminate every such part of the graph. When we find communities in a graph we claim that the nodes those are part of the community are more closely related to the other nodes of the community as compared to the other nodes of the graph. A signed social network in its simplest form can be viewed as a weighted undirectional graph having two types of weights
978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.242
1010 978
i.e. positive and negative weights. For example a network of nations where positive relation shows the political alliance and the negative shows the opposition. Positive weights are represented as +1 and the negative weights are represented by 1. [20]. Examples of signed social networks include the Gahuku-Gama [6] sub-tribes network. A. Related Studies on Community Mining Community structure is an important topological property of social networks. Positive social network communities are defined as the groups of nodes within which the links are dense but between which the links are less dense [5]. Community mining is a technique of classifying the nodes in a group or in a community within which the attribute of these vertices is maximized [20].The sign of the link between two vertices will be positive if they have some similar attributes among them and will be negative accordingly. The strength of the relation between two nodes is defined by the weight of the edge between them. These nodes are classified on the basis of both the weight and the sign of the edge. This task becomes challenging when there are some negative links within group and at the same time, some positive links between groups. In the literature, many algorithms have been developed to detect network communities for sub-graph clustering only in positive networks; mostly unweighted. They may be categorize into three groups as follows [1]: 1) graph theoretic methods like Random walk methods, physics-based methods and Spectral methods 2) divisive algorithms like i.e., 'Betweenness' algorithms of Girvan and Newman [5] Tyler algorithm [6] and Radicchi algorithm [7] in which they divide the network into smaller subsections 3) agglomerative algorithms like Modularity based algorithms [8] which form communities by joining nodes together. Girvan and Newman [5] 'Betweenness' measure iteratively removes edges with the highest "stress" to eventually find disjoint communities. Clauset [8] suggested a faster algorithm but the number of clusters must still be specified by the user. Flake et al. [9] used the max-flow min-cut formulation to find communities around a seed node; however, the selection of seed nodes is not fully automatic. Kelsic [10] proposed an agglomerative algorithm for constructing overlapping communities using local shells, and implement methods for visualizing overlap between communities. Pons and Latapy [11] reported a community finding method using random walk. It starts with single node communities and repeatedly performs the merging of a pair of adjacent communities that minimizes the mean of the squared distances between each node and its community. Qian [16] presented a link mining algorithm to identify communities of practice based on the idea that linked nodes belonging to the same community should have a larger number of 'common friends'.
M.P.S Bhatia [12] introduced BFC (breadth first algorithm) in which the communities are formed by clustering groups of nodes closely connected to each other. The algorithm uses breadth-first traversal, as discussed in Cormen et al. [13], as its propagation method. In breadth first traversal all the neighbor nodes in community are traversed first and then nodes of other community. With every node traversed, visit counter of all its neighbors is incremented. When it reaches on a node having visit counter 2 or more, it signifies that a cluster may exist as shown in fig. 1. Small tightly coupled components are detected first which merges nearby vertices together to form big cluster on the basis of majority of participation incrementally. If neighbors of a vertex belong to more than one class then the vertex is assigned to the class with maximum common class neighbors. Yang et al. [1] introduced a new algorithm FEC which works on both parameters i.e. on both link density and sign of the link. the main idea behind the algorithm is an agent-based random walk model, based on which the FC phase can find the sink community including a specified node with a linear time complexity after that the sink community is extracted from the entire network by the EC phase based on robust graph cut criteria. In find community phase a sink node is placed by agent and calculates l-step transfer probability distribution function for each node. The l-step transfer probabilities are then sorted to find the nodes with least probabilities. The nodes with least l-step transfer probabilities represent the nodes outer to community and thus remove them to find the community structure. II.
strong, friendship, hate) in mind we can draw an artificial network which will contain all of them.
Figure 1. An Artificial network of college students
The above Fig1 shows a network of 11 college mates. The red edges are demonstrating a negative relation between four nodes, 5-1 and 5-9. The intensity of the negative relation between 5 and 9 is less than that of between 5 and 1. The weight of the relations is shown in red color. The black edges are showing a friendly relation and the weight is showing the intensity accordingly. If we would attempt to form clusters in the above network we can clearly identify 3 communities. Community A will contain nodes 1,2,3 and 4. Community B will contain 5,6,7 and 8 and the last three 9,10 and 11 will be contained in community C. B. Working principle AGMA works over the principle of edge weights. In most simple terms it says “close friends must lie in same community avoiding foes as far as possible”. To understand the algorithm completely we need to understand the basic properties of every node and edge. E is the set of m edges e1, e2, e3 …. em. Every edge will have weight as its property. If not defined every edge e will have equal weight of +1. We can access the weight property of an edge by using ‘.’ operator e.g. to access the weight of edge e we will write, e.weight. Every vertex has four properties (1) n.tempwtcounter // An integer which will be incremented after traversing the adjacent edge joining vertex ‘n’. (2) n.wt // Algorithmic weight of a vertex ‘n’. (3) n.visit // of Boolean type. TRUE will mark a vertex as visited by the algorithm. (4) n.c // cluster number in which the node n will be classified after the execution of the algorithm. First of all we will calculate the weight ‘wt’ of each vertex. This weight will tell us how friendly this node is to the other node. Higher the weight of node, higher is the friendliness. Node weight is the sum of the weight of all the edges joining this node to its other adjacent nodes in the graph. E.g. in fig.1 we can calculate weight of node 8 as (with node ‘5’, 2.0 + with node ‘6’, 2.0 + with ‘7’,1.0 ) which gives node ‘8’ a weight of 5.0. Similarly weight of node ‘5’ is (-2 + 1 + 1 + -1 ), -1.0.
AGMA
A. Problem Defination AGMA was build keeping the following problems in mind. I had to: (1). find communities in weighted signed social network. (2). find communities in a disjoint graph. (3). device an algorithm in which the user doesn’t need to pass any parameter to define the total appropriate number of communities that should be formed. A social network is all about relations. The relation on whom the network is build may be weak between the nodes and it may be strong between others. It may be positive like friendship and it may be negative like hate. Security and law enforcement agencies keep tracking on the terrorist activities all around the world. When they intercept a message between two suspects they can give the connection a weight which transforms their network graph into a weighted signed social network. The intercepted messages may reveal whether those two suspects are having a positive relation or they are bonded as foe. Such network when formed gives a complete view of signed weighted social network. Disjoint graph is formed when security agencies tracks down the suspects for an activity. Not always their search is result of a chain. Sometimes they find a suspect or a group of suspect which has an unidentified link to the previously suspected personals. Keeping all four above parameter (weak,
1011 979
After calculating the weight for individual nodes AGMA will calculate an average weight avgwt=of each node in the graph. The average weight of the network in figure1 can be calculated to 4.90. Now we need to traverse each edge of the graph. AGMA does it in the Breadth first search fashion [13]. Starting from an arbitrary edge we can traverse the graph from its neighbor to its further neighbors. When the algorithm visits a node it sets the value of the Boolean visit as 1. Further while traversing the graph, when algorithm jumps from one node to another via an edge the value of node’s tempwtcounter is incremented gradually. To keep tracking the neighbors, every node V has three additional parameters: 1. NAV : Number of adjacent vertices of V 2. UCV: No. of Unclustered vertices adjacent to v having where the neighbor vertex vn has vn.wtcounter > (avgwt+vn.wt)/4 and vn has more than one neighbor 3. CV : Number of clustered vertices adjacent to V where the neighbor vertex vn has vn.c ? 0 and vn.wtcounter>=avgwt/2 So every time we traverse an adjacent node NAV is incremented by 1 and if the conditions hold true the neighbor is classified accordingly in UCV and CV. At a point while traversing the graph the condition (vi.wtcounter >= vi.wt/2 and ucvs> nav/2 and vi.c=0) holds true for a vertex vi we call it the cluster indicating condition. This condition says that we need to do something with the node Vi as it is ready to be clustered. At this time we need to check one more condition (1) if (ucv > cv) (2) else if (cv > ucv) If case 1 is true, it indicates that the number of neighboring unclustered vertices greater than the clustered vertex. In this condition AGMA forms a cluster taking those unclustered vertices with this node. If case 2 is true, it indicates that there are enough number of vertices who has a strong relation with node Vi and the node can be merged in the already formed clusters. If a node has neighbors who has been classified in different community then the Vi has to decide with which adjacent node it wants to be clustered. In such a condition when there is more than one choice of cluster AGMA finds MFC (Most Friendly Cluster) and then the node is merged with that cluster. The method of finding MFC is simple. For every cluster or class Ci which exists taking the adjacent nodes of vi as their part, we calculate the FC (Friend Coefficient). FC (Vi , Cj) = total weight of the edges those are connecting Vi with the nodes in cluster Cj. More the value of FC for a cluster, more the node is likely to be a part of it. Finally for every node left in the graph that has not been clustered when it was visited first, AGMA will classify it into its MFC.
C. The Algorithm AGMA() { For every vertex v in nodes { v.wt=sum of all the weights of edges connecting v to its adjacent edges } totalwt= ∑ni=1 vi.wt avgwt = totalwt/N For x=1 to N { If (vx.visit==1) { Append(Q,vx) vx.visit=1 While Q: { vi=pop(Q) Av=neighbor vertices of vi Nav=0 //number of adjacent vertices For every neighbor vertex, vn { If (vn ≠vi and ev(n),v(i).weight >0) // ev(n),v(i).weight denotes the weight of //edge joining node n and node i. { nav=nav+1 vn.weightcounter=vn.weightcounter+ ev(n),v(i).weight if(vn.visit==0) { vn.visit=1 Append(Q,vn) } If (vn.c ≠ 0 and vn.wtcounter>=avgwt/2) { Insert(lcv,vn) //lcv:list of clustered vertices cv=cv+1 } Elseif(vn.wtcounter > (avgwt+vn.wt)/4 and vn has more than one neighbor) { Insert(lucv,vn)//lucv: list of unclustered vertices ucv=ucv+1 } } } //for neighbor ended If (vi.wtcounter >= vi.wt/2 and vi.c=0) { ucvs=cv+ucv If (ucvs> nav/2) {
1012 980
If (ucv > cv) { Append (lucv, vi) k=k+1 lucv.c=k } Elseif(cv>ucv) { Find MFC vi.c=MFC }
mrvar.fdv.uni-lj.si/sola/info4/andrej/prpart5.htm). It describes the political alliances and oppositions among 16 GahukuGama subtribes, which were distributed in a particular area and were engaged in warfare with one another in 1954. AGMA output of the network is shown in fig 2. The positive ties is shown as white edges and the negative ones are shown in red. AGMA detects same three communities as reported by Doreian and Mrvar [3]. B. 9/11 Terrorist Attack The most denounce able terrorist activities of the modern times took place on 11th September 2001 in America. According to the data gathered be Valdis [17] the whole group of 63 terrorists who were found guilty in those attacks, they formed internal groups which were given the task of blow up the popular buildings in America with air plane attack. The multi-weighted version of the network data with 34 terrorists can be downloaded with network analyzing software “Network Work Bench (NWB)” from http://nwb.slis.indiana.edu/download.html.
} } } //while Q ended For every node vi in nodes having vi.c=0 { Find MFC vi.c=MFC } } } } D. Applications Security and law enforcement agencies use Social Networking Analysis (SNA) softwares to make a plan for destabilizing terrorist’s networks. When the size of the network they are behind is small means the number of terrorists or suspects is small they do not need to have the cluster information of the graph. They go with finding out the most prominent nodes on several criteria and they make a strategic plan out of that information. But when the size of network is large, they need to make plans such that the whole task of turning the network down is divided into manageable size. The partition of the network is essential at such times. Also sometimes the major players of a network are found in a certain part of a network. In such situations the whole network can’t be disrupted by turning down the major nodes and that will only solve a local purpose with respect to the whole graph. So we need to find out which part of the network will be affected by turning down which particular node, such information can only be reveled when the cluster information is present. Other application includes biological analysis, market analysis of products, educational research, and many other social issues where we need to classify people among groups involves cluster analysis.
Figure 2. AGMA output of Gahuku-Gama Network AGMA tries to find out the community structure in that terrorist attack and has succeeded to a good extend, figure 3. According to [17], 4 communities were formed in the whole network involving 19 terrorists out of all the accused. AGMA identifies total five communities in the network of 34 terrorist and classifies four out of five terrorists together who were reportedly responsible for the flight crash AA#77 in pentagon tower. In another community AGMA identifies four out of five terrorists who were responsible for the crash of flight AA#11 which struck WTC North. Similarly three out of four were identified in another community who were responsible for crash of flight UA#93 in Pennsylvania and three out of five who crashed flight UA#175 at WTC South.
III. RESULTS AND EXPERIMENTS I have tested my proposed community mining algorithm AGMA with a number of benchmark networks and randomly created networks in order to evaluate the effectiveness, correctness and efficiency of the algorithm.
C. Randomly generated network This is an unweighted network of 36 nodes, inspired by [18]. Bo Yang and Jiming liu [19] reported seven communitites to be formed in this network AGMA has found the same partition in the network, figure 4.
A. Gahuku-Gama Subtribes Network This network was created based on Read’s study on the cultures of highland New Guinea [4] (obtained from http://
1013 981
[3] [4] [5]
[6] [7]
[8] [9]
Figure3: AGMA output of 9/11 terrorist graph
[10] [11]
[12] [13]
[14] [15] [16] Figure 4. AGMA output of unweighted random network
IV. CONCLUSION AND FUTURE WORK In this paper I proposed AGMA, a simple fast and robust algorithm which can find communities in multiweighted,disjoint and signed social networks. AGMA considers both the link density and signs for mining signed network community and is automatic i.e. it doesn’t requires any external parameter as with the case of FEC algorithm [1]. We have shown few results of AGMA on well defined social network data and also on the randomly generated graph which covers community mining in signed social network, multiweighted social network and positive only unweighted social network. The results were found to be accurate in terms of partition. I have experimentally found the running complexity of the algorithm to be linear in terms of time which can be denoted as O(v+e). In future I will try to enhance the applicability domain of AGMA so that it can also find communities in dynamic graphs. REFERENCES [1]
[2]
Bo Yang, W.K. Cheung, and Jiming Liu, "Community Mining from Signed Social Networks", IEEE / KDE, VOL. 19 pp 13331348, NO. 10,2007. J. A. Barnes, "Class and committees in a norwegian island parish," Human Relations, 1954.
1014 982
[17] [18] [19]
[20]
P. Doreian and A. Mrvar, “A Partitioning Approach to Structural Balance,” Social Networks, vol. 18, no. 2, pp. 149-168, 1996. K.E. Read, “Cultures of the Central Highlands, New Guinea,”Southwestern J. Anthropology, vol. 10, no. 1, pp. 1-43, 1954. M. Girvan and M.E.J. Newman, "Community Structure in Social and Biological Networks," Proceedings of the National Academy of Sciences, vol. 99, no. 12, pp. 7821-7826, 2002. J.R. Tyler, D.M. Wilkinson, and B.A. Huberman, ..Email as Spectroscopy: Automated Discovery of Community Structure within Organizations", C&T '03, pp. 81-96,2003. F.Radicchi, C.Castellano, "Defining and Identifying Communities in Networks," Proceedings of the National Academy of Sciences, vol. 101, no. 9, pp. 2658-2663, 2004. A. Clauset, M. E. J Newman & Moore, C. "Finding community structure in very large networks", Physical Phys. Rev. E ,Volume 70 , Issue 6,70(066111), 2004. G. W. Flake, S. Lawrence, C. Lee Giles, "Efficient Identification of Web Communities", ACM SIGKDD, Boston, Massachusetts, United States pp:150-160 , 2000. Eric D. Kelsic, "Understanding complex networks with community finding algorithms", SURF 2005. P. Pons and M. Latapy, "Computing Communities in Large Networks Using Random Walks," Proc. 20th Int'l Symp. Computer and Information Sciences (ISCIS '05), pp. 284-293, 2005. Bhatia, M.P.S. Gaur, Pankaj “Statistical approach for community mining in social networks”, IEEE /SOLI ,2008. Thomas H. Cormen, Charles E. Leiserson and Ronald L. Rivest, Introduction to algorithms, The MIT Press and McGraw-Hill Book Company, 2001. Zachary, W. W., J. Anthropol., JOURNAL OF ANTHROPOLOGICAL RESEARCH Vol: 33 Issue: 4 Pages: 452-473, 1977 . Knuth D E, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA, 1993. Rong Qian, Wei Zhang ,Bingru Yang, "Detect community structure from the Enron Email Corpus Based on Link Mining",Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications - Volume 02 Pages: 850 - 855 ISDA, 2006. Valdis E. Krebs, “Uncloaking Terrorist Networks” First Monday, Volume 7 Number 4 - 1 April 2002 V. Batagelj and A. Mrvar, “Analysis and visualization of large networks,” in Graph Drawing Software, Springer, Berlin, 2004: 77-103. Bo Yang and Jiming Liu, “An Autonomy Oriented Computing (AOC) Approach to Distributed Network Community Mining” SASO 2007, 0-7695-2906-2/07s. Tushar Sharma, Ankit Charls, PramodKumar Singh, “Community Mining in Signed Social Networks- An Automated Approach”, pp163-168 ICCEA09, Manila, 09.