A Multi-Objective Genetic Algorithm for Community Detection in Multidimensional Social Network Moustafa Mahmoud Ahmed2,3 , Ahmed Ibrahem Hafez2,3 , Mohamed M. Elwakil1 , Aboul Ella Hassanien1,3 , Ehab Hassanien1 1
Faculty of Computers and Information, Cairo University, Egypt
[email protected],
[email protected],
[email protected] 2 Faculty of Computer and Information, Minia University, Egypt moustafa
[email protected],
[email protected] 3 Scientific Research Group in Egypt (SRGE), http://www.egyptscience.net
Abstract. Multidimensionality in social networks is a great issue that came out into view as a result of that most social media sites such as Facebook, Twitter, and YouTube allow people to interact with each other through different social activities. The community detection in such multidimensional social networks has attracted a lot of attention in the recent years. When dealing with these networks the concept of community detection changes to be, the discovery of the shared group structure across all network dimensions such that members in the same group interact with each other more frequently than those outside the group. Most of the studies presented on the topic of community detection assume that there is only one kind of relation in the network. In this paper, we propose a multi-objective approach, named MOGA-MDNet, to discover communities in multidimensional networks, by applying genetic algorithms. The method aims to find community structure that simultaneously maximizes modularity, as an objective function, in all network dimensions. This method does not need any prior knowledge about number of communities. Experiments on synthetic and real life networks show the capability of the proposed algorithm to successfully detect the structure hidden within these networks. Keywords: social network, community detection, Complex Network, Multidimensional network, Multi-objective Genetic algorithms
1
Introduction
Networks or graphs are a suitable model for representing interactions that take place between entities in the real world systems. The web, telecommunication network, biological networks, and social networks are examples of systems modeled naturally as networks, where nodes in the network represent entities and edges represent relationships between pairs of entities. A network is said to have community structure, if it can be divided into groups of densely connected nodes,
2
Authors Suppressed Due to Excessive Length
with the nodes belonging to different communities being only sparsely connected. Therefore, communities are groups of entities that probably share some common properties and/or play similar roles within the system that is being represented. The identification of communities hidden within networks is an important task for a wide set of fields such as sociology, physics, and biology [1]. Community detection in social network has been widely studied in the literature over last few years [2, 3] (Results of a recent survey can be seen in [4, 5]). However, most of these studies have been discarded that, real world social networks are often multidimensional, i.e. many different types of connections may take place between pairs of entities at the same time, reflecting different types of interactions between them. For example, user in Facebook can be connected with his friends through friendship relationship; user can be a follower of other users; also user can like/share/comment on social content like photos and videos posted by other users. Each type of these interactions can be represented by single dimension. This imposes a great challenge to the conventional graph clustering algorithm. In this paper we propose a multi-objective approach, named MOGA-MDNet, to discover communities in multi-dimensional networks by applying genetic algorithms. The method aims to find community structure that simultaneously maximizes modularity, as an objective function, in all network dimensions, so maximizing modularity in a single dimension can be considered an objective needed to be optimized. MOGA-MDNet exploits the benefits of these objectives and discovers the communities present in the network structure by selectively exploring the search space, without the need to know in advance the exact number of communities. This number is automatically determined by the optimal adjustment of objectives values. An interesting aspect of the multi-objective approach is that it aims to find the optimal community structure in the presence of trade-offs between all conflicting objectives, so it returns not a single partitioning of the network, but a set of solutions which are called Pareto-optimal solutions. Each of these solutions satisfies a different trade-off between all objectives. Experiments on synthetic and real life networks show the capability of the multi-objective genetic approach to correctly detect communities with results comparable to the state-of-the-art approaches. The rest of the paper is organized as follows. In Section 2, we review some related work. In Section 3, the concept of multi-dimensional community is defined and the community detection problem is formalized as a multi-objective optimization problem. In section 4, we described the genetic representation adopted and the variation operators used. In Section 5, the results of the method on synthetic and real life networks are presented. In Section 6, we give some concluding remarks.
2
Related work
There are many studies on community detection, in different fields of research: computer science, physics, sociology, and others. Most of them define community
MOGA Community Detection in Multidimensional Social Network
3
detection as dividing network nodes into set of groups such that nodes in the same group interact with each other more frequently than those outside the group. The papers working with this definition rely on the concept of modularity, which is a function proposed by Newman [2] to measure the clustering quality by detecting the ratio between intra- and inter-community numbers of edges. One of the first community detection algorithms based in modularity is the GirvanNewman algorithm, which introduces a divisive method that iteratively removes the edge with the greatest betweenness value [2]. Many optimization techniques have been applied to optimize the modularity as an objective such as greedy optimization [3], extremal optimization [6], simulated annealing [7], and genetic algorithms (GA) [8]. However, one aspect of such networks has been ignored so far: real networks are often multidimensional, i.e. two nodes may connect through many different types of connections, reflecting different types of relationships or interactions between entities. Recently multidimensionality has been taken into account in a lot of works. In [9, 10] authors extended and defined a set of analytical measures which take into account the structure of multidimensional networks to open the way for new techniques to analyze and extract non trivial knowledge from this type of networks. In [11–13] authors presented many integration strategies in order to map multidimensional network into mono-dimensional network to be able to apply existing solutions to multidimensional network, but the most of these strategies don’t consider the degree to which each node participate in each dimension, as in reality, social actors often participate with varied intensity in different dimensions of network, thus, within the same group, the interaction can be very sparse in one dimension but relatively more observable in another dimension. So using these strategies; dimensions with intensive interaction would force other dimensions to follow the same structure. In [14] authors introduce simple two-phase strategy called PMM to extend modularity maximization from one-dimensional networks to multi-dimensional networks, however this method ignores the fact that different relations might have different importance with respect to a certain query.
3
Community Detection in M-D Networks problem
A network can be defined as a graph G = (V, E), in which V is a set of vertices or nodes, and E is a set of ties that connect nodes. In the field of social networks, nodes represent persons or actors within the network, and ties represent the relationships or the interaction between those persons. To take into account the presence of more than one type of relationships, a multidimensional network is represented by a multiple graph; one graph for each type of relationships (dimensions), a D-dimensional network is represented as G = {G1 , G2 , ., Gd } where Gi represents the interactions between social actors in the i-th dimension. Community in multidimensional networks is a set of nodes densely connected including interactions in all network dimensions. The problem can be formulated as dividing network’s nodes into k communities, where the number k is unknown,
4
Authors Suppressed Due to Excessive Length
such that modularity is maximized for each dimension. Thus we treated this problem as a multi-objective optimization problem, where given the modularity as a quality measure we want to find community structure S that simultaneously maximize modularity in each dimension. A multi-objective optimization problem (Ω; f1 , f2 , ., fd ) is defined as min fi (S), i = 1, 2, . . . , t s.t S ∈ Ω
(1)
Where Ω = {S1 , S2 , .., Sr } is the set of feasible community structures in a network. Each fi ∈ {f1 , f2 , ., fd } is the modularity related to a different network dimension i where d is the number of dimensions in the network.
4
Algorithm Description
In this section we describe the multi-objective algorithm MOGA-MDNet, the genetic representation of community structure, the objective functions selected, and the genetic operators used by the algorithm. Genetic representation: Our algorithm adopts the locus-based adjacency representation proposed in [15]. According to this representation any individual (chromosome) c, in the population consists of n genes, where n is the number of nodes in the network, and each gene correspond to a node in this network. In individual c each gene gi can assigned an arbitrary integer value j , where j is in the range of {1, 2, . . . , n}. A value j assigned to the gene gi is interpreted as a link between the nodes i and j which means that, nodes i and j will be in a same community. To figure out community structure, a decoding step is necessary to identify connected components which can be done in linear time [16] . A main advantage of this representation is that there is no need to know in advance the number of communities, as the number of communities denoted by each individual is automatically determined in the decoding process. Genetic initialization : Our algorithm adopts the initialization process proposed in [17] (safe initialization), which takes in account the effective connections of the nodes in the social network. A random generation of individuals could generate components that are disconnected in the original graph. When generating individual gene gi could be assigned to value j , but no connection between nodes i and j exists in the original graph , which means that grouping both nodes i and j in the same group is a wrong choice. Using safe initialization such this case is avoided by substituting value j with one of the neighbors of i. Crossover operator : Uniform crossover is used. Given two parents, a random binary vector is created. Then uniform crossover selects the genes where the vector is a 1 from the first parent, and the genes where the vector is a 0 from the second parent. Then these genes are combined to form the new child. Mutation : The mutation operator that randomly changes the value of a randomly chosen gene causes a useless exploration of the search space. Therefore, as in the initialization step, we randomly select percent of the genes and for each gene i we randomly change its value to j such that node i and j are neighbors.
MOGA Community Detection in Multidimensional Social Network
5
Fitness function : In genetic algorithms, the fitness function plays an important role in the evolution process. For the community detection problem, there is a wide variety of methods to measure the quality of discovered community structure [8] , each of which could potentially be used as a fitness function. We decided to focus on modularity quality measure as an objective to determine the optimal community structure. Modularity [18] was designed specifically to measure the strength of a community structure for real-world networks. Networks with high modularity have dense connections between the nodes within communities but sparse connections between nodes in different communities. Studies in the literature shown that modularity is effective in various kinds of complex networks [19, 20].
5
Experimental Results
We tested our algorithm on a synthetic data set and real world social networks. For evaluation, we compared the results obtained by our algorithm with the results obtained by the Principal Modularity Maximization (PMM) method proposed in [14]. Normalized mutual information (NMI) [21], which measure the similarity between the discovered community structures and the true ones is adopted to measure the clustering performance in the controlled experiments. For each network we began by applying the single objective genetic algorithm to detect communities in each dimension separately as baselines. We employed standard parameters for the GA: a crossover rate of 0.8, a mutation rate of 0.4, the elite reproduction was 10 % of the population size, the population size was 200, and the number of iterations was 100. The algorithm was implemented in Python environment using DEAP (Distributed Evolutionary Algorithms in Python) [12]. Then in the multi-objective case we used the Nondominated Sorting Genetic Algorithm (NSGA-II) proposed in [22] implemented in the MATLAB Genetic Algorithm and Direct Search Toolbox as the multi-objective GA. We employed standard parameters for the GA; a crossover rate of 0.8, a mutation rate of 0.2, the elite reproduction rate was 10% of the population size. We also employed a binary tournament selection function. The population size was 200, and the number of generations was 200. 5.1
Synthetic Network Dataset
To conduct some controlled experiments, we used the benchmark proposed in [14]. The network consists of 350 nodes divided into three communities, with each having 50, 100, 200 nodes respectively. There are 4 different types of relationships (dimensions) among these nodes. For each dimension, group members connect with each other following a random generated within-group interaction probability. To discover communities in this network we applied our algorithm ten times. For each run, we investigated the set of Pareto-optimal returned by the algorithm; each solution we calculated the summation of modularity over all dimensions then we selected the community structure with the maximum
6
Authors Suppressed Due to Excessive Length
Table 1: Average NMI results obtained by MOGA-MDNet and PMM algorithms for synthetic network Strategy D1 D2 Single-Dimensional using SGA D3 D4 PMM Multi-Dimensional MOGA-MDNet
Average NMI 0.6587 0.5176 0.6544 0.6617 0.935 0.946
summation; and compared it with the optimal known structure of the network in terms of the NMI similarity measure. Average NMI over the 10 runs are calculated and reported in Table 1. Table 1 also reports the average performance of PMM method in terms of NMI as declared in [14]. 5.2
Real World Social Network Data
We use three real-world data sets to test our algorithm, YouTube network [14], students cooperation network [23], and medical innovation network [24]. Table 2 shows the density of all dimensions of each network. – Students Cooperation Social Network : The students cooperation social network [23] was constructed from the data collected during a “Computer and Network Security” course; a mandatory course at Ben-Gurion University. The students cooperation network contained 185 students and 360 links and three different types of connections. – Medical Innovation Network : This data set was prepared by Ron Burt [24]. He dug out the 1966 data collected by Coleman on medical innovation. The medical innovation network contained 246 nodes and three different types of connections. – Youtube network: The Youtube network [14] is constructed from data crawled from the popular video sharing site in December 2008. The crawler collected information about contacts, favorite videos, and subscriptions. In total, it reached 848,003 users, with 15,088 users sharing all of the information types. The network contains five different interaction types. To discover communities in each network we applied our algorithm 10 times. For each run, the set of Pareto-optimal returned by the algorithm was investigated; and the modularity of each dimension for the best structure was calculated. Average modularity of each dimension for the best structure over the ten runs was calculated and reported. Also the PMM algorithm is applied 10 times to divide the network into different number k of communities, ranging from 2 to 60. For each k average modularity over the 10 runs was calculated, and then the highest modularity value was selected. Figure 1 summarizes the result of applying Single Objective Genetic algorithm SGA, MOGA-MDNet and PMM algorithms for the Students and Medical Innovation networks. First by analyzing the result of SGA we observed that the
MOGA Community Detection in Multidimensional Social Network
7
Table 2: The Density of each dimension for the real world social networks datasets Network Students Cooperation
Medical Innovation
1
1
0.9
0.9
0.8
0.8
0.7
0.7
Modularity
Modularity
YouTube
Dimension Partner Same computer Same time Advice Discussion Friend Contact Co-contact Co-subscription Co-subscribed Favorite
0.6 0.5 0.4
0.6 0.5
0.4
0.3
0.3
0.2
0.2
0.1
Density 1.41 × 10−2 1.35 × 10−3 5.69 × 10−3 1.59 × 10−2 1.87 × 10−2 1.68 × 10−2 6.74 × 10−4 7.28 × 10−2 4.90 × 10−2 1.97 × 10−2 8.91 × 10−2
0.1
0
0 SGA D1
SGA D2
SGA D3 D1
D2
PMM D3
(a) Students network.
MOGA-MDNet
SGA D1
SGA D2
SGA D3 D1
D2
PMM
MOGA-MDNet
D3
(b) Medical Innovation networks.
Fig. 1: Average Modularity results obtained by SGA, MOGA-MDNet and PMM for the Students and Medical Innovation networks. result of the optimization process of GA using one dimension detect a modular community structure that successfully maximize the Modularity only on the selected dimension however by calculating the Modularity of the detected community structure on the other dimensions we found that it has a low or a less modular structure in the other dimensions,this is due to that SGA optimizes only using one dimension information i.e. connectivity pattern. Such observation is clearly visible in the result for Student network Fig. 1a. This problem is overcame by using MOGA in which all dimensions informations are used in the optimization process. As we can observe the MOGA-MDNet is able to detect a community structure that simultaneously maximize Modularity in all network dimensions. Regarding PMM algorithm it was able to find a community structure that maximize Modularity in all dimensions for the Medical network however the result was not high as the result of MOGA-MDNet on the same network. As for the student network, PMM fail to find a community structure that maximize Modularity in all dimensions, as we can observe in Fig 1a the result of PMM has a high Modularity value in the third dimension and low Modularity in the other dimensions compared to MOGA-MDNet. A final note regarding the result of MOGA-MDNet, is that the goal of MOGA is to find an optimized structure that maximize the objective function in all
8
Authors Suppressed Due to Excessive Length
Table 3: Average Modularity results obtained by SGA, MOGA-MDNet and PMM for the Students and Medical Innovation networks. SGA - D1 SGA - D2 Students network SGA - D3 PMM (k = 19) MOGA-MDNet SGA - D1 SGA - D2 Medical Innovation SGA - D3 PMM (k = 5) MOGA-MDNet
D1 0.9739 0.0466 0.1248 0.3842 0.8656 0.712 0.62 0.415 0.535 0.699
D2 0.551 0.9225 0.4612 0.5042 0.5933 0.57 0.732 0.474 0.52 0.743
D3 0.2891 0.144 0.8501 0.8657 0.8856 0.476 0.549 0.751 0.52 0.776
dimensions, so when looking at the result of the Student network when applying MOGA-MDNet and SGA on the first dimension D1, we observe that the average Modularity in the first dimension using MOGA-MDNet was 0.699 and it was 0.973 using SGA-D1. Such a drop in the Modularity value in D1 is compensated with an increase in the Modularity value in the other dimensions which is our ultimate goal. Table 3 shows in details the average Modularity values obtained by SGA, MOGA-MDNet and PMM for the Students and Medical Innovation networks. Regarding Youtube network; by first analyzing the result from SGA using each dimension separately. We observed that some how the network dimension are conflicting each other i.e. when maximizing the modularity on one dimension lead to poor Modularity on the other conflicted dimensions. For example maximizing the Modularity on D1 lead to a low Modularity value in D5 and vice versa also hold true; maximizing the Modularity on D5 lead to a poor Modularity value in D1. Thus making it a challenge for any community detection algorithm to detect a shared community structure that maximize the Modularity in each dimension as possible as it could. Table 4 summarizes the result of SGA , MOGA-MDNet and PMM for the Youtube network. As we can observe MOGA-MDNet able to detect a good community structure that maximize the Modularity in most dimensions.
5.3
Further Analysis
In the previous section our aim was to find the community structure that simultaneously maximize modularity in all network dimensions, however, respect to a certain query, different relations might have different importance. To find a community structure taking into account the importance of a certain relation(s), we first need to identify which relation(s) plays an important role in such structure, then MOGA-MDNet is applied to return the set of Pareto-optimal solution, then this set is investigated and the community structure that best maximize Modularity in the dimension(s) representing this relation(s) is selected.
MOGA Community Detection in Multidimensional Social Network
9
Table 4: Average Modularity results obtained by SGA, MOGA-MDNet and PMM for Youtube network. SGA - D1 SGA - D2 SGA - D3 SGA - D4 SGA - D5 PMM (k = 10) PMM (k = 20) MOGA-MDNet
6
D1 0.428 0.2 0.211 0.267 0.082 0.47 0.398 0.341
D2 0.061 0.228 0.066 0.061 0.041 0.079 0.055 0.108
D3 0.071 0.052 0.193 0.088 0.046 0.104 0.063 0.128
D4 0.133 0.085 0.13 0.328 0.032 0.153 0.1 0.234
D5 0.017 0.032 0.031 0.02 0.081 0.032 0.025 0.028
Conclusion and Future Work
This paper presented a multi objective genetic algorithm for detecting communities in multi-dimensional networks. The aim of this method is to find community structure that simultaneously maximizes modularity; as an objective function, in all network dimensions, where the number of objectives needed to be optimized equals the number of dimensions. Experiments in synthetic and real world network showed the ability of this method to correctly detect communities structures that maximize modularity across all network dimensions. The algorithm has the advantage that it does not need any prior knowledge about number of communities; also it guarantees that dimensions with intensive interaction can not affect the structural information in other dimensions. Also using this algorithm we can detect a shared community structure taking into account the importance of a certain relation(s). As the size of real world social networks increases continuously, increasing scalability of our proposed method will be investigated in the future. Also we plan to enhance the capabilities of this algorithm to discover overlapped communities.
References 1. Stanley Wasserman. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994. 2. Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004. 3. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. 4. Santo Fortunato. Community detection in graphs. Physics Reports, 486(3):75–174, 2010. 5. Michel Planti´e and Michel Crampes. Survey on social community detection. In Social Media Retrieval, pages 65–85. Springer, 2013. 6. Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Physical review E, 72(2):027104, 2005. 7. Jian Liu and Tingzhan Liu. Detecting community structure in complex networks using simulated annealing with k-means algorithms. Physica A: Statistical Mechanics and its Applications, 389(11):2300–2309, 2010.
10
Authors Suppressed Due to Excessive Length
8. Ahmed Ibrahem Hafez, Eiman Tamah Al-Shammari, Aboul ella Hassanien, and Aly A Fahmy. Genetic algorithms for multi-objective community detection in complex networks. In Social Networks: A Framework of Computational Intelligence, pages 145–171. Springer, 2014. 9. Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale, and Dino Pedreschi. Multidimensional networks: foundations of structural analysis. World Wide Web, 16(5-6):567–593, 2013. 10. Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale, and Dino Pedreschi. Foundations of multidimensional network analysis. In Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on, pages 485–489. IEEE, 2011. 11. Michele Berlingerio, Michele Coscia, and Fosca Giannotti. Finding and characterizing communities in multidimensional networks. In Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on, pages 490– 494. IEEE, 2011. 12. F´elix-Antoine Fortin, De Rainville, Marc-Andr´e Gardner Gardner, Marc Parizeau, Christian Gagn´e, et al. Deap: Evolutionary algorithms made easy. The Journal of Machine Learning Research, 13(1):2171–2175, 2012. 13. Deng Cai, Zheng Shao, Xiaofei He, Xifeng Yan, and Jiawei Han. Community mining from multi-relational networks. In Knowledge Discovery in Databases: PKDD 2005, pages 445–452. Springer, 2005. 14. Lei Tang and Huan Liu. Uncovering cross-dimension group structures in multidimensional networks. In SDM workshop on Analysis of Dynamic Networks, 2009. 15. YoungJa Park and ManSuk Song. A genetic algorithm for clustering problems. In Proceedings of the Third Annual Conference on Genetic Programming, pages 568–575, 1998. 16. TH Cormen, CE Leiserson, RL Rivest, and C Stein. Introduction to algorithms mit press. Cambridge, MA,, 2003. 17. Clara Pizzuti. Ga-net: A genetic algorithm for community detection in social networks. In Parallel Problem Solving from Nature–PPSN X, pages 1081–1090. Springer, 2008. 18. Mark EJ Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006. 19. Scott White and Padhraic Smyth. A spectral clustering approach to finding communities in graph. In SDM, volume 5, pages 76–84. SIAM, 2005. 20. Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006. 21. Leon Danon, Albert Diaz-Guilera, Jordi Duch, and Alex Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008, 2005. 22. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and Tanaka Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. Lecture notes in computer science, 1917:849–858, 2000. 23. Michael Fire, Gilad Katz, Yuval Elovici, Bracha Shapira, and Lior Rokach. Predicting student exam’s scores by analyzing social network data. In Active Media Technology, pages 584–595. Springer, 2012. 24. James Samuel Coleman, Elihu Katz, Herbert Menzel, et al. Medical innovation: A diffusion study. Bobbs-Merrill Indianapolis, 1966.