J Supercomput DOI 10.1007/s11227-017-2063-1
Adapting the TopLeaders algorithm for dynamic social networks Wenhao Gao1 · Wenjian Luo1 · Chenyang Bu1
© Springer Science+Business Media New York 2017
Abstract Evolutionary community discovery is a hot research topic related to the dynamic or temporal social networks. The communities detected in a dynamic network should get reasonable partition for the current network and do not deviate drastically from the previous ones. This paper is an extended version of our previous work in Gao et al. (in: Proceedings of the 2016 international conference on big data and smart computing (BigComp), pp 53–60, 2016). First, an evolutionary community discovery algorithm named EvoLeaders, which is inspired by TopLeaders algorithm, is proposed. Second, based on TopLeaders, an improved TopLeaders algorithm (i.e., AutoLeaders) is proposed. Experiments on three classic data sets are conducted, and experimental results show that the AutoLeaders can correctly find the number of communities and at the same time can discover reasonable communities. Third, the EvoAutoLeaders algorithm is proposed for detecting the communities in a dynamic network. Compared with the TopLeaders algorithm and EvoLeaders, experimental results over two real-world data sets demonstrate that the EvoAutoLeaders is more suitable for dynamic scenarios. Keywords Dynamic social network · Community discovery · Leader nodes
This paper is recommended by the BigComp 2016 conference as one of the selected papers.
B
Wenjian Luo
[email protected] Wenhao Gao
[email protected] Chenyang Bu
[email protected]
1
Anhui Province Key Laboratory of Software Engineering in Computing and Communication, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, Anhui, China
123
W. Gao et al.
1 Introduction A social network is a graph of relationships between individuals, where each edge represents the interaction between two individuals (e.g., emails communication, cell phones communication and hyperlinks in blogs). One of the most important problems in social networks is the detection of communities. Within a community, the connections are dense, while the connections between different communities are sparse [2–6]. Traditional approaches to analyze social network treat the network as a static graph, which aggregates interactions over all the time into one snapshot. However, in many real-world networks, the relationships between individuals may evolve over time. Omitting the temporal information in the networks, some valuable communities could not be able to be found and the temporal evolution of the communities could not be detected. Recently, there are a growing number of literatures about the community discovery and temporal evolution in dynamic networks [7–17]. A common model for such temporal or dynamic networks is using a series of consecutive snapshots of the graph, where each snapshot corresponds to a time step. By considering the current snapshot together with the recent past snapshots, such a model is often used to detect the evolutionary communities in networks. Essentially, the detection of community is a clustering problem over networks. One of the feasible clustering paradigms is named as evolutionary clustering, which processes time-evolving data to generate a sequence of clustering [18]. Typically, Chakrabarti et al. [18] address the evolutionary clustering problem in the context of attributed data (rather than a graph), where cluster membership at time step t is also influenced by the clusters at recent past time steps. Their framework captures continuity with respect to clusters at previous time steps through the notion of temporal smoothness. The framework divides the objective function into two parts: snapshot quality (Sq), measuring the clustering quality on the current snapshot; and history quality (H q), verifying how similar the current clustering is with the previous one. Based on similar ideas, several evolutionary clustering algorithms have been proposed [7,9,19,20]. Chi et al. [7] extended the similar idea to graph structure data and proposed the evolutionary spectral clustering algorithm. The metric of graph cut was used to measure community structures and community evolution in their work [7]. Recently, some works have been done to analyze the community structures and the community evolution [8,9,21–23], which are based on the temporal smoothness framework. In this paper, the TopLeaders algorithm is adapted for dynamic social network. The TopLeaders considers each community as a set of follower nodes congregating close to a potential leader, and a leader node is the most central node in the corresponding community [24]. In [1], inspired by TopLeaders algorithm, an evolutionary community discovery algorithm named EvoLeaders is given. This paper is an extended version of our previous work in [1]. The main contents of this paper are described as follows.
1. An evolutionary community discovery algorithm named EvoLeaders, which is inspired by TopLeaders algorithm, is given. By keeping the temporal smoothness of the leader nodes, the communities detected at each time step could reflect a
123
Adapting the TopLeaders algorithm for dynamic social…
valuable partition for the current snapshot, while simultaneously do not shift too much from the previous ones. 2. An improved TopLeaders algorithm (i.e., AutoLeaders) is proposed. The AutoLeaders can correctly find the number of communities and at the same time can discover reasonable communities. Compared with the TopLeaders algorithm [24], the AutoLeaders algorithm does not need a parameter to specify the number of communities. 3. The EvoAutoLeaders is proposed to detect the communities of the dynamic network. The EvoAutoLeaders is based on the AutoLeaders algorithm. Experimental results over two real-world data sets demonstrate that the EvoAutoLeaders is suitable for dynamic scenarios. The rest of this paper is organized as follows. In Sect. 2, firstly, related notations are defined. In Sect. 3, backgrounds about the evolutionary clustering and some algorithms used in this paper (including TopLeaders and the communities merging algorithm) are described. Section 4, the EvoLeaders is described. Section 5 describes the proposed AutoLeaders and EvoAutoLeaders algorithms in detail, and experimental results are provided. Finally, Sect. 6 concludes the paper briefly.
2 Notations Let G t = (Vt , E t ) be a snapshot network at time step t, where Vt = v1t , v2t , . . . , v tN t is the set of nodes in G t , and E t is the set of edges in G t . A dynamic network with T time steps can be denoted as a sequence of snapshot networks G = {G 1 , G 2 , . . . , G T }. In general, a network at time step t can be denoted as an adjacent matrix W t = t (wi j ) N t ×N t . If there is no edge between vit and v tj , wit j = 0. Otherwise, wit j = 1. In this paper, Ct = ct,1 , ct,2 , . . . , ct,kt means the communities of a snapshot network G t (1 ≤ t ≤ T ), where of communities. The corresponding kt is the number leader nodes in G t are L t = lt,1 , lt,2 , . . . , lt,kt , where lt,i (1 ≤ i ≤ kt ) is the leader node of community ct,i .
3 Backgrounds 3.1 Evolutionary clustering Evolutionary clustering has attracted much attention in recent years. An early classic work was done by Chakrabarti et al. [18] in 2006. They proposed the temporal smoothness framework, where history cost is incorporated to the cost function to ensure the temporal smoothness. Chakrabarti et al. examined two classic clustering algorithms within the temporal smoothness framework: k-means and agglomerative hierarchical clustering. The goal of the framework is to optimize the following cost function: Ctotal = α · Sq + (1 − α) · H q
(1)
where Sq measures the clustering quality when the solution is applied to the current snapshot, and H q is the measure of history quality when the solution is applied to the previous snapshot.
123
W. Gao et al.
However, Chakrabarti et al. [18] only focused on attributed data rather than networks. The temporal smoothness framework has been applied to graph clustering by Chi et al. [7], Lin et al. [9], etc. Chi et al. [7] proposed the PCM and PCQ frameworks, which extended static spectral clustering to evolutionary clustering. The Facenet approach by Lin et al. [9] is based on Markov probability models, and the dirichlet distribution, and the nonnegative matrix factorization. Folino and Pizzuti [21] formulated the detection of communities with temporal smoothness as a multiobjective problem and proposed a method named DYNMOGA, which was based on the Genetic Algorithm (GA). Kim and Han [22] proposed a particle-and-density-based evolutionary clustering measure to discover the evolution of communities. Tang et al. [23] used the joint matrix factorization method to find the community evolution in dynamic multi-mode networks. 3.2 The TopLeaders algorithm The TopLeaders algorithm regards each community as a set of follower nodes congregating close to a potential leader, and a leader node is the most central node in the corresponding community. Here, we briefly introduce the TopLeaders algorithm [24]. In the TopLeaders algorithm [24], the first step is to find K initial leader nodes of the network, where the number of communities K could be found through exploiting prior knowledge of the given network, or using existing algorithms such as FastModularity [25] (but this will greatly increase time costs). The K community leader nodes are the K most central nodes in the network, and none of them belong to the same community. To implement this strategy, TopLeaders algorithm starts from the most central node, and adds the next central one to the current set of leaders, only if its intersection size to each leader is less than a predefined threshold. The centrality of nodes in a community measures the relative importance of a node within the group. For a community C of size N , the degree centrality [26] of a node n within C is defined as DC (n) =
deg (n, C) N −1
where deg (n, C) is the number of edges incident upon n in C. Algorithm 1. TopLeaders algorithm [24] Input: Output:
A network G, and the number of communities leaders and corresponding communities
1:
Initialize
2:
while there is a change in the leaders nodes do
3: 4:
leader nodes
for each node n
G and n
5:
end for
6:
Pick a new leader with the highest centrality in each community to replace the old one
7:
123
leaders do
Associate n to an appropriate leader{Alg. 2}
end while
(2)
Adapting the TopLeaders algorithm for dynamic social…
The second step is an iteration process, which alternates between associating followers to appropriate leader nodes and electing new leaders. More specifically, first, nodes are either associated with a leader or labeled as outliers (elaborated further in Algorithm 2), and second, when all nodes are handled, a new leader is picked in each community. The TopLeaders algorithm is described in Algorithm 1. For each node n in the network, its relation to each leader node l is measured by computing the number of neighbors they have in common. Moreover, a node n may be associated with more than one leader nodes if just considering neighborhood of depth 1 (which consists of the nodes that directly connected to node n). Thus different levels of neighborhoods should be considered. More specifically, firstly, the neighborhood of depth 1 is assessed. If the node n is associated with more than one leader with the same number of common neighbors, expand the neighborhood depth of n by one. Conducting above procedure until the neighborhood depth reaches threshold δ. Khorasgani et al. [24] pointed out that the TopLeaders algorithm is not sensitive to the value of threshold δ. Thus, in this paper, the value of δ is set the same as the value in [24] (i.e., δ = 2).
The process of associating a node n to its appropriate leader node is detailed in Algorithm 2. ℵ (n, d) denotes the set of nodes in the neighborhood depth d of node n, and ℵ (n, d) includes node n itself in this paper. |A| denotes the cardinality of the set A. To detect outliers in the network, an outlier threshold γ was introduced. For each leader node l, if the number of common nodes between l and the node n is no less than the given threshold γ , l would be considered as a potential leader of the node n.
3.3 The modularity Q and communities merging algorithm Newman and Girvan [27] proposed a quality measure called modularity Q to quantify the community quality of a network. The higher the value of modularity Q, the stronger
123
W. Gao et al.
the community structure is. For the convenience of calculation, Newman [28] rewrote the modularity Q based on the adjacent matrix. The modularity of network is expressed as di d j 1 δi j wi j − (3) Q= 2M 2M i, j∈I
where I = {1, 2, . . . , N } is the set of indices of nodes in G; di = j∈I wi j ; w ; M is the number of edges in un-weighted network G and M = dj = i j i∈I 1 w . If node v and v are in the same community, then δ = 1. Otherwise, i j i j i j i, j∈I 2 δi j = 0.
To determine whether to merge two communities C p and Cq , a measure Q C p ,Cq was proposed in [8,25]. According to Ref. [8], let Bi j =
wi j di d j − 2M 2M 2M
Therefore, Q=
(4)
Bi j δi j
(5)
i, j∈I
By Eq. 5, Q C p = i, j∈I p Bi j , Q Cq = i, j∈Iq Bi j , where I p Iq is the set of indices of nodes in C p Cq . Let C pq = C p ∪ Cq , then Q C p ,Cq = Q C pq − Q C p − Q Cq = 2
Bi j
(6)
i∈I p , j∈Iq
Q C p ,Cq ≤ 0 signifies that the total community structure will be impaired if the two communities are merged, so the two community should not be merged. If Q C p ,Cq > 0, the higher the Q C p ,Cq is, the more we should merge the two communities [8].
123
Adapting the TopLeaders algorithm for dynamic social…
The community merging algorithm in [8] is used in this paper. The details of the community merging algorithm are shown in Algorithm 3.
4 The EvoLeaders algorithm
In this section, our previous work [1] (i.e., EvoLeaders) is described. The algorithm consists of four main components: (1) getting the initial leader nodes; (2) finding the corresponding initial communities; (3) splitting the communities; (4) merging the communities [8] and updating the leader node in each community. The procedure of EvoLeaders is shown in Algorithm 4. 4.1 Getting the initial leaders nodes At each timestamp t (2 ≤ t ≤ T ), anupdating strategy is proposed to get the initial , l , . . . , l in G t . The strategy is described in Algorithm leader nodes L t = lt,1 t,1 t,kt 5.
123
W. Gao et al.
In Algorithm 5, ℵt (l) denotes the direct neighbors of the leader node l at timestamp t. And we call the threshold λ as leader_threshold, which is set as 4 in this paper. At Step 6 in Algorithm 5, in consideration of the stability of leader nodes, lt−1,i will be the relatively central node in the current snapshot. So although less than half nodes of ℵt−1 lt−1,i remain in ℵt lt−1,i , lt−1,i is still added to L t . At Step 8, since less than half nodes of ℵt−1 lt−1,i remain in ℵt lt−1,i , just lt−1,i is not enough to represent the remaining set of the corresponding community ct−1,i . Direct neighbors of leader nodes are usually important nodes in a group. So the node with the highest centrality in remain_neighbor s, which is the remaining set of the direct neighbors of leader node lt−1,i , could be an useful supplement to lt−1,i . At Step 11, if lt−1,i is not contained in the current network G t , the node with the highest centrality in remain_neighbor s could be picked to represent the remaining set of the corresponding community ct−1,i . The advantages of initializing leader nodes in this way are given as follows. First, the new initial leader nodes L t do not deviate too much from L t−1 . Second, L t can represent the relatively important nodes in the current snapshot network G t .
123
Adapting the TopLeaders algorithm for dynamic social…
4.2 Splitting the communities , the nodes, At the stage of splitting the communities, for each initial community ct,i which do not belong to ct,i at the last timestamp, are isolated. Details of the community splitting algorithm are shown in Algorithm 6. In the input of Algorithm 6, each value in the source label set Slabel of the initial communities Ct is assigned as the corresponding value in the source label set Slabel of the initial leader nodes L t . And the source label set Slabel is one of the output of Algorithm 4. The reasons that the community splitting algorithm is operated are described as , c , . . . , c }, each follows. (1) After finding the initial communities Ct = {ct,1 t,2 t,kt community ct,i (1 ≤ i ≤ kt ) may have some nodes that are not contained in the corresponding community at last snapshot. Isolating these nodes, which do not belong to the corresponding community at last snapshot, could be helpful to ensure continuity with respect to the communities at previous timestamp. (2) Isolating these nodes, which do not belong to the corresponding community at last snapshot, could simultaneously give these nodes a chance to be merged into other more appropriate community at the following communities merging stage.
4.3 Two versions of EvoLeaders In Algorithm 2, only if the leader node l has the maximum common neighbors of depth 1 with node n, l can be added to C List. We call the EvoLeaders algorithm taking this strategy as EvoLeaders1. However, when the initial leader nodes obtained
123
W. Gao et al.
by Algorithm 5 is not accurate, EvoLeaders1 may associate node n to the wrong leader node. So we add some leader nodes, which do not have maximum common neighbors of depth 1 with node n, but may be the appropriate leaders of n, to C List. More specifically, in Algorithm 2, the leader nodes, which have more than max_si ze − (5% · max_si ze + 1) common neighbors of depth 1 with node n, will be assigned to C List, where max_si ze is the maximum size of common neighbors of depth 1 with node n. We call the EvoLeaders algorithm taking this strategy as EvoLeaders2. Additionally, in two versions of EvoLeaders, we slightly modify the Step 10 in Algorithm 2: associating n to one of leaders in C List randomly.
5 The EvoAutoLeaders algorithm and experimental results We will detail our extended work in this section, which includes two aspects. First, an improved TopLeaders algorithm (i.e., Autoleaders) is proposed to determine the number of communities automatically. Second, the EvoAutoLeaders algorithm is proposed to detect the communities of a dynamic network.
5.1 An improved TopLeaders algorithm 5.1.1 Algorithm descriptions of AutoLeaders The major weakness of TopLeaders [24] is that the number of leader nodes (i.e., number of communities) should be set manually or obtained from other algorithm. To determine the number of communities automatically, an improved TopLeaders algorithm, named AutoLeaders (Algorithm 8) is proposed. The autoLeaders starts with initializing some leader nodes (i.e., Algorithm 7). Algorithm 7 is inspired by the phenomenon that a leader node is always the most central node within its neighborhood and have few edges with other leader nodes.
Fig. 1 An example of initializing leader nodes in AutoLeaders
123
Adapting the TopLeaders algorithm for dynamic social…
In Algorithm 7, ℵ (n) denotes the set of neighbors of node n, and ℵ (n) includes node n itself in this paper. We illustrate the principle of Algorithm 7 by a simple example in Fig. 1. Assuming η is 0.5. Node 9 has the highest degree in ℵ (9), so node 9 can be selected as a leader node. And obviously others nodes except node 3 have no chance to be a leader. Node 3
ℵ (3) ℵ (9) = 2, has the second highest degree in ℵ (the highest is node 9), and (3)
ℵ(3) ℵ(9) |ℵ (3)| = 5, | |ℵ(3)| | = 0.4 < η. So node 3 can also be selected as a leader node. From Fig. 3, it is easy to find that the detected leader nodes are in accord with the ground truth.
Because multiple levels of neighbors are considered in Algorithm 2, some nodes in a community could be disconnected within the community. At Step 6 in Algorithm
123
W. Gao et al.
8, if some nodes in a community is disconnected within the community, no leader is picked in this community. After the iterations (Step 2–7), each node can be associated with a leader and the initial communities are formed. If there are some communities that can be merged to improve the community quality, Algorithm 3 will be conducted at Step 8. At the end of the algorithm, the left hub nodes can be assigned to the first appropriate leader according to their direct common neighbors. In order to avoid providing the parameter of the number of communities, we introduce two parameters (η and θ ). The number of parameter increases, however, the scopes of two parameters (η and θ ) are limited and we can get good results through slight adjustment of the two parameters. Based on the experimental results, we find η could be set 0.5 or 0.6 and θ could be set between 6 and 10. 5.1.2 Experiments about AutoLeaders In this subsection, we will demonstrate the effectiveness of AutoLeaders by comparing its results with TopLeaders in three well-known data sets. These data sets are used in [24] to demonstrate the performance of the TopLeaders. A. Data Sets Karate Club the data set is from the “Karate Club” study of Zachary [29]. The data set includes 34 nodes and each represents a member in the club. Edges between nodes represent relations between members. Because of a disagreement between the administrator and teacher of the club, the club was split into two small communities. Strike the data set is a communication network of 24 employees in a sawmill [24]. An edge between two nodes means that the two employees discussed with each other very often. This data set is usually divided into three groups. Football the data set is the schedule of an American football game in 2006 [30]. The data set includes 180 nodes (115 nodes as 11 communities, 4 hubs and 61 outliers) and the 180 nodes are connected by 788 edges. B. Evaluation metrics The detected communities is evaluated by comparing with the ground truth and by computing their modularity Q. To compare with the ground truth, two measures are employed: purity and Adjusted Rand Index (ARI). And to compare data sets with outliers, we considered outliers as a community when compared purity and ARI. Assuming a network G = (V, E), the detected communities is C = {c1 , c2 , . . . , ck }, and the ground truth is R = {r1 , r2 , . . . , rk , }. Purity purity is the ratio of correctly assigned nodes [31]. The value of purity of ranges from 0 to 1. The higher the purity is, the more the detected communities are in agreement with the ground truth. The purity is defined in Eq. 7, where n = |V |. purit y (C, R) =
1 · maxi c j ∩ ri n j
123
(7)
Adapting the TopLeaders algorithm for dynamic social…
Adjusted Rand Index (ARI) the ARI [32] ranges from −1 to 1. And the detected communities are full agreement with the ground truth when ARI equals 1. Using a contingency table, the ARI is given by in Eq. 8. n i. n. j n ni j − · / i, j 2 i 2 j 2 2 ARI = n n i. n. j n i. n. j 1 + − · / i 2 j 2 i 2 j 2 2 2
(8)
where n = |V |, and n i j is the number of nodes assigned into ci and r j . n i. = j n i j is the number of nodes assigned into ci , and n . j = i n i j is the number of nodes assigned into r j . Modularity the modularity is used to evaluate quality of the obtained communities when there is no ground truth. The definition of the modularity has given in Eq. 3. Similar to the experimental comparisons in [24], for networks with discovered outliers, we compute the modularity without the outliers. This is just for fair comparison. C. Experimental results The parameter η is set as 0.5, and θ is 6. There are no outliers in data set Karate and Strike, so the parameter γ is set as 0. However, the Football data set has outliers, the parameter γ is set as 3 in this data set. The visualized results of communities detected by AutoLeaders are shown in Fig. 2. Table 1 shows a comparison between AutoLeaders and the TopLeaders algorithm. And the TopLeaders algorithm needs the number of communities as an input parameter. In order to show the effectiveness of AutoLeaders, the actual number of communities is set as the input of the TopLeaders in each data set. Experimental results demonstrate that the proposed AutoLeaders can correctly discover the number of communities in each data set. As for Karate and Strike data set, the detected communities by AutoLeaders are identical to the ground truth. As for Football data set, the modularity and purity of AutoLeaders are higher than TopLeaders, while the ARI of AutoLeaders is slightly less than TopLeaders. In general, with no need to input the number of communities, the proposed AutoLeaders algorithm can also discover reasonable communities.
123
W. Gao et al.
Fig. 2 Results of three data discovered by the AutoLeaders. a Karate, b strike, and c football
Table 1 The comparison of results in three data sets Data set
Method
Karate K = 2
TopLeaders (2) AutoLeaders
Strike K = 3
2
TopLeaders (3) AutoLeaders
Football K = 11
k
3
TopLeaders (11) AutoLeaders
11
ARI
Purity
Modularity
1.0
1.0
0.371
1.0
1.0
0.371
1.0
1.0
0.548
1.0
1.0
0.548
0.988
0.977
0.513
0.972
0.994
0.565
Column k indicates the number of communities obtained by AutoLeaders Bold values in each column denote the best values of evaluation metrics
123
Adapting the TopLeaders algorithm for dynamic social…
5.2 EvoAutoLeaders 5.2.1 Algorithm descriptions of EvoAutoLeaders
In this subsection, the EvoAutoLeaders is proposed to detect the communities of a dynamic network. The details of EvoAutoLeaders are described in Algorithm 9. As for dynamic networks, the communities detected at the current time step should keep smoothness with the previous ones. In the EvoAutoLeaders, two strategies are adopted to ensure temporal smoothness. The first strategy (Step 4 in Algorithm 9) is based on the observation that leader nodes keep relatively stable in the dynamic network and previous leader nodes are also relatively central nodes in current snapshot network. Specifically, the initial leaders at time step t consist of leader nodes found by Algorithm 7 and previous leader nodes L t−1 . In the second strategy (Step 7 in Algorithm 9), the intersection size of node n and a leader node l is not only determined by their common neighbors in the current network G t , but also their common neighbors in G t−1 . Specifically, the intersection size of node n and a leader node l in the neighborhood depth d is defined in Eq. 9. ℵ (l, d) I nter section (n, l, d) = α · ℵ (n, d) Gt ℵ (l, d) + (1 − α) · ℵ (n, d)
G t−1
(9)
123
W. Gao et al.
time step: t-1
time step: t
Fig. 3 A toy example of evolutionary communities
We illustrate the second strategy by a toy example in Fig. 3. Figure 3 shows the relationship among nine nodes at time step t − 1 and t. At time step t − 1, the node 1 and node 6 are the leader nodes. According to the size of common neighbors with leader nodes, the nine nodes are divided into two communities, colored blue and red, respectively. Assuming α is 0.7. At time step t, the edge between node 2 and 3 is gone. As a result, the node 3 can be associated with leader node 1 or leader node 6. However, according to Eq. 9, I nter section (3, 1, 1) = 0.7 ∗ 2 + (1 − 0.7) ∗ 3 = 2.3; I nter section (3, 6, 1) = 0.7 ∗ 2 + (1 − 0.7) ∗ 2 = 2. So node 3 should be associate with node 1, such that the communities detected at time step t is consistent with the previous ones.
5.2.2 Experimental results In this subsection, the EvoAutoLeaders andEvoLeaders (including EvoLeaders1 andEvoLeaders2) are tested and compared on two real-world data sets (the Enron email data set and the Catalano social network). Their performances are also compared with AutoLeaders and the TopLeaders algorithm [24]. A. Data Sets Enron email data set: the Enron email data set incorporates exchange email information between employees in Enron Corporation from 1991 to 2002. The original data set [33] contains around 517,431 emails of 151 users. In the experiment, we adopt a clean version [34] of this data set described in [35], containing a subset of 252,759 emails of the 151 staffs. We concentrate on the year 2001 since it encompasses the maximum number of emails. We divide it into 12 subsets according to the month and for each subset, a graph reflecting the relationship between each pairs of staffs is constructed. Catalano social network: the Catalano social network was originally used in IEEE VAST 2008 CHALLENGE [36]. The network is a set of cell phone call records from Isla Del Sueño over a ten-day period in June 2006, and it was narrowed to about 400 unique cell phones during this period. These records have been applied to build a sequence of snapshot networks according to the day, where each node represents a unique cell phone, and an edge between two cell phones would be created if any phone call between the two cell phones occurs during the period.
123
Adapting the TopLeaders algorithm for dynamic social…
B. Evaluation metrics There is no ground truth for the two real-world data set, and three kinds of modularity are used to measure the quality of communities structure [7,8,10]. Snapshot Modularity (SQ): the first measure, Snapshot Modularity (SQ), computes the modularity Q (Eq. 3) for the communities Ct detected at the current snapshot based on the current network G t . The form of Snapshot Modularity (SQ) is given as follows. S Q = Q (G t , Ct )
(10)
History Modularity (HQ): the second measure, History Modularity (HQ) [8], evaluates the modularity Q for the communities Ct with respect to the last snapshot network G t−1 . A high History Modularity (HQ) means the community structure does not shift drastically from last timestamp to the current timestamp. The form of History Modularity (HQ) is given as follows. H Q = Q (G t−1 , Ct )
(11)
Dynamic Modularity (DQ): in a dynamic network, the community structure detected at one snapshot should be not only a good partition for that snapshot, but also a reasonable partition for the previous snapshot [10]. The third measure, Dynamic Modularity (DQ) [10], is a trade-off between Snapshot Modularity (SQ) and History Modularity (HQ). A higher Dynamic Modularity (DQ) means higher total community quality. D Q = α · S Q + (1 − α) · H Q
(12)
C. Experiments results In the AutoLeaders and theEvoAutoLeaders, the parameter η is 0.6, θ is 10 in the two real-world data sets. In EvoLeaders andTopLeaders, the numbers of communities at the first time step of the Enron email data set and the Catalano social network are 6 and 10, respectively, and the threshold θ is set as 6 in the Enron email data set, and 10 in the Catalano social network. Because we do not want to detect any outliers in the two data sets, the parameter γ is set as 0. The trade-off parameter α is set as 0.7. The results of EvoLeaders and TopLeaders are average results of ten times.1 The experimental results of the Enron email data set are reported in Tables 2, 3 and 4, and the experimental results of the Catalano social network are shown in Tables 5, 6 and 7. In order to distinguish their differences, the results of Dynamic Modularity are shown in Figs. 4 and 5, respectively. From Figs. 4, 5 and Tables 2, 3, 4, 5, 6 and 7, the proposed AutoLeaders and EvoAutoLeaders outperform TopLeaders and EvoLeaders in both real-world data sets. At the same time the EvoAutoLeaders outperforms the AutoLeaders in both real-world data sets. The EvoLeaders gains slightly higher Dynamic modularity than TopLeaders in most cases over both data sets. The experimental results demonstrate that the proposed 1 At Step 10 of Algorithm 2: associate n to one of leaders in CList randomly.
123
123
0.6309
0.5915
0.6107
0.5828
AutoLeaders
TopLeaders
EvoLeaders1
EvoLeaders2
0.5633
0.5609
0.5287
0.5437
0.6508
3
0.5753
0.6216
0.5419
0.5808
0.6359
4
0.4038
0.4203
0.3696
0.4681
0.5444
5
Bold values in each column denote the best values of evaluation metrics
0.6528
EvoAutoLeaders
2
Table 2 The Dynamic Modularity (DQ) of the Enron email data set in 2001
0.5157
0.5239
0.5161
0.6168
0.6569
6
0.5481
0.5561
0.4453
0.5634
0.6221
7
0.3960
0.4375
0.4284
0.4471
0.5190
8
0.4854
0.4745
0.4221
0.5257
0.5817
9
0.4387
0.4526
0.4061
0.4588
0.5151
10
0.4655
0.4860
0.4174
0.5463
0.5580
11
0.5211
0.5310
0.5132
0.5656
0.6043
12
W. Gao et al.
0.5917
0.5917
EvoLeaders1
EvoLeaders2
0.5989
0.6350
0.6323
0.6347
0.6551
2
0.5576
0.5761
0.5483
0.5441
0.6467
3
0.5737
0.6279
0.5752
0.5901
0.6409
4
Bold values in each column denote the best values of evaluation metrics
0.6168
0.5917
AutoLeaders
TopLeaders
0.6168
EvoAutoLeaders
1
0.3759
0.3861
0.3962
0.4644
0.5413
5
Table 3 The Snapshot Modularity (SQ) of the Enron email data set in 2001
0.5825
0.5824
0.6059
0.6206
0.6609
6
0.5612
0.5752
0.4924
0.5652
0.6203
7
0.3665
0.4203
0.4138
0.4527
0.5212
8
0.5198
0.5061
0.4514
0.5281
0.5830
9
0.42287
0.4403
0.3906
0.4596
0.5153
10
0.4842
0.5063
0.4541
0.5465
0.5588
11
0.5508
0.5529
0.5699
0.5685
0.6073
12
Adapting the TopLeaders algorithm for dynamic social…
123
W. Gao et al. Table 4 The History Modularity (HQ) of the Enron email data set in 2001 2
3
4
5
6
7
8
9
10
11
12
EvoAutoLeaders 0.6475 0.6602 0.6241 0.5517 0.6475 0.6262 0.5140 0.5787 0.5147 0.5562 0.5974 AutoLeaders
0.6221 0.5427 0.5592 0.4766 0.6077 0.5592 0.4339 0.5200 0.4569 0.5459 0.5587
Top Leaders
0.4961 0.4829 0.4640 0.3074 0.3065 0.3356 0.4623 0.3537 0.4424 0.3319 0.3807
EvoLeaders1
0.5494 0.5256 0.6069 0.5003 0.3872 0.5117 0.4775 0.4009 0.4811 0.4388 0.4798
EvoLeaders2
0.5452 0.5764 0.5790 0.4688 0.3600 0.5177 0.4650 0.4050 0.4756 0.4217 0.4518
Bold values in each column denote the best values of evaluation metrics Table 5 The Dynamic Modularity (DQ) of the Catalano social network 2
3
4
5
6
7
8
9
10
EvoAutoLeaders
0.6617
0.6440
0.6600
0.6546
0.6602
0.6478
0.6176
0.6606
0.6501
AutoLeaders
0.6404
0.6228
0.6170
0.6177
0.6270
0.6229
0.6005
0.6126
0.6196
Top Leaders
0.4713
0.4438
0.4876
0.4668
0.4684
0.4363
0.3469
0.4338
0.4669
EvoLeaders1
0.4991
0.4746
0.4945
0.4487
0.4623
0.4625
0.4706
0.4756
0.4727
EvoLeaders2
0.4913
0.4788
0.5022
0.4720
0.4806
0.4406
0.3702
0.4276
0.4487
Bold values in each column denote the best values of evaluation metrics Table 6 The Snapshot Modularity (SQ) of the Catalano social network 1
2
3
4
5
6
7
8
9
10
EvoAutoLeaders 0.5847 0.6606 0.6426 0.6601 0.6550 0.6603 0.6457 0.6105 0.6612 0.6501 AutoLeaders
0.5847 0.6398 0.6206 0.6173 0.6185 0.6283 0.6221 0.5932 0.6135 0.6194
Top Leaders
0.4621 0.5229 0.4819 0.5349 0.5103 0.5147 0.4774 0.3945 0.4719 0.5171
EvoLeaders1
0.4621 0.5348 0.5006 0.5105 0.4715 0.4793 0.4859 0.4865 0.4957 0.5039
EvoLeaders2
0.4621 0.5255 0.5019 0.5221 0.4889 0.4980 0.4591 0.3789 0.4487 0.4743
Bold values in each column denote the best values of evaluation metrics Table 7 The History Modularity (HQ) of the Catalano social network 2
3
4
5
6
7
8
9
10
EvoAutoLeaders
0.6644
0.6473
0.6598
0.6536
0.6600
0.6526
0.6341
0.6591
0.6503
AutoLeaders
0.6418
0.6282
0.6163
0.6157
0.6239
0.6247
0.6176
0.6105
0.6201
Top Leaders
0.3509
0.3549
0.3771
0.3652
0.3603
0.3404
0.2357
0.3448
0.3497
EvoLeaders1
0.4158
0.4141
0.4571
0.3956
0.4228
0.4080
0.4336
0.4288
0.4001
EvoLeaders2
0.4114
0.4248
0.4559
0.4325
0.4400
0.3974
0.3500
0.3782
0.3890
Bold values in each column denote the best values of evaluation metrics
AutoLeaders and EvoAutoLeaders can obtain better communities structure than the compared algorithms, and the EvoAutoLeaders is more suitable to discover community in dynamic network than the AutoLeaders.
123
Adapting the TopLeaders algorithm for dynamic social…
Fig. 4 The comparison of the Dynamic Modularity in the Enron email data set
Fig. 5 The comparison of the Dynamic Modularity in the Catalano social network
123
W. Gao et al.
6 Conclusions and future works In this paper, we deal with the evolutionary community discovery problem from the view of leader nodes. First, by keeping the temporal smoothness of the leader nodes, an evolutionary community discovery algorithm named EvoLeaders, which is inspired by TopLeaders algorithm, is proposed. Second, in order to cluster the network without prior knowledge of the number of communities, an improved TopLeaders algorithm (i.e., AutoLeaders) is proposed. Over three classic data sets, experimental results show that the AutoLeaders could correctly find the number of communities and at the same time could discover reasonable communities. Third, to discover community in dynamic networks, EvoAutoLeaders algorithm is proposed, which could ensure temporal smoothness. Experimental results on two real-word data sets demonstrate that the EvoAutoLeaders achieves better performance than AutoLeaders, TopLeaders and Evoleaders. In the future, it is interesting to study how to handle the dynamic network whose characteristic of leader nodes is not clear and how to deal with dynamic data which changes over time drastically. Acknowledgements This work is partly supported by Anhui Provincial Natural Science Foundation (No. 1408085MKL07). This paper is recommended by the BigComp 2016 conference (the Third International Conference on Big Data and Smart Computing) as one of the selected papers.
References 1. Gao W, Luo W, Bu C (2016) Evolutionary community discovery in dynamic networks based on leader nodes. In: Proceedings of the 2016 International Conference on Big Data and Smart Computing (BigComp), pp 53–60 2. Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826 3. Giatsoglou M, Vakali A (2013) Capturing social data evolution using graph clustering. IEEE Internet Comput 17(1):74–79 4. Papadakis H, Panagiotakis C, Fragopoulou P (2013) Locating communities on graphs with variations in community sizes. J Supercomput 65(2):543–561 5. Lee W, Lee JJ, Kim J (2014) Social network community detection using strongly connected components. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 596–604 6. Lee W, Leung CK-S, Lee JJ (2011) Mobile web navigation in digital ecosystems using rooted directed trees. IEEE Trans Ind Electron 58(6):2154–2162 7. Chi Y et al (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 153–162 8. Guo C, Wang J, Zhang Z (2014) Evolutionary community structure discovery in dynamic weighted networks. Phys A Stat Mech Appl 413:565–576 9. Lin Y-R et al (2008) FacetNet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th International Conference on World Wide Web, pp 685–694 10. Takaffoli M, Rabbany R, Zaïane OR (2013) Incremental local community identification in dynamic social networks. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp 90–94 11. Tantipathananandh C, Berger-Wolf T, Kempe D (2007) A framework for community identification in dynamic social networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 717–726
123
Adapting the TopLeaders algorithm for dynamic social… 12. Kumar R et al (2005) On the bursty evolution of blogspace. World Wide Web 8(2):159–178 13. Palla G, Barabási A-L, Vicsek T (2007) Quantifying social group evolution. Nature 446(7136):664–667 14. Hu Z et al (2015) Community level diffusion extraction. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp 1555–1569 15. Zhao Q, Bhowmick SS, Gruenwald L (2006) Cleopatra: evolutionary pattern-based clustering of web usage data. In: Proceedings of the Pacific-Asia Knowledge Discovery and Data Mining, pp 323–333 16. Sohn J-S, Chung I-J (2013) Dynamic FOAF management method for social networks in the social web environment. J Supercomput 66(2):633–648 17. Li Y et al (2016) Influential node tracking on dynamic social network: an interchange greedy approach. IEEE Trans Knowl Data Eng 29(2):359–372 18. Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 554–560 19. Xu KS, Kliger M, Hero Iii AO (2014) Adaptive evolutionary clustering. Data Min Knowl Discov 28(2):304–336 20. Chen G, Luo W, Zhu T (2014) Evolutionary clustering with differential evolution. In: Proceedings of the 2014 IEEE Congress on Evolutionary Computation, pp 1382–1389 21. Folino F, Pizzuti C (2014) An evolutionary multiobjective approach for community discovery in dynamic networks. IEEE Trans Knowl Data Eng 26(8):1838–1852 22. Kim M-S, Han J (2009) A particle-and-density based evolutionary clustering method for dynamic networks. Proc VLDB Endow 2(1):622–633 23. Tang L et al (2008) Community evolution in dynamic multi-mode networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 677–685 24. Khorasgani RR, Chen J, Zaïane OR (2010) Top leaders community detection approach in information networks. In: 4th SNA-KDD Workshop on Social Network Mining and Analysis 25. Clauset A, Newman ME, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70(6):066111 26. Bonacich P (1987) Power and centrality: a family of measures. Am J Sociol 92(5):1170–1182 27. Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113 28. Newman ME (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74(3):036104 29. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33(4): 452–473 30. Xu X et al (2007) Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 824–833 31. Schütze H (2008) Introduction to information retrieval. In: Proceedings of the International Communication of Association for Computing Machinery Conference 32. Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: Proceedings of the International Conference on Artificial Neural Networks, pp 175–184 33. http://www.cs.cmu.edu/~enron/ 34. http://www.cs.purdue.edu/homes/jpfeiff/enron.html 35. Shetty J, Adibi J (2004) The Enron email dataset database schema and brief statistical report. Information Sciences Institute Technical Report, University of Southern California 36. http://www.cs.umd.edu/hcil/VASTchallenge08/tasks.html
123