IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 26,
NO. 4, APRIL 2014
929
MultiComm: Finding Community Structure in Multi-Dimensional Networks Xutao Li, Michael K. Ng, and Yunming Ye Abstract—The main aim of this paper is to develop a community discovery scheme in a multi-dimensional network for data mining applications. In online social media, networked data consists of multiple dimensions/entities such as users, tags, photos, comments, and stories. We are interested in finding a group of users who interact significantly on these media entities. In a co-citation network, we are interested in finding a group of authors who relate to other authors significantly on publication information in titles, abstracts, and keywords as multiple dimensions/entities in the network. The main contribution of this paper is to propose a framework (MultiComm) to identify a seed-based community in a multi-dimensional network by evaluating the affinity between two items in the same type of entity (same dimension) or different types of entities (different dimensions) from the network. Our idea is to calculate the probabilities of visiting each item in each dimension, and compare their values to generate communities from a set of seed items. In order to evaluate a high quality of generated communities by the proposed algorithm, we develop and study a local modularity measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data sets suggest that the proposed framework is able to find a community effectively. Experimental results have also shown that the performance of the proposed algorithm is better in accuracy than the other testing algorithms in finding communities in multi-dimensional networks. Index Terms—Multi-dimensional networks, community, transition probability tensors, local modularity, affinity calculation
Ç 1
INTRODUCTION
R
ECENTLY,
there are growing interests in studying and analyzing large networks, such as social networks, genetic networks and co-citation networks [1], [2], [3], [4], [5]. In these networks, each node is an item corresponding to a dimension or an entity in a network. Each edge indicates a relationship between two nodes, for instance, a contact between two users in a social network, an interaction between two genes in a genetic network, and a citation between two papers or two authors in a co-citation network. Analyzing these networks enable us to understand their topological properties and structural organization. One of such objectives is to detect communities or modules in large networks. One approach is to partition the network into sub-networks so that nodes in each sub-network are densely connected while nodes in different subnetworks are loosely connected. For example, the concepts of edge between-ness [1] and modularity [6], and the maximum-flow-minimum-cut theory [7] are used to divide the network. Spectral methods [8], [9], [10], [11] are also proposed to partition the entire network to detect communities. In some applications, the primary concern is to discover a specific community containing a set of nodes of interest
X. Li and Y. Ye are with the Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Room 202, C Building, HIT Campus, Xili University Town, Shenzhen, China. E-mail:
[email protected],
[email protected]. M.K. Ng is with the Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong. E-mail:
[email protected].
Manuscript received 2 Mar. 2012; revised 6 Nov. 2012; accepted 2 Mar. 2013; date of publication 19 Mar. 2013; date of current version 18 Mar. 2014. Recommended for acceptance by H. Zha. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2013.48
rather than to partition the whole network into several groups. For example, we are interested in a particular sharing group in a social network, or a set of genes along a pathway in a genetic network. To find such community structure in a network, Bagrow and Bolt [12] developed an algorithm to search the structure of a network starting with a seed node. Clauset [13] proposed a local modularity measure and then developed a seed-based community discovery algorithm. In [14], [15], Luo et al. designed a new modularity measure and developed a KL-like algorithm to explore such community structure. However, these methods were developed for uni-dimensional networks. In this paper, we are interested in developing such a seed-based community discovery scheme for a multidimensional network. There are many data mining and information retrieval applications in which there are several dimensions/entities in networks [16], [17]. In online social media, networked data consists of multiple dimensions/ entities containing tags, photos, comments and stories [18]. We are interested to find a group of users who interact significantly on these media entities. In a co-citation network, we are interested to find a group of authors who cite/collaborate to each other (or a set of papers which are related to each other) significantly on publication information in titles, abstracts, and keywords as multiple dimensions/entities in the network [19]. Fig. 1a shows an example of an academic publication network where some concepts are labeled to papers, and each paper is associated with several keywords and authors. In this network, there are four dimensions/entities. However, items in three dimensions (author, keyword and paper) are related among themselves, and items in two dimensions (paper and concept) are related to each other. We can make use of a tensor and a matrix
1041-4347 ß 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
930
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
(a)
VOL. 26,
NO. 4,
APRIL 2014
(b)
Fig. 1. (a) An example of an academic publication network. (b) The representation of the multi-dimensional network in (a): a tensor is used to represent the interactions among items in three dimensions/entities: author, keyword and paper, and a matrix is used to represent the interactions between items in two dimensions/entities: concept and paper.
to represent their interactions, see Fig. 1b. For the detailed definition, we will discuss in Section 3. Multiple interactions among entities/dimensions should be incorporated and studied in order to identify useful and important community structure in such a multidimensional network. The main aim of this paper is to propose an algorithm, MultiComm, to identify a seed-based community structure in a multi-dimensional network such that the involved items of the entities inside the community interact significantly, and meanwhile, they are not strongly influenced by the items outside the community. In our proposal, a community is constructed starting with a seed consisting of one or more items of the entities believed to be participating in a viable community. Given the seed item, we iteratively adjoin new items by evaluating the affinity between the items to build a community in the network. As there are multiple interactions among the items from different dimensions/entities in a multidimensional network, the main challenge is how to evaluate the affinity between the two items in the same type of entity (from the same dimension/entity) or in different types of entities (from different dimensions/entities). For example, in Fig. 2, the affinity between a paper “A” and a paper “B” (the same type of entity), and the affinity between a paper “A” and a keyword “C” (different types of entities) are required in order to evaluate and decide the papers “A” and “B” or the paper “A” and the keyword “C” to put together in a community. On the other hand, we need a criterion in order to evaluate a high quality of generated communities by the proposed algorithm, and thus we study a local modularity measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data suggest that the proposed framework is able to find a community effectively. Experimental results have also shown that the performance of the proposed algorithm is better in accuracy than the other testing algorithms in finding communities. The rest of the paper is organized as follows. In Section 2, we review the related work and give the motivations why the proposed algorithm is better than the other existing algorithms. In Section 3, we describe notation in this paper and present how to evaluate the affinity between items. In Section 4, we present the proposed framework and algorithm. In Section 5, we demonstrate the usefulness of the proposed algorithm by presenting the experimental results. In Section 6, we give some concluding remarks.
Fig. 2. The affinity between two items in the same/different type of entities.
2
THE RELATED WORK
2.1 Community Discovery In the literature, many methods have been proposed to extract community structures from uni-dimensional networks. In [20], Ding et al. proposed to find communities based on a minmax cut principle, i.e., minimize the connections between communities while maximize the connections within a community. Following this principle, it has been shown that the corresponding optimization problem can be relaxed and solved by finding the second lowest eigenvector of its Laplacian matrix. In [13], Clauset proposed a local modularity measuring the sharpness of a subgraph boundary, and then developed a greedily-growing algorithm based on this modularity for exploring community structure. Later, Luo et al. designed a new local modularity based on the “indegree” and “outdegree” of a subgraph [14], and then proposed three algorithms for the identification of communities [15]. In 2006, Anderson and Lang investigated methods for growing communities by using random walk techniques [21]. Their basic idea is to generate communities by simulating a “truncated” random walk for a small number of steps starting from a distribution concentrated on the seed set. Recently, Mehler and Skiena [22] also proposed a general method for expanding network communities from input seed set. In this method, they studied several heuristic scoring criteria (neighbor count, juxtaposition count, neighbor ratio, etc.) to select the most promising next member for community expansion. During the expansion, part of seeds are kept aside as validation set to check when to stop. Based on the matrix blocking techniques, Chen and Saad proposed a method to extract dense subgraphs from sparse graphs for discovering dense communities [23]. Their basic idea is reordering the adjacency matrix to find dense diagonal blocks, each of which represents a dense community. However, all these methods were developed for uni-dimensional networks and thus may not be used to yield good performance for multi-dimensional networks. As an example, we consider a network consisting of three entities/dimensions (A, B and C). Entity A includes five items {A1 ; A2 ; A3 ; A4 ; A5 }, entity B includes six items {B1 ; B2 ; B3 ; B4 ; B5 ; B6 } and entity C includes four items {C1 ; C2 ; C3 ; C4 }. The interactions between them are represented as a tensor A of size 5 6 4. When items Ai , Bj and Ck interact, ði; j; kÞ position of A is set to be 1, otherwise, it is a zero. Að:; :; 1Þ, Að:; :; 2Þ, Að:; :; 3Þ and Að:; :; 4Þ are given as follows:
LI ET AL.: MULTICOMM: FINDING COMMUNITY STRUCTURE IN MULTI-DIMENSIONAL NETWORKS
0
1 B B1 B B1 B B @0
1 1 0 1 1 0 1 1 0 0 0 0
1 0 0 0 1 C B 0 1C B1 C B B 0 0C C; B 1 C B 1 0A @0
0
0 0 1
0 0
0
0 0
0
B B0 B B0 B B @0 0
0 0 0 0 0 0 1 0
0 0 0
1 0
0
1 0 0
1 0 0 C 1 0C C 0 0C C; C 0 0A 0 0
0
0 1 0
0 0
C B 0 1 0C B1 C B B 0 0 0C C; B 0 C B 0 0 1A @0 0 0 0 0
1 1 0 1 1 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
1
C 0 1C C 0 0C C; C 0 1A 0 0
respectively. In this example, we generate a community consisting of items fA1 ; A2 ; A3 ; B1 ; B2 ; B3 ; C1 ; C2 g in the network. In order to employ uni-dimensional network community discovery methods, we must change the above tensor data into the adjacency matrix data as follows:
where 1 in the above matrix refers to the interaction between the items. For instance, when items Ai , Bj and Ck interact, the interactions between Ai and Bj , Bj and Ck , and Ai and Ck are shown in the matrix form. According to this example, we observe there are two disadvantages to handle this problem in matrix form. The first issue is that there is no direct interaction between the items of the same entity (all the diagonal blocks are zero). In order to group the items of same entity in the community, we must make use of the items in the other entities. The second issue is that some interactions are duplicated in the matrix form. For instance, the items A1 , B1 and C1 interact and the items A1 , B1 and C2 interact, thus A1 and B1 interact but we cannot differentiate the interaction, and capture the correlation between items in different dimensions/entities in the original tensor data. We report that the uni-dimensional based community discovery algorithms like Clauset cannot find the correct community in this example. The community discovered by Clauset is fA1 ; A2 ; A3 ; B1 ; B2 ; B3 ; C2 ; C4 g. In this paper, we propose a scheme to extract communities from tensor data arising from multi-dimensional networks.
2.2 MetaFac In multi-dimensional network analysis, networks have more than two types of entities. Most existing methods are based on tensor factorization [18], [24], [25]. Tensor factorization is a generalized approach for analyzing multi-way interactions among entities.
931
In [18], Lin et al. proposed MetaFac (MetaGraph Factorization), a framework that extracts community structures from various social contexts and interactions. In this method, the interactions between different entities are represented as a set of tensors, some of which have overlapping dimensions. Then it decomposes these tensors into matrices simultaneously using KL-divergence as a measure of approximation cost. Since the KL-divergence is used, the factorization matrices can represent the prior probabilities of different communities and the conditional probabilities of each item in these communities. Based on them, the posterior probability that each item belongs to a particular community can be computed and then the final community results can be obtained. However, in this method, we need to select the number of decompositions (low-rank approximation) in the tensor factorization. The number of decompositions may not be known in advance. On the other hand, the computation of such tensor factorization may not be unique as there are several numerical methods (e.g., the alternating least squares procedure) used to compute such factorization and the factorization results depend on the initial guess. Different from this method, we compute visiting probabilities of items in different dimensions in a multi-dimensional network to calculate the affinity between the items, and to find a community. The analysis of such probability calculation will be studied and analyzed in the next section. By using the example in Section 2.1, we find that MetaFac detects the community fA1 ; A2 ; A3 ; A5 ; B1 ; B2 ; B3 ; B4 ; C1 ; C2 ; C4 g which is not the correct one.
2.3 Random Walk with Restart The proposed affinity calculation in Section 3 is based on the idea of random walk with restart. There are many applications using random walk and related methods such as image annotation [26], connection subgraph identification [27], cluster discovery [28], and bi-relational network analysis [29]. The idea of random walk with restart is to consider a random particle that starts from node i, and the particle iteratively transmits to its neighborhood with the probability that is proportional to their edge weights. Also at each step, it has some probability a (0 < a < 1) to return to the node i. The relevance score of node j with respect to node i is defined as the steady-state probability pj that the particle will be at node j [30]: p ¼ ð1 aÞWp þ aei ; where p ¼ ½pj is the steady-state probability referring to relevance scores of different nodes, W is the normalized weighted matrix associated with the graph, and ei is the starting vector with the ith element 1 and 0 for others. Here the relevance score can capture the global structure of the graph and multi-facet relationship between two nodes. However, random walk and related methods only deal with simple networks so far. The main contribution of this paper to develop random walk and related methods based on tensors for finding communities in a multi-dimensional network. By using the proposed algorithm in Section 3, we can find the ground truth community fA1 ; A2 ; A3 ; B1 ; B2 ; B3 ; C1 ; C2 g exactly in the network example in Section 2.1.
932
3
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
3.1 Preliminary In this section, we describe notations and present some preliminary knowledge on tensors for representing multidimensional networks. Let IR be the real field. We call A ¼ ½aði1 ; . . . ; im Þ where aði1 ; . . . ; im Þ 2 IR, for ik ¼ 1; . . . ; nk , k ¼ 1; 2; . . . ; m, a real ðn1 nm Þ-dimensional tensor A. A is called nonnegative if aði1 ; . . . ; im Þ 0. A positive (or nonnegative) vector means all its entries are positive (or non-negative). It is denoted by x > 0 (or x 0). Definition 1. A m-dimensional network with the size of the kth dimension being nk ðk ¼ 1; 2; . . . ; mÞ, is represented by a set of S tensors fAðsÞ gSs¼1 where AðsÞ ¼ ½aðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ is a real ðnj1 ðsÞ nj2 ðsÞ njls ðsÞ Þ-dimensional tensor to describe the interactions of items among ls dimensions of indices j1 ðsÞ; j2 ðsÞ; . . . ; jls ðsÞ in dimensions between 1 and m (they are not necessarily distinct): aðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ ¼ 1 if the ij1 ðsÞ th item in the j1 ðsÞth dimension, the ij2 ðsÞ th item in the j2 ðsÞth dimension, . . ., the ijls ðsÞ th item in the jls ðsÞth dimension interact, otherwise, aðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ ¼ 0. It is clear that each AðsÞ is a non-negative tensor. For example, suppose that there are four dimensions/ entities (papers, authors, keywords and concepts), and there are n1 papers, n2 authors, n3 keywords and n4 concepts in the network shown in Fig. 1. The network is represented by two tensors, i.e., S ¼ 2. Að1Þ is an (n1 n2 n3 )-dimensional tensor representing the interactions among the papers, authors and keywords, i.e., l1 ¼ 3, j1 ð1Þ ¼ 1, j2 ð1Þ ¼ 2 and j3 ð1Þ ¼ 3. Að2Þ is an (n1 n4 )-dimensional tensor (matrix) representing the interactions between the papers and concepts, i.e., l2 ¼ 2, j1 ð2Þ ¼ 1 and j2 ð2Þ ¼ 4. In this example, when the ij1 ð1Þ th paper is related to the ij2 ð1Þ th author and it contains the ij3 ð1Þ th keyword, we set að1Þ ðij1 ð1Þ ; ij2 ð1Þ ; ij3 ð1Þ Þ to be 1, otherwise we set it to be 0, when the ij1 ð2Þ th paper is related to the ij2 ð2Þ th concept, we set að2Þ ðij1 ð2Þ ; ij2 ð2Þ Þ to be 1, otherwise we set it to be 0. As we consider the calculation of probabilities of the items of each dimension in a non-negative tensor arising from a multi-dimensional network, and study the likelihood that we will arrive at any particular item in a network, we construct an ðnj1 ðsÞ nj2 ðsÞ njls ðsÞ Þ-dimensional transition probability tensor P ðs;tÞ ¼ ½pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ by normalizing the entries of AðsÞ with respect to the index ijt ðsÞ in between 1 and njt ðsÞ (t ¼ 1; 2; . . . ; ls ) as follows: pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ aðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ ¼ Pnj ðsÞ : t ðsÞ ij ðsÞ ¼1 a ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ
APRIL 2014
where Xk u is a random variable referring to the visit to any particular item of the ju ðsÞth dimension/entity at the time k. Here pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ can be interpreted as the probability of visiting the ijt ðsÞ th item of the jt ðsÞth dimension in a network given that the iju ðsÞ th item of the ju ðsÞth dimension is currently visited, where u ¼ 1; . . . ; ls except u 6¼ t. We note that if aðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ is equal to 0 for all 1 ijt ðsÞ njt ðsÞ , this is called the dangling node [32], and the values of pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ can be set to 1=njt ðsÞ (an equal chance to visit any item in the jt ðsÞth dimension). With the above construction, we have 0 pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ 1; njt ðsÞ
and
X
pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ ¼ 1:
ð2Þ
ijt ðsÞ ¼1
We call P ðs;tÞ (t ¼ 1; 2 . . . ; ls ) transition probability tensors derived from AðsÞ . P ðs;tÞ can be viewed as a high-dimensional analog of transition probability matrices in Markov chains [31]. It is necessary to know the connectivity among the items of the entities within a tensor. We remark that the concept of irreducibility has been used in the PageRank matrix in order to compute the PageRank vector [32]. Definition 2. A ðn1 nm Þ-dimensional tensor T is called irreducible if for any j and j0 (the other indices are fixed) the nj nj0 matrices ½tð...;ij ;...;i 0 ;... Þ are irreducible. If T is not irrej
ducible, then we call T reducible. When AðsÞ is irreducible, any two items in the same dimension or in different dimensions in a network can be connected via the other items. As we would like to make use of probability distributions to define an affinity between two items of the entities, irreducibility is a reasonable assumption that we will use in the following analysis and discussion. It is clear that when AðsÞ is irreducible, the corresponding tensors P ðs;tÞ ðs ¼ 1; 2 . . . ; S and t ¼ 1; 2; . . . ; ls Þ are also irreducible. According to (1), we calculate the probability of visiting an item in the jt ðsÞth dimension by multiplying the probabilities of visiting the other items in the other dimensions with pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ. Here we deal with a multiplication operation of a tensor with several vectors. Let xju ðsÞ be vectors of length nju ðsÞ and ½xju ðsÞ ij ðsÞ be the value of the iju ðsÞ th item in the ju ðsÞth u s dimension (u ¼ 1; 2 . . . ; ls ). Let P ðs;tÞ lu¼1;u6 ¼t xju ðsÞ be a njt ðsÞ vector in IR such that ðs;tÞ ls P u¼1;u6¼t xju ðsÞ i jt ðsÞ
nj1 ðsÞ
¼
t
These numbers give the estimates of the following conditional probabilities: pðs;tÞ ij1 ðsÞ ; ijðsÞ ; . . . ; ijls ðsÞ 2 2 3
8u¼1;...;ls ;u6¼t
NO. 4,
ðj ðsÞÞ
THE MATHEMATICAL FORMULATION
6 ðj ðsÞÞ 7 ðju ðsÞÞ ¼ ijt ðsÞ j Xk1 ¼ iju ðsÞ 5; ¼ Prob4Xk t |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}
VOL. 26,
(1)
X
nj2 ðsÞ
X
ij1 ðsÞ ¼1 ij2 ðsÞ ¼1
nj ðsÞ ls X
pðs;tÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ
ij ðsÞ ¼1 ls
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} except the sum for ijt ðsÞ
½xj1 ðsÞ ij ðsÞ ½xj2 ðsÞ ij ðsÞ ½xjls ðsÞ ij ðsÞ ; 1 2 ls |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} except the term ½xjt ðsÞ i
jt ðsÞ
for ijt ðsÞ ¼ 1; 2; . . . ; njt ðsÞ . It is easy to check that when xju ðsÞ are probability distribution vectors, i.e.,
LI ET AL.: MULTICOMM: FINDING COMMUNITY STRUCTURE IN MULTI-DIMENSIONAL NETWORKS nju ðsÞ
X
½xju ðsÞ ij
iju ðsÞ ¼1
u ðsÞ
¼ 1;
s then the output vector P ðs;tÞ lu¼1;u6 ¼t xju ðsÞ is also a probability distribution vector. By using the publication network in Fig. 1 as an example, we can calculate
the probabilities of visiting the items in the “paper” dimension given the probabilities that the items in the “author” and “keyword” dimensions are visited in P ð1;1Þ derived from Að1Þ : x1 ¼ P ð1;1Þ x2 x3 and n3 n2 X X ½x1 i1 ¼ pð1;1Þ ði1 ; i2 ; i3 Þ½x2 i2 ½x3 i3 ; i2 ¼1 i3 ¼1
the probabilities of visiting the items in the “author” dimension given the probabilities that the items in the “paper” and “keyword” dimensions are visited in P ð1;2Þ derived from Að1Þ : x2 ¼ P ð1;2Þ x1 x3 and n3 n1 X X ½x2 i2 ¼ pð1;2Þ ði1 ; i2 ; i3 Þ½x1 i1 ½x3 i3 ; i1 ¼1 i3 ¼1
the probabilities of visiting the items in the “keyword” dimension given the probabilities that the items in the “author” and “paper” dimensions are visited in P ð1;3Þ derived from Að1Þ : x3 ¼ P ð1;3Þ x1 x2 and n1 X n2 X pð1;3Þ ði1 ; i2 ; i3 Þ½x1 i1 ½x2 i2 : ½x3 i3 ¼ i1 ¼1 i2 ¼1
Similarly, we can calculate the probabilities of visiting the items in the “paper” dimension given the probabilities that the items in the “concept” dimension are visited in P ð2;1Þ derived from Að2Þ : x1 ¼ P ð2;1Þ x4
and ½x1 i1 ¼
n4 X i4 ¼1
pð2;1Þ ði1 ; i4 Þ½x4 i4 :
In the above example, there is no overlapping dimension in Að1Þ or Að2Þ which refers to paper-author-keyword interactions or paper-concept interactions. In our model, we can handle the case where overlapping dimensions appear in a tensor, see the two examples in Experiment 3.
3.2 The Affinity between Two Items A community is constructed starting with a seed consisting of one or more items of the entities believed to be participating in a viable community. Given the seed item, we iteratively adjoin new items by evaluating the affinity between the items to build a community in the network. In this
933
section, we present how to evaluate the affinity between the two items in the same type of entity (from the same dimension/entity) or in different types of entities (from different dimensions/entities). Our idea is to calculate the affinity based on calculation of probabilities of visiting other items in a network from a given set of items. Motivated by the idea of topic-sensitive PageRank [33] and random walk with restart [30], we consider a random walker chooses randomly among the available interactions among the items in different dimensions, and makes a choice with probability a going back to a set of items in the current community. Based on this concept, we set the following tensor equations to calculate the required probabilities of visiting items in the vth dimension in the whole network ðv ¼ 1; 2; . . . ; mÞ: 2
3 S X 1 s 5 P ðs;tÞ lu¼1;u6 xv ¼ ð1 aÞ4 ¼t xju ðsÞ ; jNv j s¼1;j ðsÞ¼v t
ð3Þ
þ azv ; where zv is a probability vector that is constructed by setting the entries that correspond to the seed items or the current items in the vth dimension in the community to be ones, and then normalizing it; the parameter 0 < a < 1 is to control the probability of a random jump to the items in the current community; and Nv is the number of tensors involving the vth dimension in the network. As there is no prior knowledge in using a particular tensor in the network, we assume that we have equal chance to consider these Nv tensors involving the vth dimension in the random walk, and the factor 1=Nv is used in the calculation of probabilities. By using the publication network in Fig. 1 as an example, we calculate the following probabilities of visiting the items in the “paper” “author,” “keyword” and “concept” dimensions with given initial probabilities vectors in these dimensions z1 , z2 , z3 and z4 by solving the following set of tensor equations with N1 ¼ 2 and N2 ¼ N3 ¼ N4 ¼ 1: x1 ¼ ð1 aÞ
1 ð1;1Þ 1 x2 x3 þ P ð2;1Þ x4 þ az1 ; P 2 2
x2 ¼ ð1 aÞP ð1;2Þ x1 x3 þ az2 ; x3 ¼ ð1 aÞP ð1;3Þ x1 x2 þ az3 ; x4 ¼ ð1 aÞP ð2;2Þ x1 þ az4 :
We note that only the “paper” dimension is involved in both Að1Þ and Að2Þ . In the theorem below, we state that (3) is solvable, and the proof is given in the Appendix, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TKDE.2013.48. Theorem 1. Suppose 0 < a < 1 and AðsÞ (1 s S) are irreducible. Then (3) is solvable and its solution xv is non-negative and the summation of all the entries of xv is equal to 1 (i.e., a probability distribution vector) for v ¼ 1; 2; . . . ; m. Moreover, xv for v ¼ 1; 2; . . . ; m are even positive.
934
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Under the certain assumption on fP ðs;tÞ gSs¼1 , the solution xv for v ¼ 1; 2; . . . ; m are unique, see the Appendix, available in the online supplemental material. Next we present an efficient iterative algorithm to solve the tensor equations in (3) to obtain xv (v ¼ 1; 2; . . . ; m) for the probabilities of the items in the multi-dimensional network. In the algorithm, the computations require several iterations, through the collection to adjust approximate probability values of items of the entities in the multidimensional network to more closely reflect their theoretical true values (underlying probability distributions fxv gm v¼1 ). The iterative method is similar to the power method for computing the eigenvector corresponding to the largest eigenvalue value of a matrix [32]. The main computational cost of the above algorithm depends on the cost of performing tensor operations in Step 3. Assume that there are OðNÞ nonzero entries (sparse data) in P ðs;tÞ , the cost of these tensor calculations are of OðNÞ arithmetic operations.
THE PROPOSED ALGORITHM
After solving a set of tensor equations in (3), we obtain the probability distributions of visiting each item in each dimension in the multi-dimensional network. These probability distributions fxv gm v¼1 can be viewed as an affinity vector because it indicates the affinity of the items in each dimension to the items in the current community. Based on their probability values, we can determine the candidate items in different dimensions that are closely related to the current items in the community. In the next section, we will define the goodness criterion in order to determine the “best” community.
Local Modularity in a Multi-Dimensional Network In [13], Clauset defined a measure of community structure for a graph. The idea is that a good community should have a sharp boundary, i.e., it will have few connections from its boundary to the other portion of the network, while having a greater proportion of connections
APRIL 2014
cency tensor for AðsÞ ðs ¼ 1; 2; . . . ; SÞ, which can be computed as bðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ 8 ðsÞ > < 1; if a ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ ¼ 1 and the ¼ B it th item of the tth dimension is in B; > : 0; otherwise: Thus, we define the local modularity rs for the tensor AðsÞ to given by P ðsÞ ij1 ðsÞ ;...;ij ðsÞ b ðij1 ðsÞ ; . . . ; ijls ðsÞ Þdðij1 ðsÞ ; . . . ; ijls ðsÞ Þ ls P (4) ðsÞ ij ðsÞ ;...;ij ðsÞ b ðij1 ðsÞ ; . . . ; ijls ðsÞ Þ ls
where dðij1 ðsÞ ; . . . ; ijls ðsÞ Þ of the indicator tensor D is one B when some ijv ðsÞ th items of the jv ðsÞth dimension are in B and the other ij0v ðsÞ th items of the j0v ðsÞth dimension are in the community, otherwise it is zero. Here the denominator in (4) is the number of connections with one or more items in B, B while the numerator in (4) is the number of connections with neither items outside the community. By considering the fraction of boundary connections which are internal to the community, we ensure that the measure of local modularity for the tensor AðsÞ lies on the interval 0 rs < 1, where its value is directly proportional to sharpness of the boundary given by B. B Here we use the multi-dimensional network example in Section 2 to demonstrate the definition and computation of modularity. Assume that the current community that has been explored is fA1 ; A2 ; B1 ; C1 g. Then the boundary set B B is fA1 ; A2 ; B1 ; C1 g because we find based on the adjacency tensor A that these four items in the current community have connections to items outside of this community. In this case, we have the boundary adjacency tensor B where Bð:; :; 1Þ, Bð:; :; 2Þ, Bð:; :; 3Þ and Bð:; :; 4Þ are given as follows: 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 0 C C B B B1 1 1 0 0 1C B1 1 1 0 1 0C C C B B B 1 1 1 0 0 0 C; B 1 0 0 0 0 0 C; C C B B C C B B @0 0 0 0 1 0A @0 0 0 0 0 0A 0 0 0 0 0 0 0 0 0 1 0 0 0
4.1
NO. 4,
from the boundary back into the community. Here we extend this idea to define a local modularity of a community in a multi-dimensional network. Let B B denote its boundary set composed of items in an mdimensional network, where the item has connections to the items outside of the community explored. Let BðsÞ ¼ ½bðsÞ ðij1 ðsÞ ; ij2 ðsÞ ; . . . ; ijls ðsÞ Þ indicate its boundary-adja-
1
4
VOL. 26,
0 0 0
B B0 B B0 B B @0 0
0 0
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
0
1 0
0 0 1
C B 0C B1 C B B 0C C; B 0 C B 0A @0 0 0
0 0 0 0 0 0 0 0
0 0 0
1
C 0 0 1C C 0 0 0C C; C 0 0 0A 0 0 0
respectively, and the indicator tensor D where Dð:; :; 1Þ, Dð:; :; 2Þ, Dð:; :; 3Þ and Dð:; :; 4Þ are given as follows:
LI ET AL.: MULTICOMM: FINDING COMMUNITY STRUCTURE IN MULTI-DIMENSIONAL NETWORKS
0
1 B B1 B B0 B B @0 0
1 0 0 0 0 C B 0 0C B0 C B B 0 0C C; B 0 C B 0 0A @0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 B B0 B B0 B B @0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 C B 0 0 0C B0 C B B 0 0 0C C; B 0 C B 0 0 0A @0
0
0 0
0 0 0
0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 C 0 0C C 0 0C C; C 0 0A 0 0
935
the first two experiments, we test the performance based on two synthetic data sets. In the third experiment, we use two journal publication data sets to construct multidimensional networks to identify communities.
1 0 0 C 0 0C C 0 0C C; C 0 0A 0 0
respectively. We see dð1; 1; 1Þ ¼ 1 because the item B1 is in the boundary set and the items A1 and C1 are in the current community. Similarly, dð2; 1; 1Þ ¼ 1 is because the item B1 is in the boundary set and the items A2 and C1 are in the current community. Thus, the local modularity of this example is P5 r¼
i1 ¼1
P6
P5
i2 ¼1
i1 ¼1
P4
bði1 ; i2 ; i3 Þdði1 ; i2 ; i3 Þ 1 þ 1 1 ¼ : ¼ P4 24 12 bði ; i ; i Þ 1 2 3 i2 ¼1 i3 ¼1
P6
i3 ¼1
When communities vary in different subsets of dimensions, we can make use of rs to identify which dimension of its corresponding item with the highest probability joins the community. In Algorithm 1, we summarize the MultiComm algorithm for finding a community in a multi-dimensional network. We remark that after one community is determined, we can apply the MultiComm algorithm again to find another community. The proposed algorithm allows an item belonging to different communities.
5
EXPERIMENTAL RESULTS
In this section, we present experiments on both synthetic data sets and real data sets to demonstrate the performance of the proposed algorithm.
5.1 Data Sets and Evaluation Metrics In the comparison, we study the results by using the MetaFac algorithm [18], the Clauset algorithm [13] and the LWP algorithm [15]. There are three community discovery algorithms for LWP method and we implemented the KL-like algorithm since it has been shown in [15] that this algorithm leads to better performance than the other two algorithms. For the Clauset algorithm and the LWP algorithm, we put all the items as nodes in a single graph and two nodes are connected if their corresponding items are interacted in the original multi-dimensional network. This setting cannot distinguish the interactions among different dimensions. MetaFac is to extract communities from a multi-dimensional network based on tensor factorization. The number of decompositions in the factorization is required to be specified. We use three data sets to test the performance of MultiComm and the other three comparison algorithms. In
In this paper, two metrics are employed to evaluate the accuracy for the algorithms, i.e., F-measure and NMI (Normalized Mutual Information) [34]. Given a community identified, we compute the F-measure as follows: F measure ¼
2 precision recall ; precision þ recall
where the precision and recall are calculated in terms of the ground-truth community. For NMI, we consider the community discovery problem as a binary class problem of whole items, and compute the normalized mutual information score between the identified partition pa and the ground-truth partition pb as follows:
P2 P2 nnh;l n log h;l ðaÞ ðbÞ h¼1 l¼1 nh nl NMIðpa ; pb Þ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; P ðaÞ P ðbÞ
nh nl ðaÞ ðbÞ 2 2 h¼1 nh log n l¼1 nl log n ðaÞ
ðbÞ
where n is the total number of items, nh , nl and nh;l represents the number of items in the hth class in the partition pa , the number of items in the lth class in the partition pb , and the number of items both in the hth class and the lth class respectively. Both F-measure and NMI are measured in between 0 and 1. When their values are higher, the accuracy of the algorithm is better.
936
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
(a)
(b)
Fig. 3. The generated tensor with one community. (a) m ¼ 2, b ¼ 0:5, g ¼ 0:1; (b) m ¼ 2, b ¼ 0:25, g ¼ 0:1; (c) m ¼ 2, b ¼ 0:6, g ¼ 0:4.
5.2 Experiment 1 In this experiment, we generate a m-dimensional network represented by a tensor. We construct one “ground-truth” community and add noisy interactions in a tensor, and then check how different algorithms can recover this community. There are two parameters b and g to control the data generation. The parameter b is used to control how strong the interactions among items in the community. The parameter g is used to control how many noisy interactions in the network. More precisely, these two parameters are defined as follows: p1 ; q1
and g ¼
NO. 4,
APRIL 2014
(b)
(c)
(c)
b¼
(a)
VOL. 26,
p2 ; q2
where p1 is the number of interactions between items in the community, q1 is the number of interactions when items in the community are assumed to be fully connected, p2 is the number of interactions between items in the community and outside of the community, q2 is the number of interactions when all the items are fully connected. For example, we show in Fig. 3 three generated 2-dimensional networks, i.e., m ¼ 2. The two axes represent the two different dimensions/entities. Each value on the axis refer to an item in the corresponding dimension/entity. Here the number of items in each dimension is 200, and thus A is a (200 200) 2-dimensional tensor, i.e., S ¼ 1. A point in the figure represents an interaction between two items. We see from Fig. 3a that the region ½1; 100 ½1; 66 in the generated network has more points concentrated together, and this region corresponds to a community in the network. The points sparsely distributed in the other parts refer to the noisy interactions among items of the two dimensions. When the value of b is large, the items in the community interact strongly, see Fig. 3a with b ¼ 0:5 and Fig. 3b with b ¼ 0:25. Also when the value of g is large, there are more noisy interactions among items, see Fig. 3c with g ¼ 0:4 and
Fig. 4. The generated tensor with two overlapping communities. (a) m ¼ 2, b ¼ 0:5, g ¼ 0:1; (b) m ¼ 2, b ¼ 0:25, g ¼ 0:1; (c) m ¼ 2, b ¼ 0:6, g ¼ 0:4.
Fig. 3b with g ¼ 0:1. We note that in these 2-dimensional networks, the following tensor equations are required to solve in order to compute the required probabilities: x1 ¼ ð1 aÞP ð1;1Þ x2 þ az1 and x2 ¼ ð1 aÞP ð1;2Þ x1 þ az2 ; where P ð1;1Þ and P ð1;2Þ are two transition probability matrices of sizes 200 200 from derived from A. We also construct two “ground-truth” communities and add noisy interactions in the generated networks. The two communities can be overlapped, i.e., an item can belong two communities, see Fig. 4. In the figure, we assume that 25 percent of items of each dimension in a community is overlapped. For the other generated multi-dimensional networks in Tables 1 and 2, their corresponding tensor equations can be set up similarly. In Fig. 5, we show the changes of F-measure and NMI with respect to the values of a to the discovery of the community. We see that the values of F-measure and NMI increase when a increases up to around 0.4. For a > 0:4 the values of F-measure and NMI are about the same. Therefore, we set a ¼ 0:85 to be the default value in MultiComm in the following experiments. Tables 1 and 2 show the performance of four algorithms for several values of m, b and g. The accuracy results of Clauset, LWP and MultiComm are shown based on the averaged value of the ten runs of the corresponding algorithm with randomly selected seed items in the community. Since Metafac computes the tensor factorization, it is independent of seed items. In Metafac, we set the number of decompositions to be 2 for the tensors in Table 1 and select the community with the largest F-measure value from one of the decompositions. For the tensors in Table 2, we set the number of decompositions to be 3, and select the two communities with the largest and the second largest F-measure values from the tensor decomposition. According to Tables 1 and 2, we find
LI ET AL.: MULTICOMM: FINDING COMMUNITY STRUCTURE IN MULTI-DIMENSIONAL NETWORKS
937
TABLE 1 The Performance of Four Different Methods (One Community)
TABLE 2 The Performance of Four Different Methods (Two Communities)
that the performance of MultiComm is better than the other three algorithms. We note in Table 2 that 25 percent of items of each dimension in a community is overlapped in the generated multi-dimensional networks. We have also tested the performance of the four methods for different percentages of items to be overlapped in a community. The results show that MultiComm can give higher values of F-measure and NMI than those by the other three methods. Moreover, Fig. 6 shows how the local modularity changes against the number of items joined in the community on two generated 3-dimensional networks. We see from the figure that the increase of local modularity is not significantly changed when more than 200 items (the size of the ground-truth community) are joined in the community. This phenomena indicates the local modularity can be used as a stopping criterion in the discovery of a community.
(a)
(b)
Fig. 5. The changes of F-measure and NMI with respect to the values of a. (a) Two community data generated with m ¼ 3, b ¼ 5:0 103 and g ¼ 5:0 104 ; (b) Two community data generated with m ¼ 3, b ¼ 3:0 103 and g ¼ 5:0 104 .
5.3 Experiment 2 In this experiment, we generate several tensors to represent a multi-dimensional network in order to test the performance of the proposed method. We consider a mdimensional network represented by S tensors. Similar to Experiment 1, we generate a “ground-truth” community in the network, and add noisy interactions randomly in the network. For example, we show in Fig. 7a three-dimensional network generated by two 2-dimensional tensors Að1Þ and Að2Þ with m ¼ 3, S ¼ 2, l1 ¼ 2 (j1 ð1Þ ¼ 1 and j2 ð1Þ ¼ 2) and l2 ¼ 2 (j1 ð2Þ ¼ 1 and j2 ð2Þ ¼ 3). In this example, the community is composed of items (from 1 to 100) in the first dimension, (from 1 to 66) in the second dimension and (from 101 to 166) in the third dimension. We see from the figure that the items (from 1 to 66) in the first dimension and the items (from 1 to 66) in the second dimension are strongly linked together in the first tensor, and the items
(a)
(b)
Fig. 6. The local modularity changes against the number of items joined in the community. (a) One community data generated with m ¼ 3, b ¼ 5:0 103 and g ¼ 5:0 104 ; (b) Two community data generated with m ¼ 3, b ¼ 5:0 103 and g ¼ 5:0 104 .
938
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Fig. 7. The generated tensor with one community with m ¼ 3, b ¼ 0:5 and g ¼ 0:1.
(from 34 to 100) in the first dimension and the items (from 101 to 166) in the third dimension are strongly linked together in the second tensor. Here the following tensor equations are required to solve in MultiComm in order to compute the required probabilities: 1 1 x1 ¼ ð1 aÞ P ð1;1Þ x2 þ P ð2;1Þ x3 þ az1 ; 2 2 x2 ¼ ð1 aÞP ð1;2Þ x1 þ az2 ;
x3 ¼ ð1 aÞP ð2;2Þ x1 þ az2 ;
where P ð1;1Þ and P ð1;2Þ are two transition probability matrices of sizes 200 200 derived from Að1Þ , and P ð2;1Þ and P ð2;2Þ are two transition probability matrices of sizes 200 200 derived from Að2Þ . For the other generated multi-dimensional networks, their corresponding tensor equations can be set up similarly.
VOL. 26,
NO. 4,
APRIL 2014
Tables 3 (S ¼ 2) and 4 (S ¼ 3) show the performance of the four methods for several values of m, b and g. For each m-dimensional network, we generate one community and set l1 ¼ l2 ¼ ¼ lS ¼ l ¼ d2mSþ1 Sþ1 e. For example, when m ¼ 5 and S ¼ 2, l is equal to 3, i.e., for a five-dimensional network, there are two 3-dimensional tensors to represent the network. For Metafac, we set the number of decompositions to be 2 and select the community with the largest Fmeasure value from one of the decompositions. The accuracy results of Clauset, LWP and MultiComm are shown based on the averaged value of the ten runs of the corresponding algorithm with randomly selected seed items in the community. We see from the two tables that the performance of MultiComm is better than those of the other three methods for different settings. In Section 2, we used a simple example to illustrate that MultiComm has some advantages over the other three algorithms (Metafac, Clauset and LWP) to handle multi-dimensional networks. In Experiments 1 and 2, we also show that MultiComm outperforms the other three algorithms. For Metafac, the computed tensor decomposition composition may not be unique. Therefore, it may suffer from the local minima problem. For Clauset and LWP, the multi-dimensional network data are required to change into a matrix form in order to apply the methods. In the matrix setting, there are two disadvantages: (i) one is that there is no direct interaction between the same entities; (ii) the second is that the interactions are duplicated in the matrix form. However, MultiComm has good theoretical properties, see Theorem 1 and the Appendix, available in the online supplemental material. Also MultiComm generates communities according to interactions
TABLE 3 The Performance of Four Different Methods on Multi-Dimensional Networks Generated with S ¼ 2
TABLE 4 The Performance of Four Different Methods on Multi-Dimensional Networks Generated with S ¼ 3
LI ET AL.: MULTICOMM: FINDING COMMUNITY STRUCTURE IN MULTI-DIMENSIONAL NETWORKS
(a)
(b)
Fig. 8. The local modularity changes against the number of items joined in the community. (a) The multi-dimensional network generated with S ¼ 2, m ¼ 5, l ¼ 3, b ¼ 5:0 103 and g ¼ 5:0 104 ; (b) The multidimensional network generated with S ¼ 3, m ¼ 7, l ¼ 3, b ¼ 5:0 103 and g ¼ 5:0 104 .
in the multi-dimensional networks by using tensor representation directly. Thus the performance of MultiComm would be better than that of other three algorithms. In addition, Fig. 8 shows how the local modularity changes with respect to the number of items joined in the community on two generated multi-dimensional networks. As each of these two multi-dimensional networks is represented by multiple tensors, here the local modularity refers to the average value of local modularities corresponding to these tensors. For instance, when S ¼ 3 we have three tensors Að1Þ , Að2Þ and Að3Þ , and the local modularity is ðr1 þ r2 þ r3 Þ=3 where rs is defined as in (4). We see from the figures that the increase of local modularity is not significantly changed when more than 350 items in Fig. 8a, and when more than 450 items in Fig. 8b. These item numbers are indeed the size of the ground-truth community in the generated 5-dimensional network and 7-dimensional network respectively. These results demonstrate the usefulness of the local modularity designed in Section 4.1.
5.4 Experiment 3 In this experiment, we use the SIAM journal data and DBLP conference data to construct multi-dimension networks and test the performance of MultiComm. For SIAM journal data, we consider the papers published in SJMAEL (SIAM Journal on Matrix Analysis and Applications) from volume 18 to volume 32, in SJNAAM (SIAM Journal on Numerical Analysis) from volume 34 to volume 49, and in SJOCE3 (SIAM Journal on Scientific Computing) from volume 18 to volume 33. For DBLP data, we consider the papers published in KDD conference from 1999 to 2010 and in SIGIR from 2000 to 2010. Journal/Conference-related Communities: We construct multi-dimensional networks as follows. The first step is to preprocess the data by keeping the authors that have at least two papers and their papers in the collection. We construct four tensors to represent multi-dimensional networks,
939
namely, the paper-author-keyword tensor, the paper-paper citation tensor, the author-author collaboration tensor and the paper-category concept tensor. For the SIAM data set, the category concepts refer to the AMS codes in each paper. For the DBLP data set, the category concepts are provided in each paper. The description of these tensors in the multidimensional SIAM and DBLP networks are shown in Table 5. There are four dimensions (papers, authors, keywords, category concepts) in each network, i.e., m ¼ 4. Each network is represented as four tensors, i.e., S ¼ 4. Að1Þ is a tensor representing the interactions among papers, authors and keywords with l1 ¼ 3, j1 ð1Þ ¼ 1, j2 ð1Þ ¼ 2 and j3 ð1Þ ¼ 3. Að2Þ is a tensor representing the citation interactions among papers with l2 ¼ 2, j1 ð2Þ ¼ 1 and j2 ð2Þ ¼ 1. (Here l2 ¼ 2 because the citation interactions are directed.) Að3Þ is a tensor representing the collaboration interactions among authors with l3 ¼ 1 and j1 ð3Þ ¼ 2. Að4Þ is a tensor representing the interactions among papers and concepts with l4 ¼ 2, j1 ð4Þ ¼ 1 and j2 ð4Þ ¼ 4. In MultiComm, the following tensor equations are built to calculate the required probabilities: 1 1 1 x1 ¼ ð1 aÞ P ð1;1Þ x2 x3 þ P ð2;1Þ x1 þ P ð2;2Þ x1 4 4 4 1 ð4;1Þ þ P x4 þ az1 ; 4 1 1 x2 ¼ ð1 aÞ P ð1;2Þ x1 x3 þ P ð3;1Þ x2 þ az2 ; 2 2 x3 ¼ ð1 aÞP ð1;3Þ x1 x2 þ az3 ; x4 ¼ ð1 aÞP ð4;2Þ x1 þ az4 ; where P ð1;1Þ , P ð1;2Þ and P ð1;3Þ are three transition probability tensors derived from tensor Að1Þ , and P ð2;1Þ and P ð2;2Þ are two transition probability tensors derived from Að2Þ , and P ð3;1Þ is a transition probability tensor derived from Að3Þ , and P ð4;1Þ and P ð4;2Þ are two transition probability tensors derived from Að4Þ . We test the usefulness of MultiComm by evaluating the three journal communities (SJMAEL, SJNAAM and SJOCE3) discovered in the SIAM network and the two conference communities (KDD and SIGIR) in the DBLP network. We report the values of F-measure and NMI by computing the discovered communities and the groundtruth journal or conference labels according to the entity of paper. We remark that the ground-truth labels of the other entities are not available. The four methods are performed on the SIAM network and the DBLP network. For MetaFac, we set the number of decompositions to be 3 for the SIAM network and 2 for the DBLP network. Table 6 shows the results of four methods. For Clauset, LWP and MultiComm, we use 10 papers and 10 authors that are densely connected in the corresponding community as seed items. We see from the table that the performance of MultiComm is better
TABLE 5 The Descriptions of SIAM Multi-Dimensional Network and DBLP Multi-Dimensional Network
The notation nz represents the number of nonzero entries in the corresponding tensor.
940
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 26,
NO. 4,
APRIL 2014
TABLE 6 The Results of Four Methods on SIAM and DBLP Data Sets for Journal/Conference Communities
For SIAM data, there are 899 papers, 1,416 papers and 1,421 papers from SJAMEL, SJNAAM and SJOCE3 respectively. For DBLP data, there are 992 papers and 1,370 papers from KDD and SIGIR respectively. The b and g values are computed based on the citation interactions of paper entity as the labels of the other entities are not available.
TABLE 7 The Results of Four Methods on SIAM and DBLP Data Sets for Category Concept Communities
For SIAM data, there are 665 papers, 355 papers, 670 papers, 262 papers and 274 papers labeled as 65F10, 65F15, 65N30, 65N15 and 65N55, respectively. For DBLP data, there are 137 papers, 104 papers, 105 papers and 243 papers labeled as clustering, performance evaluation, information filtering and retrieval models respectively. The b and g values are computed based on the citation interactions of paper entity as the labels of the other entities are not available.
than those of the other three methods on both networks. We see that the LWP has extremely bad performance on real data compared to its performance on synthetic data in Experiments 1 and 2. This is because the real data contains many noisy interactions and the community structure is less clear compared with those in synthetic data. On the other hand, the LWP method stops too early as it cannot find items to move into or out the current community such that its modularity is increased. For example, it stops with a community of 66 items when we try to identify the KDD community on DBLP network. Category Concept-Related Communities: In this setting, we construct the multi-dimensional networks by using the tensors Að1Þ , Að2Þ and Að3Þ because we want to discover category concept-related communities. The corresponding tensor equations are required to solve in MultiComm: 1 ð1;1Þ 1 ð2;1Þ 1 ð2;2Þ x2 x3 þ P x1 þ P x1 þ az1 ; x1 ¼ ð1 aÞ P 3 3 3 1 1 x2 ¼ ð1 aÞ P ð1;2Þ x1 x3 þ P ð3;1Þ x2 þ az2 ; 2 2 x3 ¼ ð1 aÞP ð1;3Þ x1 x2 þ az3 : In MetaFac, we set the number of decompositions to be 10, and select the decomposition that corresponds to the largest F-measure for each testing category concept. Table 7 shows the results of four methods. Again, we see from the table that the performance of MultiComm method is better than
those of the other three methods on both networks in discovery of category-concept related communities. Similarly, the performance of the LWP is bad because it stops too early.
6
CONCLUDING REMARKS
In this paper, we have proposed a framework (MultiComm) to determine communities in a multi-dimensional network based on probability distribution of each dimension/entity computed from the network. Both theoretical and experimental results have demonstrated that the proposed algorithm is efficient and effective. On the other hand, in social networks, user actions are constantly changing and co-evolving. In the future work, it is required to adapt the proposed model to be timevarying. As probability distributions are non-stationary in this situation, we must consider and study statistically dependence in time-varying Markov chains [35] for items of different dimensions to obtain the affinities among them in order to find an evolution of communities across different time stamps.
ACKNOWLEDGMENTS The work of X. Li was supported in part by NSFC under Grant No. 61100190. The work of M. Ng was supported in part by Centre for Mathematical Imaging and Vision, HKRGC Grant No. 201812 and HKBU FRG Grant No. FRG2/11-12/127. The work of Y. Ye was supported in part
LI ET AL.: MULTICOMM: FINDING COMMUNITY STRUCTURE IN MULTI-DIMENSIONAL NETWORKS
by NSFC under Grant No. 61272538, National Key Technology R&D Program of MOST China under Grant No. 2012BAK17B08, and Shenzhen Strategic Emerging Industries Program under Grant Nos. ZDSY20120613125016389 and JCYJ20120613135329670. The authors would also like to thank SIAM for providing the SIAM journal data for experiments. Yunming Ye is the corresponding author.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]
[10]
[11] [12] [13] [14] [15] [16]
[17] [18]
[19] [20] [21] [22]
M. Girvan and M. Newman, “Community Structure in Social and Biological Networks,” Proc. Nat’l Academy of Sciences USA, vol. 99, no. 12, p. 7821, 2002. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network Motifs: Simple Building Blocks of Complex Networks,” Science, vol. 298, no. 5594, p. 824, 2002. M. Newman, “The Structure and Function of Complex Networks,” SIAM Rev., vol. 45, no. 2, pp. 167-256, 2003. G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society,” Nature, vol. 435, no. 7043, pp. 814-818, 2005. S. Strogatz, “Exploring Complex Networks,” Nature, vol. 410, no. 6825, pp. 268-276, 2001. M. Newman and M. Girvan, “Finding and Evaluating Community Structure in Networks,” Physical Rev. E, vol. 69, no. 2, p. 026113, 2004. G. Flake, S. Lawrence, C. Giles, and F. Coetzee, “Self-Organization and Identification of Web Communities,” Computer, vol. 35, no. 3, pp. 66-70, 2002. B. Yang, J. Liu, and J. Feng, “On the Spectral Characterization and Scalable Mining of Network Communities,” IEEE Trans. Knowledge and Data Eng., vol. 24, no. 2, pp. 326-337, Feb. 2012. J. Ruan and W. Zhang, “An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks,” Proc. Seventh IEEE Int’l Conf. Data Mining (ICDM ’07), pp. 643-648, Jan. 2007. M. Shiga, I. Takigawa, and H. Mamitsuka, “A Spectral Clustering Approach to Optimally Combining Numericalvectors with a Modular Network,” Proc. 13th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’07), pp. 647-656, 2007. S. Smyth, “A Spectral Clustering Approach to Finding Communities in Graphs,” Proc. Fifth SIAM Int’l Conf. Data Mining (SDM ’05), 2005. J. Bagrow and E. Bollt, “Local Method for Detecting Communities,” Physical Rev. E, vol. 72, no. 4, p. 046108, 2005. A. Clauset, “Finding Local Community Structure in Networks,” Physical Rev. E, vol. 72, no. 2, p. 026132, 2005. F. Luo, Y. Yang, C. Chen, R. Chang, J. Zhou, and R. Scheuermann, “Modular Organization of Protein Interaction Networks,” Bioinformatics, vol. 23, no. 2, p. 207, 2007. F. Luo, J.Z. Wang, and E. Promislow, “Exploring Local Community Structures in Large Networks,” Web Intelligence and Agent Systems, vol. 6, no. 4, pp. 387-400, 2008. M. Ng, X. Li, and Y. Ye, “MultiRank: Co-Ranking for Objects and Relations in Multi-Relational Data,” Proc. 17th ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD ’11), pp. 12171225, 2011. X. Li, M. Ng, and Y. Ye, “HAR: Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search,” Proc. 12th SIAM Int’l Conf. Data Mining (SDM ’12), pp. 141-152, 2012. Y. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher, “Metafac: Community Discovery via Relational Hypergraph Factorization,” Proc. 15th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’09), pp. 527-536, 2009. L. Tang, H. Liu, and J. Zhang, “Identifying Evolving Groups in Dynamic Multimode Networks,” IEEE Trans. Knowledge and Data Eng., vol. 24, no. 1, pp. 72-85, Jan. 2012. C.H.Q. Ding, X. He, H. Zha, and M. Gu and H.D. Simon, “A MinMax Cut Algorithm for Graph Partitioning and Data Clustering,” Proc. IEEE Int’l Conf. Data Mining, pp. 107-114, 2001. R. Andersen and K.J. Lang, “Communities from Seed Sets,” Proc. 15th Int’l Conf. World Wide Web (WWW ’06), pp. 223-232, 2006. A. Mehler and S. Skiena, “Expanding Network Communities from Representative Examples,” ACM Trans. Knowledge Discovery from Data, vol. 3, no. 2, article 7, 2009.
941
[23] J. Chen and Y. Saad, “Dense Subgraph Extraction with Application to Community Detection,” IEEE Trans. Knowledge and Data Eng., vol. 24, no. 7, pp. 1216-1230, July 2010. [24] Y. Chi, S. Zhu, Y. Gong, and Y. Zhang, “Probabilistic Polyadic Factorization and Its Application to Personalized Recommendation,” Proc. 17th ACM Conf. Information and Knowledge Management (CIKM ’08), pp. 941-950, 2008. [25] E. Acar and B. Yener, “Unsupervised Multiway Data Analysis: A Literature Survey,” IEEE Trans. Knowledge and Data Eng., vol. 21, no. 1, pp. 6-20, 2009. [26] J. Pan, H. Yang, C. Faloutsos, and P. Duygulu, “Automatic Multimedia Cross-Modal Correlation Discovery,” Proc. 10th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’04), pp. 653-658, 2004. [27] H. Tong and C. Faloutsos, “Center-Piece Subgraphs: Problem Definition and Fast Solutions,” Proc. 12th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’06), pp. 404-413, 2006. [28] K. Macropol, T. Can, and A. Singh, “RRW: Repeated Random Walks on Genome-Scale Protein Networks for Local Cluster Discovery,” BMC Bioinformatics, vol. 10, no. 1, article 283, 2009. [29] J. Xia, D. Caragea, and W. Hsu, “Bi-relational Network Analysis Using a Fast Random Walk with Restart,” Proc. IEEE Ninth Int’l Conf. Data Mining (ICDM ’09), pp. 1052-1057, 2009. [30] H. Tong, C. Faloutsos, and J. Pan, “Random Walk with Restart: Fast Solutions and Applications,” Knowledge and Information Systems, vol. 14, no. 3, pp. 327-346, 2008. [31] S. Ross, Introduction to Probability Models. Academic Press, 2007. [32] L. Page, S. Brin, R. Motwani, and T. Winograd, “The Pagerank Citation Ranking: Bringing Order to the Web,” Technical Report, Stanford InfoLab. 1998. [33] T. Haveliwala, “Topic-Sensitive PageRank,” Proc. 11th Int’l Conf. World Wide Web (WWW ’02), 2002. [34] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions,” J. Machine Learning Research, vol. 3, pp. 583-617, 2003. [35] W. Ching and M. Ng, Markov Chains: Models, Algorithms and Applications. Int’l Series on Operations Research and Management Science, Springer, 2006. Xutao Li received the BS and MS degrees in computer science from Lanzhou University and Harbin Institute of Technology in China in 2007 and 2009, respectively. He is currently working toward the PhD degree in the Department of Computer Science, Harbin Institute of Technology. His research interests include data mining, graph mining and social network analysis, especially tensor based learning and mining algorithms.
Michael K. Ng received the BSc degree in 1990 and MPhil degree in 1992 from the University of Hong Kong, and the PhD degree in 1995 from the Chinese University of Hong Kong. He is currently a professor in the Department of Mathematics at the Hong Kong Baptist University. His research interests include bioinformatics, image processing, scientific computing and data mining, and he serves on the editorial boards of international journals. For more information visit http://www.math.hkbu.edu.hk/mng. Yunming Ye received the PhD degree in computer science from Shanghai Jiao Tong University. He is currently a professor in the Shenzhen Graduate School, Harbin Institute of Technology. His research interests include data mining, text mining, and ensemble learning algorithms.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.