HCDF: A Hybrid Community Discovery Framework Keith Henderson∗ Tina Eliassi-Rad∗ Spiros Papadimitriou† Christos Faloutsos‡ ∗ Lawrence Livermore National Laboratory † IBM T.J. Watson ‡ Carnegie Mellon University ∗ † ‡ {keith, eliassi}@llnl.gov
[email protected] [email protected] Abstract We introduce a novel Bayesian framework for hybrid community discovery in graphs. Our framework, HCDF (short for Hybrid Community Discovery Framework ), can effectively incorporate hints from a number of other community detection algorithms and produce results that outperform the constituent parts. We describe two HCDF-based approaches which are: (1) effective, in terms of link prediction performance and robustness to small perturbations in network structure; (2) consistent, in terms of effectiveness across various application domains; (3) scalable to very large graphs; and (4) nonparametric. Our extensive evaluation on a collection of diverse and large real-world graphs, with millions of links, show that our HCDF-based approaches (a) achieve up to 0.22 improvement in link prediction performance as measured by area under ROC curve (AUC), (b) never have an AUC that drops below 0.91 in the worst case, and (c) find communities that are robust to small perturbations of the network structure as defined by Variation of Information (an entropybased distance metric). 1 Introduction Graph analysis methods have recently attracted significant interest. In this paper, we focus on the problem of discovering community structures in large, real-world graphs. In this context, three desirable properties are: P1. Effective: The global connectivity patterns of the network are successfully factored into communities, which are highly predictive of individual links and robust to small perturbations in network structure. P2. Scalable: Time and space complexity are strictly sub-quadratic w.r.t. the number of nodes, scaling up to graphs with millions of links. P3. Nonparametric: The number of communities need not be specified a priori, instead it is determined from the data itself. Graphs have attracted interest because they arise naturally in many different settings, and across a very diverse set of domains, including bibliographic analysis,
social networks, the World-Wide Web, computer networks, product recommendations, and so on. Consequently, a fourth property is also desirable: P4. Consistent: The effectiveness of the discovered community structure, as defined in (P1), is consistently high, across a wide range of data sets. Despite the recent interest, approaches that possess all of the above properties, (P1) through (P4), are rare. Here, we introduce two hybrid approaches that exhibit all of them. Our approaches are based on a novel Bayesian framework, HCDF (for Hybrid Community Discovery Framework ), which can successfully incorporate communities discovered via other, non-Bayesian approaches as hints that lead to improved effectiveness of results and consistency across various domains. Why is a hybrid community discovery a good idea? Intuitively, soft clustering approaches with mixedmembership (most of them Bayesian) have more leeway to explain a node’s links, compared to hard clustering approaches (most of them non-Bayesian). On one hand, if a node’s links cannot be explained by a single membership, soft clustering has an advantage. On the other hand, if a node’s links can be explained almost equally well by a number of single and mixed memberships, hard clustering may make a simpler assignment. Therefore, combining these two clustering approaches can potentially lead to improved community factorization. However, designing a framework that can successfully combine different approaches is not a trivial task. We extensively studied three orthogonal aspects, which we summarize next. Core Bayesian model. This serves as the foundation, upon which HCDF is built. We chose Latent Dirichlet Allocation on Graphs (LDA-G) [12] as the core Bayesian method for community detection, because it is simple and satisfies properties (P1) to (P3). Hint sources. Among known methods, we examined two that also satisfy properties (P1) to (P3). The first is Fast Modularity (FM) [5, 18], which is a spectral partitioning method. The second is Cross-Associations (XA) [3], which is an MDL-based approach. Both methods produce hard clusters (also referred to as
754
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
communities or groups). Coalescing strategies. We explored three options for incorporating hints into the core Bayesian model: seed which uses hints only as an initial configuration for the inference procedure; prior which propagates hints from one configuration to the next; and attribute which incorporates hints as additional link-attributes. To quantitatively measure the effectiveness of the discovered community structure, we use link prediction performance and robustness to small perturbations in the network structure. Why is link prediction a good measure? A factorization of the graph’s connectivity structure into a number of communities is good, if it can be used to predict the presence or absence of links between a pair of nodes based on their respective communities. Therefore, we evaluate effectiveness by randomly holding out a number of links, building the model under evaluation, and then trying to predict the held-out links. We use area under the ROC curve (AUC), as an accurate measure of performance.1 Why is robustness to small perturbations in the network structure a good measure? As Karrer et al. argue, community structures that are “significant / believable” should be able to withstand small perturbations in the network structure [13]. We utilize their quantification of robustness, measured by Variation of Information ∆: an entropy-based distance metric that measures the distance between two clusterings – one on the original network and one on the perturbed network. We describe an extensive evaluation on several real data sets from a diverse range of domains (see Section 4). Our methods, HCD-X (which utilizes hints from Cross-Associations as link-attributes) and HCD-M (which utilizes hints from Fast Modularity as link-attributes), achieve significant improvements in effectiveness across the board, while maintaining scalability. We observe up to 0.22 improvement in AUC scores for link prediction. The average AUC scores for HCD-M and HCD-X (across nine data sets and five trial runs) are 0.95 and 0.96, respectively. Furthermore, the AUC of our hybrid methods never drops below 0.91 in the worst case. With respect to robustness to small perturbations in the network structure, both of our hybrid methods maintain low values for Variation of Information, ∆ ∈ [0, 1]. For example, with 10% of the links randomly rewired (while maintaining the same degree distribution), our hybrid methods have on average ∆ of 1 AUC is equal to the probability of a model ranking a randomly chosen positive instance higher than a randomly chosen negative instance. The default value for AUC is 0.5. In our case, a positive instance is a nonzero entry in the adjacency matrix versus a negative instance is a zero entry.
0.19 (indicating that “significant / believable” communities were found). Summarizing, the main contributions of the paper are: • We propose a generic Bayesian framework, HCDF, for hybrid community discovery; and present three ways for coalescing hints from various community discovery algorithms. • We propose two novel solutions, HCD-X and HCDM, for the task of community discovery that are (1) effective in terms of link prediction and robustness to small perturbations in network structure, (2) scalable, (3) nonparametric, and (4) consistent across various application domains. • We present results from an extensive evaluation of HCD-X and HCD-M on several large realworld graphs, with millions of links. Our results demonstrate that HCD-X and HCD-M are highly effective and consistent across different application domains The rest of the paper is organized as follows. Next, we review the related work. We present our proposed method in Section 3, followed by experimental results and discussion. We conclude the paper in Section 5. 2
Related work Bayesian Approaches to Community Discovery in Graphs. There are only a handful of scalable Bayesian approaches to community discovery in graphs [26, 23, 22, 15, 12], most extend Latent Dirichlet Allocation (LDA) [2]. LDA is a latent variable model for topic modeling. SSN-LDA [23] and LDA-G [12] are the simplest adaptions of LDA for community discovery in graphs. GWN-LDA [22] introduces a Gaussian distribution with inverse-Wishart prior on a LDA-based model to find communities in social networks with weighted links. LDA-based models have also been used to find communities in textual attributes and relations [24, 15]. For simplicity’s sake, we chose LDA-G for the Bayesian constituent of HCD-X and HCD-M. We could have easily chosen one of the other scalable Bayesian approaches. The effectiveness and consistency of these non-hybrid Bayesian models to community discovery in a diverse collection of real-world graphs is unknown. In [15], the authors propose a multi-step approach involving LDA to cluster large document collections containing both text and relations. In [16, 10], the authors propose a two-step approach to find clusters in data containing attributes and relations (namely, citation graphs and a gene-interaction network). The strategies used to incorporate information from one step to the next is different in HCDF than in these methods.
755
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Also, it is not clear how these methods would perform across a diverse collection of real-world graphs (besides citation networks and biological networks). Non-Bayesian Approaches to Community Discovery in Graphs. There are numerous nonBayesian methods for community detection, including the popular METIS [14], spectral methods [4] maximum flow methods [9], co-clustering [7], multi-level clustering (Graclus) [6] and others [25]. All of the above methods do not satisfy the nonparametric property of the problem statement because they require the user to specify the number of clusters k, or some other, related threshold. Cross-Associations (XA) [3] is one of the few methods that automatically determines the number of clusters, as the one that minimizes the description length of the whole graph. Graphically, XA rearranges the rows and columns of the adjacency matrix such that each row-group/column-group intersection is as dense or as sparse as possible. The runtime complexity of XA for a graph G = [V, E] is O(|E|), which is usually much smaller than O(|V |2 ). Extensions of XA that detect communities in timeevolving graphs are: timeFalls [8] and GraphScope [20]. They determine communities, as well as merge and split of them, as time grows. Although scalable and effective, XA can not handle hints, neither in its base form, nor in its extensions above. Another popular method, that also does not require the number of communities a priori, is based on the concept of “modularity” [18], a spectral partitioning method. However, its original version has prohibitive complexity. Its fast version (FM) [5] addresses the runtime complexity, with O(|V | · log2 (|V |)), though it only operates on the largest connected component of a graph and not the entire graph. None of the modularity approaches are able to handle hints. Different Notions of a Good Community. FM’s notion of a good community is one in which the density of intra-community linkage is more than inter-community linkage as compared to the expected number. XA’s notion of a good community is based on minimizing the total encoding cost, where community members tend to have the same set of neighboring nodes and may not be linked to each other at all. LDA-G’s notion of a good community is similar to XA’s except that community structure is found by maximizing likelihood and community members tend to have the same set of neighboring nodes in similar proportions (i.e. it produces soft memberships in communities).
3 Proposed Method Table 1 lists the notations used in this paper. We first briefly describe LDA-G, which is core Bayesian model in our proposed HCDF, as well as an extension to LDA-G, which handles hints as additional observed attributes. Then, we provide details on HCDF. 3.1 Preliminaries LDA-G [12] is an extension of Latent Dirichlet Allocation (LDA) [2] for use in graphs rather than text corpora. LDA-G models each source node in the graph as a multinomial distribution over some set of communities Z. The cardinality of Z is unknown a priori and is learned during inference (by adopting a Dirichlet prior). In LDA-G, each source node generates a series of communities from its multinomial; and each community is a multinomial distribution over target nodes. Any time a community is generated by a source node, that community generates a target node from its distribution. The distributions over sourcenode to community and community to target-node are learned using MCMC techniques (most commonly Gibbs sampling) [11]. To simplify inference, it is assumed that the roles of a node as a source-node and as a target-node are probabilistically independent. The generative model for LDA-G is as follows: (3.1) (3.2) (3.3) (3.4)
vi |zi , φ(zi ) ∼ Discrete(φ(zi ) ) φ ∼ Dirichlet(β) ui zi |θ ∼ Discrete(θui ) θ ∼ Dirichlet(α)
Then, LDA-G’s update equation is:
(3.5)
p(zi = k|z−i , v) ∝
nv + β nku + α · k nu + αK nk + βN
Unlike most approaches to community discovery, LDA-G only requires present links (i.e., non-zero entries in the adjacency matrix). This property helps its runtime and space complexities. Its runtime complexity is O(N KM ); its space complexity is O(N (K + M )). Since K N and M ∼ log(N ), LDA-G runs in O(N · log(N )) ≈ O(|E|). 3.2 Extending LDA-G to Handle NodeAttributes For use in HCDF, we extend LDA-G to graphs with attributes by augmenting communities with a new collection of multinomial distributions, one per link-attribute. Each node-attribute generates two link-attributes since source and target nodes are treated
756
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
V
set of nodes
ui
ith source node (analogous to a document in LDA)
E
set of links
vi
ith target node (analogous to a word in LDA)
N
number of nodes = |V |
zi
ith community (analogous to a topic in LDA)
M
average node degree
ζi
community for graph element i
set of communities
ζir
row-group for graph element i
K
number of communities = |Z|
ζic
column-group for graph element i
R
set of attributes to be added to the graph as hints
aij
j th attribute value on link i
A
number of attributes = |R|
Ai
maximum number of possible values for attribute i
nku
count of community k in source-node u PK k i=1 nu
Z
nu nvk nk
count of target-node v in community k PN v i=1 nk
naki
count of attribute ai in community k
θ
Dirichlet prior on source-to-community distributions
α
hyperparameter for Dirichlet on source-to-community
φ
Dirichlet prior on community-to-target distributions
β
hyperparameter for Dirichlet on community-to-target
ρ
Dirichlet prior on attribute distributions
γ
hyperparameter for Dirichlet on attributes
Table 1: Notations used in the paper.
I
E D
T
.
z
v a
J
U
or real-valued attributes, so long as they are given the appropriate priors. We use Gibbs sampling [11] for the inference procedure. The runtime complexity of LDA-G on graphs with attributes is O(N KM A) ≈ O(E). The update equation for LDA-G on graphs with attributes is:
$ 0 1
p(zi = k|z−i , u, v, a1 , a2 , · · · , aA ) ∝ (3.8)
A nku + α nv + β Y naki + γ · k nu + αK nk + βN i=1 nk + Ai γ
$
Figure 1: Graphical model for LDA-G with Attributes. 3.3 Hybrid Community Detection Framework The observables are the target nodes v and the at- (HCDF) HCDF takes any pair of hnon-Bayesian, tributes a. The latent variable are the communities z. Bayesiani community discovery algorithms and a coalescing strategy, and produces an algorithm for finding community structure in graphs. separately and independently. The generative model 3.3.1 HCDF(algorithm1 , LDA-G, ATTR) for LDA-G with link-attributes adds the following HCDF’s coalescing strategy, ATTRIBUTE or generators to the aforementioned LDA-G model: ATTR for short, incorporates the communities found by algorithm1 with LDA-G by treating them as attributes. In other words, the group assignments (z ) (z ) (3.6) aij |zi , ρj i ∼ Discrete(ρj i ) of algorithm1 became attributes of the input graph. Then, LDA-G learns a model of the community (3.7) ρ ∼ Dirichlet(γ) structure given both the network structure and these Figure 1 depicts the graphical model for LDA-G attributes. with attributes. This model is appropriate for categoriTo make our discussion more concrete here, assume cal attributes. However, attributes can be binary-valued we have chosen XA for algorithm1 . Algorithm 1
757
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
presents our HCD-X algorithm (an instantiation of HCDF). Given a graph G = [V, E], HCD-X runs XA and treats its row- and column-groups as link-attributes. Then, HCD-X runs LDA-G on the attributed graph (i.e., G with these link-attributes). HCD-X treats each node’s row-group (i.e. cluster label) as one categorical attribute and its columngroup as another categorical attribute (rather than treating each row-group/column-group intersection as a single attribute). In graphs with non-symmetric adjacency matrices, these row- and column-groups are often different. In Algorithm 1, XA can be replaced with any other scalable community detection algorithm. Specifically, when we replace XA with FM, we refer to it as HCD-M.
and algorithm1 ’s communities are used to initialize the count variables.
Algorithm 2 HCDF(algorithm1 , LDA-G, SEED) Require: Graph: G = (V, E) /∗ Run community discovery algorithm algorithm1 on G ∗/ hint = algorithm1 (G) Ensure: hint ≡ [su , sku , sk , svk ] /∗ Run LDA-G on graph G ∗/ /∗ Set the prior ∗/ /∗ Initialize all count variables ∗/ nu ← su ; nku ← sku ; nk ← sk ; nvk ← svk /∗ Select initial configuration ∗/ for each edge < u, v >∈ E do Sample community zi = k using Equation 3.5 Increment count variables nu , nku , nk , nvk by 1 Algorithm 1 HCDF(XA, LDA-G, ATTR) = HCD-X end for Require: Graph: G = [V, E] /∗ Forget the hint ∗/ /∗ Run Cross-Association on G ∗/ nu ← nu − su ; nku ← nku − sku ; nk ← nk − sk ; [rowGroups, colGroups] = XA(G) nvk ← nvk − svk R = [rowGroups, colGroups] /∗ Run Gibbs sampling ∗/ Ensure: ∀i ∈ [1, A] and ∀ < u, v >∈ E: R ≡ {a } i for each edge < u, v >∈ E with in community k do /∗ Run LDA-G on graph G with attributes R and Decrement count variables nu , nku , nk , nvk by 1 hyperparameters γ ≈ 10 and α = β ≈ 1 ∗/ Sample community zi = k using Equation 3.5 /∗ Set the prior ∗/ Increment count variables nu , nku , nk , nvk by 1 ai1 ← ai ; ai2 ← ai end for Initialize all count variables nu , nku , nk , nvk , naki1 , naki2 to 0 for each link < u, v >∈ E do 3.3.3 HCDF(algorithm1 , LDA-G, PRIOR) Sample community zi = k using Equation 3.8 Increment count variables nu , nku , nk , nvk , naki1 , naki2 HCDF’s coalescing strategy, PRIOR, is similar to SEED, but it does not discard algorithm1 ’s hints by 1 (i.e., community information) after choosing an initial end for configuration. Instead, the multinomial distributions /∗ Run Gibbs sampling ∗/ for each edge < u, v >∈ E with in community k do for each source-node and community are comprised Decrement count variables nu , nku , nk , nvk , naki1 , of (1) the empirical prior imparted by algorithm1 ’s results and (2) the learned distribution from LDA-G’s naki2 by 1 inference. There is no speed-penalty for retaining the Sample community zi = k using Equation 3.8 Increment count variables nu , nku , nk , nvk , naki1 , naki2 empirical prior because the Gibbs sampler requires only the nku and nvk counts (which refer to the count of comby 1 munity k in source-node u and the count of target-node end for v in community k, respectively). Algorithm 3 presents HCDF(algorithm1 , LDA-G, PRIOR). 3.3.2 HCDF(algorithm1 , LDA-G, SEED) HCDF’s coalescing strategy, SEED, is the sim- 4 Experiments plest way of fusing the output of a given commu- This section presents an empirical comparison of linknity discovery algorithm, algorithm1 with LDA-G. prediction performance and community robustness to HCDF(algorithm1 , LDA-G, SEED) merely uses the small perturbation in network structure. Link prediccommunities from algorithm1 to set the initial configu- tion demonstrates a community structure’s ability to ration for LDA-G, then discards this information (i.e., model the underlying connectivity of the graph. Comforgets the hints) in subsequent sampling. Algorithm 2 munity robustness to small perturbations in the network outlines HCDF(algorithm1 , LDA-G, SEED), where structure measures the “significant / believability” of Gibbs sampling is used to conduct inference in LDA-G the discovered community structure. Specifically, if a
758
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Algorithm 3 HCDF(algorithm1 , LDA-G, PRIOR) Require: Graph: G = (V, E) /∗ Run community discovery algorithm algorithm1 on G ∗/ hint = algorithm1 (G) Ensure: hint ≡ [su , sku , sk , svk ] /∗ Run LDA-G on graph G ∗/ /∗ Set the prior ∗/ /∗ Initialize all count variables ∗/ nu ← su ; nku ← sku ; nk ← sk ; nvk ← svk /∗ Select initial configuration ∗/ for each edge < u, v >∈ E do Sample community zi = k using Equation 3.5 Increment count variables nu , nku , nk , nvk by 1 end for /∗ Run Gibbs sampling ∗/ for each edge < u, v >∈ E with in community k do Decrement count variables nu , nku , nk , nvk by 1 Sample community zi = k using Equation 3.5 Increment count variables nu , nku , nk , nvk by 1 end for
community structure is significant, then it should be able to withstand small perturbations to network structure [13]. 4.1
Experimental Setup
4.1.1 Data Sets and Algorithms Tables 2 and 3 summarize the graphs used in our experiments. Internet Topology Data: Autonomous Systems (AS ) Graph2 is an AS-level connectivity graph collected on May 26, 2001. It includes Oregon route-views, Looking glass data, and Routing registry data. IP Traffic Data: (IP1 through IP5 ) are composed of IP traffic3 collected at the perimeter of an enterprise network over five days in 2007. PubMed Data: AxK and AxA are collections of data from the PubMed database.4 AxK is a bipartite Author × Knowledge graph, where author nodes connect to knowledge-theme5 nodes. A link exists from an author u to a knowledge-theme k for every article in which u is a coauthor and k is a theme appearing in the abstract. AxA is a coauthorship graph. It is composed of 4555 connected components. The largest connected component has 8763 authors (approximately 2 Available 3 This
24% of the entire graph). WWW Data: This graph was originally created to measure the diameter of the Web [1]. It was collected by a Web crawler that started from a nd.edu site.6 Table 4 outlines the different algorithms used in our experiments. Details of each method were provided in Sections 2 and 3. 4.1.2 Experimental Methodology for Link Prediction Our experimental methodology for measuring link-prediction performance on a single graph is a threestep process. First, we randomly select a subset of links from a graph. When selecting links, we pick both present and absent links. A present link has a non-zero entry in the graph’s adjacency matrix. An absent link has a zero entry in the graph’s adjacency matrix. The selected links comprise the held-out test set. Second, each algorithm is given the remaining graph in which to discover communities.7 Third, each algorithm uses its discovered communities to estimate the probability that a link in the held-out test set was a present link. These estimates are then used to calculate the area under the ROC curve (AUC). For each data set, we ran five trials. In each trial, the held-out test-set contains 1000 randomly chosen links, with 500 present links and 500 are absent links. Across the different methods (XA, LDA-G, HCD-X, etc), the held-out test sets are the same for each graph. The reported AUC-scores are averages of the AUCscores over the five trials. Using Communities to Predict Links To predict a link between source-node u and target-node v in LDA-G, HCD-X -SEED, and HCD–X -PRIOR, we use the following formula:
(4.9) X scoreLDA−G (< u, v >) = p(ζ = k|Z, E) k∈K
Recall that K is the number of communities discovered, Z is the discovered communities and E is the graph’s links. p(ζ = k|Z, E) is the same as LDAG’s update equation (see Equation 3.5). To predict a link between source-node u and targetnode v in HCD-X and HCD-M, we use the following formula:
at http://topology.eecs.umich.edu/data.html.
data is proprietary.
4 PubMed is a repository containing millions of citations from biomedical articles (see http://www.pubmedcentral.nih.gov/). 5 Knowledge themes were extracted based on term frequency in PubMed abstracts.
6 Available
at http://www.nd.edu/ networks/resources.htm.
7 LDA-G
handles the held-out links as “unknown” observations. However, XA and FM treat them as absent links (with zero entries in the adjacency matrix).
759
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Real-World Graphs Autonomous Systems Day 1: IP × IP Day 2: IP × IP Day 3: IP × IP Day 4: IP × IP Day 5: IP × IP PubMed Author × Knowledge PubMed Coauthorship WWW Graph
Acronym AS IP1 IP2 IP3 IP4 IP5 AxK AxA WWW
|V | 11,461 34,449 33,732 34,661 34,730 33,981 37,346 (A) & 117 (K) 37,225 325,729
|E| 32,730 303,175 320,754 428,596 425,368 112,271 119,443 143,364 1,497,135
# Components 1 4 8 2 2 13 1 4,556 1
% of V in LCC 1 99.98% 99.96% 99.99% 99.99% 99.92% 1 23.54% 1
Table 2: Summary of real-world graphs used in experiments. LCC refers to the Largest Connected Component. Data Graph AS IP1 IP2 IP3 IP4 IP5 AxK AxA WWW
Average Degree 2.86 8.80 9.51 12.37 12.25 3.30 3.19 3.85 4.60
Clustering Coefficient 0.258 0.198 0.18 0.198 0.216 0.058 0 0.49 0.28
Average Path 3.67 3.23 3.22 3.04 3.07 3.54 1.00 8.85 11.38
Diameter 11 7 8 6 7 7 1 23 58
# of Articulation Points 828 1,258 1,208 920 841 1,524 54 1,467 21,780
% of V that are Articulation Points 7.2% 3.7% 3.6% 2.7% 2.4% 4.5% 0.1% 3.9% 6.7%
Table 3: Some characteristics of real-world graphs used in experiments. An articulation point is a node such that its removal increases the number of connected components. (The values for average path and diameter are approximate. For each data graph, we computed these by selecting uniformly at random 500 nodes and doing single-source-shortest-paths from each of the selected nodes.) Algorithms Tested Cross-Association Fast Modularity Latent Dirichlet Allocation on Graphs HCDF(XA, LDA-G, ATTRIBUTE) HCDF(FM, LDA-G, ATTRIBUTE) HCDF(XA, LDA-G, SEED) HCDF(XA, LDA-G, PRIOR)
Acronym XA FM LDA-G HCD-X HCD-M HCD-X -SEED HCD-X -PRIOR
Table 4: List of methods used in experiments.
(4.10) X scoreHCD (< u, v >) = p(ζ = k|Z, E, R)
(4.11) scoreF M (u, v) =
k∈K
| {∀ < i, j >∈ E : (ζi ≡ ζu )&(ζj ≡ ζv )} | | ζu | · | ζv |
where ζi returns the FM-assigned community for node i. Recall that R is the attribute-set. p(ζ = k|Z, E, R) is the same as the update equation for LDAG on graphs with attributes (see Equation 3.8). For XA and FM, we use density to predict links. (4.12) | {∀ < i, j >∈ E : (ζir ≡ ζur )&(ζjc ≡ ζvc )} | Specifically, we use the following equations to predict a scoreXA (u, v) = | ζur | · | ζvc | link between source-node u and target-node v:
760
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Average AUC
where ζir is the row-group assigned to node i and ζic is IPxIP graphs. HCD-X (black line) and HCD-M (red the column-group assigned to node i. line) are the only two methods that consistently produce high AUC results (> 0.93) on link prediction. HCD4.1.3 Experimental Methodology for Commu- X performs better than HCD-M on these data sets nity Robustness We adapt the experimental method- since XA produces better hints than FM here. We have ology described in [13] to measure community robust- included error bars on the LDA-G link-prediction curve, ness. Specifically, we perturb a data graph by randomly which shows that we would not reach HCD-X ’s high reassigning a number of its links. The rewiring param- results even if we ran LDA-G multiple times. eter c ∈ [0, 1] determines the fraction of links rewired. Links are reassigned in a way that preserves the exFM LDA-G XA HCD-M HCD-X 1 pected degree of each node in the graph. Then, a com0.95 munity detection algorithm is applied to the original 0.9 and perturbed graphs, and the Variation of Informa0.85 tion [13], ∆, is calculated between the two community 0.8 assignments C (with c = 0) and C 0 (with c ∈ [0, 1]): 0.7 0.65
∆(C, C 0 ) = H(C|C 0 ) + H(C 0 |C)
0.6
H is the entropy function; thus, H(C 0 |C) measures the information needed to describe C 0 given C). ∆ treats each assignment as a message; it is a symmetric entropy-based measure of the distance between these messages. ∆ is bounded between 0 and log(N ) for hard clustering assignments, but in the case of mixedmembership models the measure can grow without bound. For consistency across data sets, we generate hard clusters for all algorithms and normalize by reporting ∆/ log(N ). For LDA-G, HCD-X, and HCD-M, we assign each node to the group that contains a plurality of its outgoing links in the Maximum Likelihood (ML) configuration. Ties are broken by selecting the largest group (i.e., the group that appears most often in the ML configuration). For XA, we only consider the row assignments and discard column assignments. Since FM only operates on the largest connected component (LCC) of a graph, we extract the LCC of the original graph before perturbing the network for FM experiments. Thus, only the LCC graph, G0 , is perturbed, and ∆ values are calculated over it. For HCD-M, we remove G0 from the original graph and replace it with the perturbed versions of G0 . Except for the Coauthorship graph, the remaining graphs used in our experiments have LCCs that contain over 99.99% of the nodes in the graph. The Coauthorship graph’s LCC covers only 24% of the nodes in the graph. We apply this method to each data set and each community discovery algorithm, considering various values of the rewiring probability c.
0.55 0.5 IP1
IP2
IP3
IP4
IP5
Various IP x IP Graphs
Figure 2: Link prediction results on various IPxIP graphs
Figure 3 presents performance of methods listed in Table 4 on the PubMed graphs, the Web graph, and the Internet AS graph. Again, HCD-X and HCD-M are the only two robust methods across these diverse data sets. They produce community structures that consistently predict links with high AUC results (> 0.91). The same cannot be said for the other methods. FM
LDA-G
XA
HCD-M
HCD-X
1 0.95 0.9 0.85 Average AUC
(4.13)
0.75
0.8 0.75 0.7 0.65 0.6 0.55 0.5 AxA
AxK
AS
WWW
Various Real-World Graphs
Figure 3: Link prediction results on various graphs from 4.2 Results on Link Prediction Figure 2 shows PubMed, WWW, and Internet. performance of the methods listed in Table 4 on the
761
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Average
Figure 4 depicts the effectiveness of different coalescing strategies in HCD-X. As expected, direct incorporation of XA-provided hints as link-attributes provides the best link-prediction performance. The next best coalescing strategy is PRIOR since it propagates the hints from one configuration to the next. SEED is the simplest and weakest coalescing strategy since it forgets the hints after setting the initial configuration. SEED
PRIOR
Maximum
Mininum
Improvement ment in Average AUC
0.22 0.18 0.14 0.10 0.06 0.02
ATTRIBUTE
1 -0.02
0.95
Δ(HCD-X, LDA-G)
0.9
Δ(HCD-X, XA)
Δ(HCD-M, LDA-G)
Δ(HCD-M, FM)
Average AUC
0.85
Figure 5: Typical link prediction performance improvement for HCD-X and HCD-M over their constituents. ∆(M1 , M2 ) = AverageAU C(M1 ) − AverageAU C(M2 )).
0.8 0.75 0.7 0.65 0.6 0.55
1
0.5 IP1
IP2
IP3
IP4
IP5
AxA
AxK
AS
0.95
WWW
Various Real-World Graphs Worst rst Case Average AUC
0.9
Figure 4: Effectiveness of different coalescing strategies. Results show average AUC on link prediction with HCDF(XA, LDA-G, strategy), where strategy is one of SEED, PRIOR, and ATTRIBUTE.
0.85 0.8 0.75 0.7 0.65 0 65 0.6
Figure 5 shows improvements in the average AUC scores (w.r.t. link prediction) when comparing HCD-X to LDA-G, HCD-X to XA, HCD-M to LDA-G, and HCD-M to FM across our diverse collection of real graphs. On average, the hybrid methods offer significant improvements, an increase of up to 0.22 in average AUC. There are rare cases in which HCD-X and HCD-M do not perform as well as their constituents. These cases arise when the constituent providing the hints generate communities that are really bad at predicting link structure (e.g., FM’s average AUC on PubMed AxA is 0.63). Figure 6 depicts the worst-case average AUC across the methods listed in Table 4. The hybrid methods (HCD-X and HCD-M ) have worst-case average AUCscores above 0.9. The non-hybrid methods (FM, LDAG, and XA) do not show such consistent effectiveness. 4.3 Results on Community Robustness Figure 7 depicts the Variation of Information, ∆, for the PubMed Author×Knowledge graph as we vary the rewiring probability. A lower value for ∆ indicates a more robust community structure. HCD-X has a lower ∆ than its constituent parts: LDA-G and XA. HCD-M has a lower
0.55 0.5 FM
LDA-G
XA
HCD-M
HCD-X
Various Community Detection Algorithms
Figure 6: Worse-case performance across all graphs.
∆ than LDA-G and a comparable one to FM. A community discovery algorithm can artificially produce a low ∆ by finding a trivial community structure (e.g., by placing all nodes in one community). To guard against this, we also need to look at other properties of the community structure such as the number of communities and the largest community fraction. In the Author×Knowledge graph, all five methods find on average tens of communities and the average largest community fraction is less than 31%. So, the lower ∆ results here are not artificial. Figure 8 shows ∆ at rewiring probability c = 0.2 for the IP graphs and for the communities discovered on them by LDA-G, FM, and HCD-M. Points to note here are (1) HCD-M has much lower ∆ values than LDA-G; (2) HCD-M has comparable ∆ values to FM, but it has
762
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Variation of Information, Δ/log(N)
Author x Knowledge Graph FM
LDA-G
0.2
0.3
XA
HCD-M
HCD-X
0.6 0.5 0.4 0.3 0.2 0 2 0.1 0 0
0.1
0.4
0.5
0.6
0.7
0.8
0.9
1
Rewiring Probability, c
Figure 7: Robustness across various algorithms and rewiring probabilities for the PubMed Author×Knowledge graph.
a higher AUC (on average +0.15) and more consistent link prediction performance than FM (as previously shown in Figure 2). As we describe in the next section, XA produces artificially low ∆ values; therefore, we exclude XA and HCD-X from Figure 8.
Variation of Information at c = 0.2
FM
LDA-G
HCD-M
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0 15 0.1 0.05 0 IP1
IP2
IP3
IP4
IP5
Various IPxIP Graphs
Figure 8: Variation of information for the IP×IP graphs at rewiring probability c = 0.2.
For brevity, we omit showing robustness results for the other data graphs in our experiments. Their results are generally similar to Figures 7 and 8. The next section presents a summary discussion of our experiments on other graphs. 4.4 Discussion We first present some general observations from the link prediction results. (1) HCD-X and HCD-M are consistent w.r.t. the link prediction performance of their discovered community structures on a diverse set of real-world graph data. (2) HCD-X and HCD-M always improve on one of their constituents and usually improve on both. (3) Restarting LDA-G multiple times (with different random seeds to discover
a good configuration) does not improve link-prediction performance sufficiently (as shown in Figure 2). This suggests that available computation time is better spent by combining methods than by applying a single method repeatedly. (4) On our sparsest IP graph, IP5 with roughly 34K nodes and 112K links, all non-hybrid methods perform only fairly (∼ 0.8 average AUC). However, HCD-X and HCD-M perform extremely well (∼ 0.96 average AUC). There are at least two explanations for the superior performance of hybrid methods over non-hybrid methods. First, we explain the improvement over LDA-G. Note that XA and FM are optimizing for different functions than LDA-G. Thus, we expect them to compensate for each others’ shortcomings when applied together as in HCD-X. The hints offered by XA, for example, can be used by HCD-X to make decisions about community structures that LDA-G is ambiguous about. Second, we explain the improved performance of the hybrid methods over the non-Bayesian methods. Neither XA nor FM can generate mixed-membership models for communities (i.e., soft clustering). In these models, each node belongs to a single community and this community must explain all of the node’s links. Because LDA-G is a mixed-membership model, it can loosen this restriction and learn models that explain all of a node’s links, rather than trying to explain most of them. Of the non-Bayesian methods studied (XA and FM), XA typically produces more predictive community structures. This is not unexpected. As shown by Stone [19], Akaike’s criterion (MDL) and crossvalidation (MLE, short for Maximum Likelihood Estimate) essentially pick the same model. Also note that, a MDL-based approach can be represented in a Bayesian framework with Universal (Solomonoff) prior probability function and a uniform likelihood function [21]. We empirically observed the relationship between compression and good prediction in our experiments, where methods with higher average AUC produced community-sorted adjacency matrices that have more whitespace and less “snow.” Figure 9 shows this phenomenon for the IP5 graph, where the original IP5 adjacency matrix and its corresponding community-sorted matrices produced by LDA-G, FM, XA, HCD-M , and HCD-X are depicted. The matrices with more whitespace (i.e., less “snow”) have higher average AUC on link prediction. The community-sorted matrices were produced as follows. For FM, the rows and columns of the adjacency matrix are sorted by their FM-assigned communities, so that rows/columns that belong to the same community are next to each other. For XA, rows are sorted by their XA-assigned row-groups; and similarly, columns are sorted by their XA-assigned column-
763
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LDA-G (Average AUC: 0.795)
FM (Average AUC: 0.821)
XA (Average AUC: 0.821)
Adjacency Matrix
HCD-M (Average AUC: 0.955)
HCD-X (Average AUC: 0.968)
Figure 9: Plots showing the adjacency matrix for the IP5 graph and its corresponding community-sorted matrices produced by existing approaches (FM, XA, and LDA-G) and proposed methods (HCD-X and HCD-M ) on that graph. The matrices associated with the proposed methods have more whitespace (i.e., less “snow”), achieve better clustering, and have higher average AUC on link prediction. groups. For LDA-G, HCD-M , and HCD-X, rows are sorted first by their primary community. A primary community for a row i is the community with the highest percentage of row i’s entries. Then, within each primary community, the rows are sorted by their degrees of membership in that community (from highest to lowest). Columns are sorted in the same way. W.r.t. robustness to small perturbations in network structure, ∆ generally increases with rewiring probability c, and is maximized for a given graph and algorithm when c = 1. This corresponds to a perturbed graph where all of the links are randomly reassigned (i.e., the only remaining information from the original graph is the expected degree of each node). For each IP graph, XA produces almost identical clusters at all values of c. We conclude that in these cases, XA is determining the groups almost entirely based on degree (which remains the same as links get rewired).
LDA-G has high ∆ values even at c = 0.2, but the clusters do not deteriorate significantly beyond that point. This suggests that it suffers significantly from small changes in network topology. The hybrid methods have much less variation than LDA-G, are comparable to FM, and have slightly higher variation than XA. However, as discussed above in most cases XA’s ∆ is artificially low with one community containing a substantial fraction (> 67%) of the nodes. Lastly, we observed that algorithms with higher AUC on link prediction tend to have lower ∆ values (assuming the ∆ values are not artificially low). Lastly, HCD-X can also be applied to time-evolving graphs by using the appropriate constituents. For example, LDA-G can be applied to time-evolving graph by simply using the ML configuration from time t − 1 as the prior for time t. GraphScope [20] details extensions of XA to time-evolving graphs. through the use of
764
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
appropriate constituents. For instance, Newman, et al. [17] describe distributed inference for LDA. 5 Conclusions We present a novel Bayesian framework, HCDF, for community discovery in large graphs. HCDF produces algorithms that are (1) effective, in terms of link prediction performance and community structure robustness; (2) consistent, in terms of effectiveness across various application domains; (3) scalable to very large graphs; and (4) nonparametric. A key aspect of HCDF is its effectiveness in incorporating hints from a number of other community detection algorithms. HCDF produces results that outperform each of the constituents. On link prediction across a collection of diverse realworld graphs, HCDF methods achieve (1) improvements of up to 0.22 in average AUC and (2) an average AUC of 0.96. HCDF methods also find communities that are robust to small perturbations of the network structure as defined by Variation of Information. 6 Acknowledgements We thank Branden Fitelson for introducing us to Stone’s article [19] and subsequent discussions. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 (LLNL-CONF422685). References
[1] R. Albert, H. Jeong, and A.-L. Barab´ asi. Diameter of the world wide web nature. Nature, 140, 1999. [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003. [3] D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsos. Fully automatic cross-associations. In KDD, 2004. [4] F. R. K. Chung. Spectral Graph Theory. AMS Bookstore, 1997. [5] A. Clauset, M. Newman, and C. Moore. Finding community structure in very large networks. Phys. Rev. E, 70:066111, 2004. [6] I. Dhillon, Y. Guan, and B. Kullis. Weighted Graph Cuts without EigenVectors : A Multilevel Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1944–1957, November 2007. [7] I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD, 2003. [8] J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, and M. Grobelnik. Monitoring network evolution using MDL. In ICDE, 2008. [9] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In KDD, 2000.
765
[10] R. Ge, M. Ester, B. J. Gao, Z. Hu, B. Bhattacharya, and B. Ben-Moshe. Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications. ACM TKDD, 2(2):1–35, 2008. [11] T. Griffiths. Gibbs sampling in the generative model of latent Dirichlet allocation. Technical report, Stanford University, 2002. Available at http://citeseer.ist.psu.edu/613963.html. [12] K. Henderson and T. Eliassi-Rad. Applying latent Dirichlet allocation to group discovery in large graphs. In ACM SAC, 2009. [13] B. Karrer, E. Levina, and M. E. J. Newman. Robustness of community structure in networks. Phys. Rev. E, 77:046119, 2008. [14] G. Karypis and V. Kumar. Parallel multilevel kway partitioning for irregular graphs. SIAM Review, 41(2):278–300, 1999. [15] H. Li, Z. Nie, W.-C. Lee, C. L. Giles, and J.-R. Wen. Scalable community discovery on textual data with relations. In CIKM, 2008. [16] F. Moser, R. Ge, and M. Ester. Joint cluster analysis of attribute and relationship data without a-priori specification of the number of clusters. In KDD, 2007. [17] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation. NIPS, 20:1081–1088, 2007. [18] M. E. J. Newman. Modularity and community structure in networks. PNAS USA, 103:8577, 2006. [19] M. Stone. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. Roy. Statist. Soc. Ser. B, 39:44–47, 1977. [20] J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphscope: Parameter-free mining of large timeevolving graphs. In KDD, 2007. [21] P. M. B. Vit´ anyi and M. Li. Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Transactions on Information Theory, 46(2):446–464, 2000. [22] H. Zhang, C. L. Giles, H. C. Foley, and J. Yen. Probabilistic community discovery using hierarchical latent gaussian mixture model. In AAAI, 2007. [23] H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, and J. Yen. An lda-based community structure discovery approach for large-scale social networks. In IEEE ISI, 2007. [24] D. Zhou, Eren, Manavoglu, J. Li, C. L. Giles, and H. Zha. Probabilistic models for discovering ecommunities. In WWW, 2006. [25] H. Zhou and D. Woodruff. Clustering via matrix powering. In PODS, 2004. [26] L. Zhou, Y. Liu, J. Wang, and Y. Shi. Hsn-pam: Finding hierarchical probabilistic groups from largescale networks. In ICDM Workshops, 2007.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.