Probabilistic model for discovering topic based ... - ACM Digital Library

18 downloads 359 Views 602KB Size Report
ABSTRACT. Social graphs have received renewed interest as a research topic with the advent of social networking websites. These online networks provide a ...
Probabilistic Model for Discovering Topic based Communities in Social Networks Mrinmaya Sachan

Danish Contractor

Tanveer A. Faruquie

L. V. Subramaniam

IBM Research India New Delhi, India

{mrsachan,dcontrac,ftanveer,lvsubram}@in.ibm.com ABSTRACT

These online social graphs provide a rich source of data for studying community and user relationships. A network feature that has been emphasized in recent works is community structure i.e. the gathering of vertices into groups such that there is higher relatedness among vertices within a group. The discovery of communities in a social graph consisting of “similar” users is an important problem and finds applications in many areas as diverse as sociology, biology and computer science where systems are often represented as graphs. In this paper, we propose a generative Bayesian model for extracting latent communities from a social graph. It assumes that community memberships are dependent on the topics of interest amongst users and their link relationships in the social graph. Users can belong to multiple communities and a community can be related to multiple topics. Further, a user may be interested in multiple topics based on his interest. This is useful in modelling user interests or roles they play in the network. Finally, we observe that the “type” of interactions between users can be used to emphasize their interest in topics, and thus community membership. For example, two users engaging in conversations related to politics (posting ”replies” to each other) would be more likely to be members of a community on politics than someone who has occasionally ”broadcasted” posts on politics (for example, during the elections in a country). Hence, we also model communication types. To the best of our knowledge we are the first to use all the three q: topics, social graph topology and nature of user interactions in the discovery of latent communities from a social graph. The paper is structured as follows: In Section 2, we review some prior work on latent community discovery. In section 3, we describe the TURCM model and present Gibbs sampling equations to infer its parameters. In section 4, we give experiments to validate the model. Section 5 concludes.

Social graphs have received renewed interest as a research topic with the advent of social networking websites. These online networks provide a rich source of data to study user relationships and interaction patterns on a large scale. In this paper, we propose a generative Bayesian model for extracting latent communities from a social graph. We assume that community memberships depend on topics of interest between users and the link relationships between them in the social graph topology. In addition, we make use of the nature of interaction to gauge user interests. Our model allows communities to be related to multiple topics and each user in the graph can be a member of multiple communities. This gives an insight into user interests and topical distribution in communities. We show the effectiveness of our model using a real world data set and also compare our model with existing community discovery methods.

Categories and Subject Descriptors H.2.8 [Information Systems]: Database ApplicationsData mining; G.3 [Probability and Statistics]: Probabilistic algorithms

General Terms Algorithms

Keywords Community Detection, Social Networks, Probabilistic methods

1.

INTRODUCTION

With the rise of online social networking websites, Social Network Analysis has gained renewed interest. A Social Network is a graph that develops with interaction between users who may be friends, acquaintances, colleagues, etc. in the real world and/or those who share similar interests, likes or dislikes.

2.

PRIOR WORK

Traditionally, community detection has dealt with a hardpartitioning of nodes in the graph and do not allow nodes (users) to have membership in multiple communities. Also, they do not account for inter-user interactions outside the graph link. Recently, with the popularity of LDA, there has been significant interest in Bayesian probabilistic models for community discovery like Simple Social Network LDA (SSNLDA) [8], Generic Weighted network LDA (GWN-LDA) [7], Hybrid Community Discovery Framework (HCDF) [2] and Hierarchical Social Network-Pachinko Allocation Model (HSN-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00.

2349

PAM) [6]. Although they allow for mixed community memberships, they too rely primarily on the link structure in a social graph to learn communities. Some community discovery methods have tried to utilize the semantic content of social graphs. In [9] the authors propose the CUT (Community-User-Topic) model which uses the semantic content to discover communities in an email social graph. Communities are modelled as random mixtures over users who are in turn modelled by their interests (represented by the topical distribution concerning them). The model, however, does not utilize the link information in a graph while discovering communities. While models like CUT are suitable when community members communicate actively with each other, models like SSN-LDA assume that users that are inter-connected share similar interests. However, this is not always true in real world scenarios. For example, in social networks there are many users who are members of communities but do not actively contribute in them. Thus, both the graph structure and the interactions between users should be used to model the formation of communities. Recently, researchers have started looking for one integrated probabilistic model that combines both content and link information available in social networks. The CommunityAuthor-Recipient-Topic (CART) model [5], was one of the first attempts to combine previous link based community discovery methods with content based community discovery. Though, both the CART model and our models use topics and social graph topology to model communities, the generative process in both is significantly different. While our models assume that communities are generated based on users, recipients, topics and links connecting them, the CART model generates authors(users) and recipients of a post from latent communities. Since CART, there have been few more attempts to combine content and link to obtain community structures more effectively ([4] [1]). These models assume links as binary. Whereas, our model utilizes the nature of interactions between users while modelling communities. This is synonymous to using weights in a graph for community discovery.

3.

COMMUNITY DISCOVERY MODELS

1. For each of the topics, 1 ≤ z ≤ Z, sample a V dimensional multinomial, ~λz ∼ DirV (δ).

Notation: Let U be the set of users in the social network under consideration. Let Ri be the set of neighbors (recipients) of user (sender) ui ∈ U. For a given user (sender) ui ∈ U and its neighbor (recipient) uj ∈ Ri , let Pij be the set of S posts (messages) sent by ui to uj . Overall, let Pi = Pij be the set of posts (messages) sent

2. For each of the communities, 1 ≤ c ≤ C sample a K ~ c ∼ DirK (β). dimensional social interaction mixture φ 3. For the ith user ui , 1 ≤ ui ≤ U :

uj ∈Ri

by user(sender) ui . S Let Np be the number of words in a given post p, p ∈ { Pi }. Also let the cardinality of all

(a) For each neighbor/recipient 1 ≤ uj ≤ Ni : i. Sample a Z dimensional vector of topic proportions, ~ ηui ,uj ∼ DirZ (ν). This represents that the post is of a mutual interest to the sender and the recipient. ii. For each topic z ∈ Z, sample a C dimensional multinomial, θ~ui ,uj ,z ∼ DirC (α), representing the community proportions for that topic and the sender-recipient pair. iii. For each post p (1 ≤ p ≤ Pij ) sent by ui to uj , having Np words: A. Choose a topic assignment zp ∼ M ult(~ ηui ,uj ) zp ∈ [1 : Z] for the post.

ui ∈U

these sets (in boldface) be represented by their corresponding capitalized symbols. The sender ui , recipient uj , the set of words Wp for each post p and the type of posts Xp are observable variables, while the communities, c, and topics, z, are considered as latent variables.

3.1

by users in the form of latent topics and the type of posts generated by them. User interests are modelled via topics of mutual interest to both the sender as well as the recipient. We use a mixture of unigrams approach in our work, essentially assuming that the entire post is on one topic. As we shall see later, this is a reasonable assumption and allows our model to scale much better. Consequently, we experienced a significant decrease in training time over CUT and CART. Scalability is a major issue in the application of such probabilistic approaches in Social Network Analysis. We also use the “type” of communication to improve community discovery. The idea being that two users who share a series of posts (messages) with each other, are likely to be communicating on certain common topics, indicating similar interests and therefore, can be members of the same community. The type of communication indicates the strength of association between two users and their interest in a topic. The type of interaction varies from one social graph to another. User uploads/downloads, comments, wall posts, shared photographs, tags, etc. can be considered as interactions in social networks. For example, in the traditional model of email, an initiated email, a reply, or a forwarded email could be types. We can also include email list subscriptions, bulk-emails, etc. also into types if we have the required information. Motivated by SSN-LDA [8], we represent every user as a combination of his interaction space. Each user (ui ) is characterized by a social interaction profile representing all its Pi interactions (post types): SIP (ui ) = {X1 . . . XPi }. The social interaction profile of users is represented as random mixtures over latent community variables. Each community is in turn defined as a distribution over the interaction space. This idea is analogous to LDA, where the social interaction profile is a document, interactions are words in that document and communities are latent topics. The number of topics Z, and the number of communities C, to be discovered are specified apriori as model parameters. Let the size of the vocabulary from which the communications between users are composed be V . The number of different type of communications is K. Let DirX (α) denote a X-dimensional symmetric Dirichlet with scalar parameter α and M ult(.) denote the discrete multinomial distribution.

Topic User Recipient Community Model

In this section we describe our generative model : Topic User Recipient Community Model (TURCM) for latent community discovery in such networks. TURCM tries to discover communities by integrating the content being discussed

2350

California Power power transmission energy calpx california

B. Choose a community assignment: cp ∼ M ult(θ~ui ,uj ,zp ) cp ∈ [1 : C]. C. Choose a social interaction type Xp ∼ ~ cp ), Xp ∈ [1 : K] for the post. M ult(φ D. For every slot j, 1 ≤ j ≤ Np in p, choose a word wj ∼ M ult(λzp ).

Gas Transportation gas energy enron transco chris

Trading price market dollar nymex trade

Deals meeting contract report enron deal

Table 1: Topics extracted from Enron

Let W be the set of observable words in the corpus, X be the set of interaction types observed on the social graph among the U set of users (senders) and R recipients. Let Z and C be the latent topic and community assignments for every post. The generative process for the model is given below. The joint likelihood of users, recipients, posts, interaction types, topic and community assignments given by the TURCM model is

Number of Communities TURCM CART CUT

6

8

10

12

14

0.198 0.152 0.133

0.271 0.249 0.231

0.339 0.302 0.266

0.331 0.294 0.278

0.283 0.255 0.227

Table 2: Fuzzy modularity on the Enron dataset

L = P (W, X, U, R, Z, C, θ, φ, η, λ|α, β, ν, δ) This can be factorized into: L=

proved to be close to the optimal in Sections 4.1 and 4.2. We ran 1000 iterations in the burn-in period and took 250 samples (every fourth sample) in the next 1000 iterations. In Table 1, we list a few topics (~λz ) discovered by TURCM. We give top 5 words to visualize each topic. Here “calpx” is the California Power Corporation, “transco” is a gas transportation company and “nymex” is the New York Mercantile Exchange. We see that TURCM is able to discover meaningful topics, thus validating the assumption that each post is often associated with a single topic.

P (W|Z; λ)P (X|C; φ)P (Z|U, R; η) P (C|Z, U, R; θ)P (θ|α)P (φ|β)P (η|ν)P (λ|δ)

Next, we give a Gibbs Sampling based approach to infer ¯ p as the set of words TURCM’s parameters. We represent W in a given post p, Np as the number of words in the post and Nwp as the number of times a given word w occurs in ¯ −p represent the community, p. Let C−p , X−p , Z−p and W post type, topic assignments and the set of words except post p. The Gibbs update equations are: P (cp = c|C−p , Z, U, R) = P c0

P (xp = x|X−p , C) = P x0

Q ¯ p = w| ¯ −p , Z) = P (W ¯ W

n−p z(ur) + ν n−p z 0 (ur) + Zν p N Qw

(n−p wz + i + δ)

¯ p i=0 w∈W NQ p −1

(

i=0

P −p (nw0 z + i + V δ)) w0

where n−p a(b1 ...bv ) represents the number of times a,(1 ≤ a ≤ A) is generated from the combination of variables b1 . . . bv in the model, (1 ≤ bi ≤ Bi , 1 ≤ i ≤ v) excluding post p. The topics P (w|z), community memberships P (u|c) and topic proportions for a community P (z|c) are estimated using a Maximum Likelihood estimate. The P (u|c) computation gives us the community as a distribution over users while P (z|c) gives us the topical interest in a community.

4.

Community Analysis

Next, we evaluate the quality of communities discovered by the TUCM and TURCM models against the communities discovered by the CUT and the CART models. Newman proposed modularity, a measure of goodness for community structure. Unlike most other works, it assumes that a good division of the network is not merely one in which the number of edges running between groups is small. Rather, it is one in which the number of edges between groups is smaller than expected. Since the output of all these probabilistic models(TURCM, CART and CUT) is a fuzzy community structure, in which each node has a certain probability of belonging to a certain community; we use a fuzzy version of modularity proposed in [3]. Table 2 compares the fuzzy modularity of the TURCM model with its peers(CUT and CART). The high modularity values support our assumption that people who share common interests and are inter-connected with each other in the social graph often form communities. Modularity is important as one is always interested in strong-knit communities where people know each other as well as share common interests for reasons such as networking and task assignment. Methods that form communities purely on interest can end up with disparate people (who dont know each other and are disconnected in the graph) in one community. This is shown by much weaker numbers for the CUT model. The number of topics was set to 20 for these experiments.

n−p c0 (urz) + Cα

n−p xc + β n−p x0 c + Xβ

P (zp = z|Z−p , U, R) = P z0

4.1

n−p c(urz) + α

4.2

EXPERIMENTS

Perplexity Analysis

Next, we explore the perplexity of our model. We choose two previous probabilistic models for community discovery (CUT and CART) and compare our models with them in terms of perplexity. Figure 1 gives perplexity comparison establishing a significant improvements by the TURCM model.

In this section, we evaluate the TURCM model on the ENRON email corpus and compare it with CUT and CART models. For all our simulations, we set the number of communities C at 10 and topics Z at 20. These choices are later

2351

Figure 1: Perplexity comparison on the ENRON dataset

Figure 3: Perplexity on the Enron dataset against the number of communities

5.

CONCLUSION

We posited that communities are formed by users who communicate on topics of mutual interest, are connected to each other in the social graph and share frequent communication of certain types. Then, we argued that interaction types are important indicators of the strength of association between users and proposed a probabilistic scheme that incorporates all these three features to discover communities more effectively. Finally, we showed superior community discovery results over our closest peers.

6.

REFERENCES

[1] J. Chang and D. Blei. Relational topic models for document networks. In AIStats, 2009. [2] K. Henderson, T. Eliassi-Rad, S. Papadimitriou, and C. Faloutsos. Hcdf: A hybrid community discovery framework. In SDM 10, 2010. [3] J. Liu. Fuzzy modularity and fuzzy community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 77:547–557, 2010. 10.1140/epjb/e2010-00290-3. [4] Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 665–672, New York, NY, USA, 2009. ACM. [5] N. Pathak, C. DeLong, A. Banerjee, and K. Erickson. Social topics models for community extraction. In Proceedings of the 2nd SNA-KDD Workshop, 2008. [6] H. Zhang. Hsn-pam: Finding hierarchical probabilistic groups from large-scale networks, 2010. [7] H. Zhang, C. L. Giles, H. C. Foley, and J. Yen. Probabilistic community discovery using hierarchical latent gaussian mixture model. In Proceedings of the Conference on Artificial intelligence, 2007. [8] H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, and J. Yen. An lda-based community structure discovery approach for large-scale social networks. In In IEEE Conference on Intelligence and Security Informatics, pages 200–207, 2007. [9] D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and H. Zha. Probabilistic models for discovering e-communities. In Proceedings of the International Conference on World Wide Web, 2006.

Figure 2: Perplexity on the Enron dataset against the number of topics This is because the TURCM model generates topic based communities and accounts for link types which are important descriptors of the strength of relationship between users. The results also confirm to our intuition that the post topics should model the joint interest of the sender and the recipient. For these reasons, CUT doesn’t perform well. Finally, in order to comment on the model parameters (number of topics and communities), we analyze how model perplexities behave as the model parameters are changed. Figure 2 plots the perplexities against the number of topics. The number of communities was set to 10 for this experiment. It can be roughly concluded that the peplexities attain their minimum at around 20-25 topics. Figure 3 plots the perplexities against the number of communities. The number of topics was set to 10 for this experiment. Again, it can be roughly concluded that the perplexities attain their minimum at around 10 communities. Similar insights could be obtained about the number of communities from Table 2 where the fuzzy modularities optimize around 10 communities for both datasets. This analysis not only helps us in concluding that our model outperforms the two baselines (CUT and CART) independent of the model parameters (number of topics and number of communities) but also get an estimate on optimal model parameter settings.

2352

Suggest Documents