Evolutionary Community Discovery from Dynamic Multi ... - CiteSeerX

4 downloads 134 Views 182KB Size Report
like, tell, start, feel laptop, processor, transfer, comput, memori,core,ipod, price,game,graphic window, vista, us, run, download, free, make, game, secur, program.
2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Evolutionary Community Discovery from Dynamic Multi-Relational CQA Networks Zhongfeng Zhang1, Qiudan Li1, Daniel Zeng1,2 1. Institute of Automation, Chinese Academy of Sciences; 2. Department of Management Information Systems, University of Arizona [email protected], [email protected], [email protected]

Abstract

formed in the network is represented as the community structure of the network. Since the interests of users may change over time, temporal information is also an important factor of the communities. Consequently, extracting user communities in the multi-relational networks of CQA and tracking their evolution has been a crucial aspect of CQA analysis. This analysis could help understand the structural properties of CQA networks and assist the task of finding authority users within or between communities, which in turn could help provide better services for CQA users, such as recommending relevant questions or manipulating the quality of contents. We represent the relationship (friendship) among users in a multi-relational network as the topological structure of the network. Given a multi-relational network and its associated topological structure, the task of community extraction is to extract communities, where users in each community are densely connected and have coherent interests toward the community topic. The problem of Evolutionary Community Discovery is to extract communities in each timestep, and track the evolution procedure of these communities. Communities and their evolutions have been studied on the topological structure of a network [4][6] by detecting densely connected user groups. Authorship has also been incorporated into topic models to infer semantically similar communities [5] by exploring the contents generated by users. The later models usually ignore the topological structure while the former methods seldom take the content information into consideration. Studies on homophily have shown that two friends are more likely to share common interests and vice versa [8]. The friendship connections are represented by the topological structure. Thus, the quality of community structures could be improved by integrating both the topological structure and user generated contents. In this paper, a unified framework combining the author-topic model with topological structure analysis is proposed to mine evolutionary communities. We aim to solve such fundamental questions: How to explore the communities and the topics of communities

As a knowledge sharing platform, Community Question Answering (CQA) services have attracted much attention from both academic and industry. This paper studies the problem of mining evolutionary community structures in CQA, through analysis of time-varying, multi-relational data among users and contents. We propose a unified framework for this problem, which makes the following contributions: 1) We propose an AT-LDA model, which combines author-topic model with topological structure analysis, to discover densely connected communities and the community topics in a unified process; 2) Our framework captures community structures and their evolution with temporal smoothing given by historic community structures. Empirical evaluation on realworld dataset shows that interesting communities and their evolution patterns can be detected.

1. Introduction Community Question Answering (CQA) has recently become a popular platform for users to seek and share information. Online users are increasingly seeking advice from CQA websites, e.g. Yahoo! Answers and Baidu Zhidao etc. Yahoo! Answers alone has attracted more than 90 million unique users worldwide1. In the CQA scenario, an asker posts his information need as a question, which will be answered by other participants. A multi-relational network is formed during this interactive process, with the actors involving users, questions and answers. We represent a question together with its related answers as a question thread. We indicate that a user generally shows an interest in the topic of the question thread if he/she spends time asking or answering in the thread. A community is a collection of users with similar interests. The topic of a community is thus defined as the topic that the majority users in the community are concerned with. A combination of communities 1

http://help.yahoo.com/l/us/yahoo/answers/overview/overview55778.html

978-0-7695-4191-4/10 $26.00 © 2010 IEEE DOI 10.1109/WI-IAT.2010.189

83

simultaneously from the CQA interactive network? How to track the evolution of the communities over time? Our research contributes to the literature in the following aspects: 1) We propose an AT-LDA model, which combines author-topic model with topological structure analysis for community detection in multi-relational networks of CQA, to discover densely connected communities and community topics in a unified process; 2) Our framework captures community structures and their evolution with temporal smoothing given by historic community structures.

users. In our model, users are modeled as probabilistic mixture over communities, while the topic of each community is characterized by the multinomial distribution over words. The process of generating a question thread can be described as: First, choose a user  at random for each individual word token in the question thread; then, a community is sampled from community distribution  given the user; finally, a word is chosen from the community topic distribution over words I . The distribution over words within a question thread can be calculated as:

2. Evolutionary Community Discovery In this section, we first present our AT-LDA model to extract community structure for each timestep. Then, temporal smoothing is applied to track the community evolution patterns in a nature way.

ad

K

u 1

j 1

¦ ¦ p(w

i

1 ad

ad

K

u 1

j 1

¦ ¦ p(w , z

p ( wi | T , I , a d )

i

i

j , I j ) p ( zi

| zi

j, F j|F

u | T , I , ad ) u, Tu ) p( F

u | ad )

K

¦ ¦I

u a d

wi , j

T j ,u

(1)

j 1

where Iwi , j

p ( wi | z i

j , I j ) is the probability of

word wi under community j; ad is the set of users for a question thread d. p(=u|ad) is assumed to be uniform over the elements of ad, and deterministic if j | F u , T u ) is the probability | ad | 1 ; T j , u p ( z i

that the jth community is sampled given user u. The generative procedure of the topological structure for user u is described as: First, a community z is sampled from the community distribution  given u; then, a user p is sampled from the distribution over users  given community z. The topological relationship between user u and user p thus can be calculated as:

Figure 1. The graphical model representation of (a) AT model [5]; (b) LDA model [6]; (c) AT-LDA model. The LDA model has been applied to discover communities in social networks [6], which is capable of capturing overlapping community structure by assuming that a user may have multiple interests and join multiple communities. The author-topic (AT) model was used to discover the author’s involvement in different topics [5]. The graphical model representation of LDA and AT is shown in Figure 1(a) and (b), respectively. We propose an AT-LDA model that combines AT and LDA into a unified framework as shown in Figure 1(c). The proposed model considers textual contents and the topological structure of the network simultaneously in the process of community discovery and allows us to discover communities and topic of each community in a unified way. As we can see from Figure 1(c), the left part of the proposed model (AT model part) deals with userquestion thread relationship, and the right part (LDA model part) analyzes the topological structure among

K

p (c p ,u | T , : )

¦ p(F

p | zu

k ) p ( zu

k|F

u, Tu )

k 1 K

¦:

p ,k

T k ,u

(2)

k 1

where each user p belongs to a community j with a probability : p , k p ( F p | z k ) . j,u consists of two major parts: the semantic part derived from the question threads associated with user u and the topology part derived from the topological structure of user u. We write the first part as p ( zi j | u , w) , and the second part as p ( zi j | u , c ) . The community distribution for user u thus can be regularized as: (3). T j , u = O p ( zi j | u , w) + (1  O ) p ( zi j | u , c ) The adjustable weight factor O can be set between 0 and 1 to control the balance between the semantic part

84

and the topology part. The historic community structure also contains valuable information related to current community structure [1]. The quality of discovered community structure could be improved by considering recent historic communities. To fulfill this purpose, we propose to perform temporal smoothing on community structures. Our approach models the shift between community structures as a Markov Chain. Specifically, instead of randomly initializing the community structure, we fold-in the network at timestep t to the model learned at timestep t-1. With temporal smoothing, the kth community at timestep t is smoothly evolved from the kth community at timestep t-1. As discussed above, the generative procedure of AT-LDA for timestep t is described in Algorithm 1. For parameter estimation in Algorithm 1, the Gibbs sampling algorithm is adopted.

3.1. Community Number Selection The modularity function Q [4] has been widely adopted to measure the quality of community structures in which a user is clustered into at most one community. The soft modularity Qs was introduced to handle overlapping community structures in [3]. A better community structure generally has a higher value on Qs. We test modularity value for each timestep on different settings of community numbers. Figure 2 (a) shows the results when O 0.2 and the number of iteration is set to 500. It can be seen that the modularity value achieves relatively high value when the community number is between 8 and 10 for each timestep. For simplicity, we fix the community number to be 8 for each timestep in the following sections. 0.65 0.6

Algorithm 1: Generative Procedure for AT-LDA model t t-1 1) Draw community structure  | ,. Draw topic distribution on words for communities I t | I t-1,  2) For each question thread d=1,…, D Given the related users ad For each word wi (i=1,…, Nd ) Conditioned on ad, choose a user ~Uniform(ad) t Conditioned on xi, choose a community zi~Discrete( ) t Conditioned on zi, choose a word wi~ Discrete( I ) 3) For each of the L users in user u’s topological structure t Choose a community zi~Discrete( ) t Choose a user p connected with u, such that cp,u~Discrete( )

M odula rity

0.55 0.5 0.45 I1: Sep.25-Oct.10 I2: Oct.11-Oct.26 I3: Oct.27-Nov.11 I4: Noc.12-Nov.27 I5: Nov.28-Dec.14

0.4 0.35 0.3 2

4

6

8

10

12

14

16

18

20

Community Number

Figure 2. Modularity varying with community number.

3.2. Performance Comparison In order to show the performance of our model, we use LDA, NMF (non-negative matrix factorization) and MetaFac models as benchmark algorithms for performance comparison. The LDA and NMF models detect overlapping community structure from the topological structure [6][7]. The MetaFac model [2] performs tensor factorization on the multi-relational network for overlapping community discovery. Figure 3 (a) plots the modularity value for different timesteps. Both LDA and AT-LDA model significantly outperforms the NMF and MetaFac algorithms. Our AT-LDA model outperforms LDA, with pvalue=0.047 for pairwise t-test. It can also be concluded that with temporal smoothing, the community structure discovered by AT-LDA has lower variation over time. We then examine the topics of communities discovered. In the absence of ground truth, we compare our AT-LDA model with AT model by computing perplexity on hold-out data, which has been used extensively in topic models. Perplexity measures the ability of a probabilistic model to generalize to unseen data, and the lower perplexity indicates better generalization performance. In our experiments, the perplexity is calculated by averaging over 10-fold cross validation, and for each fold we randomly split the

3. An Experimental Study In this section, we empirically evaluate our algorithm on a “windows 7” dataset collected from Yahoo! Answers. With the keyword query “windows 7”, we collected about 9452 resolved question threads during a period of Sep/25/09-Dec/14/09, with more than 22000 users who have asked or answered on this subject. Without loss of generality, we empirically split the whole dataset into 5 timesteps, with each timestep spanning half a month. In the following sections, we use Ik (k=1,…,6) to represent the kth timestep. For all the experiments in the following sections, we set the parameters D E1 E 2 0.01 . For each timestep, we first extract all the question threads and related users who have posted the question or an answer in the question thread. We then construct the topological structures among users. If a user i has answered a question asked by user j during the timestep, then there’s an edge between user i and user j. We extract communities and their topics for each timestep with our AT-LDA model. Finally, we track the evolution relationship of communities during different timesteps.

85

dataset into a training set (90%) for model construction and a hold out set (10%) for test purpose. Figure 3 (b) shows the perplexity comparison under different timesteps. It can be seen that our model gains lower perplexity value, with p-value=0.012 for pairwise t-test.

precise information for a potential consumer to make an informed decision on their purchase behaviors.

drive, comput,go, get, mayb, sound, thank, product, card, motherboard

1500

0.7

AT-LDA LDA NMF MetaFac

0.65

a

window, premium, instal, upgrad, vista, version, comput,system,drive, driver,

a

I3: Oct.27-Nov.11

I4: Nov.12-Nov.27

window, program, upgrad, microsoft, instal, work, comput,vista, drive,problem

window, comput, click, free, internet, us, secur, download, softwar, registri

a

I5: Nov.28-Dec.14

a

game, laptop, card, come, call, love, core, peopl, driver, plai

AT-LDA AT

window, vista, us, run, download, free, make, game, secur, program

0.6

1000 0.55

Perplexity

M o d u la rity

I2: Oct.11-Oct.26

I1: Sep.25-Oct.10

0.5 0.45

b

window, version, instal, microsoft, laptop,comput, come, vista, drive, upgrad

b

window, look, peopl, point, link, instal, like, tell, start, feel

b

window, card, graphic, comput, game, work, file, look, drive, video

b

laptop, processor, transfer, comput, memori,core,ipod, price,game,graphic

Figure 4. Example of evolutionary community threads

500

0.4

4. Conclusion

0.35 0.3 I1

I2

I3

Timestep

(a)

I4

I5

0 I1

I2

I3 Timestep

I4

I5

Analyzing communities and their evolutions in a dynamic way has been a challenging problem with broad potential applications. In this paper, we propose a framework for extracting evolutionary community structures and their topics in a unified process. The empirical studies on large-scale real life dataset show the efficiency of our method.

(b)

Figure 3. (a) Modularity for AT-LDA, LDA and NMF models; (b) Perplexity for AT-LDA and AT models.

3.3. Community Topics and Evolution In this section, we check the evolution of community topics during different timesteps. Figure 4 illustrates two discovered evolutionary community threads. Thread a is probably about the driver installation problems of windows 7. It starts from the hardware driver problems (sound card, motherboard etc) during I1. With the release of windows 7 on Oct. 22nd, the driver installation for different windows 7 versions is discussed during I2 and I3. Then, it shifts to talk about the internet security problem and entertainment (e.g. games) during I4 and I5, respectively. Thread b seems to focus on the entertainment on Windows 7, with keyword “game” frequently appeared. The program security problem is first discussed during I1. The clean installation and upgrade of windows 7 is focused during I2. After the release of Windows 7, user experiences are discussed during I3. The driver problems of graphic card are talked over during I4, and iPod transfer for windows 7 received most concern during I5. Generally speaking, with the release of Windows 7, a successful product compared to Windows vista, the upgrading or installation problems and the hardware driver problems are two major types of problems concerned by users. Discovering communities and their topics provide an alternative feedback channel for manufactures to better grasp users’ demands. Such channel enriches the foundations for businesses to improve their services, thus improving customers’ satisfaction. For individual users, this provides a chance to get insight into user experience about a product from a huge number of users, providing

5. Acknowledgement This research is supported by the projects 863 (No. 2006AA010106), 973 (No. 2007CB311007), NSFC (No. 60703085, 70890084, 60875049 and 60621001), Chinese Academy of Sciences (No. 2F07C01).

6. References [1] Y. Chi, X. Song, D. Zhou, K. Hino, and B. Tseng.: Evolutionary spectral clustering by incorporating temporal smoothness. In KDD 2007, pp.153-162. [2] Y. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher.: MetaFac: community discovery via relational hypergraph factorization. In KDD 2009, pp. 527-536. [3] Y. Lin, Y. Chi, S. Zhu, H. Sundaram, and B. Tseng.: Analyzing communities and their evolutions in dynamic social networks, ACM Transactions on Knowledge Discovery from Data, special issue on Social Computing, Behavioral Modeling, and Prediction (3:2), 2009. [4] M. Newman and M. Girvan.: Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113, 2004. [5] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths.: Probabilistic author-topic models for information discovery. In KDD 2004, pp. 306-315. [6] H. Zhang, C. Giles, H. Foley, and J. Yen.: Probabilistic Community Discovery Using Hierarchical Latent Gaussian Mixture Model. In AAAI 2007, pp. 663-668. [7] S. Zhang, R.S. Wang, X.S. Zhang.: Uncovering fuzzy community structure in complex networks, Phys. Rev. E, 76(4), 046103, 2007. [8] H. Lauw, J. C. Shafer, R. Agrawal, and A. Ntoulas.: Homophily in the digital world: A livejournal case study, IEEE Internet Computing, (14:2), 2010, pp. 15-23.

86

Suggest Documents