Predicting the Tendency of Topic Discussion on the ... - CiteSeerX

19 downloads 20256 Views 1MB Size Report
Jun 19, 2008 - forum and blog site, for users to discuss all kinds of topics with each other. Through ... deciding strategy of publishing business advertisement and preventing ... this model and slight variations of it on top of complex topologies.
Predicting the Tendency of Topic Discussion on the Online Social Networks Using a Dynamic Probability Model Yadong Zhou1, Xiaohong Guan1, 2, Zhefei Zhang1, Beibei Zhang1 1

2

SKLMS Lab and MOE KLNNIS Lab, Xi’an Jiaotong University, P.R.China

Center for Intelligent and Networked Systems and TNLIST Lab, Tsinghua University, P.R.China

{ydzhou, xhguan, zfzhang, bbzhang}@sei.xjtu.edu.cn ABSTRACT Topic discussion is a significant phenomenon on the online social network, which would attract the rising attention in the near future. In this paper, we predict the tendency of topic discussion on the online social networks using a dynamic probability model. We analyze the process of topic discussion, and give the formulation of it. Three main factors (individual interest, group behavior, and time lapse) are analyzed and quantized, based on which, we propose a dynamic probability model to predict the user’s behavior, i.e. attending the topic discussion or not, and then obtain the number of the attending users. Most of the parameters of the model can be calculated by ML estimate methods, and the rest 3 parameters are set by human experience. Experiment shows that our model could predict the tendency of topic discussion accurately. Also, we simulate different sets of the three experience parameters and study the selection of suitable experience parameters.

Categories and Subject Descriptors G.3 [PROBABILITY AND STATISTICS]: stochastic processes, statistical computing; J.4 [SOCIAL AND BEHAVIORAL SCIENCES]: Sociology.

General Terms Algorithms, Experimentation, Human Factors

Keywords Topic Discussion, Group Dynamics, Online Social Networks

1. INTRODUCTION As one of the most important media, Internet provides various communication platforms, such as BBS (Bulletin Board System), forum and blog site, for users to discuss all kinds of topics with each other. Through the process of the discussion, an online social network is created, which actually is a set of groups of users with particular patterns of communication between them [1]. Nowadays this online social network plays an increasingly important role in people’s life. Most corporations select it to publish the news and advertisement of their products, and want to attract valuable users’ attention and discussion. In the other hand, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WebScience'08 - June 19, 2008, Pittsburgh, Pennsylvania, USA Copyright 2008 ACM 978-1-59593-XXX-X/08/06…$5.00.

online social network could be a great platform for rumor and illegal information spreading, which is harmful to the society. So, it is important to explore how information is spreading and how relevant topics are discussed on the online social network. We need effective theory to predict the tendency of topic discussion, and estimate the influence of it. The theory could be helpful for deciding strategy of publishing business advertisement and preventing rumor and illegal information spreading. Sociologists have investigated many properties of traditional social networks [2], including representation, structure, evolvement et al. In recent years, researches on the web social network attracted much attention [3], and studies indicated that web presents to be a small-world and scale-free network [4]. As a great deal of web 2.0 sites emerge and become popular, sociologists and computer scientists begin to investigate their sites’ properties. Gilad Mishne [5] analyzes the comments of blogs and the relationship between the weblog popularity and commenting patterns. K. S. Emaili [6] builds a recommender system for weblogs based on the link structure. D. Liben-Nowell [7] finds a strong correlation between friendship and geographic location in social networks by using data from LiveJournal. Backstrom [8] examines friendship links and community membership in LiveJournal. Alan Mislove [9] measures and analyzes the structural properties of online social network based on a large scale source data collected from 4 popular sites. The researches above mainly concentrate on the structural properties of online social networks, lacking of researches on how to solve the problem mentioned in the first paragraph. So, we present a dynamic probability model for predicting the tendency of topic discussion in this paper. Dynamic model of information diffusion and disease spreading attracts many researchers and has been widely studied in recent years. Related works were developed at epidemic model and rumor model, the former is basic. Research on mathematical epidemiology has grown exponentially since the middle of the 20th century [10-12]. Traditionally, both the SIR and the SIS models are considered within the homogeneous mixing hypothesis [13], which is a strong and questionable assumption. Recent research shows that many real networks follow the smallworld property and are referred to as “scale-free” networks [14]. So, starting with the works by Pastor–Satorras [15], there has been a burst of activities on understanding the effects of the network topology on the rate and patterns of disease spread [1516]. The standard rumor model is DK model, proposed by Daley and Kendal [17]. Recently, several authors [18-19] have explored this model and slight variations of it on top of complex topologies. The researches above are usually based on the idea of classifying users into three states, proposing dynamic differential equation models for the changing of the number of users in each state, and analyzing the dynamic process. However, there are still some problems in the above researches. Firstly, epidemic model

and rumor model merely describe the dynamic changing of the number of users, missing other important features about the dynamic process, such as the detailed behaviors of each user. Secondly, they need the numbers of susceptible, infected, and removed users and the transmission rate, recovery rate as input parameters which are difficult to obtain in the real world. Thirdly, they could only describe a large-scale process of infecting, which is a harsh condition for the tendency of topic discussion in the network; usually there are just hundreds of users involved in a hot topic discussion on the online social network, which can be proved by our experiment data. Thus, we find that the existing models could not describe the dynamic process of the topic discussion on the online social networks properly. In this paper, we propose a dynamic probability model, which could predict the tendency of the topic discussion on the online social network more appropriately. We consider a topic discussion process as a sum of dynamic behaviors of users, who belong to a group, and the behavior is attending the topic discussion or not at a certain time point. Here, we define a group as a set of users who are discussing the topic. As the topic discusses, the corresponding group develops as well. Therefore, the number of users who are discussing the topic is equal to the number of users who belong to the group at any time point. We study the factors which would influence the user’s behavior, and find that there are three main factors, i.e., individual interest, group behavior and time lapse. By quantizing those three factors, we propose a dynamic probability model, and get the trend probability of each user’s behavior. Most of the parameters in the model could be calculated by MLE (maximum likelihood estimation) methods, and 3 other parameters could be decided by experiences. After this, we explore the validity of the model based on the real data, which are composed of BMY BBS data and SINA blog sites data, as introduced at section 4. The experiment results show that the model could predict the tendency of topic discussion accurately. In addition, we simulate different sets of the three experience parameters, and study how to select the suitable values for them. This paper is organized as follows: Section 2 presents the dynamic formulation of topic discussion process. Section 3 proposes the dynamic probability model and the experimental results and discussion are presented in Section 4. Concluding remarks is given in the last section.

2. DYNAMIC FORMULATION OF TOPIC DISCUSSION PROCESS Normally, the dynamic process of topic discussion on the online social networks can be regarded as the discussion of one topic about some lasting content on web sites by many users, whose quantity and individuals would change as the time lapses. In this paper, we try to describe the dynamic process as below. Definition 1: A group is defined as a set of users who discuss the topic, and additionally the size and component of the group changes dynamically as time lapses. In the process of topic discussion, some new users will attend the discussion; some existed users will exit the discussion. If we define all the users who have chances to discuss the topic as a universal set, then the group which consists of users who were discussing at some time is a subset. We set A as the universal set, a as one user, q as the size of the set:

A = {a1 , a2 ,..., aq }

(1)

We consider A as a constant set, in which the elements would not change as time goes on. Moreover, G(t) is the dynamic group about topic T, which means the elements would change while time changes: G (t ) = {g 1 , g 2 ,..., g m ( t ) }, G (t ) ⊆ A and m(t ) ≤ q (t = 1,2,...) (2) where ∀g i ∈ G (t ), (1 ≤ i ≤ m(t )) , the user whom g i stands for was attending the discussion about topic T at time t. We describe the dynamic process of a group’s evolvement from the perspective of the change of individual sets, as shown in formula 1, 2. Also we can describe it from the perspective of the change of individual behavior: a ∈ A and a ∈ G (1), a ∉ G (2), a ∈ G (3),..., a ∈ G ( N ) (3) where a user participates all the time expect at the time 2. We can express the user’s behavior (In this paper, we call user’s behavior attending discussion or not as user’s behavior for short) sequence for user a as formula 4 from formula 3. (4) a : x = ( x1 , x2 ,..., xN ) and xi ∈ {0,1} (1 ≤ i ≤ N ) where xi equals to 1 means that user a discusses the topic at time i and xi equals to 0 means the user doesn’t. Corresponding to the behavior sequence of user a in formula 3, we can get (5) x1 = 1, x2 = 0, x3 = 1,..., xN = 1 If we sum all of the users’ behavior about the discussion shown in formula 5, then the result will become another perspective of the dynamic process of group evolvement. Therefore, we can predict the tendency of dynamic process of group by predicting the tendency of individual behavior, and a dynamic probability model of individual behavior is presented as below.

3. DYNAMIC PROBABILITY MODEL In the topic discussion process, a user’s behavior would be mostly affected by 3 factors: individual interest, group behavior, and time lapse. So we propose 3 hypotheses as follows. Hyp.1: individual interest factor: we assume that the more times one user attends the discussion about topic T at present and in the past, the more probably he attends the discussion in the next time. Hyp.2: group behavior factor: we assume that the larger the amount of users who attend the discussion about topic T increases at present and in the past, the more probably a user attends the discussion in the next time. Hyp.3: time lapse factor: we assume that the longer the interval between present and peak time is, the less probably users attend the discussion in the next time. We consider each user’s behavior habits in Hyp.1, and suppose that each user has his unique interest to some contents. If one user often attends the discussion in a past period, then he would be much interested in the topic, and he is more likely to discuss the topic again in the future. Based on Hyp.1, we give the behavior tendency function f(X) for user a at time n (1 1, l j > 1, λ > 0

(10)

where P(xn) is the probability of the tendency of attending the discussion or not, P(xn) has positive correlation with χ(X). Based on formula 1 and 5, all of the q users’ behavior in the universal set A can be expressed as (11) a1 : xa1 , a2 : xa 2 ,..., aq : xaq We have estimated the parameters in formula 10 by MLE method. The results are shown as follow: ln L(q) = ln[ χ ( xa1 ) ⋅ χ ( xa 2 ) ⋅ ⋅ ⋅ χ ( xaq )] s

s'

= ∑{∑{[2 − ( xn − xn − i ) 2 ] ⋅ ln ki − ki } + ∑{{1 + r (2 xn −1) } ⋅ ln l j − l j }} A

i =1

(12)

j =1

d ln L( q) 2 − ( xn − xn − i ) = ∑{ − 1} = 0 ⇒ ki = dki ki A 2

∑[2 − ( x

n

− xn − i ) 2 ]

A

d ln L(q) 1 + r (2 xn −1) = ∑{ − 1} = 0 ⇒ l j = dl j lj A

q

∑{1 + r A

q

(2 xn −1)

}

(13) (14)

According to formula 13 and 14, we can estimate the parameters in formula 10 by sample data, and calculate the value of behavior tendency function χ(X) by setting xn as 1 and 0. If xn is 1, χ(X) has positive correlation with P(xn=1); if xn is 0, χ(X) has positive correlation with P(xn=0). After normalizing χ(X), we can obtain the values of P(xn=1) and P(xn=0).

4. EXPERIMENT RESULTS 4.1 Validity of the Model In the experiment, we collect the topic data from BMY BBS site (http://bbs.xjtu.edu.cn/) of Xian Jiaotong University and SINA blog site (http://blog.sina.com.cn). BMY BBS site, which has 364 boards and 37,290 registration users, and 3000 more subjects are discussed in the site per day, is one of the most influential BBS sites in CERNET. SINA blog site, which has 2,600,000 registration users, is one of the most influential blog sites in China. We captured BMY BBS topic data ranging from Nov.1, 2006 to Dec.8, 2006, and select “job info discussion about Huawei (a company name)” as topic1, “job info discussion about ZTE (a company name)” as topic 2. We captured SINA blog topic data ranging from Oct.6, 2007 to Oct.30, 2007, and select “17th National Congress of CCP” as topic 3. Fig.1 (a, b, c) shows the comparison between real data and the predicting results about topic 1, 2, 3; and x axis is time whose unit is day, y axis is the number of participators whose unit is user; blue line with diamond figure corresponds to the real data, while black line with cross figure corresponds to the predicting result. We set the values of s, s’ and λ to 7, 3 and 0.5 respectively; and

calculate the other parameters by MLE which has been mentioned

(a) Predicting result about topic 1

in formula 13 and 14.

(b) Predicting result about topic 2

(c) Predicting result about topic 3

Fig.1 Compare real data to predicting data about topic 1, 2, 3

(a) Influence of s about topic 1

(d) Influence of s’ about topic 2

(b) Influence of s about topic 2

(e) Influence of λ about topic 1

(c) Influence of s’ about topic 1

(f) Influence of λ about topic 2

Fig.2 Influence of parameters s, s’, λ to the predicting result As shown in Fig.1 (a, b, c), we find that the probability model in formula 10 can predict the tendency of topic discussion properly. Also some obvious errors appear in Fig1 (c), which are caused by several reasons. The most important reason is abrupt outside news (convening of the congress) prompting factor, which is difficult to be predicted by the previous data. Although some errors appear at the peak time, the predicting result has a satisfactory performance about both the overall trend and most details. The most influential factors to the precision of our model are listed below. (1) Holidays factor. Fewer users will log on the site on holidays than those on business days, so the number of participators would be less. This situation is obvious in the BBS site especially. (2) Abrupt outside news prompting factor. Some abrupt news can influence the number of participators seriously, which is shown in Fig.1 (c) obviously.

4.2 Influence of the Experience Parameters After proving the validity of the probability model by the simulation experiment in 4.1, we will discuss the influence of the three experience parameters by selecting deferent values of parameters below. Experience parameter s, which is the available duration of individual interest factor in formula 10, means that the user’s behavior at time n only associates with the behavior from time n-s to time n-1. According to the data of the topic 1 and topic 2, we launch the simulation process expressed in 4.1, and set s’ and λ to 1 and 0.5, while s to 5, 9, 13 respectively. The other parameters are also calculated by MLE method. The comparison simulation result is shown in Fig.2 (a, b). In Fig.2 (a, b), when s is 5, the predicting curve fluctuates obviously, and properly coincides with the trend of the real data in spite of several unsatisfied details; when s is 9, the predicting curve fluctuates slightly, and is well coincide with both the trend and the details; when s is 13, the

predicting curve fluctuates smoothly, and fits to the mean value of the real data despite of several unsatisfied details. Thus, if we set s to a small value, the predicting result would properly coincide with the trend of real data; if we set s to a middle value, the predicting result would properly coincide with the details of real data; if we set s a big value, the predicting result would properly coincide with the mean value of real data. We can select a proper value for s according to our aim in the application of the model. Experience parameter s’, which is the available duration of group behavior factor in formula 10, means that the user’s behavior at time n only associates with the group behavior from time n-s’ to time n-1. According to the topic 1 and topic 2 data, we launch the simulation process expressed in 4.1, and set s and λ to 7 and 0.5 respectively, while s’ to 3, 4, 5 respectively. The other parameters are also calculated by MLE method. The comparison simulation result is shown in Fig.2 (c, d). In Fig.3 (c, d), while s’ is set to 3, 4, 5 respectively, the predicting curves properly coincide with the real data, and the differences between them are small. We can select a value of 3 for s’ in the application to reduce the algorithm complexity. Experience parameter λ is the lapse exponential coefficient, which is correlated with the decay rate of user’s interest with the time lapse. According to the topic 1 and topic 2 data, we launch the simulation process expressed in 4.1, and set s and s’ to 7 and 3, while λ to 0.3, 0.6, 0.9 respectively. The other parameters are also calculated by MLE method. The comparison simulation result is shown in Fig.2 (e, f). In Fig.4 (e, f), when λ is 0.3, the predicting curve is above the real data; when λ is 0.6, the predicting curve coincides with the real data; when λ is 0.9, the predicting curve is under the real data. Besides, the predicting curve (λ=0.3) is the upper envelope curve of the real data; the predicting curve (λ=0.6) is the mean value position curve of the real data; the predicting curve (λ=0.9) is the lower envelope curve of the real data. So, we can select a large or small value of λ to get the upper or lower bound, and select a middle value of λ to reach the mean value of the real data.

5. CONCLUSION In this paper, we analyze the process of topic discussion, and provide some related formulations. Based on the three factors (individual interest, group behavior, and time lapse), we propose a dynamic probability model for predicting the tendency of topic discussion. Most parameters in the model could be calculated by MLE methods, while the others (3 parameters) could be set by experience. Experiment demonstrates that the model could predict the trend of topic discussion accurately. Also, we simulate with different sets of the three experience parameters, and study how to select suitable experience parameters for the requirement. We find that the process of topic discussion will be influenced by the structure of the social networks in the sites. In the future work, we will pay more attention to analyze the influence of structure of social networks to improve the precision of our model.

6. ACKNOWLEDGMENTS The research presented in this paper is supported in part by the National Natural Science Foundation (60574087), 863 High Tech Development Plan (2007AA01Z475, 2007AA01Z480, 2007AA01Z464) and 111 International Collaboration Program, of China.

7. REFERENCES [1] Scott J. 2000. Social Network Analysis: A Handbook, Sage Publications, London, 2nd ed., [2] S. Wasserman, K. Faust. 1994. Social Network Analysis: Methods and Applications. Cambridge University Press. [3] Mohsen Jamali, Hassan Abolhassani. 2006. Different Aspects of Social Network Analysis. International Conference on Web Intelligence (WI'O6), [4] Albert, Réka; Jeong, and Hawoong. 1999. Internet: Diameter of the World-Wide Web. Nature, Vol. 401, Issue 6749, p.130. [5] G. Mishne, N. Glance. 2006. Leave a reply: An analysis of weblog comments. WWW 2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis andDynamics [6] K. S. Emaili, M. Neshati, M. Jamali, and H. Abolhassani. 2006. Comparing performance of recommendation techniques in the blogsphere. In ECAI'06 Workshop on Recommender Systems, Riva del Garda, Italy. [7] D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. 2005. Geographic Routing in Social Networks. Proceedings of the National Academy of Sciences, 102(33):11623–11628. [8] L. Backstrom, D. Huttenlocher, J. Kleinberg. 2006. Group Formation in Large Social Networks: Membership, Growth, and Evolution. Proce. 12th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining. [9] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, Bobby Bhattacharjee. Measurement and Analysis of Online Social Networks. IMC’07, October 24-26, 2007, San Diego, California, USA. [10] N.T.J. Bailey. 1975. The Mathematical Theory of Infectious Diseases and Its Applications, Hafner Press, New York. [11] R.M. Anderson, R.M. May. 1992. Infectious Diseases in Humans, Oxford University Press, Oxford. [12] O. Diekmann, J. Heesterbeek. 2000. Mathematical Epidemiology of Infectious Diseases: Model Building, Analysis and Interpretation, Wiley, New York. [13] R.M. Anderson, R.M. May. 1992. Infectious Diseases in Humans, Oxford University Press, Oxford. [14] S. Boccalettia, V. Latorab, Y. Morenod, M. Chavezf , D.-U. Hwanga. 2006. Complex networks: Structure and dynamics. Physics Reports 424 175 – 308 [15] Romualdo Pastor-Satorras, Alessandro Vespignani. 2001. Epidemic Spreading in Scale-Free Networks. PHYSICAL REVIEW LETTERS. VOL. 86, NUM. 14. 2. [16] Yamir Moreno, Javier B. Go´mez, Amalio F. Pacheco. 2003. Epidemic incidence in correlated complex networks. PHYSICAL REVIEW E. 68, 035103(R). [17] D. J. Daley, D. G. Kendall. 1964. Epidemics and Rumours. Nature 204, 1118 [18] Damia´n H. Zanette. 2002. Dynamics of rumor propagation on small-world networks. PHYSICAL REVIEW E, VOLUME 65, 041908. [19] Yamir Moreno, Maziar Nekovee, Amalio F. Pacheco1. 2004. Dynamics of rumor spreading in complex networks. PHYSICAL REVIEW E 69, 066130

Suggest Documents