Modeling and Predicting the Re-post Behavior in ... - Semantic Scholar

13 downloads 0 Views 526KB Size Report
content features such as URL, mentions, etc., but also some ... publish time, wording, author publicity, and even the URL shortening service used. There have ...
2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing

Modeling and Predicting the Re-Post Behavior in Sina Weibo

Xinjiang Lu, Zhiwen Yu, Bin Guo, Xingshe Zhou School of Computer Science Northwestern Polytechnical University Xi’an, China [email protected], [email protected], [email protected], [email protected]

ultimately gain. Researchers have found that user attention is allocated in a rather asymmetric way, with most posts getting few views, re-propagates, or downloads, whereas some ones receive the most attention [21]. While it is possible to predict the distribution of attention over many items, it is notably difficult to predict the amount that will be devoted over time to any given item. Twitter is the most influential social media platform around the world, with a large number of users and a huge amount of content distribution every day. Sina Weibo 1 is another remarkable microblogging service provider, although it was founded in 2006, much younger than Twitter, but the number of users, content and traffic rank make it the most popular microblogging site in China, and many special features in Sina Weibo differs from Twitter [17]. However, unlike Twitter which has been well studied on, such as its potential applications and information propagation mechanisms, much less studies have been done on Sina Weibo. Recent researchers mainly focus on characterizing the web structure of Sina Weibo, and make comparison between Weibo and Twitter. Very few studies investigate the content spread in Sina Weibo so as to predict the information diffusion in Weibo. Like Twitter, Sina Weibo also has a retweet-like mechanism so called repost for content propagation. In this paper we explore the approaches to model and predict reposting patterns in Sina Weibo. We are motivated to investigate the feasibility of building a repost model with topic free measures. We are interested in not only general content features such as URL, mentions, etc., but also some hidden features like the modality of the media types in content of a microblog, and the unique contextual features like VerifiedOrNot (See Section 4.2) in Sina Weibo. We built a tree-based predictive model, taking consideration of the users’ relationship and reposting behavior patterns in Weibo. The rest of this paper is organized as follows. First, we discuss prior work on microblogging-like social networks. Next, in Section 3 we documented in details on our prediction model. We then introduce our data sets used in our experiment, and describe our analysis on impacting features about repostability, followed by illustrating experiment results on both linear regression model and treebased model. We conclude our work in Section 5.

Abstract—Study of human behavior patterns is of utmost importance to many areas, such as disease spread, resource allocation, and emergency response. Because of its widespread availability and use, online social networks (OSNs) have become an attractive proxy for studying human behaviors. One of the interesting and challenging problems about OSNs is that how much attention of a post from a user can gain? In this paper, we try to tackle this issue by exploring approaches to predict the amount of reposts any given post will obtain in Sina Weibo, a famous microblogging service in China. Specifically, we propose a RepostsTree based method to model the reposting process in a temporal dynamic manner. Experiments over the real world collected data indicate that our method is effective on repost predicting. Keywords- Social Media; Sina Weibo: Reposts predicting

I.

INTRODUCTION

With the development of the Internet, the place for expressing oneself and contacting peers is transferring rapidly from offline to online. Online social networks are increasingly becoming popular places to exchange information and meet like-minded people. In popular social networking sites like Google+ and Twitter, people share activity updates with their neighbors or followers, while the neighbors will disseminate the received updates with others. In this way, content generated by a user propagates through the network to a wide user population. Because of its usefulness, and wide-accessibility, social network is rapidly changing the public discourse of human society. Content-centric social networks, such as Twitter and Sina Weibo, promote content spread by allowing users to connect with people having common interests, who may not be their friends. The extent to which a social network spreads content is a key metric that impacts both user engagement and network revenues. From the social network’s perspective, higher content spread can promote user engagement and further improves user retention and audience growth. Furthermore, it provides more opportunities for monetizing the content via online ads, sale of virtual goods, and so on. As a result of the above benefits, it is of importance for researchers to investigate the dissemination of content in social networks. One of the significant challenges under this trend is that can we predict how much attention of an online post will

978-0-7695-5046-6/13 $26.00 © 2013 IEEE DOI 10.1109/GreenCom-iThings-CPSCom.2013.166

1

962

http://en.wikipedia.org/wiki/Sina_Weibo

II.

activity is inherently non-Poissonian. The authors in [29] give a Poissonian explanation for the heavy tails in e-mail communication. Similar to our approach, in [30], the authors give a logarithm linear model based on the Gaussian random variables to model and predict the dynamics of the popularity growth of online content. While, the former researchers mainly detected reposts (retweets in Twitter) associated with the original microblogs based on the textual analysis (e.g., check the “RT”, “//@” etc.) approach, which may suffer from the free text style of reposted microblogs [24]. We instead leverage the real world reposting lists data of given microblogs in Sina Weibo and build a tree-based model to generate an dividing rule for the original time serials of reposting behaviors.

RELATED WORK

Social networks, on an unprecedented scale, serve as a fertile ground for examining social interactions [2]. Microblogging online social networks focus on sharing information, and as such, have been studied extensively in the context of information diffusion. For example, diffusion and influence have been modeled in blogs [6, 7], emails [8], and sites such as Twitter [9]. Leskovec et al. [10] studied the explicit graph of product recommendations, Sun et al. [11] studied cascading in page fanning, and Bakshy et al. [12] examined the exchange of user-created content. For Twitter, there have been many studies on retweeting that try to analyze retweeting behaviors and related factors. For retweeting behaviors, various motivations are explored in [22], while the propagation graph and statistics are studied in [20] and [23]. For retweeting-related factors, [24] and [1] find that retweeted and normal tweets are different in dimensions such as the inclusion of URLs and hashtags, publish time, wording, author publicity, and even the URL shortening service used. There have been some researchers that try to build a model to predict the retweeting decisions of a targeted network. The authors of [25] address such problem by means of constrained optimization using factor graphs in a generative manner. In [26], the authors try to solve the same problem using conditional random fields, and in order to improve the prediction runtime performance, they investigate approaches to partition the social graph and construct the network relations for retweet prediction. Yet, such fine grained predictive model for retweeting may not be well adopted in practice, since it is difficult to aware of the targeted users the content will arrive. Unlike the understanding or characterizing work on the retweet mechanism of Twitter, few studies have been done to predict the number of retweets that will be obtained for a given tweet. One study toward this problem is [1], which build s a regressor to predict the aggregate number of retweets for a given tweet. While, since the number of followers for a Twitter user ranges from none to millions, using such an aggregate prediction is very difficult to estimate the information spread. Up to now, few studies on Sina Weibo have been conducted. New features in Sina Weibo which differs from Twitter have been analyzed [17, 19]. Qu et al. [15] focused on empirical descriptions of the spreading pattern of disasterrelated information in Sina Weibo systems. Chen et al. [16] made a comparison between Weibo and Twitter in terms of web structure, and some researchers conducted such comparisons from cultural perspective [13, 14]. Another interesting study on Sina Weibo was conducted by Sandes et al. [3], which proposed a logical relationship model and explored the optimizing of queries in Weibo. Zhang et al. [4] did some initial work on predicting retweet behavior in Sina Weibo, yet their study is limited on focusing the correlation of number of followers and retweet probability. Being able to explain and predict online human activity (behavior), much work has been done with statistical tools [27]. And Barabási [28] reported that deliberate human

III. TREE-BASED PREDICTION MODEL To gain some intuition about reposting activity pattern in Sina Weibo, let us consider a fictitious user (termed as A) who published a microblog M. After M is posted, a subset (who happen to be on the Sina Weibo within a certain time window after M is posted) of the followers of A, have approximately the equal probability to view M. Of course, the exact time each follower browses the post differs, and the latter the follower view the post, the less opportunity the post will be viewed, hence, the less chance to be reposted. The process that whether user A published microblog M and whether her followers reposted M within a certain time duration can be approximately considered as a single Poisson process [5]. After M is reposted, the reposters’ followers who may or may not be following A, also have similar possibility to view M and re-repost it. Again, this can be considered as another single Poisson process which may parameterly differs from the former Poisson process. Thus, the whole reposting process is a compound Poisson process which consists of one or more individual Poisson processes. Based on our hypothesis, we propose a RepostsTree based method to model the reposting process in a dynamic manner. First, we explain the details on how to construct such a RepostsTree, and an illustrative example is given; then, we formulate the RepostsTree based prediction model. A. RepostsTree Construction We propose the RepostsTree model, which describes the reposting temporal dynamics in a hierarchical fashion. Original microblogs are represented as the root node, and the boosting reposts are placed as the children or grandchildren of the root. The boosting reposts are the ones with high contribution value. Here we define contribution as the property that whether or not a specific repost can fuel the reposting process, in other words, the higher the contribution value is, the more reposts will be gained by the original post as a whole. To simplify our method, we use followers count as the contribution measurement, and determine this value in both relative and absolute ways. In addition, the reposts which cannot be considered as boosting reposts are treated as the members of the time serials attached to those boosting reposts. We propose the RepostsTree model, which describes the reposting temporal dynamics in a hierarchical fashion.

963

Original microblogs are represented as the root node, and the boosting reposts are placed as the children or grandchildren of the root. The boosting reposts are the ones with high contribution value. Here we define contribution as the property that whether or not a specific repost can fuel the reposting process, in other words, the higher the contribution value is, the more reposts will be gained by the original post as a whole. To simplify our method, we use followers count as the contribution measurement, and determine this value in both relative and absolute ways. In addition, the reposts which cannot be considered as boosting reposts are treated as the members of the time serials attached to those boosting reposts. The structure of RepostsTree is based on the follower/followee relationship of the authors who published the posts and/or reposts. However, since it is common for users to add or delete his/her followees in Sina Weibo, and due to the strict policy on obtaining followers’ IDs of a specific user via open API in Sina Weibo, it is difficult to obtain the complete figure of follower/followee relationship of the users who are involved in a certain reposting process. Thus, to relieve negative effect brought by this situation, we provide a remedy in which the reposts are collected by “orphan collector”. The “orphan collector” collect those reposts which are published by the users who did not follow (maybe he/she followed, but we did not recognize it) any “previous” users who had published the boosting reposts to the same microblog before them. The detailed steps on constructing RepostsTree are illustrated in Fig.1. Where O is the original microblog, and RepostsList is the repost-list collected via Sina Weibo open API 2 (note that, whether or not RepostsList’s empty is relative, since that when the delay time between N and O is longer than the window of our training set, we also considered the RepostsList is empty, although it is not really empty). R is the root node of the first tree; here, we note that the RepostsTree is actually a forest rather than tree, since there may be more than one root in the RepostsTree.

B. An Example To illustrate our RepostsTree construction process, an example is given as follows (Fig.2). At the beginning, the original microblog (denoted as R, and its author as A) is placed as the root node of the RepostsTree, and the “orphan collector” is empty. Firstly, we scan the repost-list of R, and pop the first repost (denoted as N) from the list. And, perhaps the user who published N is one of the followers of A, and its contribution value satisfies the absolute condition (e.g., the author of N has more than 10,000 followers), thus N is placed as the child node of R (shown as Fig.2.a). Next, the second repost (switch the symbol N to it, the same below) is popped from the list, and again its author is one of followers of A, but N doesn’t satisfy the absolute condition, neither the relative condition (e.g., the number of followers recorded for the author of N is more than A’s, and we call the absolute and relative conditions as the privilege conditions). Hence, N is placed as one of the members of the sub time serials attached to R (Fig.2.b). Then, some reposts are popped, and none of them satisfy the privilege conditions, so they are appended to the sub time serials of the nodes (according to the follower/followee relationships). And Fig.2.c depicts the situation that a repost N satisfies the relative condition of privilege conditions (i.e., its author has more followers than A’s). When the author of a repost N did not follow anyone who published the boosting microblog/reposts before him/her, and N does not satisfy the privilege conditions, then it will be placed in the “orphan collector” (Fig.2.d). After that, a repost like the N in Fig.2.d is popped, but it does satisfy the privilege conditions, then place it as root node of another tree (Fig 2.e). Finally, the RepostsTree is constructed as shown in Fig 2.f. 5 1

1

5

1

5

2USKDQFROOHFWRU

D 5 2USKDQFROOHFWRU

1

2USKDQFROOHFWRU

G

2USKDQFROOHFWRU

E

H

5

5 1

2USKDQFROOHFWRU

F

Figure 1. The procedure constructs the RepostTree based on user relationship 2

2USKDQFROOHFWRU

I

Figure 2. Illustrative example on the process of constructing RepostsTree

http://open.weibo.com/wiki/2/statuses/repost_timeline

964

C. The prediction Model Let X be the time serials of repost-list ZKHUH each xi represents a time delay between original microblog and the repost i. Then, with the RepostsTree constructed, we can divide X into several sub-time serials, so

TABLE I. HasPicOrNot Modality MaxMediaWeight Mention

X  Xi



  The dividing rule is based on the RepostsTree, where each node D and its direct children consist a subset. It includes the time serials attached to D, and the time delays between the direct children of D and original microblog. Then, given the dividing of X, we make the estimation on each subset Xi with maximum likelihood estimation to calculate parameter λ so that the single Poisson distribution can be obtained. Finally, we get the compound Poisson distributions set P = {P1, P2, Ă, Pn}, where n is decided by the above dividing. Given distribution set P after original microblog had been published for time duration t0, we can get the predicted reposts count for time duration t0+Δt by calculating the sum of the predicted value of each Pi for time duration t0+Δt. Note that, our model works for predicting in a dynamic way, including the construction of RepostsTree and the dividing of the time serials. IV.

EXPERIMENTAL RESULTS

A. Data Sets Two data sets were collected for analysis. First, we collected 13 million microblogs with Sina Weibo open API3 . And, we select 142K microblogs (142K dataset) to perform an exploratory data analysis using Principal Components Analysis (PCA) and Linear Regression Modeling. In addition to the 142K dataset, we also selected 294 popular microblogs, and collected the full repost-list for each microblog we selected (294 repost-lists data set), and such data set was used to perform tree-based modeling and predicting. 1) The 142K Data Set The purpose of our exploratory data analysis is to understand the features related to repostability. Conceptually, one would like to focus on a set of microblogs, use their features as independent variables, and treat the reposts count as the dependent variable to be predicted from the samples. For our exploratory analysis, we arbitrarily chose a date and collected microblogs since that date (12,928,673 microblogs published by about 100K users since January 1, 2012). Reposts were filtered from this sample to yield 888,888,9 original microblogs. And, we randomly selected 142,498 microblogs which are posted for more than one day. Out of the 142K microblogs ˈ 117,498 microblogs are reposted more than once and 130 of them are reposted over 10,000 times. 2) The 294 Repost-Lists Data Set

3

URL

Followers Followees Statuses Favorites Activity VerifiedOrNot Days Reposts

MICROBLOG FEATURES

Whether or not include picture Number of media types included Max media weight value Sum of followers count of the users mentioned in a microblog Maximum value of the numeral properties related to the URLs included in a microblog (numeral properties related with a URL include clicks count, share count and comments count) Number of users who follow the author of a microblog Number of friends that the author is following Number of microblogs generated by the author since the creation of the account Number of favorited microblogs by a user Activity level of the author Whether or not the author is verified Number of days since the author created the Sina Weibo account Number of reposts recorded for a given microblog

After the repost-lists of the 294 original microblogs downloaded, we found that 41 out of 294 repost-lists contain less than 10K reposts, which may result from the deletion of the original microblog or occasional unavailability of Weibo open platform. Then, we filtered those lists including less than 10K reposts. This yielded a set of 253 repost-lists. B. Data Analytics and Feature Extraction For the data sets collected above (142K data set), we extracted a set of features which are language independent. We divided these features into two sets: the first is concerned with the content features (HasPicOrNot, Modality, URL, Mention, MaxMediaWeight), and the second set is mainly related to the authors (Followers, Followees, Favorites, Statuses, Activity, VerifiedOrNot, Days). The features are defined in TABLE I. The Modality feature in TABLE 1 aims to capture the variey of media types, like if the content of a microblog contains the image, audio and video, then its Mdality is 3. The extraction of this feature is implemented by extracting URLs from the content, and detecting the URL type via Sina Weibo open API4. In Sina Weibo, the URL has five types: ordinary web page, video, music, activity and voting, and each type has related value, such as ordinary web page assigned to 0, video assigned to 1, music assigned to 2, etc. According to our observation, the voting and activity URL type are more attractive than video and music, so we arbitrarily assign weight to the media types (URL types), voting and activity is assigned to 3 respctively, video and music is assigned to 2 respectively, and ordinary web page is assigned to 0. So, the MaxMediaWeight feature is the maximum value of weight assigned to the media type.

4

http://open.weibo.com/wiki/2/statuses/user_timeline

965

http://open.weibo.com/wiki/2/short_url/info.html

TABLE II.

In order to obtain the accurate reposts count, we select the 142K data set by eliminating the microblogs had been published for less than 24 hours. 1) Pearson Correlation Analysis We first check the relevance between extracted features and reposts count by using Pearson correlation coefficient. The 142K data set was used in this experiment. The result is shown in Fig. 3. We can observe that the reposts count is strongly associated with MaxMediaWeight of content features and Followers of context features. Interestingly, the features of Followees and VerifiedOrNot are marginally negatively associated with reposting. However, as we can see that, Pearson correlation coefficients of all features are less than 0.2, which suggest that there is little linear correlation between the features and the repostability of given content, and further underlying factors should be unsealed. 2) PCA analysis PCA (Principal Component Analysis) is widely used in an attempt to reveal underlying structure that maximally accounts for the variance in the data set. We used the varimax rotation technique to produce orthogonal factors. TABLE II presents the factors (principal components) extracted by PCA. There are many criteria for determining the proper number of factors about data variance analysis, such as Kaiser Criterion, Variance explained criteria, and Scree plot test etc [31]. Each criterion suggests retaining at least 5 factors (the first 5 factors) . It indicates that our original extracted features can be well adopted to be the independent variables for explaining the repostability of a microblog.

Factor 1 2 3 4 5 6 7 8 9 10 11 12 13 TABLE III.

PRINCIPAL COMPONENTS FROM THE ANALYSIS Eigenvalue

%Variance

2.4349 1.6076 1.3361 1.1371 1.0056 0.9945 0.9294 0.9157 0.7763 0.6591 0.5656 0.3624 0.2756

18.7303 12.3663 10.2778 8.7468 7.7355 7.6501 7.1492 7.0439 5.9715 5.0699 4.3508 2.7875 2.1202

Comulative % Variance 18.7303 31.0966 41.3744 50.1212 57.8567 65.5068 72.6561 79.7000 85.6715 90.7414 95.0923 97.8798 100.0000

RESULTS OF LINEAR REGRESSION MODEL

Estimate Confidence interval (Constant) -134.1185 [-183.8915, -84.3454 ] HasPicOrNot -148.8293 [-182.4933, -115.1652] Modality -42.5813 [-62.6177, -22.5448] MaxMediaWeight 249.7550 [219.8201, 279.6899] Mention 0.00002 [0.00001, 0.00003] URL 0.0003 [0.0002, 0.0004] Followers 0.0002 [0.0002, 0.0002] Followees -0.0146 [-0.0260, -0.0032] Statuses 0.0010 [0.0001, 0.0019] Favorites 0.0214 [0.0133, 0.0295] Activity 20.9612 [10.1684, 31.7539] VerifiedOrNot -20.5952 [-38.6995, -2.4909] Days 0.0461 [0.0120, 0.0803] R  0.0296 F  362.4810 p  0.0000 s  1.3722  10 2

2

6

C. Results of the Linear Regression Model Linear Regression is a traditional approach to model the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here, in our scenario, y is the Reposts features, and X is the first twelve features in TABLE I. We used the least-squares approach to fit out linear regression model on the 142K data set. The results are shown in TABLE III (with the confidence 1-  = 0.99). The results in TABLE III correspond to the findings that the author feature of the followers has significant effects on repost probability. The content features of Mention and URL have significant effects on repostability. The p value is less than  , suggesting that the linear regression model holds. The statistical s2 is really high, which indicates that exogenous influence plays an important role in reposting of Sina Weibo.

Figure 3. Correlated coefficients between reposts count and extracted features

D. Results of Tree-based model As mentioned in Section 3.1, our RepostsTree is based on the user relationship of ones who participated in the

966

reposting of certain microblog in Sina Weibo. In order to make our relationship map mostly cover the ties of follower/followee relationship, we collect the followees’ IDs of the users who are involved in the reposting process. As Sina Weibo open API 5 allows to obtain at most 5,000 followees’ user IDs, and it is rarely for users to follow more than 5,000 users in microbloging service sites as we observed. Hence, we collect the follower/followee relationship from the following perspective (that is we collect the followees’ IDs of certain users), and the symmetry of the follower/followee relationship in microbloging system [3] make our approach feasible. In order to simulate the real situation where the reposts accumulated over time, we recursively carry out our experiments. For instance, after the microblog M had been published for one hour, we use our tree-based model to get the compound Poisson distribution set P (see Section 3.2) based on the repost-list for M, and calculate the predicted reposts count for, like two hours, and compare the predicted value to the real reposts count. Then, we repeat such procedures to compare the dynamics of predicted reposts with the real data we collected. According to Kwak et al. [20], retweeting probability in Twitter drops rapidly with the increasing of time, and more than 50% of retweets take place within one hour. Here we show our reposting prediction result within 72 hours. Note that, our approach is not limited within 72 hours. Firstly, we test our tree-based prediction model on 253 repost-lists in 2 hours, 10 hours and 24 hours after original microblog published respectively. Here, the learning step is set to 1 hour, that is setting 1 hour to Δt (see Section 3.2). The predictive error rates are illustrated in Fig.4. Note that, the error rate on case 139 (WeiboID = 3492391866790186) fluctuate severely. It is because that there is too much missing data in the repost-list of this microblog. After eliminating the low quality sample, we present the prediction error rate in Fig. 5. We can see that, as the time elapses, the predictive error rate drops, which suggest that

Figure 5. The mean value, median and variance of relative error rate on the 252 popular cases

our approach can learn iteratively, and the more empirical knowledge it learned, the better precision it will obtain. So far, our approach only demonstrate its predictive precision on the popular content (with higher reposts count), we also need to investigate if our approach works well over less-popular contents. We test 182 unpopular microblogs with reposts recorded less than 10,000, the result is shown in Fig.6. The precision on the unpopular content is also acceptable although it is not well performed as the popular ones.

Figure 6. The error rate on the 182 unpopular cases

Figure 4. The error rates for tree-based model on predicting the reposts count after the original microblog published for 2, 10 and 24 hours

5

Figure 7. Reposts count as a function of time with learning step 1 hour

http://open.weibo.com/wiki/2/friendships/friends/ids

967

Figure 8. The inter-event time delay distributions for popular cases

E. Case Analysis We randomly select a specific microblog (Weibo ID = 3467177816566372) to evaluate the effectiveness of the treebased model on predicting the reposting dynamics. First, we use the time serials data within first hour as the training set, and the next hour as the test set. Then, we add the second hour data into the training set, and third hour as the test set, and so on. So, our learning approach works in an iterative way. From Fig.7, we can see that our model can commendably predict the reposting dynamics for a given content. The number of reposts count estimated is more than real data in the later phase of the reposting. This is because that, the time delay between consecutive reposts became much longer with the time elapsed, hence the heavy tails dominate the Poisson process (see Section 4.6). In Fig.7, we also set our learning step as one hour.

V.

CONCLUSION

In this paper, we examined the factors that affects the repostability of contents in Sina Weibo using Pearson Correlation Analysis, PCA, and Linear Regression. Generally, we find that in addition to intrinsic features such as content and context, exogenous elements which the traditional correlation analysis approaches fail to capture, also play important roles on users’ reposting behavior in microblogging sites. We proposed a dynamic approach to predict users’ reposting patterns, taking consideration of user follower/followee relationships. Experimental results validate the effectiveness of our model. There are still limitations that need to be studied further. For example, our proposed predictive method works in an iterative manner, which may result in some loss of performance. We plan to extend our work to develop our predictive approach with model-free methods which make no assumptions about the patterns of users’ reposting behaviors and structure of the underlying statistical process.

F. Why Poisson works? As we mentioned in Section 2, online human activities follow non-Poisson statistics, which termed as heavy tailed distribution. A Poisson distribution decreases exponentially, forcing the consecutive events to follow each other at relatively regular time intervals and forbidding very long waiting times. In contrast, the slowly decaying heavy-tailed processes allow for very long periods of inactivity that separate bursts of intensive activity [28]. We conduct experiments to investigate the inter-event distributions of reposting behaviors in Sina Weibo (as shown in Fig.8). Indeed, the time delays between consecutive reposts follow approximately a power law, which is one of typical characteristics of heavy-tailed distributions. However, as Fig.8 shows, among most of the reposting cases, the heavy tails intensively appeared in the final phase, which differs from other online human behaviors [28], and that is why our approach can successfully work. We also conduct experiments over unpopular posts, where such phenomenon is not so obvious as the popular ones.

ACKNOWLEDGMENT This work was partially supported by the National Basic Research Program of China (No. 2012CB316400), the National Natural Science Foundation of China (No. 61222209, 61103063), and the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20126102110043). REFERENCES B. Suh, L. Hong, P. Pirolli, and E. H. Chi. “Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network.” In SOCIALCOM ’10, pp. 177–184. [2] C. Castellano, S. Fortunato, and V. Loreto. “Statistical physics of social dynamics.” Rev. Mod. Phys., vol. 81(2), May 2009, pp. 591–646. [3] E. Sandes, W. Li and A. d. Mello. “Logical model of relationship for online social networks and performance optimizing of queries.” WISE 2012, pp. 726-736. [1]

968

[4]

[5] [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

H. Zhang, Q. Zhao, H. Liu, K. Xiao, J. He, X. Du, H. Chen. “Predicting retweet behavior in weibo social network.” WISE 2012, pp. 737-743. Haight, F. A. Handbook of the Poisson Distribution. Wiley, New York, 1967. E. Adar and A. Adamic, Lada. “Tracking information epidemics in blogspace.” In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 207214. M. Gomez Rodriguez, J. Leskovec, and A. Krause. “Inferring networks of diffusion and influence.” In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '10), pp. 1019-1028. D. Liben-Nowell and J. Kleinberg. “Tracing information flow on a global scale using internet chain-letter data.” Proceedings of the National Academy of Sciences, 105(12):4633, 2008. E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. "Everyone's an influencer: quantifying influence on twitter.” In 3rd ACM Conference on Web Search and Data Mining, Hong Kong, 2011. ACM Press. Leskovec, L. A. Adamic, and B. A. Huberman. “The dynamics of viral marketing.” In Proceedings of the 7th ACM conference on Electronic commerce, pp. 228-237. E. S. Sun, I. Rosenn, C. A. Marlow, and T. M. Lento. “Gesundheit! modeling contagion through facebook news feed.” In Proceedings of the 3rd Int' AAAI Conference on Weblogs and Social Media, 2009. E. Bakshy, B. Karrer, and L. Adamic. “Social influence and the diffusion of user-created content.” In Proc. of the 10th ACM conference on Electronic commerce, pp. 325-334. Yu, L., Asur, S., Huberman, B.A. “What trends in Chinese social media”. The 5th SNA-KDD Workshop. 2011. Q. Gao, F. Abel, G. Houben, Y. Yu. “A comparative study of users' microblogging behavior on sina weibo and twitter.” UMAP 2012, pp. 88-101. Y. Qu, C. Huang, P. Zhang, J. Zhang. “Microblogging after a major disaster in China: a case study of the 2010 Yushu earthquake.” CSCW 2011, pp. 25-34. S. Chen, H. Zhang, M. Lin, S. Lv. “Comparison of microblogging service between sina weibo and twitter.” In Proc. of the 2011 International Conference on Computer Science and Network Technology (ICCSNT), pp. 2259-2263. L. Liu, K. Jia. “Detecting spam in Chinese microblogs - a study on sina weibo.” In Proc. of the 8th International

[18]

[19]

[20]

[21]

[22]

[23]

[24] [25]

[26]

[27]

[28] [29]

[30] [31]

969

Conference on Computational Intelligence and Security(CIS), Nov. 2012, pp. 578-581. L. L. Yu, S. Asur, B. A. Huberman. “Artificial inflation: the true story of trends in Sina Weibo.” CoRR abs/1202.0327 (2012) J. Chen, J. She. “An analysis of verifications in microblogging social networks - sina weibo.” In Proc. of the 32nd International Conference on Distributed Computing Systems Workshops (ICDCSW), June 2012, pp. 147-154. H. Kwak, C. Lee, H. Park, and S. Moon. “What is twitter, a social network or a news media?” In Proc. of the 19th International World Wide Web Conference, April 2010, pp. 591-600. F. Wu. and B. A. Huberman. “Novelty and collective attention.” Proc. of the National Academy of Sciences, pp. 104, 45, Nov, 2007. S. Golder. “Tweet , tweet , retweet: conversational aspects of retweeting on twitter.” In 43rd Hawaii International Conference on System Sciences (HICSS), IEEE Press, Jan. 2010, pp. 1–10. W. Galuba and K. Aberer. “Outtweeting the twitterers – predicting information cascades in microblogs.” In Conference on Online social networks (WOSN), 2010. D. Zarrella. The Science of ReTweets, 2009. Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zhang, and Z. Su. “Understanding retweeting behaviors in social networks.” In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM’10), pp. 1633-1636. H. K. Peng, J. Zhu, D. Piao, R. Yan, Y. Zhang. “Retweet modeling using conditional random fields”. ICDM’11 Workshops on Data Mining Technologies for Computational Collective Intelligence. Dec, 2011. K. C. Sia, C. J, K. Hino, Y. Chi, S. Zhu, and B. L. Tseng. “Monitoring RSS Feeds Based on User Browsing Pattern,” In Proceedings of the International Conference on Weblogs and Social Media, Mar. 2007, pp. 161–168. A.-L. Barabási. “The origin of bursts and heavy tails in human dynamics.” Nature. Vol. 435, May 2005, pp. 207–211. R. Dean Malmgren, Daniel B. Stouffer, Adilson E. Motter, Luis A. Nunes Amaral. “A Poissonian explanation for heavytails in e-mail communication.” CoRR abs/0901.0585 (2009) G. Szabó, B. A. Huberman. “Predicting the popularity of online content.” Commun. ACM (CACM) 53(8):80-88 (2010) https://en.wikipedia.org/wiki/Principal_component_analysis