Hot Post Prediction in BBS Forums Based on Multifactor Fusion ...

2 downloads 0 Views 671KB Size Report
classifying were incorporated with topic transition [8,9]. Research on TDT may help to understand the formation and evolution of hot topics. Hot Post Prediction in ...
Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

Hot Post Prediction in BBS Forums Based on Multifactor Fusion 1

Fei Xiong, 2Yun Liu, 3Jiang Zhu, 4Jie Lian, 5 Ying Zhang Key Laboratory of Communication and Information Systems Beijing Municipal Commission of Education Beijing Jiaotong University, Beijing 100044, China [email protected], [email protected],[email protected] 3,5 Carnegie Mellon University Silicon Valley, Moffett Field, CA 94035, USA [email protected], [email protected]

1, 2 Corresponding Author,4

Abstract Bulletin Board Systems (BBS) becomes more and more common in the age of Web 2.0. It may attract a large number of users, and cause public responses. To predict online emergency in BBS forums, hot posts are required to be detected as early as possible. In this paper, we put forward a finegrained predictive model for hot posts. Given a post, we would like to predict its hotness using its early data. By analyzing the data collected from a real BBS, topic hotness is measured by the influence extent and intensity. Factors that may affect the diffusion of posts are extracted, including content influence, short-term trend influence and time influence. Logistic regression model is used to combine these factors into the probability that the input post can become hot. Simulation results prove that our model can detect a majority of hot posts within a short time after the posts are published, and has a preferable precision.

Keywords: BBS, Hot post prediction, Feature Fusion, Topic Detection 1. Introduction With the development of Web 2.0 [1], Internet has become an important medium in people’s life. Lots of people surf the Internet, read news, and obtain information every day. Compared with traditional media, it is easy and convenient for users to achieve information on Internet [2,3]. Social networks on Internet, provide a public and free place where people can express their ideas or emotions, and communicate with each other. Due to the advantages of Internet, information in online social networks often diffuses faster than that in actual society. As a typical representative of online media, BBS has demonstrated its strength as an effective new medium. It is the easy access to spread and share information to the public. In BBS forums, after a user publishes a post, other users can reply to the post and discuss with the author. The users that publish replies are marked as participants. All users including anonymous users can read posts, but only registered users can join in interactions, and publish posts and replies. As users present their points of view without any limit in BBS, some hot posts attract a lot of users and cause significant influences. If we are able to predict hot posts in advance, then we provide the early warning of online emergency. Many related studies concentrate on the topic detection and tracking (TDT). The task of TDT is to detect the organization of documents about a certain topic from diverse sources of information. The clustering algorithm is very important for topic detection. By using temporal semantic references, an incremental clustering algorithm that is hierarchically applied can discover the structure of topic [4]. The temporal references improve the performance of online detecting systems. In Reference [5], similarity between user interests was characterized by interest weights, and a hierarchy of classes for user interest ontology was generated to detect innovative topics. Rather than the classic vector space model used in TDT analysis, a temporal discriminative probability model with feature selection and temporal weights [6] is an effective method of topic detection. Topics from some special sources of information such as chat messages, were also studied [7]. In fact, documents about a topic often evolve and transit with time. In order to improve the accuracy of topic model based detection, the authors proposed a topic transition model based on hidden Markov model, and topic discovering and document classifying were incorporated with topic transition [8,9]. Research on TDT may help to understand the formation and evolution of hot topics.

Journal of Convergence Information Technology(JCIT) Volume7, Number12, July 2012 doi:10.4156/jcit.vol7.issue12.16

129

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

What topic is called hot topic? There are many definitions of hot topic. In Reference [10], topics were sorted in decreasing order of interest predicted by a trend index and information retrieval measures. From the time series of topics, the average of percentage of increase and the noninterpolated average precision rate were used as the predictor. Topics in professional blogs were chosen by the number of related blog posts, comments and clicks, incorporating with the user participation degree and the opinion communication degree [11]. In [12], the authors thought that a hot topic should appear in many stories on many news channels, and should have strong continuity. They represented hot topics as multi-dimensional sentence vectors. Hot words [13] were extracted according to their frequency used in the collection and the variation of their usage over time. Based on these hot words, documents were clustered and hot topics were selected out by the analysis of topic trends. In [14], hot topics in BBS were defined using aging theory, and the specific form of energy function was proposed to rank the hotness of topics. However, to date, hot topics do not have a uniform definition. In this paper, we concentrate on the influence scope and intensity of posts in BBS, so we emphasize the participants for each post, and try to detect hot posts as early as possible. We extract some features from the data of each post in the first several hours, and fuse the features to predict whether a post will become hot. The whole data are split into two parts to train and verify our model. The remainder of the paper is organized as follows. In section 2, we describe the data set used in our work. In Section 3 we define hot posts for BBS. Section 4 discusses the features that may affect the diffusion of posts, and Section 5 fuses these features to a probability model. Experiment results are presented in Section 6. We conclude the paper in Section 7.

2. Data Acquisition Tianya BBS (www.tianya.cn) is now one of the most famous online forums in China. Till August 2011, more than 56 million users take part in this network, and the count of online active users stays above one million. Because of its huge number of participants, Tianya BBS is often the important source of online emergency. Many events that happen in actual society sometimes spread first in Tianya BBS. Research on information diffusion in Tianya BBS can help us to better understand other social networks. We collected data from the economic board of Tianya by our directed robot. Both posts and replies were downloaded, as well as user information. After more than eight hours’ crawling, we downloaded 11011 posts and more than 300 thousand corresponding replies. Reduplicated and redundant posts were removed, and spam posts were also excluded. The final number of posts collected is 9994.

3. Definition of Hot Posts Hot posts attract a lot of attention and cause public responses within a short time. Since many users take part in discussions on hot posts, these posts thus have wide influences. Then the impact of hot posts is not limited to online networks, but extends to actual society. As far as we know, hot posts have a large number of reads and replies; however, some posts that have plentiful replies do not always mean hot posts. Instead, they may be continued stories, or local discussions. Meanwhile, some other posts can absorb a few users each day and last a long time. Although at last these posts may have a great many accumulated participants, the posts are not likely to cause conspicuous effect or even online emergency. Therefore, they are not explosively diffused, and should not be treated as hot posts. We use the maximal number of daily participants for each post to measure its hotness. Figure 1 illustrates that the relationship between the number of reads and maximum daily participants for different posts in Tianya BBS. Many users do not reply to a post, but have already read it, implying they are influenced by the post. It is obvious that the number of reads increases approximately with the maximal number of daily participants. From Fig. 1, if 20 participants reply to a post on a day, at least 1000 anonymous users have read the post. These posts that have more than 50 daily participants, can absorb 10 thousand reads on average. Therefore, the hotness of posts can be reflected by the number of maximum daily participants. We define the threshold value  to evaluate the quantity. If the number of maximum daily participants for a post is above  , it is considered as a hot post. Figure 2 shows the distribution of maximum daily participants. The distribution decays as a power law, and the power exponent is -1.6518. It is usual that less than 10 users reply to a post on a

130

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

day, but a small proportion of posts have more than 20 daily participants. Especially, the number of daily participants for several posts is over 1000, meaning they cause the widest influence. 6

10

5

10

4

Number of reads

10

3

10

2

10

1

10

0

10 0 10

1

2

3

10 10 Number of maximum daily participants

10

Figure 1. The number of reads for each post versus its maximal number of daily participants. 0

10

-1

Proportion of posts

10

-2

10

-3

10

-4

10

0

10

1

2

3

10 10 10 Number of maximum daily participants

4

10

Figure 2. Distribution of maximal number of daily participants for different posts.

4. Feature Extraction We extract factors that are related to the attraction of posts. Whether a post will attach abundant participants depends on its content interest and user activity. We suppose that the post’s hotness is decided by all different features. Since the features will be integrated in a model, they should be quantified and divisible. Features can be divided into three groups: content influence, trend influence and time influence. We first describe the observations based on which we enumerate our features, followed by the respective features generated. Content influence. Users browse the forum, select the posts they are most interested in, and read and reply to the posts. If a post belongs to a hot topic, or it is ever read or replied by many users that share the same interest, the post is more likely to become hot. On the contrary, posts that belong to cold topics are hard to absorb participants. Short-term trend influence. The online forum is a public circumstance. After a user replies to a post, the post will be pushed to the top of all posts in the forum, and will be more probable to be noticed by new users. These posts that attract more participants have more opportunities to be pushed to the front

131

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

and are easier to attach new replies. This phenomenon is called “Richer get richer”, and leads to scalefree distribution. We investigated the distribution for the number of reads and replies in Tianya BBS, and found that the distribution follows a power law [15]. Therefore, we use the read and reply characteristics of a post in its early stage to inspect the spreading trend of the post. Time influence. Users’ read and reply behavior coincides with their daily habits. When there are many active users in the network sometime, posts published at that time can be diffused faster. It was proved that a large proportion of hot posts are concentrated at about 12:00 p.m. [16], while posts created late at night are hard to spread widely.

4.1. Content Influence Topic cluster. The posts belong to different topic clusters. Since the data were collected from the economic board of Tianya BBS, each cluster expresses a kind of economic activities, consisting of original reports, discussions and subsequent reports. Reports on the course of an economic or social event should be considered as the identical cluster, tracking the evolution of the event. For instance, posts about unemployment of workers and bankrupting of companies in 2008 are classified to the cluster of financial crisis. Indeed, topic clusters have different attraction for users. Some topic clusters, such as economic crisis, taxes and tariffs, occasional accidents, etc., tend to attract more users. On the contrary, some other clusters like stock investment, macroscopic readjustment and control, economic analysis, etc., seem impossible to become hot, since very small groups of populations concern about these special or technical topics. Posts are classified to topic clusters based on content similarity, which is measured by the cosine of vector angle between two posts. Texts written in some languages, like Chinese, Korean, do not use blank space to split words simply as English. Therefore, we need to divide words from a length of continuous text first. The tool kit of ICTCLAS [17] is used to split words of our data. According to the frequency of words occurring in different posts, each post V is converted to a word vector  w1 , w2 ,...wn  . The weight wi represents the TF  IDF value of the word that occurs in post V . We use the K-Means algorithm for topic clustering based on the cosine distance between posts. As some posts have a too long length of content, 2000 words with the highest weight values are conserved in each post. As shown in Fig. 3, the distribution for average number of participants in each cluster follows two different scalings. The proportion of clusters increases with the average number of participants at first; when the average number of participants is above 6, the proportion of clusters has a power-law decaying. Most of clusters have less than 10 participants in a post on average, but a tiny part of clusters can attract more than 20 participants. Posts from popular clusters may be more intriguing, and hence cluster characteristic can be a measure of post’s attraction. If two posts present the same event and are classified to the identical cluster, the previous post is possible to draw more attention. We obtain the average number of participants for the cluster that a post belongs to, and the rank of publishing time for a post in its cluster. 0

Proportion of clusters

10

-1

10

-2

10

-3

10

0

10

1

10 Average number of participants

2

10

Figure 3. Distribution for average number of participants over posts of each cluster.

132

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

Title and words. In online forums, users can only see titles of posts at first. If they find the title of a post interesting enough, then they click the title to enter the post page to read the main body of the post. Thus an absorbing title may have more clicks and impel more users to interact in this post. Nominal entities and sentiment expressions may strike users’ sympathetic chords. We therefore include a series of such features: whether the title contains nominal entity words; how frequently the nominal entities appear in the title; whether the title contains sentiment words; how frequently these sentiment words appear in the title. In addition, if named entities like the name of famous people or corporations appear in the main body of a post, users may acquaint these people or corporations, and may be absorbed in the post. These following features are also included: the number of nouns for personal names in the main body of a post; the number of nouns for places in the main body; the number of nouns for company or community names; the number of sentiment words that appear in the main body. In all, 4 title features and 4 main body features are included in this part. Overall, content influence accounts for 2 cluster features and 8 word features.

4.2. Short-term Trend Influence Assuming we have the number of reads and replies for each post in its early stage, we will use the short-term data to infer the diffusion trend of a post at the following time. Some posts that hardly attract users at their early time may also become hot after several hours. If there are only few posts in a period of time, every post in that period is more easily noticed by users. We use the data within the first T hours of each post after it is created, and define the features as follows: the number of replies within the first T hours; the number of participants within the first T hours; the number of total posts in the first T hours; the average number of participants over the posts in the first T hours. In total, short-term trend influence accounts for 4 features.

4.3. Time Influence The diffusion of a post is closely correlated with the time of day and time of week when the post is published. There are more active users in the BBS forum around 12:00 p.m. at noon, while users seldom interact in the deep of night. We calculate the number of participants at each hour of day averaged over all days of training data, for more active users may lead to more replies. Besides the user daily behavior, we also notice that users interact with each other more frequently in the weekend than in the weekday. Each day of a week is split into four periods, that is, 0:00 a.m. – 6:00 a.m., 6:00 a.m. – 12:00 p.m., 12:00 p.m. – 6:00 p.m., 6:00 p.m. – 12:00 a.m.. Thus a week consists of 28 periods of time, and the average number of participants in each period is computed as the week feature. Then we obtain 2 features for a post: the average number of participants at the publishing hour of day and in the publishing period of week. The BBS network is a dissipative structural system and far from equilibrium. Many new users enter the network every day, but some old users lose their interests and withdraw from the BBS. We find that more and more participants exist in the system, and the proportion of hot posts increases slowly with the time elapsed. The distribution for user activity follows a power law [15]. Since the hotness of posts is also related to the group of active users, we introduce these features: the total count of participants during the past one day and one week.

5. Feature Fusion Now we have 18 features of each post, and we will use logistic regression model [18] to combine these features. Whether a post will become hot is denoted as label y . We also denote x as the  observation of a post in Tianya BBS. Then a set of features about x are generated as h  x  , and the weight of feature hi  x  is given by wi . The prediction of hotness for post x is denoted as follows in the light of logistic regression model.

133

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

P  y  1| x  

1  T  1  exp w h  x  1  exp   w  h x    i i    i 



1



(1)

From above model, if the weigh wi of feature i is positive, the probability P  y  1| x  increases with this feature. If the probability P is above 0.5, we consider the post as hot post reasonably.

6. Experiment Results In order to examine our model discussed above, we use data of Tianya BBS to evaluate the performance of the model. This data set includes 9994 posts and 48964 users. We select 4000 posts as training data set, and treat the others as testing data set to predict hot posts. The average number of participant for each cluster, and the average number of participants at the time of day or week are calculated only over the training data. We set the time window T at 4 hours, and use the data in the first 4 hours of a post to predict whether it will become hot. The threshold value  for hot post detection is varied from 20 to 50. For   20 , 910 of all posts are hot posts, and for   50 , only 213 posts become hot. Due to the limited size of dataset we have collected so far, we use cross-validation to evaluate the model. If   20 , the 10-folds cross-validation over the training data archives 95.325% in accuracy, and 98.05% for   50 . The recall, precision and accuracy of the estimation results are calculated among the whole testing data. In the trained logistic regression model, the weights of title and word features are positive, but those of global features about other posts are negative, such as the number of total posts. From Fig. 4, the accuracy of the predicting results is above 96%, and increases with the threshold value  slightly. The recall is a little lower than the precision, for only a few posts can become hot. The precision is about 78%, nearly independently of  . We notice that the recall drops visibly with the increase of  . By increasing the threshold value, extremely few posts are treated as hot posts, and some important posts may be neglected with this strict condition. Hot post samples in the training data are not adequate to generate a predictive model, so more data that contain hot posts are required for training. To investigate the importance of multi-features on prediction, we consider only two features: the count of replies and participants for each post during the first 4 hours. For   20 with these two features, the recall over the testing data is 13.8%, and the precision is 12.12%, implying that these short-term trend features are not enough to distinguish hot posts. To provide early warning of online emergency, it is expected that hot posts should be detected as soon as possible. Thus, we explore the dependence of our model on the time window T , as shown in Fig. 5. The recall of predicting results increases with the period T , but the accuracy and precision almost remain unchanged. For T  2 , the recall is larger than 0.5, meaning more than a half of hot posts can be detected by our model in two hours after the posts are published. Especially, about 75% hot posts are correctly classified within their first 6 hours. However, when T  2 , only considering the number of replies and participants, the recall is less than 10%.

134

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

1 0.9 0.8

Performance

0.7 0.6 0.5 0.4 Accuracy Recall precision

0.3 0.2 0.1 0 10

15

20

25

30

35

40

45

50

55

60



Figure 4. Accuracy, recall, precision as a function of the threshold value  .

1 0.9 0.8

Performance

0.7 0.6 0.5 0.4

Accuracy Recall precision

0.3 0.2 0.1 0

0

1

2

3

4 T

5

6

7

8

Figure 5. Accuracy, recall, precision versus T for   20 .

7. Conclusions In this paper, we proposed a predictive method that is used to detect hot posts in BBS. We collected data from Tianya BBS and described hot posts in terms of their participants. We extracted a series of features that are related to the topic hotness, including content and time influence. These features were fused by the logistic regression model to predict the appearance of hot posts. We split the whole data set into the training data and testing data, and ran simulation study to examine the impact of features. Experiment results show that our model is effective to identify hot posts within a short time. In future work, we will collect more data to train our model and run simulations. Furthermore, we will investigate user behavior in BBS, and integrate our model with microscopic characteristics to improve the performance of prediction.

135

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

8. Acknowledgment This work was partially supported by the State Natural Sciences Fund under Grant 60972012, the Beijing Natural Science Foundation under Grant 4112045, the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant 20100009110002, the Fundamental Research Funds for the Central Universities under Grant 2011YJS005 , the Academic Discipline and Postgraduate Education Project of Beijing Municipal Commission of Education, the Service Business of Scientists and Engineers Project under Grant 2009GJA00048, Beijing Excellent Talents Training Fund.

9. References [1] Fei Xiong, Yun Liu, Xiameng Si, Fei Ding, “Network Model with Synchronously Increasing Nodes and Edges Based on Web 2.0”, Acta Physica Sinca, vol. 59, no. 10, pp. 6889-6895, 2010. [2] Wei-Feng Tung, Yu-Ren Chen, “User-Based Social Ranking Service Design for Tagging Search and Recommendation”, Journal of Convergence Information Technology, vol. 6, no. 10, pp. 385390, 2011. [3] Renjie Zhou, Huiqiang Wang, Guangsheng Feng, Bingyang Li, Wenjin Jin, Xu Lu, “Research on Patterns of User Interactions and Media Popularity on Online Social Networks”, Journal of Convergence Information Technology, vol. 7, no. 2, pp. 269-276, 2012. [4] Pons-Porrata, A and Berlanga-Llavori, R and Ruiz-Shulcloper, J, “Detecting Events and Topics by Using Temporal References”, Advances in Artificial Intelligence, vol. 2527, pp. 11-20, 2002. [5] Nakatsuji, Makoto and Yoshida, Makoto and Ishida, Toru, “Detecting Innovative Topics Based on User-interest Ontology”, Journal of Web Semantics, vol. 7, no. 2, pp. 107-120, 2009. [6] He Qi, Chang Kuiyu, Lim Ee-Peng and Banerjee Arindam, “Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1795-1808, 2010. [7] Dong Haichao, Hui Siu Cheung and He, Yulan, “Structural Analysis of Chat Messages for Topic Detection”, Online Information Review, vol. 30, no. 5, pp. 496-516, 2006. [8] Zeng Jianping and Zhang Shiyong, “Variable Space Hidden Markov Model for Topic Detection and Analysis”, Knowledge-Based Systems, vol. 20, no. 7, pp. 607-613, 2007. [9] Zeng Jianping and Zhang Shiyong, “Incorporating Topic Transition in Topic Detection and Tracking”, Expert Systems with Applications, vol. 36, no. 1, pp. 227-232, 2009. [10] Yuen-Hsien Tseng, Yu-I Lin, Yi-Yang Lee, Wen-Chi Hung and Chun-Hsiang Lee, “A Comparison of Methods for Detecting Hot Topics”, Scientometrics, vol. 81, no. 1. pp. 73-90, 2009. [11] Erzhong Zhou, Ning Zhong and Yuefeng Li, “Hot Topic Detection in Professional Blogs”, Active Media Technology, vol. 6890, pp. 141-152, 2011. [12] Kuan-Yu Chen, Luesukprasert L., Chou S.-C.T., “Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 8, pp. 1016-1025, 2007. [13] Zhongfeng Zhang, Qiudan Li, “QuestionHolic: Hot Topic Discovery and Trend Analysis in Community Question Answering Systems”, Expert Systems with Applications, vol. 38, no. 6, pp. 6848-6855, 2011. [14] Donghui Zheng and Fang Li, “Hot Topic Detection on BBS Using Aging Theory”, Web Information Systems and Mining, vol. 5854, pp. 129-139, 2009. [15] Xiong Fei, Liu Yun, Zhu Jiang, et al., “A Dissipative Network Model with Neighboring Activation”, European Physical Journal B, vol. 84, no. 1, pp. 115-120, 2011. [16] Jiang Zhu, Fei Xiong, Dongzhen Piao, Yun Liu, Ying Zhang, “Statistically Modeling the Effectiveness of Disaster Information in Social Media”, In Proceeding(s) of the IEEE Global Humanitarian Technology Conference, pp. 431-436, 2011. [17] Hua-Ping Zhang, Hong-Kui Yu, De-Yi Xiong, and Qun Liu, “HHMM-based Chinese Lexical Analyzer ICTCLAS”, In Proceeding(s) of Second SIGHAN Workshop on Chinese Language Processing, pp. 184-187, 2003.

136

Hot Post Prediction in BBS Forums Based on Multifactor Fusion Fei Xiong, Yun Liu, Jiang Zhu, Jie Lian, Ying Zhang

[18] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. "LIBLINEAR: A Library for Large Linear Classification", Journal of Machine Learning Research, vol.9, no. 6, pp.1871-1874, 2008.

137

Suggest Documents