Content Diffusion Prediction in Social Networks. Ali Balali, Aboozar Rajabi, Sepehr Ghassemi, Masoud Asadpour and Hesham Faili. School of Electrical and ...
2013 5th Conference on Information and Knowledge Technology (IKT)
Content Diffusion Prediction in Social Networks Ali Balali, Aboozar Rajabi, Sepehr Ghassemi, Masoud Asadpour and Hesham Faili School of Electrical and Computer Engineering University of Tehran Tehran, Iran {balali.a67, ab.rajabi, s_ghassemi, asadpour, hfaili}@ut.ac.ir A common way to detect trends is to focus on the contents that are published online every minute. Then patterns of words are analyzed in order to detect frequent patterns that appear the most on these published contents. We think in addition to the content itself the comments that discuss around it can be useful in trend analysis as well.
Abstract—Social networks are valuable resources for analyzing users’ natural behavior. User profile information, social links and interchanging opinions among users in these networks can be used by social analyzers to discover mental and behavioral patterns of users in social networks. In this paper, news agencies are used as the social media to detect effective factors of diffusing contents in public. We believe that the volume of comments on content show how well the content has spread and attracted attentions. As a result, we extract features of contents to predict volume of comments. To achieve this goal, content of the news articles and its publication time are considered as two critical factors. A novel method for prediction of content diffusion is proposed and its accuracy is evaluated. The promising results of our experiments indicate that these factors can gain accuracy of at least 70%.
If we find the important features in a news article that catch the attention of people, we would be able to predict whether a topic becomes a hot trend in a near future or not. Detection of effective factors in popularity of a content have much more applications e.g. specifying a proper time for publication of an article, optimization of content specially the used words to be better seen in search engines, marketing and advertisement, choosing a good slogan for electoral campaigns, using in search engines as a metric in ranking documents based on the importance of used words, etc.
Keywords-Social Networks; Text mining; Content Diffusion;
I.
In this paper we focus on the features that are easily acquired from news articles i.e. main content of an article, publish time, and comments. Some new agencies mention the number of visitors of each article, however we do not use it here as we think volume of comments that are left for an article is more informative for our purpose [3]. Because users might see a news article but leave it soon or even not read it at all, while commenting on an article indicates not only the visitor has read the article but also it has been important for him.
INTRODUCTION
Nowadays, social network analysis is used as a useful tool in sociology, anthropology, geography, social psychology, linguistics, organizational study, economics, modern biology, communication and information science. Study of social networks has a long history in research community e.g. in simulation of complex set of behaviors among individuals from interpersonal relationships to worldwide communications [1,2]. By the invention of Internet and the raise of Web 2 websites, online social networks have prevailed our regular life. The content that are published every day by users on these websites have valuable information for us. Nowadays, many websites allow their visitors to collaboratively add content to their site, at least by adding comments on their articles.
In this paper we will try to predict diffusion of content and extract features that affect the diffusion process. In summary, the contributions of this paper are: •
Among the important websites for content analysis, are news agencies that have wide range of viewers with different opinions. News articles are categorized in different categories such as politics, economics and society, so they are proper choice for analyzing opinion of society about different topics. They usually allow visitors to express their opinion on the news articles or about each other’s comments.
• •
The paper is organized as follows: Literature and related works are reviewed in Section 2. Section 3 outlines the methodology and the required features for content diffusion that we propose. Then, we present the collected dataset and evaluation metrics in Section 4. This is followed by evaluation results. The conclusion and future works comes in Section 5.
When an article catches attention of visitor, it means the article is speaking about something that is important to the society. It is very important for government or social organizations to detect this kind of issues and respond to them accordingly. For example predicting social tension is of vital importance. Another interesting topic is detection of trends in the society.
978-1-4673-6490-4/13/$31.00 ©2013 IEEE
Proposing a new method for determining whether a content will diffuse or not based on the content, publish time, and volume of comments Achieves 70% accuracy in predicting volume of comments for news articles Proposing a method to weight words in order to find their impact in spreading the content.
467
II.
RELATED WORKS
The “Without Comment” class means the diffusion rate is low and the news article does not affect the society. In the other words, there are not any interests to the news. The “Moderately Commented” class indicates average influence and diffusion rate while the “Highly Commented” class demonstrates high diffusion rate and vast spread of the article. The classifier that we use in our paper is the Random Forest Classifier (RFC). The RFC [15] has multiple decision trees and its output is the mode of the classes that are specified by those decision trees. A model is created by the RFC based on the training data. Each decision tree uses a model for classifying the test data sets. It uses the defined features and chooses one of the three classes we introduced earlier. The implementation that is used in this paper is the one that is available in Weka1[16] software. The features that are used by our classifier are news title, news content, news topic and publish time. We calculate a weight for every word that is composed of three metrics: Term Frequency (TF), Inverse Document Frequency (IDF) and Number of Comments. The TF is the number of times a specific word appears in an article. IDF indicates whether a word is general or not and is calculated this way:
Some papers have focused on the detection of different kinds of social relations among users of web sites and have extracted a social network. Jamali and Rangwala [17] considered Digg website data to extract a social network among users based on their comments. They defined a collaboration network among users who leave comments on the same posts. Similarly, in [18] the social network is made among users who write comments on each other’s post. Mishne and Glance [12] have done a popularity analysis that shows the number of comments is correlated with the number of viewers and referred links’ count in weblogs. Although the number of viewers and comments are related to each other but the number of comments better indicates the importance and hotness of the expressed subject. In different papers, various metrics are used to determine whether an article has attracted visitors’ attentions or not e.g. the number of comments [2,3] and the number of visits. The user’s comments are used in different application like user behavior analysis and measuring the importance of the users [4], summarizing the posts [5], improvement of clustering [6], predicting the volume of comments [3,7,8] and measuring the reputation of a blog [12].
IDF(w)=Log(N/nw)
Among features which are used to predict the volume of comments the following features are worth mentioning: • •
where N is the total number of news articles in the dataset and nw is the number of news articles that contain the word w.
surface features like months of a year, days of a week, hours of a week and hours of a day,
These three features are then combined into a weight for the word, w:
textual features such as words, repetition frequency of words in a post and length of the post,
•
semantic features like using negative sentences, name of people, organizations and places and finally
•
real-world features such as weather conditions.
¦ ( ( log(TF (w , i )) + 1) * IDF (w ) *C i ) if C i > C °° i ∈articles Weight (w ) = ® § 1 · °− ¦ ¨ ( log(TF (w , i )) + 1) * IDF (w ) * otherwise ¸ C i +1 ¹ °¯ i ∈articles ©
(2)
where i is the news article, TF(w,i) is frequency of word w in article i, IDF(w) is the IDF of word w, Ci is the number of comments received for article i, and ܥҧ is the average number of comments received for articles. If a word appears in an article that has received more comments than average, its weight is increased. Otherwise, the weight is decreased. The weight reduction is reversely proportional to the number of comments. Also, words which are appeared in the test data but not in the train data are not considered.
In some papers, prediction is done in a particular context like politics [9]. In addition, the number of comments could be affected by the context as it is related to specific communities [10,11] e.g. blogs are related mostly to younger people or students. Sun et al. [13] have investigated opinion analysis through polarity of comments. The proposed method, DWC (DistanceWeighted Count), is an unsupervised method that unlike PWC (Polarized Word Count) method considers weight of the polarized words. Similarly to [13], Symoneaux et al. [14] have analyzed like/dislike comments in order to find reason behind selecting a specific product. III.
(1)
As longer contents or titles make the weights bigger, they should be normalized based on the length of the article’s title and body. Therefore we define some new features: 1.
OUR METHODOLOGY
Weight of the title of an article: ܹ݄݁݅݃ݐሺ݈݁ݐ݅ݐሻ ൌ
In this section, our proposed method for prediction of content diffusion based on the comments is explained. We try to classify articles into 3 classes: (1) “Without Comment”, (2) “Moderately Commented” and (3) “Highly Commented”.
σೢא ௐ௧ሺ௪ሻ ୪୭ሺȁ௧௧ȁሻ
(3)
where |title| is the number of words in the title. 2. Weight of content, weight of the content words: ܹ݄݁݅݃ݐሺܿݐ݊݁ݐ݊ሻ ൌ
σݐ݄ܹ݃݅݁ ݐ݊݁ݐ݊ܿאݓሺݓሻ ሺȁܿݐ݊݁ݐ݊ȁሻ
where |content| is the number of words of the content. 1
468
Waikato Environment for Knowledge Analysis
(4)
TABLE 3. THE DATASETS’ STATISTICS PER CLASS
3. Sum of the weights of the title and content is also added to the features We also add some other features that are listed in Table 1.
Tabnak Alef
TABLE 1. SOME EXTRA FEATURES USED IN THE CLASSIFIER Feature Name
Description
Category Publish time
In "yyyy-MM-dd HH:mm:ss" format
Publish year Publish month Publish day
Publish time related information
B. Evaluation Metrics To evaluate the results of the experiments, we use the standard accuracy, precision, recall, and F1-score: Accuracy is the percentage of predictions that are correct, precision is the percentage of positive predictions that are correct, recall is the percentage of positive labeled instances that are predicted as positive and, F1-score is the harmonic mean between precision and recall. They are computed as follows:
Publish hour Publish minute Length of the title Length of the content
Number of words of the title Number of words of the article
EXPERIMENTS AND EVALUATION
In this section we describe our dataset and several experiments we have run on it to evaluate the proposed methodology. In experiments we have done by RFC, maximum height of output tree is set to 7 to prevent overfitting of the trees.
ൌ
ൌ
A. Dataset We have done our experiments on two popular news agencies in Iran: 1) Tabnak 1 is one of the Alexa 2 top 1000 traffic rank globally and 11th popular web site in Iran. Also, Tabnak is the most famous online news agencies in Iran based on Alexa3 2) Alef 4 is ranked 3033 globally. The statistics of these two news agencies are summarized in Table 2. Tabnak and Alef datasets which are used in this paper are available for future works at http://ece.ut.ac.ir/NLP/resources.html. Besides popularity of these websites, the wide range of news categories they cover and the multilevel commenting structure of these sites are the other reasons for choosing them. Multilevel commenting enables visitors to see related comments and replies in one place. Some statistics about these sites are shown in Table 2.
ൌ
Tabnak Alef
20872 10010
No. of Comments
Alexa rank in Iran
127091 81997
11 48
ା ାାା ା
(5) (6) (7)
ା
ͳܨെ ݁ݎܿݏൌ ʹ ൈ
୰ୣୡ୧ୱ୧୭୬ൈ୰ୣୡୟ୪୪ ୰ୣୡ୧ୱ୧୭୬ା୰ୣୡୟ୪୪
(8)
Where TP, TN, FP, and FN are the number of True Positives, True Negatives, False Positives, and False Negatives, respectively. We have applied 5 fold crossvalidation to minimize the bias. C. Performance Evaluation The accuracy of classifier for each website is shown in Table 4. The results show that the achieved accuracies from both websites are more than 70%. We could achieve higher accuracies by adding some very specific features e.g. important events such as elections. In these occasions people leave more comments and participate more in discussions however, our goal in this paper was to be more general. Thus, these kinds of features were not added.
TABLE 2. SOME STATISTCAL INFORMATION ABOUT THE COLLECTED DATA No. of Articles
14033 (67%) 5653 (~57%)
Highly Commented 3714 (18%) 1831 (18%)
We divide the news articles into 3 classes: the NoComments class is assigned to the articles that have no comments, the Moderately-Commented class is assigned to the articles that have from 1 to 6 comments, and the HighlyCommented class is assigned to the articles that have more than 6 comments. Table 3 shows the number and percentage of articles in each class. It shows most of the news articles (5767%) are in “No Comments” class and around 20% are highly commented.
Category of the news article (politics, culture, economics, sport and society)
IV.
Moderately Commented 3125 (15%) 2526 (25%)
No Comments
Average comments per news 6 8
We use the performance acquired by using only the 6 timerelated features i.e. date, year, month, day, hour and minute as a baseline [3]. As it is shown in Table 4, the defined baseline produces rather good result. This shows in specific days corresponding to important events the rate of comments increases.
1
Tabnak – Professional News Site, www.tabnak.ir Alexa – The web information company, www.alexa.com 3 April, 2013. 4 Alef – The Analytical News Agency, www.alef.ir 2
Using only content-based features such as content length, title length, content weight, title weight, category, and sum of
469
content and title weights leads to around 70% accuracy in both datasets. These results show that content is also important in specifying the popularity of news articles, even more than the time-based features.
E. Some facts about the time-based features In this section, some statistics on the time-based features are presented that can help us understand why these features are important in classification.
TABLE 4. ACCURACY OF THE CLASSIFIER
TABLE 7. EVALUATION OF THE FEATURES
Alef
All features 72.76%
Time-based features (baseline) 63.59%
Content-based features 69.63%
Tabnak
70.82%
67.79%
69.62%
Dataset
Feature
Date
Table 5 contains the precision, recall and F1-scorse on the Alef’s dataset for each class of articles. Table 6 shows the same information with the baseline features. As Table 5 shows, “Without Comment” class could be detected easier than two other classes. Also, determination of “Moderately Commented” class is more crucial than the other classes because existence of some words indicates whether the content has high potential to receive numerous comments or not but finding the words that specifies the content gets medium number of comments is very hard.
Precision
Recall
F1-score
Without Comment
0.757
0.896
0.82
Moderately Commented
0.567
0.357
0.438
Highly Commented
0.762
0.72
0.74
Precision
Recall
F1-score
Without Comment
0.696
0.856
0.768
Moderately Commented
0.465
0.378
0.417
Highly Commented
0.567
0.311
0.402
1
0.24734
2
0.22835
3
0.16018
4
Month
0.12087
5
Content Length
0.09891
6
Category
0.06674
7
Weight of Title
0.0382
8
Title Length
0.01444
9
Day
0.00813
10
Hour
0.00361
11
Minute
0
12
Fig. 1 and 2 shows the total number of articles and comments published by Tabnak during 2007 to 2012. The growing rate of the articles and comments shows why the “year” feature is important in classification. The growing rate might be due to some reasons: 1) Internet has prevailed social life of Iranians and online content is becoming one of their main sources of knowledge and information gathering. 2) Online websites and social networks have played an essential role in the important events of the recent years specially the presidential election in 2008 and its post-election events. 3) People have tendency to express their opinions and beliefs due to the fact online websites allow them to do it anonymously.
TABLE 6. PRECISION, RECALL , AND F1-SCORE FOR ALEF WEBSITE USING TIME-BASED FEATURES (BASELINE) Class
Rank
Weight of content Sum of the weight content and title Year
TABLE 5. PRECISION, RECALL , AND F1-SCORE FOR ALEF WEBSITE Class
Score Gain Ratio 0.27003
Fig. 3 and 4 shows the total number of articles and comments published in Alef during different hours of the day. The result are almost the same for Tabnak, therefore we omit it. The main hours during which most of the articles and comments are published for both of them are between 10:00 and 14:00. This is the time when people generally take a break and have free time to read news articles and leave a comment. Therefore it contains very little information about the importance of the news as most important news articles are published at these hours.
In conclusion, using content features along with date features leads to increasing accuracy, precision, recall and F1score. The biggest difference between our proposed method and baseline is in determining “Highly Commented” class, i.e. 34% in F1-score. This shows the role of the words is very important in determining the “Highly Commented” class. D. Evaluation of the features In this section, we evaluate the role of the introduced features in Table 7 in the performance of the classifier. We compare them by their gain-ratio that exists in Weka. As the table shows, the “minute” feature has no gain-ratio and it is useless. Indeed, the “year” and “month” features are quite useful. The “year” feature is ranked 4th among features. This is due to the fact that the number of comments has grown in recent years.
V.
CONCLUSION AND FUTURE WORKS
In this paper, two important factors in content diffusion i.e. the content and publish time were analyzed. We designed a classification method to predict success of articles in creating dialogue among visitors. We measured this success by counting the number of comments received by the article. A set of time-based and content-based features was fed into a Random Forest Classifier and the analysis was done on two online news agencies. The results showed that the proposed model is capable of prediction with more than 70% accuracy. We showed that some features like publish hour are not informative for this
In conclusion, the “date” and then “weight of content” features have the highest impact on the performance of the classifier. It is worth mentioning that the content-based features were more accurate than the time-based features specially in detecting “Highly Commented” class.
470
purpose but publish date and a weight measure that was introduced for content were the most informative features. For future, we would like to add features such as importance of events corresponding to the date (like elections and holidays) and find out whether adding these features can increase the accuracy of prediction.
12000 10000
count
8000 6000 4000
REFERENCES
2000
[1] L. Freeman, The Development of Social Network Analysis. Vancouver: Empirical Pres, 2006. [2] A. Kaltenbrunner, et al., "Homogeneous temporal activity patterns in a large online communication space," In Proc. BIS 2007 Workshop on Social Aspects of the Web (SAW 2007), 2007. [3] M. Tsagkias, W. Weerkamp, and M. de Rijke, “Predicting the volume of comments on online news stories,” Proceeding of the 18th ACM conference on Information and knowledge management - CIKM ’09, p. 1765, 2009. [4] J. Chan, et al., "Decomposing Discussion Forums and Boards Using User Roles," presented at the Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010. [5] M. Hu, et al., "Comments-Oriented Blog Summarization by Sentence Extraction," presented at the CIKM’07, November 6–8, 2007, Lisboa, Portugal. [6] B. Li, et al., "Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments," presented at the ACMSE’07, March 23-24, 2007, Winston-Salem, N. Carolina, USA. [7] G. Mishne, Applied Text Analytics for Blogs.: PhD thesis, University of Amsterdam, Amsterdam., 2007. [8] J. O'Neill and D. Martin, "Text chat in action," presented at the Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work, Sanibel Island, Florida, USA, 2003. [9] T. Yano, et al., "Predicting response to political blog posts with topic models," presented at the Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, 2009. [10] M. Gumbrecht, "Blogs as “Protected Space”," presented at the WWW’04, May 18–21, New York, NY, USA. , 2004. [11] L. Wang and D. W. Oard, "Context-based Message Expansion for Disentanglement of Interleaved Text Conversations," Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 200–208, 2009. [12] G. Mishne, “Leave a Reply: An Analysis of Weblog Comments.” Third annual workshop on the Weblogging ecosystem, 2006 [13] S. Sun, G. Kong, and C. Zhao, “Polarity Words Distance-Weight Count For Opinion Analysis Of Online News Comments,” Procedia Engineering, vol. 15, pp. 1916-1920, Jan. 2011. [14] R. Symoneaux, M. V. Galmarini, and E. Mehinagic, “Comment analysis of consumer’s likes and dislikes as an alternative tool to preference mapping. A case study on apples,” Food Quality and Preference, vol. 24, no. 1, pp. 59-66, Apr. 2012. [15] L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 5-32, 2001/10/01 2001. [16] Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; and Witten, I. H. 2009. The WEKA Data Mining Software: An update. 11. [17] S. Jamali and H. Rangwala, “Digging Digg: Comment Mining, Popularity Prediction, and Social Network Analysis,” 2009 International Conference on Web Information Systems and Mining, pp. 32-38, Nov. 2009. [18] Y. Wu, L.-dong Sun, and H.-tao Liu, “Property Description and Comparison of BLOG Community at Different Scales,” no. 1, 2010.
0 2007
2008
2009
2010
2011
2012
year Figure 1. Total number of news articles published by Tabnak in different years
Figure 2. Total number of comments received by Tabnak in different years
Figure 3. Total number of comments received by Alef’s in different hours 2500 2000
count
1500 1000 500 0 1
6
11
16
21
hour Figure 4. Total number of news articles published by Alef in different hours
471