Mining Domain Information from Social Contents Based on News Categories Yin-Fu Huang & Chen-Ting Huang National Yunlin University of Science and Technology Touliu, Yunlin, Taiwan, R.O.C. +886-5-5342601 Ext. 4314
[email protected],
[email protected] ABSTRACT In this paper, to help users find the domain information from tweets they are interested in and further support the work of domain explorers and social content retrieval, a classification method on the social contents in Twitter is proposed, based on news categories. The classification method uses traditional “bag of words” features and explicit features extracted from tweets to facilitate the classification. Since data sparseness is always a serious problem when considering “bag of words” features in the text classification, we further employ dimensionality reduction methods on “bag of words” features and observe their performances. The experimental results show that 1) our proposed method can achieve good performances in the tweet classification and 2) using dimensionality reduction methods can achieve higher accuracy than not using them.
Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing –Text analysis.
General Terms Algorithms, Documentation, Experimentation.
Keywords Twitter, New York Times, dimensionality reduction, hashtag.
short
text,
classification,
rich-meaning tweets covering different cultures, languages, and personal cognition make mining work more difficult. To effectively explore tweet contents, our study in this paper is to classify these comprehensive tweets into different categories categorized by the New York Times. Classified tweets can not only help users find the contents they want to see, but also further support more detailed contents, such as finding popular topics in tweets, which are related to technology information and other applications. Since tweet contents cannot exceed 140 characters imposed by Twitter, mining tweet contents becomes a short text classification problem. In other words, using “bag of words” features in the text classification will incur data sparseness problem. Furthermore, when sample spaces become very huge, it will lead to a large amount of computation costs. Therefore, achieving good performances in the tweet classification and reducing the computation costs incurred in model learning and predicting are the goals of our study. In this paper, in addition to the traditional “bag of words” features, we consider extra features specified in tweets. On the other hand, we also compare different dimensionality reduction methods and inspect their performances in the tweet classification. The remainder of this paper is organized as follows. Section 2 introduces the previous studies related to ours. In Section 3, we present the system framework and describe the procedure of each component. In Section 4, through experiments, we observe and analyze the performances of using different feature spaces and dimensionality reduction methods in the tweet classification. Finally, we make conclusions in Section 5.
1. INTRODUCTION Recently, since social media information increases rapidly, social network mining has become a popular topic. Twitter is a wellknow social networking application, which has vast amounts of data (or tweets) posted daily by users. A lot of valuable information hidden in tweets can be used to develop new applications, such as social information summary based on users' interesting, interest-oriented social information retrieval, popular topics, event exploration, and mining related information to support decision making for organizations etc. However, these Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. IDEAS '15, July 13 - 15, 2015, Yokohama, Japan. © 2015 ACM. ISBN 978-1-4503-3414-3/15/07…$15.00. DOI: http://dx.doi.org/10.1145/2790755.2790776
2. RELATED WORK Previous related research on Twitter has been proposed as follows. Zhao et al. exploited Latent Dirichlet Allocation (LDA) model to explore topic distributions of Twitter and a traditional medium such as New York Times (NYT) [9]. They compared the differences between two media, and further confirmed that Twitter is a better source of entity-oriented topics than NYT. Unlike traditional classification methods only using the “bag of words” model, Sriram et al. also used a small set of domain-specific features like “@username” at the beginning of tweets, authors, slang, time-event phrases, opinion words, etc. extracted from author profiles and texts to classify tweets into categories such as News, Events, Opinions, Deals, and Private Messages [7]. However, our work is different from that done by Sriram et al. In this paper, since we further identify the news category of social contents, these features cannot be used in our classification model directly. In addition to Twitter-related research, text classification is also strongly related to our work. To improve the classification
performance on short and sparse texts, Phan et al. collected a large amount of hidden topics from large-scale data collections such as Wikipedia and MEDLINE, and integrated them into the training data for reducing the impact of data sparseness [6]. Besides, Gabrilovich and Markovitch observed the performances of SVM and C4.5 in text categorization when feature selection is applied on the dataset with many redundant features [4]. The results show that feature selection is essential to improve the SVM performance in text categorization when a dataset has many redundant features.
3. SYSTEM FRAMEWORK A tweet may be considered as belonging to different categories, based on the recognition of different people. Here, we use hashtags with the same categories to collect tweets used in our system. The collected tweets are related to 12 categories including Arts, Business, Health, NYregion, Opinion, Politics, Science, Sports, Style, Technology, Travel, and World categorized by the New York Times. Totally, 3000 tweets for each category are crawled using twitter4j with the thresholds of Favorite Count and Retweet Count greater than 1 for each tweet, and then collected as the dataset in the system. To identify the category of tweet contents, we propose the system framework including 5 components; i.e., 1) preprocessing, 2) building dictionary, 3) generating samples, 4) dimensionality reduction, and 5) building classifier, as shown in Figure 1. First, the preprocessing is to remove URLs, special characters, and the same hashtags as the 12 categories in tweet contents. Then, we collect news from the New York Times and extract keywords to build a dictionary used to filter the “bag of words” in tweets. Next, The samples are generated with the features integrating the “bag of words” features in tweets and other explicit features extracted in the preprocessing. To find relevant features, we reduce sample spaces using dimensionality reduction methods. Finally, a classification model is built to observe the performances of different combinations of feature spaces and dimensionality reduction methods.
Preprocessing
Social Contents
Generating Samples
Building Dictionary
Dictionary
Dimensionality Reduction
Samples
Building Classifier
SVM Model
Figure 1. System framework.
3.1 Preprocessing The preprocessing is to extract relevant information from tweets to facilitate the classification. Since these tweets could include some noises useless for the classification, they should be removed one by one. First, we remove duplicate tweets and then URLs, punctuations, special characters, and stop words (such as “n’t”, “’ll”, “’ve’, “’s”, “’re” etc.) in tweets. Then, we remove the tweets where their word counts are less than 10 from the collected tweets.
Finally, totally 20564 tweets remain as the social content dataset for experiments, as shown in Table 1. Besides, we also extract hashtags other than the 12 categories, user entities, and authors in each tweet to support the classification. Table 1. Tweets collected in each category Categories
Number of tweets
Arts
1579
Business
1867
Health
1906
NYRegion
1739
Opinion
957
Politics
1834
Science
1388
Sports
1922
Style
1609
Technology
2199
Travel
1669
World
1895
3.2 Building Dictionary On the other hand, we also collect the New York Times (NYT) news during 02/01/2013-02/01/2014, according to the 12 categories. Keywords are extracted from the collected news, which are used to build a dictionary for filtering the “bag of words” in tweets. Totally, we collect 10700 news. After removing URLs, punctuations, special characters, and stop words appearing in Natural Language Toolkit (NLTK) Corpus of English Stopwords from these news [1], we calculate Document Frequency (DF) of all terms in the news texts. To make sure that each keyword is relevant, we remove those terms with their DF=1 or DF within top 0.6%. Finally, a dictionary is built with the remaining 67923 keywords in all categories for extracting the “bag of words” in tweets. The numbers of news and keywords in each category are shown in Table 2.
3.3 Generating Samples After extracting social contents and building the keyword dictionary, we generate samples for building the classifier. Two kinds of features are considered as the attributes of samples to identify tweets. The first one is the traditional “bag of words” features that can be filtered from tweets using the dictionary. Totally, 14082 words are common to tweets and the dictionary, so that they form the binary “bag of words” features of samples. These “bag of words” features are also called implicit ones in the paper.
Table 2 News and keywords in each category Categories
Number of news
Number of keywords
Arts
902
35989
Business
879
27519
Health
680
20063
NYRegion
922
30015
Opinion
967
30431
Politics
894
22495
Science
809
24850
Sports
966
28337
Style
990
39036
Technology
801
22749
Travel
999
29625
World
891
26075
For the second one, some information in tweets is considered useful for tweet classification, such as hashtags allowing users to define contents by themselves based on Folksonomy. Hashtags can be used to share information with other users having the same interest, and help users to quickly find tweets related to the topics (or tags) they want to see. However, hashtags are too many, as shown in Table 3, to be the features of samples. Furthermore, parts of hashtags are duplicates of the “bag of words” features. Thus, we make 12 hash concepts representing 12 news categories where each one is organized with the top 20 hashtags appearing frequently in tweets of its corresponding category. For each social content (or sample), we calculate the number of tags covered with the 20 hashtags in a hash concept as the feature value of each hash concept. Therefore, 12 numeric features corresponding to 12 hash concepts form parts of explicit features, as shown in Figure 2. Numeric
Hash Concepts
Binary
Author Nominal
@username
Binary
Currency
Percentage
Binary
12
Figure 2. Explicit features. In addition to hash concepts, we further consider one nominal feature called “author” and 3 binary features representing the presence of user entities such as “@username”, “currency” and “percentage” signs [7]. All these features called explicit features are shown in Figure 2.
Table 3 Hashtags in social content samples Categories
Number of hashtags
Arts
813
Business
1341
Health
1730
NYRegion
1482
Opinion
594
Politics
1326
Science
1031
Sports
1637
Style
1893
Technology
922
Travel
1447
World
1843
3.4 Dimensionality Reduction Data sparseness is always a serious problem when “bag of words” features are used in the text classification. More than thousands of features incur a large amount of computation costs not only on building a classifier, but also on using the classifier in prediction. Although dimensionality reduction methods can be used to reduce feature dimensions, classification accuracy usually decreases. Nevertheless, we use dimensionality reduction methods to generate different sample spaces and observe what sample spaces still keep good performances. Here, two feature selection methods called Information Gain (IG) and Chi-Square (CHI), and one feature extraction method called Principal Component Analysis (PCA) built by WEKA [5] are used to reduce sample spaces.
3.5 Building Classifier In this paper, we use SVM [2] to train the classification model since it has achieved good performances in the previous studies [3, 4, 8]. Although many studies claimed SVM performances in the text classification cannot be improved using feature selection methods, some studies [4] still proposed that feature selection is essential to improve SVM performances when many redundant features exist. Despite using feature selection or feature extraction methods to reduce sample spaces, reduced samples are normalized between 0 and 1 to build SVM models with the optimal parameters C and γ.
4. EXPERIMENTAL RESULTS In this section, we observe and analyze the performances of using different feature spaces and dimensionality reduction methods in
the tweet classification. The experiments implemented in Java code using Weka library were built on the platform with i7-4790 CPU and 32GB Memory. In the first experiment, different feature spaces are used to investigate what feature space as the attributes of samples is the best for the tweet classification. In the second experiment, different dimensionality reduction methods are used to inspect what reduced samples generated by one of them perform better than the others in the tweet classification. Furthermore, we also compare the execution time of using different dimensionality reduction methods in dimensionality reduction, training, and test. In these experiments, we use the SVM models built with the optimal parameters C= 25 and γ=2-7. Then, the classification accuracy is evaluated in 10-fold crossvalidation. Finally, the comparisons between our method and the other existing ones are presented.
4.1 Experiment 1 As mentioned in Section 3.3, the samples can be generated using two kinds of features; i.e., 1) the traditional “bag of words” called implicit features, totally 14082 features and 2) explicit features specified in tweets, totally 16 features including 12 hash concepts, “author”, “@username”, “currency” and “percentage” signs. In this experiment, we investigate what feature space is the best for the tweet classification. The features could be combined into 5 kinds of feature spaces as follows: 1) only implicit features (IM), 2) only explicit features (EX), 3) implicit features plus explicit features (IMEX), 4) implicit features plus explicit features, but without hash concepts (IMEXWOHC), and 5) only hashtags. As illustrated in Figure 3, using IMEX features can achieve the best accuracy 92.32% in identifying social contents, which is higher than those of using IM features (i.e., 81.14%) and EX features (i.e., 71.93%). This verifies that both implicit and explicit features are the same relevant to the tweet classification, so that they should be considered together. Furthermore, we also investigate whether the hash concepts proposed in this paper play major roles in the tweet classification. The results as shown in Figure 3 reveal that using IMEXWOHC features only achieves accuracy 81.6%, although it is still higher than that of using IM features (i.e., 81.14%). Therefore, extracting the hash concepts in tweets and regarding them as part of the attributes of samples is a correct step in the tweet classification. On the other hand, if we only use 12839 hashtags as the attributes of samples in the tweet classification, which are collected from tweets and form the 12 hash concepts, it achieves the worst accuracy 60.26%. This indicates that only using hashtags has no way to effectively identify social contents.
Figure 3. Accuracy of using different feature spaces.
4.2 Experiment 2 Although using IMEX features can achieve very high accuracy 92.32%, it takes a large amount of computation cost to deal with 20564 samples with 14098 dimensions (i.e., 14082 implicit features plus 16 explicit features). In this experiment, we use different dimensionality reduction methods to reduce sample spaces and inspect what method still keeps good performances in the tweet classification. Here, three dimensionality reduction methods called 1) Principal Component Analysis (PCA), 2) Information Gain (IG), and 3) Chi-Square (CHI) are used to reduce sample spaces. Here, the reduced target is implicit features because they have far more amount than explicit features. As shown in Figure 4, PCA achieves top accuracy 92.62% for the samples with 3016 dimensions, whereas IG and CHI achieve further top accuracy 93.12% and 93.13%, respectively. This illustrates that using dimensionality reduction methods can still achieve higher accuracy than using the original dimensions (i.e., 92.32%). Besides, we observe that the top accuracy in PCA would not appear at both ends of the sample-size curve, but the top accuracy in IG and CHI appear at the right end of the curve. Different from two feature selection methods IG and CHI, PCA is a feature extraction method based on minimal loss of information. Although PCA cannot make sure that information loss is beneficial to classification problems, it really achieves higher accuracy using far less dimensions than using the original dimensions in this study.
Figure 4. Accuracy of using different dimension reduction methods. Furthermore, we also compare the execution time of using different dimensionality reduction methods in 1) dimensionality reduction, 2) SVM training, and 3) SVM test. First, we find if the original dimensions (i.e., 14098 dimensions) are used to train and test the SVM model, it costs 3 hours 32 minutes 6 seconds in training and 1 minute 18 seconds in test, respectively. For the dimensionality reduction, computing IG and CHI scores requires 1 minute 30 seconds and 1 minute 27 seconds, respectively whereas PCA requires 91 hours 10 minutes 31 seconds. Different from feature selection methods, PCA achieves dimensionality reduction by transforming original samples with binary values into new samples with numeric values, and this makes the execution time much longer. For the SVM training time as shown in Fig. 5, PCA increases dramatically as the feature dimension increases, and PCA is obviously higher than IG and CHI for each dimension, especially for larger dimensions. For 10016 dimensions, PCA costs almost 4 hours to train the SVM model, which is even more than the training time (i.e., 3 hours 32 minutes 6 seconds) while using the original dimensions (i.e., 14098
dimensions). Furthermore, IG and CHI only cost about 1 hour to train the SVM model.
data sparseness [6]. However, web snippets are different from social contents where the words appearing in the former are a little formal, whereas the words appearing in the latter are with more gossips. In general, their method can achieve good performances due to using the topics generated from Wikipedia. But for tweets, it may not be correct since the topics generated from tweet contents are different from those generated from Wikipedia. Finally, Sriram et al. used 8 domain-specific features extracted from author profiles and texts to classify tweets into 5 categories [7]. Although their study is similar to ours, their categories are completely different from ours. Besides, to classify these 5 categories, 8 features intentionally chosen by them are strongly correlated with the categories, whereas the explicit features extracted from tweets in this paper (i.e., especially hashtags) are without so strong correlation with the news categories we use.
Figure 5. SVM training time of using different dimensionality reduction methods. For the SVM test time measured on average based on 10 random test samples, IG and CHI cost 33 seconds and 35 seconds, respectively whereas PCA costs 1 minute 9 seconds, as shown in Fig. 6. Therefore, although PCA has higher accuracy than IG and CHI when reduced samples are less than 6016 dimensions as shown in Fig. 4, its test time is higher than those of IG and CHI.
In summary, our study classifies tweets based on news categories; we not only propose an effective way to classify tweets, but also employ dimensionality reduction methods to reduce the high computation cost incurred in the SVM training and test. The experiment results show that using IG and CHI to reduce dimensions to 10016 can still keep good performance and decrease more than 50% prediction time. Therefore, IG and CHI are better choices than PCA in the tweet classification when feature selection is required. Table 4. Comparisons among all methods Gabrilovich and Markovitch [4]
Phan et al. [6]
Sriram et al. [7]
Reuters 21578 news
Google 12340 snippets
Twitter 5407 tweets
-
-(BOF) +50
6747(BOF) +8
8
5
No
No
Dataset
Figure 6. SVM test time of using different dimensionality reduction methods.
4.3 Comparisons and Discussions Here, the qualitative comparisons between our method and the other existing ones are illustrated in Table 4. First, Gabrilovich and Markovitch verified that feature selection can improve the SVM performance in text categorization, but they used news dataset completely different from social contents [4]. Besides, Outlier Count (OC) strongly correlates with the magnitude of improvement obtained through feature selection where OC is the number of features of which information gain is three standard deviation more than the average. At lower OC values, feature selection is highly beneficial; conversely, at higher OC values, feature selection causes degradation in accuracy. Since Reuters 21578 news has high OC value 78, feature selection has no improvement in accuracy. Then, to improve the classification performance on short and sparse texts, Phan et al. collected 50 hidden topics from largescale data collections such as Wikipedia and MEDLINE, and integrated them into the training data for reducing the impact of
Original feature dim. Class Dim. reduction
10 IG, CHI, BNS
Classifier
SVM
92.2% (using all Accuracy features) BOF: Bag of Features
MaxEnt 83.73%
Naïve Bayes 95% (using only 8 features)
Ours Twitter 20564 tweets 14082 (BOF) +16 12 PCA, IG, CHI SVM 93.13%
BNS: Bi-Normal Separation -: UNAVAILABLE
5. CONCLUSIONS Facing comprehensive tweet contents, we propose an approach to classify them into different categories categorized by the New York Times, based on the implicit features (or traditional “bag of words”) extracted from tweet contents and the explicit features including 12 hash concepts generated by hashtags, and other tags in tweets. The experimental results show that using both the implicit features and explicit features can achieve the best accuracy in the tweet classification; the experiment also verifies that using hash concepts is more effective than using hashtags as the attributes of samples. Besides, to reduce the
computation costs incurred in model learning and predicting due to more than thousands of features, we also compare different dimensionality reduction methods on feature spaces for inspecting what method still keeps good performances in the tweet classification. The experimental results show that using dimensionality reduction methods can still achieve higher accuracy than using the origin dimensions in the tweet classification. In the future, we will explore the topic distribution in classified tweets, and further develop the applications based on social content retrieval.
6. ACKNOWLEDGMENTS This work was supported by National Science Council of R.O.C. under grant MOST 103-2221-E-224-049.
7. REFERENCES [1] Bird, S., Klein, E., and Lope, E. 2009. Natural Language Processing with Python, O’Reilly Media Inc. Publisher. [2] Chang, C. C. and Lin, C. J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2, 3, 27:1-27:27. DOI=http://www.csie.ntu.edu.tw/~cjlin/libsvm. [3] Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research. 3, 1289-1305.
[4] Gabrilovich, E. and Markovitch, S. 2004. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of the 21st International Conference on Machine Learning. 41-48. [5] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The WEKA data mining software: an update. SIGKDD Explorations Newsletter. 11, 1, 10-18. [6] Phan, X. H., Nguyen, L. M., and Horiguchi, S. 2008. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th International Conference on World Wide Web. 91100. [7] Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. 2010. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 841-842. [8] Tong, S. and Koller, D. 2002. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research. 2, 45-66. [9] Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., and Li, X. 2011. Comparing twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval. 338-349.