Data-Driven Feature Word Selection for Clustering Online News Comments Heeryon Cho
Jong-Seok Lee
School of Computer Science Kookmin University Seoul, South Korea
[email protected]
School of Integrated Technology & Yonsei Institute of Convergence Technology Yonsei University Incheon, South Korea
[email protected] the large pool of online comments often contains similar opinions on the subject, and can be grouped into manageable number of clusters. By clustering the online comments into several similar opinions, we can more efficiently understand and analyze the entire online comments as a whole.
Abstract—Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.
In order to successfully cluster the vast number of online comments, the problem of data sparsity must be addressed since performance of clustering algorithms declines drastically due to high dimensionality and data sparsity [2]. One way to reduce the dimensionality of the feature space is to perform feature selection. Existing works have proposed various methods for feature selection which include document frequency, information gain, χ2 statistics, term strength, term contribution, and term variance [3,4,5]. However, few have investigated the set-theoretic approach to feature selection. In this paper, we propose a simple yet effective set-theoretic approach to selecting feature words for clustering of Korean online news comments. Simple set operations such as intersection and difference are performed on the set of common nouns to select relatively more specific feature words for better clustering. We prove the effectiveness of our approach by quantitatively evaluating the internal cluster validity using three cluster validation indices, Dunn [6], PBM [7], and Silhouette [8].
Keywords—feature word selection; online news comments; Kmeans clustering; cluster validation index; text analysis
I. INTRODUCTION One of the key characteristics that distinguish the pre and post social media era is the way in which public opinion is shaped and viewed. Prior to the widespread usage of social media, mass media assumed the unique role of setting and disseminating public opinion [1]. However, everyday usage of social media has facilitated many-to-many direct sharing of individual opinions, giving rise to a drastic change in how the general public shape and view public opinion. Nowadays, not only is mass media the sole conveyor of public opinion, but individuals are actively forming unique versions of public opinion by analyzing the opinions shared through social media. An exemplary case of public opinion formed over social media is the reader-provided comments attached to online news articles. Such online news articles cast light upon pressing social issues, and they often trigger heated discussion among the interested readers via online comments. Much insight can be gained from such online comments, but the sheer number of texts that needs to be reviewed precludes the reader from reading through all the comments. Fortunately,
978-1-4673-8796-5/16/$31.00 2016 IEEE
II. ONLINE NEWS COMMENT Naver (http://www.naver.com/) is one of the leading web portals in South Korea, and among its many services is Naver News (http://news.naver.com/) which provides online news to its visitors. At the bottom of each news article in Naver News is an online comment box which displays the readers’ online comments. Fig. 1 shows a sample online comment box placed at the bottom of a news article.
Fig. 1. Online comment box given in Naver News
494
BigComp 2016
TABLE I.
For this study, we crawled Korean online comments attached to the 2,078 news articles covering politics, economy, and local news in 2013. As a result, more than 7.44 million online comments were collected.
No.
News Title
News Category
Comment Size
1
29 Dec. 2012
Income Gap Widens
Economy
1,481
Politics
3,021
Economy
1,206
Local News
2,726
Local News
3,197
Local News
3,073
Politics
1,741
Politics
2,170
Economy
3,220
2
III. METHOD
3
A. Set Operation for Feature Word Selection The raw online comments were fed to a Korean morphological analyzer (Korean Language Technology (KLT) version 2.3.0) to extract tokenized words and part-of-speech information. From the word tokens, only common nouns were used to create the overall noun set and the three types of noun sets, local, global, and distinct. Common nouns were chosen based on the intuition that common nouns carry key concepts contained in opinions. Hereafter, we will indicate common nouns as nouns.
4 5 6 7 8 9
Each of the local, global, and distinct noun set was created as follows. For each news article, a frequency-sorted list of nouns was generated by first creating a unique noun list per online comment, and then merging all comments’ unique noun lists, and finally sorting the merged list according to unique nouns’ frequency. We used the entire 7.44 million online comments to create a noun frequency list and take the top 1,000 most frequent nouns to prepare an overall noun set. Then local noun set was obtained by creating a noun list using the online comments attached to each news article. That is, local noun set was created per news article. The global noun set was obtained by taking the intersection between the overall and local noun set. The distinct noun set was obtained by removing the global noun set from the corresponding local noun set. We then selected the top 250 most frequent nouns from the local, global, and distinct noun sets to construct the final three noun sets. The set relationship among the overall, local, global, and distinct noun set is visualized in Fig. 2.
NEWS TITLE AND COMMENT SIZE OF NINE NEWS ARTICLES
Article Date
NISa Agent Feels No 5 Jan. 2013 Remorse for Her Act Increase in Rate of Elderly 22 Jan. 2013 Support Accelerates Effect of Cigarette Price 12 Feb. 2013 Increase Analyzed Government Introduces 19 Mar. 2013 Grandma Childcare Three-Strike-Out for Stalking 12 Apr. 2013 Proposed Congressmen’s Ugly 2 Jun. 2013 Overseas Trip Scorned Air Force Enforces 5 Jun. 2013 Compulsory Smoking Ban Personal Debt Write Off 25 Dec. 2013 Raises Heated Debate Total Comment Size a.
21,835
National Intelligence Service (NIS) is the chief intelligence agency of South Korea.
B. Cluster Validation Index In order to quantitatively assess the cluster quality, we used the three cluster validation indices, namely, Dunn [6], PBM [7], and Silhouette [8]. The calculation of each measure was performed using R’s ‘clusterCrit’ package [11]. These three indices measure the structurally better clustered results by giving higher values to the clustered results having small within-cluster distances and large between-cluster distances. The value of each measure is calculated as follows. 1) Dunn: Let dmin denote the minimal distance between points of different clusters and dmax the largest within-cluster distance. The Dunn index (CDunn) is defined as the quotient of dmin and dmax: d C Dunn min d max 2) PBM: Let DB denote the largest distance between two cluster barycenters. Let EW denote the sum of the distances of the points of each cluster to their barycenter, and let ET denote the sum of the distances of all the points to the barycenter G of the entire data set. The PBM index (CPBM) is defined as follows:
Fig. 2. Set-relation among overall, local, global, and distinct noun set
In essence, the local noun set is equal to document frequency since document frequency counts the number of documents in which a term occurs. What is different is the selection of various subset of document frequency in relation to the overall document frequency.
1 E C PBM T DB K E w
2
K indicates the number of clusters. Note that ET is a constant which does not depend on the partition or on the number of clusters.
For the experiment, nine news articles and their online comments were selected. The title of the nine news articles and the size of online comments are listed in Table 1. A total of 21,835 comments were used in the clustering experiment. For each news article, three types of 250-word vectors were constructed using local, global, and distinct noun sets. An open source statistical software R [9] was used for clustering. Specifically, we performed K-means clustering using the Hartigan-Wong algorithm [10]. The maximum number of iteration for convergence was set to 300 (iter.max=300) while random start was performed 100 times (nstart=100).
3) Silhouette: For each datum i, let a(i) be the average dissimilarity of i with all other data within the same cluster. Let b(i) be the lowest average dissimilarity of i to any other cluster, of which i is not a member. Then the silhouette width s(i) can be calculated as follows: s (i )
495
b(i ) a (i ) max{a (i ), b(i )}
The mean of the silhouette widths for a given cluster Ik is called the cluster mean silhouette and is denoted as Sk:
Sk
1 nk
For example, cluster 2 contains 1,031 comments (315+716) for the distinct noun set, but 796 (80+716) and 799 (83+716) comments for the local and global noun set respectively. This tendency is observed also in other clustered results, and we believe this tendency reflects the majority vs. minority opinion formation observed in online news comments.
s (i )
i I k
Finally, the global silhouette index is the mean of the mean silhouettes through all the clusters: CSilhouete
1 K Sk K k 1
The higher the calculated values of Dunn, PBM, and Silhouette, the better the clustered result in terms of cluster structure. IV. RESULTS A. Quantitative Evaluation The number of clusters was set from two to six clusters for each news article in the experiment. Although choosing an appropriate cluster size is in itself a major issue in clustering, we have a natural restriction imposed by the online news comment clustering problem. Since our goal is to present a manageable number of clustered results, we presume two to six clusters to be adequate. To see how well the clusters were structurally clustered, Dunn, PBM, and Silhouette values were calculated for each cluster size of each news article’s online comments. Figure 3 shows each of the averaged Dunn, PBM, and Silhouette values across the nine news articles’ online comments for each cluster size. The horizontal axis indicates the number of clusters: e.g., ‘c2’ indicates two clusters. The vertical axis indicates the averaged index values of the nine news articles’ online comments. We see that the clustered results using the distinct noun set performs best among the three noun sets.
Fig. 3. Distinct, local, and global noun sets’ Dunn, PBM, and Silhouette values averaged across nine news articles
Figure 4 shows the normalized and averaged index values of the three indices per news article’s online comments. Here we also see that the distinct noun set performs the best. In both Figs. 3 and 4, we see that the online comments clustered using the distinct noun set exhibit the highest internal cluster validation index values for all cluster sizes and various news articles’ online comments. B. Qualitative Evaluation Table 2 shows the clustered result of cluster size five using online comments for no. 3 article in Table 1, “Increase in Rate of Elderly Support Accelerates”. The rightmost ‘Shared’ column contains nouns shared by all three types of nouns. The top ten most frequent nouns, excluding the shared nouns, appearing within each cluster of each noun type are listed in both English (top) and Korean (bottom). We see that more specific concepts such as chaebol, post-retirement preparation, euthanasia, private education expenses, capitalism, and robot appear in the distinct noun set. We think that such differentiation in the feature word allows the between-cluster distance to become greater. Note that verb do and adverb truly are mistakenly introduced as nouns. This error is due to KLT, but the error is few. The size of the cluster is also different for the distinct nouns; more dramatic size difference is observed.
496
global
Fig. 4. Normalized and averaged index values for individual news articles
V. CONCLUSION We presented a simple yet effective set-theoretic feature selection method for strucuturally superior clustering of Korean online news comments. The evaluation experiment using three internal cluster validation indices, Dunn, PBM, and Silhouette showed that the distinct noun set formed structurally better clusters than the local and global noun sets.
TABLE II.
Cluster 1
Feature FrequencyBased Ranked Words (English/ Korean)
Cluster 2
Size Feature FrequencyBased Ranked Words Size
Cluster 3
Feature
FrequencyBased Ranked Words
Cluster 4
Size Feature FrequencyBased Ranked Words
Cluster 5
Size Feature FrequencyBased Ranked Words Size
COMPARISON OF TOP TEN FREQUENT NOUNS IN DISTINCT, LOCAL, AND GLOBAL NOUN’S CLUSTERED RESULT
Distinct support, income, chaebol, elderly support, youth, employment, postretirement preparation, birthrate, permanent position, childbirth /
Local
Global
support, elderly, offspring, parents, parents, now, problem, thought, people, us /
youth, elderly, now, thought, us, government, welfare, world, generation, do /
youth, support, elderly, us, offspring, old people, do, national pension, pension, self /
부양, 소득, 재벌, 노인부양, 청년, 취직, 노후준비, 출산율, 정규직, 출산 85 Distinct temporary position, euthanasia, birthrate, post-retirement, welfare for the aged, old age, employment, each, elderly support, condolences /
부양, 노인, 자식, 부모, 부모님, 지금, 문제, 생각, 국민, 우리
젊은이, 노인, 지금, 생각, 우리, 정부, 복지, 세상, 세대, 하지
젊은이, 부양, 노인, 우리, 자식, 늙은이, 하지, 국민연금, 연금, 자기
비정규직, 안락사, 출산률, 노후, 노인복지, 고령, 취직, 각자, 노인부양, 애도
젊은이, 출산율, 모시, 우리, 자기, 정부, 생각, 지금, 자식, 부모님
315 Distinct birthrate, private education expenses, education expenses, children, hereafter, condolences, system, our parents, emigration, life expectancy /
79 Local
64 Global
youth, birthrate, take care, us, self, parents, offspring, parents, self, government, thought, now, offspring, problem, thought, now, reality, truly, parents / welfare /
지금, 우리나라, 자식, 생각, 부모님, 세금, 부모, 결혼, 정부, 자기
83
716
Global
Shared
human, elderly, now, tax, parents, marriage, people, thought, country, youth /
human, elderly, now, tax, parents, thought, marriage, people, country, benefit /
birthrate, human, solved, elderly, government, welfare, here, employment, needs, country /
출산율, 사교육비, 교육비, 자녀, 향후, 애도, 시스템, 우리부모님, 이민, 수명 25 Distinct take care, support, tuition, pay, respect, underage children, must emigrate, capitalism, fund, North Korea /
사람, 노인, 지금, 세금, 부모, 결혼, 국민, 생각, 나라, 젊은이
사람, 노인, 지금, 세금, 부모, 생각, 결혼, 국민, 나라, 혜택
출산율, 사람, 해결, 노인, 정부, 복지, 여기, 취업, 필요, 국가
78 Local
78 Global
elderly, youth, now, welfare, country, elderly, offspring, now, parents, thought, offspring, government, problem, thought, country, us, parents, problem / government, welfare /
elderly, take care, parents, ultimately, thought, beggar, long time, politician, public servant, live /
모시, 부양, 학비, 내야, 공경, 어린이, 이민가야, 자본주의, 자금, 이북 20 Distinct plan, local taxes, small & mediumsized businesses, national tax, robot, research & development, old-age pension /
노인, 젊은이, 지금, 복지, 나라, 생각, 자식, 정부, 부모, 문제
노인, 자식, 지금, 부모, 문제, 생각, 국가, 우리, 정부, 복지
노인, 모시, 부모, 결국, 생각, 거지, 오래, 정치인, 공무원, 살기
124 Global
3 Shared
기획, 지방세, 중견중소기업, 국세, 로봇, 연구개발, 노령연금 3
나라, 지금, 우리나라, 정말, 생각, 나라, 지금, 우리나라, 대한민국, 대한민국, 자식, 젊은이, 우리, 하지 정말, 생각, 국민, 세금, 복지, 일본
116 Local country, now, our country, really, thought, Republic of Korea, offspring, youth, us, do /
95
Country, now, our country, Republic of Korea, really, thought, people, tax, welfare, Japan /
99
[4]
4 Shared
-
0
T. Liu, S. Liu, and Z. Chen, “An evaluation on feature selection for text clustering,” In Proc. of the 20th Int’l Conf. on Machine Learning (ICML ’03), pp. 488-495, Washington DC, 2003. [5] L. Liu, J. Kang, J. Yu, and Z. Wang, “A comparative study on unsupervised feature selection methods for text clustering,” in Proc. of 2005 IEEE Int’l Conf. on Natural Language Processing and Knowledge Engineering, (IEEE NLP-KE '05), pp. 597-601, 2005. [6] J. Dunn, “Well separated clusters and optimal fuzzy partitions,” Journal of Cybernetics, 4, pp. 95-104, 1974. [7] S. Bandyopadhyay, M.K. Pakhira, and U. Maulik, “Validity index for crisp and fuzzy clusters,” Pattern Recognition, 37, pp. 487-501, 2004. [8] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, 20, pp. 53-65, 1987. [9] R Core Team, “R: A Language and Environment for Statistical Computing,” R Foundation for Statistical Computing, Vienna, Austria, 2013. [10] J. Hartigan and M. Wong, “A K-Means Clustering Algorithm,” Applied Statistics, 28, pp. 100-108, 1979. [11] B. Desgraupes, “Clustering Indices,” Package clusterCrit for R, Lab Modal'X, University Paris Ouest, 2013. (Dunn, p. 10-11; PBM, p. 14; Silhouette, p. 18)
REFERENCES
[3]
now, our country, offspring, thought, parents, tax, parents, marriage, government, self /
80
This work was in part supported by the Convergence Research Center (CRC) Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (MSIP), Korea (NRF2015R1A5A7037615), and in part by the Ministry of Science, ICT and Future Planning (MSIP), Korea, under the “IT Consilience Creative Program” (IITP-2015-R0346-15-1008) supervised by the Institute for Information & Communications Technology Promotion (IITP).
[2]
부모님, 자식, 부모, 자기, 문제, 생각, 지금, 현실, 진짜, 복지
35 Shared
Local
ACKNOWLEDGMENT
[1]
Shared
M. McCombs, Setting the Agenda: The Mass Media and Public Opinion, Polity, 2013. C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high dimensional spaces,” in Proc. of the 2000 ACM SIGMOD Int’l Conf. on Management of data (SIGMOD ’00), pp. 70-81, ACM, New York, NY, USA, 2000. Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proc. of the 14th Int’l Conf. on Machine Learning (ICML ’97), pp. 412-420, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
497