2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
Preprocessing of Slovak Blog Articles for Clustering
Tomas Kuzar
Pavol Navrat
Institute of Informatics and Software Engineering FIIT STU Bratislava, Slovakia
[email protected]
Institute of Informatics and Software Engineering FIIT STU Bratislava, Slovakia
[email protected] The paper is structured as follows: in the related work we present data preprocessing state of art. In section three we describe our dataset and data preprocessing techniques we use. In section four we present results of provided experiments. In the last section we conclude our finding and propose the future work.
Abstract—Web content clustering is very important part of topic detection and tracking issue. In our paper we focus on pre-processing phase of web content clustering. We focus on blog articles published in Slovak language. We evaluate the impact of different data pre-processing methods on success of blog clustering. We found out that applying various text data manipulation techniques in preprocessing can improve the quality of clusters. The quality of clusters is measured by traditional clustering metrics like precision, recall and Fmeasure.
II.
Data preprocessing consists mainly of term extraction and term selection. Basic term extraction can be provided by tokenization where the terms are delimited by whitespaces. Authors in [2] addressed the problem of compounds splitting. Other text segmentation tasks are related to multiple words alignment [6], phrases segmentation and NER segmentation. In order to cope with multiple words Ngrams segmentation can be used as well. Most of the languages use suffixes and infixes of the terms. Normalization extracts the bases of the terms. Normalization process is language dependent. Widely used stemmer for English language was developed by Martin Porter. There are various mutations of stemmers developed and used. Stemming removes affixes from words algorithmically and converts the term to its stem. Lemmatization for each inflected word form in a document or request, its basic form, the lemma, is identified. Disadvantage of lemmatization is that the terms the dictionary does not contain cannot be lemmatized. Another problem of stemming and lemmatization is ambiguity, which can be solved using N-gram analysis or by applying stochastic algorithms. Word-based N-gram analysis splits text into N-grams and each term has its context. This context of the term can be used for heuristic while stemming. If the heuristic is based on the probability stochastic algorithm for term extraction are used. Usage of normalization technique is not only language dependent but also application domain and machine learning method dependent, according to [7], [8]. Different techniques can be applied which can be applied alone or in combination. Usually the number of extracted terms is too high and just the most important terms need to be selected. According to the literature, several term selection methods could be applied. Authors in [7] used PCA (principal component analysis) for dimensionality reduction. Other authors [8] use
Keywords: Text Preprocessing, Categorization, Text Mining
I.
INTRODUCTION
Business users often ask: How could we best use the potential of social media? Social media is increasing their power very fast. It consists of huge amount of user generated content. This information is already being used for different purposes – fun, brand building, marketing, sales support, web mining, information search etc. There is high number of methods and algorithms available which can process enormous amount of social media information. We found out that business areas like Customer Relationship Management, Product Management and Marketing can benefit from knowledge retrieved from social media. The better these business areas know their target audience the more taylor-made can be the solutions they prepare. In order to get internet-based information about their customers they need to gather, process and analyze information from the social media and integrate it into their internal enterprise applications. In this paper we focus on processing of the information gathered from the social media. We gather the articles from the Slovak blogging portal. We manually annotate blog articles according to their main topic. Unstructured articles need to be put into structured representation in order to apply the machine learning techniques. Transformation of unstructured information into structured representation is called preprocessing. After blog data preprocessing unsupervised clustering technique could be applied in order to divide blogs into clusters. We found out that applying various text data manipulation techniques in preprocessing can significantly improve the quality of clusters. The quality of clusters is measured by traditional clustering metrics like precision, recall and F-measure. 978-0-7695-4191-4/10 $26.00 © 2010 IEEE DOI 10.1109/WI-IAT.2010.273
RELATED WORK
314
basic TDIDF method. In our experiments we use LDA probabilistic topic model [1] for dimensionality reduction. Document representation is important in many text related tasks. We represent document as LDA topics. We decided to represent documents by LDA topic model. We aim to create a model of discussion behavior on blog portal in order to effectively plan article publishing and in order to track the activity regarding the industry segments on the web.
number of published blogs
number of discussion posts
Sunday
Saturday
Friday
Thursday
May
800 700 600 500 400 300
Pr ecision =
200 100 Sunday
Saturday
Friday
Thursday
Wednesday
Tuesday
Monday
0
Wednesday
July
Tuesday
After downloading the articles we manually picked representative ones and labeled them manually using four labels – traffic, education, religion and complaints. There is labeled blog subset as an input. As a first step we tokenized the input dataset. Then we applied some of preprocessing methods which are described below. After that we used articles as input for building of LDA model. One of the outputs of the LDA model is an article-topic matrix. This article-topic matrix was the input for KNN algorithm where we set the number of cluster we want to build. KNN algorithm is mentioned in section 2. At the end we compared built clusters with labeled corpus and calculate Precision (1), Recall (2) and F-measure (3).
Our dataset consists of articles and blogs of Slovak newspaper publisher www.sme.sk. Dataset includes over 30 000 articles and 10 000 blogs. As depicted at the figure 1, most blogs are published during the week. June
5000
Figure 2. Daily discussions count from May to June.
OUR APPROACH
900
10000
Monday
A. Clustering and classification Text classification and clustering are widely used methods used in text-related tasks performed on web data. Authors in [3] dealt with domain categorization, authors in [4] concentrated on private and public blogs categorization or team in [5] focused on product features categorization. Widely used clustering method is called K-Nearest Neighbors (KNN) which we use in our experiments. III.
May June July
15000
Re call =
Figure 1. Publishing activity from May to July.
truepositives truepositives + falsepositives
truepositives truepositives + falsenegatives
F − measure = 2.
Blog discussions have similar trend – most of the blogs comments are added during the week. Blog discussions represent specific context information of the text. Blog discussion count will be considered preprocessing later.
precision.recall precision + recall
(1)
(2)
(3)
We applied the steps mentioned above – term extraction, LDA model building, K Nearest Neighbors clustering and clustering metrics calculation – several times using of different term extraction methods or combination of these term extraction methods. We used four methods in order to enhance the term extraction – usage information consideration, lemmatization, taxonomy-based term extraction and extraction method based on lexical similarity of terms. Our data preprocessing process is depicted in figure 3.
315
one term. We calculated lexical similarity on terms longer as three characters. If two terms are equal on more on 75% of term length, they are mapped on some lexical term. E. Usage information consideration While information gathering process we downloaded not only articles but also information about count of the discussion posts. We suppose that information about count of the discussion posts can improve the quality of clusters. Some topics have higher average discussion count then some others. We added information about discussion count into article-topic matrix. The evaluation of this step is out of scope of this paper.
Figure 3. Data preprocessing tasks.
IV.
Figure 3 describes alternative approach to preprocessing of Slovak articles. Slovak articles can be translated into English language using Google Translate service and afterwards processed by various methods developed for English language.
Figure 4 shows experimental results of clustering measured by F-measure using some combinations of term extraction methods described in section 3. Figure 4 depicts four setting of preprocessing – basic processing (only tokenization), enhanced processing (stop words identification and lemmatization), eurovoc (stop words identification, lemmatization and taxonomy based term extraction) and lexical classes (stop words identification, lexical classes building).
A. Stop words identification We created list of terms which were filtered before the term extraction process started. The list includes mainly conjunctions, prepositions or pronouns. B. Lemmatization We used Slovak dictionary for lemmatization with more then 100 thousand lemmas. But the problem of lemmatization dictionary is the ambiguity of the word forms. E.g. lemma for word form ‘je’ can be ‘byť’ and ‘jesť‘ as well. We fix the ambiguity using just one-one replacement.
10
General categories
127
Basic expressions
6797
Connections
7132
Hierarchical connections
6825
Associative connections
4814
Synonyms
6386
Language mutations
F-measure
0,6 0,5 10 topics
Figure 4. Quality of topic clusters.
We performed experiments on two sets: on 4 topics and on 10 topics. As depicted in the figure 4, applying basic and enhanced processing F-measure increased in both sets. Usage of eurovoc and lexical classes was successful just in case of 4 topics and it decreased the F-measure in case of 10 topics. The reason of decrease of F-measure in case of 10 topic dataset is the domain specific dictionary. And the dictionary does not cover all of the 10 topics. V.
CONCLUSION
In this paper we focused on the term extraction methods applied in preprocessing phase of blog oriented text mining. Our experiments consisted of several steps: term extraction, LDA model based term selection, clustering and F-measure based evaluation. We used some combinations of term extraction methods and we found out that only lemmatization has always enhanced the quality of clusters. We focused on blog articles published in Slovak language. The development of preprocessing methods suitable for Slovak language required high amount of effort
23
Taxonomy based term extraction searches for Eurovoc terms in articles and after exact matching replaces the term from the article by more general term in Eurovoc hierarchy. D. Lexical classes building Slovak language uses high number of suffixes. We created method, can be considered as basic stemmer for Slovak language, for grouping lexically similar terms into 1
0,7
4 topics
Num. records
Domains
0,8
0,4
CHARACTERISTICS OF EUROVOC TAXONOMY. Object Name
basic processing enhanced processing eurovoc lexical classes
0,9
C. Taxonomy-based term extraction In our research we use the Eurovoc1 taxonomy which is known due to its multilingualism. The main advantage of using Eurovoc is the existence of its Slovak mutation. TABLE I.
EXPERIMENTS AND EVALUATION
http://europa.eu/eurovoc
316
and moreover some tasks are dependent on dictionaries which are not always available. We aim to compare preprocessing methods suitable for Slovak language with English preprocessing methods applied on automatically translated Slovak texts as depicted in figure 3. As future work we aim to examine the terms and conditions of use of Google Translate in Slovak blog clustering in more details. Moreover we want to focus on term extraction method enriched by extensive knowledge of blogosphere. ACKNOWLEDGMENT This work was partially supported by the grants VEGA 1/0508/09, KEGA 345-032STU-4/2010 and it is the partial result of the Research & Development Operational Programme for the project Support of Center of Excellence for Smart Technologies, Systems and Services, ITMS 26240120029, co-funded by the ERDF. REFERENCES [1] [2]
[3]
[4]
[5]
[6]
[7]
[8]
Blei, M., Ng, A., Jordan, M.: Latent Dirichlet Allocation, Journal of Machine Learning Research, 3, 2003. Khaitan, S.: Data-driven compound splitting method for English compounds in domain names. In: Proceeding of the 18th ACM conference on Information and knowledge management, ACM Press, (2009), pp. 207-214. Hashimoto, Ch.: Blog Categorization Exploiting Domain Dictionary and Dynamically Estimated Domains of Unknown Words. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, ACM, (2008), pp. 69-72. Elgersma, E.: Personal vs non-personal blogs. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, (2008), pp. 723-724. Guo. H.: Product Feature Categorization with Multilevel Latent Semantic Association. In: Proceeding of the 18th ACM conference on Information and knowledge management, ACM, (2009), pp. 10871096. Bhargava, A.: Multiple word alignment with profile hidden Markov models. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, ACM Press, (2009), pp. 43-48. Korenius, T.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management, ACM Press, (2004), pp. 625-633. Sun, L.: User-driven development of text mining resources for cancer risk assessment. In: Proceedings of the Workshop on BioNLP, ACM Press, (2009), pp. 108-116.
317