Keywords: Blog classification, Wikipedia, Fuzzy clustering, Fuzzy similarity. ... 169.56 million identified blogs, 96,488 new blogs in last 24 hours and 502,525.
Enhancing Concept Based Modeling Approach for Blog Classification Ramesh Kumar Ayyasamy1, Saadat M. Alhashmi1, Siew Eu-Gene2, and Bashar Tahayna1 1
School of Information Technology, 2 School of Business Monash University, Malaysia {ramesh.kumar,alhashmi,siew.eu-gene,bashar.tahayna}@monash.edu
Abstract. Blogs are user generated content discusses on various topics. For the past 10 years, the social web content is growing in a fast pace and research projects are finding ways to channelize these information using text classification techniques. Existing classification technique follows only boolean (or crisp) logic. This paper extends our previous work with a framework where fuzzy clustering is optimized with fuzzy similarity to perform blog classification. The knowledge base-Wikipedia, a widely accepted by the research community was used for our feature selection and classification. Our experimental result proves that proposed framework significantly improves the precision and recall in classifying blogs. Keywords: Blog classification, Wikipedia, Fuzzy clustering, Fuzzy similarity.
1
Introduction
Blog classification plays a very important role in many information management and retrieval tasks. It refers to the task of assigning blogs one or more pre-defined categories. Blog classification is cumbersome than the normal web classification, because blog posts are frequently updated and the bloggers disseminate information and present their ideas on various topics. According to Blogpulse1 statistics, there are 169.56 million identified blogs, 96,488 new blogs in last 24 hours and 502,525 blogposts are indexed in last 24 hours. This proves that the weblogs are growing at a rapid rate, which is a rich source of information requiring an efficient and effective automatic classification and categorization techniques. Text clustering received a significant attention in recent years in the area of machine learning and text mining applications such as webpages and blogs. The purpose of text clustering is to create vector space model [1]. The entire text collection could be represented as text by document matrix. Clustering of blogposts enables automatic categorization and facilitates certain types of blog search. In any clustering method, if two documents are similar to each other, it has to embed in a suitable similarity space. In contrast with boolean logic, where binary sets have two-valued logic, fuzzy logic variables may have a truth value that ranges in degree between 0 and 1. Fuzzy logic values are known to be many-valued logic [2]. The difference between fuzzy clustering and 1
http://www.blogpulse.com
Y. Wang and T. Li (Eds.): Knowledge Engineering and Management, AISC 123, pp. 409–416. springerlink.com © Springer-Verlag Berlin Heidelberg 2011
410
R.K. Ayyasamy et al.
regular clustering is that in the former case each data point belongs to more than one cluster are fuzzified in accordance with certain membership functions. Although fuzzy C-means clustering was not widely researched on blog classification, there is a considerable amount of literature [5- 7] carried out on document clustering. Mendes and Sacks [5] analyzed various clustering methods to discover relevant document relationship. Their experiments with various test documents [5] have proven that fuzzy C-means clustering performs better than the hard K-means algorithm. Miyamoto [6] developed a method using fuzzy clustering for fuzzy multi-sets. This work used two dissimilarity measures for computing cluster centers. Saraçoglu et al. [7] used fuzzy clustering to find similar documents. This work is similar to ours, but their clustering is done using training documents. Our work is automatic and does not depend on training documents. Widyantoro and Yen [8] adopted a fuzzy similarity approach originated from Rocchio’s algorithm for text classification. This research work [8] used only the fuzzy term-category relation to predict the category of documents. Despite widespread adoption on clustering tasks, only few studies have investigated using Wikipedia as a knowledge base for document clustering [10 - 13]. Gabrilovich et al., [10] applied structural knowledge repository-Wikipedia as feature generation technique. Their work confirmed that background knowledge based, features generated from Wikipedia can help text categorization. In our previous work [11] we followed two-value logic (or boolean logic), and utilized Wikipedia to index and classify text documents. Huang et al. [12] clustered documents using Wikipedia knowledge and utilized each Wikipedia article as a concept. This work extracted related terms such as synonyms, hyponyms and associated terms. Hu et al. [13] used Wikipedia concept feature for document clustering. In our work, we use concepts which are derived from Wikipedia to classify blogposts. In this paper, we developed a framework where fuzzy clustering is optimized with fuzzy similarity for blog classification. The knowledge base-Wikipedia, a widely accepted by the research community was used for our classification. The reminder of the paper is organized as follows: Section 2 describes our proposed framework, and in Section 3, we evaluate the proposed framework with real data set and discuss experimental results. Finally, we conclude our paper in Section 4.
2
Our Proposed Framework
We present our fuzzy clustering framework (Figure 1) that has different stages in classifying blogposts using fuzzy clustering and fuzzy similarity. Our contribution comes in three folds: we use Wikipedia to find the membership value of categories. Through blogpost, we find the membership value of concepts using n-gram based concept extraction. We optimized fuzzy c-means clustering with fuzzy similarity to do blog classification. 2.1
n-gram Based Concept Extraction
We use relatedness measurement method that computes the relatedness among words based on co-occurrence analysis. As a simple case, we map each word to a concept and the combination of these two or more concepts/words can create a new concept (compound concept). For example, assume A is a word mapped to a concept Con1, and
Enhancing Concept Based Modeling Approach for Blog Classification
411
B is a word mapped to a concept Con2. Then there is a probability that the combination of A&B can produce a new concept: Let us consider the following example, A = Yellow, B = River: A: Color concept, B: Nature concept. So, A+BYellow River Geographical place in China: Place concept and B+A River Yellow: no Concept. Blog posts
Wikipedia
n-gram based Concept extraction Fuzzy conceptCategory relation
Fuzzy blog postConcept relation
Membership value of concepts
Membership value of categories
Fuzzy C-means clustering Compute using Fuzzy similarity
Culture and arts
Geography and places
People and self
Classified blogs*
* Sample 3 categories are shown here (Total classified categories: 12)
Fig. 1. Our Proposed Framework
Since the order of A and B can give different concept or no concept at all, we neglect the reordering. For effective classification, we mine existing knowledge bases, which can provide a conceptual corpus. Wikipedia is one of the best online Knowledge Base to provide an individual English article with more than 3 million concepts. With that said, an important feature of Wikipedia is that on each wiki page, there is a URI, which unambiguously represent the concept. Our concept-based representation method was guided by the research carried out in concept vectorization and categorization [11]. Two words co-occur if they appear in an article within some distance of each other. Typically, the distance is a window of k words. The window limits the co-occurrence analysis area of an arbitrary range. The reason of setting a window is due to the number of co-occurring combinations that becomes too huge when the number of words in an article is large. Qualitatively, the fact that two words often occur close to each other is
412
R.K. Ayyasamy et al.
more likely to be significant, compared to the fact that they occur in the same article, even more so when the number of words in an article is huge. The n-gram based concept extraction algorithm was briefly discussed in Ayyasamy et al.’s work [11]. 2.2
Blogpost-Concept Relation Based on Fuzzy Logic
Our framework utilizes n-gram based concept extraction to identify the belonging concepts from blogposts. We use a set of blogpost P = { p1 , p 2 , p 3 ..... p n } and set of
concepts collected from blogposts, Con = {Con1 , Con2 , Con3 ,…..Conm } , m ∈]+ to
determine the fuzzy relation from blogpost to concept. Each blogpost is represented by concept frequency pairs bp = {(Con1 , w1 ), (Con2 , w2 ),...(Conm , wm )} where wm denotes the occurrence frequency of Conm in the blog post. The relevance of blogposts to concepts is expressed by a fuzzy relation R : P × Con → [ 0,1] , where the membership values of this relation R, denoted by μ R ( Pn , Coni ) specifies the degree
of relevance concept Pn to category Coni. The membership values of this relation are determined by n-gram based concept extraction technique, which consists of a blog post and its concepts. The membership value μ R ( Pn , Coni ) for each concept (1) is computed from (2) is shown below:
μ R ( Pn , Coni ) = dist ( Pn , Coni ) =
dist ( Pn , Coni )
(1)
max m, k dist ( Pm , Conk )
w ∈ (bp) , (bp) ∈ bp, Con((bp) ) = Con w w ∈ (bp) , (bp) ∈ bpw i
l
l
i
l
l
l
i
i
(2)
i
Where dist ( Pn , Coni ) is the distribution to be the total number of occurrences of concepts Pn in category Coni. The blogposts are grouped together according to their concepts and the occurrence frequency of each blogpost for each concept is collected by summing up the blogpost frequency of individual concepts. 2.3
Concept-Category Relation Based on Fuzzy Logic
+ We use a set of concepts Con = {Con1 , Con2 , Con3 ,…..Conm } , m ∈] and Wikipedia
category set C = {c1 , c 2 , c3 … .c12 } to determine the fuzzy relation from concept to category. Each category is represented by concept-frequency pairs cp = {(Con1 , w1 ), (Con2 , w2 ),...(Conm , wm )} where wm denotes the occurrence frequency of Conm in the category. The relevance of concepts to categories is expressed by a fuzzy relation R : Con × C → [ 0,1] , where the membership values of this relation R,
denoted by μR ( Coni , c j ) specifies the degree of relevance concept Coni to category cj.
The membership value μ R (Coni , c j ) for each category (3) is computed from (4) is shown below: dist ( Coni , c j ) (3) μ R (Coni , c j ) = max k ,l dist ( Conk , cl )
Enhancing Concept Based Modeling Approach for Blog Classification
dist ( Coni , c j ) =
w ∈ (cp) , (cp) ∈ cp, c(cp ) = c w w ∈ cp , cp ∈ cpw i
k
k
i
k
k
k
j
i
413
(4)
i
Where dist ( Coni , c j ) is the distribution to be the total number of occurrences of concepts Coni in category cj. The concepts are grouped together according to their categories and the occurrence frequency of each concept for each category is collected by summing up the concept frequency of individual category. 2.4
Fuzzy C-Means Clustering
For fuzzy C-means clustering, membership value of concepts (1) and membership value of categories (3) are involved. We utilize Fuzzy C-means clustering (FCMC), which was developed by Dunn [3] and improved by Bezdek [4]. FCMC allows one piece of data to belong to one or more clusters with different membership degrees (between 0 and 1) and fuzzy boundaries between clusters. Let blogpost P = { p1 , p 2 , p 3 ..... p n } be a finite dataset and U be the fuzzy matrix: where each pi = { pi1 , pi 2 , pi 3 ..... pif } has f features, such that M fcn = {U ∈ R
c× n
: uik ∈ [0,1] . To
classify P into c (2 ≤ c ≤ n) categories, let V be V = ( v1 , v2 , v3 .....vc ) CON the cluster c× f
centers or prototype matrix, V ∈ R . To find the optimization of fuzzy clustering, a fuzzy matrix U is selected, where U ∈ M fcn , we equate as: c n M fcn = U ∈ R c×n : uik ∈ [0,1]; uik = 1;0 < uik < n i =1 k =1
(5)
FCMC executes iteratively these following steps until a satisfactory objective is reached. The objective here is defined as reducing the total error to a given threshold value ε or stopping after a certain number of iterations are completed. To minimize the objective function Jm to calculate U and V [4], n
c
n
c
J m (U ,V ) = (uki )m d ki2 = (uki )m Pk − vi k =1 i =1
2
(6)
k =1 i =1
where m is the fuzzifier, m>1, which controls the fuzziness of the method. uki is the membership degree of Pk in the cluster i, Pk is the kth of d-dimensional measured data, vi is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and center. Both the parameters need to be specified 2 beforehand. d ki2 = Pk − vi is the square Euclidean distance between data object Pk to cluster center vi .Thus the fuzzy clustering is carried out by minimizing the objective function (4) with the update of membership uki (7) and the cluster centers vi (8) as, 2 c P − v m −1 k i uki = P v − l =1 k l
−1
(7)
414
R.K. Ayyasamy et al. n
Where vi =
u k =1 n
m ki
.Pk (8)
ukim k =1
The choice of the above appropriate objective function (5) is the point to the success of the cluster analysis. The iteration (6) terminates when (l +1)
}
MAXij { u ki − uki(l ) < Φ
(9)
Where Φ is between 0 and 1. Thus for a finite set of blogposts, the fuzzy C-means clustering returns a list of cluster centers V = ( v1 , v2 , v3 .....vc ) CON and a fuzzy matrix uki. 2.5
Fuzzy Similarity
The aim of using fuzzy similarity is to identify the category of the blogpost based on fuzziness. One can suggest that, fuzzy C-means clustering returns the cluster centers, which is nothing but identified category of the blogpost. In our framework we would like to improve the fuzzy C-means clustering result, by finding the fuzzy similarity with the returned list of n-gram based concept extraction. In this work [9], they highlighted the six fuzzy similarity measures using fuzzy conjunction and disjunction operators such as Einstein, Hamacher, Bounded Difference, Algebraic, Drastic and Min-Max. Their empirical result on [9] showed that similarity measure using Einstein operators achieved the best performance in classifying blogs than the other measures. We utilize Einstein similarity measure for classifying blogs by measuring the similarity between the returned list of n-gram based concept extraction and the fuzzy C-means cluster center.
3
Experiments
This section describes the experiment that test the efficiency of our proposed framework. We carried out experiments using part of TREC BLOGs082 dataset. TREC dataset is a well-known dataset in blog mining research area. This dataset consists of the crawl of Feeds, associated Permalink, blog homepage, and blogposts and blog comments. This dataset is a combination of both English and NonEnglish blogs. We used blogpost extraction program named BlogTEX 3 to extracts blog posts from TREC Blog dataset. During preprocessing, we removed HTML tags and used only English blog posts for our experiment. To test our approach, we collected around 41,178 blog posts and assigned based on the found blog concepts related to each category. Table 1 shows the CategoryTree structure used in our experiment. We downloaded the Wikipedia database dumps on 15 February 2010 and extracted 3,207,879 Wikipedia articles. After pre-processing and filtering we used 3,101,144 article titles and are organized into 145,990 subcategories. 2 3
http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html http://sourceforge.net/projects/blogtex
Enhancing Concept Based Modeling Approach for Blog Classification
415
Table 1. The CategoryTree Structure used in this experiment Categorical Index General Reference
# of documents 878
# of concepts 3758
Culture and Arts
8580
18245
Geography and Places
4558
23220
Health and Fitness
4082
21409
History and Events
2471
12746
Mathematics and Logic
213
2561
Natural and Physical sciences
1560
8530
People and Self
6279
10125
Philosophy and Thinking
2345
7591
Religion and Belief
2345
8315
Society and Social sciences
4657
13170
Tech and Applied sciences
3210
26310
We ran our experiment as mentioned in our framework and fuzzy similarity based on fuzzy C-means method, and classified blogposts based on 5 mentioned categories. To evaluate the proposed framework, two popular evaluation measures, precision and recall are used. Precision measures the percentage of “categories found and correct” divided by the “total categories found”. Recall measures the percentage of “categories found and correct” divided by the “total categories correct”. We evaluated the classification performance among the top 10, 20, and 50 of the classified blogs respectively. From our dataset, most of the blogposts are categorized on Culture and Arts, and Geography and Places, and very few posts are categorized on Mathematics and Logics. Table 2. Evaluation Result Precision
Recall
General Reference
Categorical Index
0.810
0.764
Culture and Arts
0.792
0.663
Geography and Places
0.841
0.735
Health and Fitness
0.928
0.762
History and Events
0.787
0.683
Mathematics and Logics
0.741
0.769
Natural and Physical sciences
0.910
0.785
People and Self
0.850
0.694
Philosophy and Thinking
0.962
0.815
Religion and Belief
0.827
0.897
Society and Social sciences
0.942
0.895
Tech and Applied sciences
0.903
0.823
0.8577
0.7654
Average
416
R.K. Ayyasamy et al.
From Table 2 we can note that Mathematics and Logics and Religion and Belief have less precision and more recall than the other four categories. Our framework produced better precision of 85.77% and recall of 76.54% on classifying blogs.
4
Conclusions
In this paper, a framework in optimizing fuzzy clustering with concept based modeling approach for blog classification is presented. We demonstrated the framework’s effectiveness by measuring precision and recall through real blog dataset. We conclude that using of Wikipedia knowledge base combined with fuzzy clustering and fuzzy similarity measures produce better results than the traditional clustering techniques. In our future work, we plan to improve our experiment using the full TREC Blogs08 dataset and combining the blog features such as blog tags to measure the clustering performance and its scalability.
References 1. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. J.Information Processing & Management 24, 513–523 (1988) 2. Zadeh, L.A.: Fuzzy Sets, Information and Control 8, 338–353 (1965) 3. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. of Cybernetics 3(1), 32–57 (1973) 4. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981) 5. Mendes, M.E.S., Sacks, L.: Evaluating fuzzy clustering for relevance-based access. In: IEEE International Conference on Fuzzy Systems, pp. 648–653 (2003) 6. Miyamoto, S.: Fuzzy multisets and fuzzy clustering of documents. In: 10th IEEE International Conference on Fuzzy Systems, pp. 1191–1194 (2001) 7. Saraçoglu, R., Tütüncü, K., Allahverdi, N.: A fuzzy clustering approach for finding similar documents using a novel similarity measure. Expert Systems with Applications 33(3), 600–605 (2007) 8. Widyantoro, D.H., Yen, J.: A Fuzzy Similarity Approach in Text Classification Task. In: IEEE International Conference on Fuzzy Systems, pp. 653–658 (2000) 9. Ayyasamy, R.K., Tahayna, B., Alhashmi, S., Eu-gene, S.: Concept Based Modeling Approach for Blog Classification using Fuzzy Similarity. In: 8th IEEE International Conference on Fuzzy Systems and Knowledge Discovery, pp. 1007–1011 (2011) 10. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI, Park (2006) 11. Ayyasamy, R.K., Tahayna, B., Alhashmi, S., Eu-gene, S., Egerton, S.: Mining Wikipedia Knowledge to improve Document Indexing and Classification. In: 10th Int. Conf. on Information Science, Signal Processing and their Applications, pp. 806–809 (2010) 12. Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering Documents Using a WikipediaBased Concept Representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 628–636. Springer, Heidelberg (2009) 13. Hu, J., Fang, L., Cao, Y., Hua-Jun Zeng, H., Li, H.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: ACM SIGIR, pp. 179–186 (2008)