Enhancing Concept Based Modeling Approach for Blog Classification

8 downloads 156319 Views 429KB Size Report
Keywords: Blog classification, Wikipedia, Fuzzy clustering, Fuzzy similarity. ... 169.56 million identified blogs, 96,488 new blogs in last 24 hours and 502,525.
Enhancing Concept Based Modeling Approach for Blog Classification Ramesh Kumar Ayyasamy1, Saadat M. Alhashmi1, Siew Eu-Gene2, and Bashar Tahayna1 1

School of Information Technology, 2 School of Business Monash University, Malaysia {ramesh.kumar,alhashmi,siew.eu-gene,bashar.tahayna}@monash.edu

Abstract. Blogs are user generated content discusses on various topics. For the past 10 years, the social web content is growing in a fast pace and research projects are finding ways to channelize these information using text classification techniques. Existing classification technique follows only boolean (or crisp) logic. This paper extends our previous work with a framework where fuzzy clustering is optimized with fuzzy similarity to perform blog classification. The knowledge base-Wikipedia, a widely accepted by the research community was used for our feature selection and classification. Our experimental result proves that proposed framework significantly improves the precision and recall in classifying blogs. Keywords: Blog classification, Wikipedia, Fuzzy clustering, Fuzzy similarity.

1

Introduction

Blog classification plays a very important role in many information management and retrieval tasks. It refers to the task of assigning blogs one or more pre-defined categories. Blog classification is cumbersome than the normal web classification, because blog posts are frequently updated and the bloggers disseminate information and present their ideas on various topics. According to Blogpulse1 statistics, there are 169.56 million identified blogs, 96,488 new blogs in last 24 hours and 502,525 blogposts are indexed in last 24 hours. This proves that the weblogs are growing at a rapid rate, which is a rich source of information requiring an efficient and effective automatic classification and categorization techniques. Text clustering received a significant attention in recent years in the area of machine learning and text mining applications such as webpages and blogs. The purpose of text clustering is to create vector space model [1]. The entire text collection could be represented as text by document matrix. Clustering of blogposts enables automatic categorization and facilitates certain types of blog search. In any clustering method, if two documents are similar to each other, it has to embed in a suitable similarity space. In contrast with boolean logic, where binary sets have two-valued logic, fuzzy logic variables may have a truth value that ranges in degree between 0 and 1. Fuzzy logic values are known to be many-valued logic [2]. The difference between fuzzy clustering and 1

http://www.blogpulse.com

Y. Wang and T. Li (Eds.): Knowledge Engineering and Management, AISC 123, pp. 409–416. springerlink.com © Springer-Verlag Berlin Heidelberg 2011

410

R.K. Ayyasamy et al.

regular clustering is that in the former case each data point belongs to more than one cluster are fuzzified in accordance with certain membership functions. Although fuzzy C-means clustering was not widely researched on blog classification, there is a considerable amount of literature [5- 7] carried out on document clustering. Mendes and Sacks [5] analyzed various clustering methods to discover relevant document relationship. Their experiments with various test documents [5] have proven that fuzzy C-means clustering performs better than the hard K-means algorithm. Miyamoto [6] developed a method using fuzzy clustering for fuzzy multi-sets. This work used two dissimilarity measures for computing cluster centers. Saraçoglu et al. [7] used fuzzy clustering to find similar documents. This work is similar to ours, but their clustering is done using training documents. Our work is automatic and does not depend on training documents. Widyantoro and Yen [8] adopted a fuzzy similarity approach originated from Rocchio’s algorithm for text classification. This research work [8] used only the fuzzy term-category relation to predict the category of documents. Despite widespread adoption on clustering tasks, only few studies have investigated using Wikipedia as a knowledge base for document clustering [10 - 13]. Gabrilovich et al., [10] applied structural knowledge repository-Wikipedia as feature generation technique. Their work confirmed that background knowledge based, features generated from Wikipedia can help text categorization. In our previous work [11] we followed two-value logic (or boolean logic), and utilized Wikipedia to index and classify text documents. Huang et al. [12] clustered documents using Wikipedia knowledge and utilized each Wikipedia article as a concept. This work extracted related terms such as synonyms, hyponyms and associated terms. Hu et al. [13] used Wikipedia concept feature for document clustering. In our work, we use concepts which are derived from Wikipedia to classify blogposts. In this paper, we developed a framework where fuzzy clustering is optimized with fuzzy similarity for blog classification. The knowledge base-Wikipedia, a widely accepted by the research community was used for our classification. The reminder of the paper is organized as follows: Section 2 describes our proposed framework, and in Section 3, we evaluate the proposed framework with real data set and discuss experimental results. Finally, we conclude our paper in Section 4.

2

Our Proposed Framework

We present our fuzzy clustering framework (Figure 1) that has different stages in classifying blogposts using fuzzy clustering and fuzzy similarity. Our contribution comes in three folds: we use Wikipedia to find the membership value of categories. Through blogpost, we find the membership value of concepts using n-gram based concept extraction. We optimized fuzzy c-means clustering with fuzzy similarity to do blog classification. 2.1

n-gram Based Concept Extraction

We use relatedness measurement method that computes the relatedness among words based on co-occurrence analysis. As a simple case, we map each word to a concept and the combination of these two or more concepts/words can create a new concept (compound concept). For example, assume A is a word mapped to a concept Con1, and

Enhancing Concept Based Modeling Approach for Blog Classification

411

B is a word mapped to a concept Con2. Then there is a probability that the combination of A&B can produce a new concept: Let us consider the following example, A = Yellow, B = River: A: Color concept, B: Nature concept. So, A+BYellow River  Geographical place in China: Place concept and B+A River Yellow: no Concept. Blog posts

Wikipedia

n-gram based Concept extraction Fuzzy conceptCategory relation

Fuzzy blog postConcept relation

Membership value of concepts

Membership value of categories

Fuzzy C-means clustering Compute using Fuzzy similarity

Culture and arts

Geography and places

People and self

Classified blogs*

* Sample 3 categories are shown here (Total classified categories: 12)

Fig. 1. Our Proposed Framework

Since the order of A and B can give different concept or no concept at all, we neglect the reordering. For effective classification, we mine existing knowledge bases, which can provide a conceptual corpus. Wikipedia is one of the best online Knowledge Base to provide an individual English article with more than 3 million concepts. With that said, an important feature of Wikipedia is that on each wiki page, there is a URI, which unambiguously represent the concept. Our concept-based representation method was guided by the research carried out in concept vectorization and categorization [11]. Two words co-occur if they appear in an article within some distance of each other. Typically, the distance is a window of k words. The window limits the co-occurrence analysis area of an arbitrary range. The reason of setting a window is due to the number of co-occurring combinations that becomes too huge when the number of words in an article is large. Qualitatively, the fact that two words often occur close to each other is

412

R.K. Ayyasamy et al.

more likely to be significant, compared to the fact that they occur in the same article, even more so when the number of words in an article is huge. The n-gram based concept extraction algorithm was briefly discussed in Ayyasamy et al.’s work [11]. 2.2

Blogpost-Concept Relation Based on Fuzzy Logic

Our framework utilizes n-gram based concept extraction to identify the belonging concepts from blogposts. We use a set of blogpost P = { p1 , p 2 , p 3 ..... p n } and set of

concepts collected from blogposts, Con = {Con1 , Con2 , Con3 ,…..Conm } , m ∈]+ to

determine the fuzzy relation from blogpost to concept. Each blogpost is represented by concept frequency pairs bp = {(Con1 , w1 ), (Con2 , w2 ),...(Conm , wm )} where wm denotes the occurrence frequency of Conm in the blog post. The relevance of blogposts to concepts is expressed by a fuzzy relation R : P × Con → [ 0,1] , where the membership values of this relation R, denoted by μ R ( Pn , Coni ) specifies the degree

of relevance concept Pn to category Coni. The membership values of this relation are determined by n-gram based concept extraction technique, which consists of a blog post and its concepts. The membership value μ R ( Pn , Coni ) for each concept (1) is computed from (2) is shown below:

μ R ( Pn , Coni ) = dist ( Pn , Coni ) =

dist ( Pn , Coni )

(1)

max m, k dist ( Pm , Conk )

 w ∈ (bp) , (bp) ∈ bp, Con((bp) ) = Con w  w ∈ (bp) , (bp) ∈ bpw i

l

l

i

l

l

l

i

i

(2)

i

Where dist ( Pn , Coni ) is the distribution to be the total number of occurrences of concepts Pn in category Coni. The blogposts are grouped together according to their concepts and the occurrence frequency of each blogpost for each concept is collected by summing up the blogpost frequency of individual concepts. 2.3

Concept-Category Relation Based on Fuzzy Logic

+ We use a set of concepts Con = {Con1 , Con2 , Con3 ,…..Conm } , m ∈] and Wikipedia

category set C = {c1 , c 2 , c3 … .c12 } to determine the fuzzy relation from concept to category. Each category is represented by concept-frequency pairs cp = {(Con1 , w1 ), (Con2 , w2 ),...(Conm , wm )} where wm denotes the occurrence frequency of Conm in the category. The relevance of concepts to categories is expressed by a fuzzy relation R : Con × C → [ 0,1] , where the membership values of this relation R,

denoted by μR ( Coni , c j ) specifies the degree of relevance concept Coni to category cj.

The membership value μ R (Coni , c j ) for each category (3) is computed from (4) is shown below: dist ( Coni , c j ) (3) μ R (Coni , c j ) = max k ,l dist ( Conk , cl )

Enhancing Concept Based Modeling Approach for Blog Classification

dist ( Coni , c j ) =

 w ∈ (cp) , (cp) ∈ cp, c(cp ) = c w  w ∈ cp , cp ∈ cpw i

k

k

i

k

k

k

j

i

413

(4)

i

Where dist ( Coni , c j ) is the distribution to be the total number of occurrences of concepts Coni in category cj. The concepts are grouped together according to their categories and the occurrence frequency of each concept for each category is collected by summing up the concept frequency of individual category. 2.4

Fuzzy C-Means Clustering

For fuzzy C-means clustering, membership value of concepts (1) and membership value of categories (3) are involved. We utilize Fuzzy C-means clustering (FCMC), which was developed by Dunn [3] and improved by Bezdek [4]. FCMC allows one piece of data to belong to one or more clusters with different membership degrees (between 0 and 1) and fuzzy boundaries between clusters. Let blogpost P = { p1 , p 2 , p 3 ..... p n } be a finite dataset and U be the fuzzy matrix: where each pi = { pi1 , pi 2 , pi 3 ..... pif } has f features, such that M fcn = {U ∈ R

c× n

: uik ∈ [0,1] . To

classify P into c (2 ≤ c ≤ n) categories, let V be V = ( v1 , v2 , v3 .....vc ) CON the cluster c× f

centers or prototype matrix, V ∈ R . To find the optimization of fuzzy clustering, a fuzzy matrix U is selected, where U ∈ M fcn , we equate as: c n   M fcn = U ∈ R c×n : uik ∈ [0,1];  uik = 1;0 <  uik < n  i =1 k =1  

(5)

FCMC executes iteratively these following steps until a satisfactory objective is reached. The objective here is defined as reducing the total error to a given threshold value ε or stopping after a certain number of iterations are completed. To minimize the objective function Jm to calculate U and V [4], n

c

n

c

J m (U ,V ) =  (uki )m d ki2 =  (uki )m Pk − vi k =1 i =1

2

(6)

k =1 i =1

where m is the fuzzifier, m>1, which controls the fuzziness of the method. uki is the membership degree of Pk in the cluster i, Pk is the kth of d-dimensional measured data, vi is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and center. Both the parameters need to be specified 2 beforehand. d ki2 = Pk − vi is the square Euclidean distance between data object Pk to cluster center vi .Thus the fuzzy clustering is carried out by minimizing the objective function (4) with the update of membership uki (7) and the cluster centers vi (8) as, 2   c  P − v  m −1   k i uki =     P v −  l =1  k l    

−1

(7)

414

R.K. Ayyasamy et al. n

Where vi =

u k =1 n

m ki

.Pk (8)

 ukim k =1

The choice of the above appropriate objective function (5) is the point to the success of the cluster analysis. The iteration (6) terminates when (l +1)

}

MAXij { u ki − uki(l ) < Φ

(9)

Where Φ is between 0 and 1. Thus for a finite set of blogposts, the fuzzy C-means clustering returns a list of cluster centers V = ( v1 , v2 , v3 .....vc ) CON and a fuzzy matrix uki. 2.5

Fuzzy Similarity

The aim of using fuzzy similarity is to identify the category of the blogpost based on fuzziness. One can suggest that, fuzzy C-means clustering returns the cluster centers, which is nothing but identified category of the blogpost. In our framework we would like to improve the fuzzy C-means clustering result, by finding the fuzzy similarity with the returned list of n-gram based concept extraction. In this work [9], they highlighted the six fuzzy similarity measures using fuzzy conjunction and disjunction operators such as Einstein, Hamacher, Bounded Difference, Algebraic, Drastic and Min-Max. Their empirical result on [9] showed that similarity measure using Einstein operators achieved the best performance in classifying blogs than the other measures. We utilize Einstein similarity measure for classifying blogs by measuring the similarity between the returned list of n-gram based concept extraction and the fuzzy C-means cluster center.

3

Experiments

This section describes the experiment that test the efficiency of our proposed framework. We carried out experiments using part of TREC BLOGs082 dataset. TREC dataset is a well-known dataset in blog mining research area. This dataset consists of the crawl of Feeds, associated Permalink, blog homepage, and blogposts and blog comments. This dataset is a combination of both English and NonEnglish blogs. We used blogpost extraction program named BlogTEX 3 to extracts blog posts from TREC Blog dataset. During preprocessing, we removed HTML tags and used only English blog posts for our experiment. To test our approach, we collected around 41,178 blog posts and assigned based on the found blog concepts related to each category. Table 1 shows the CategoryTree structure used in our experiment. We downloaded the Wikipedia database dumps on 15 February 2010 and extracted 3,207,879 Wikipedia articles. After pre-processing and filtering we used 3,101,144 article titles and are organized into 145,990 subcategories. 2 3

http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html http://sourceforge.net/projects/blogtex

Enhancing Concept Based Modeling Approach for Blog Classification

415

Table 1. The CategoryTree Structure used in this experiment Categorical Index General Reference

# of documents 878

# of concepts 3758

Culture and Arts

8580

18245

Geography and Places

4558

23220

Health and Fitness

4082

21409

History and Events

2471

12746

Mathematics and Logic

213

2561

Natural and Physical sciences

1560

8530

People and Self

6279

10125

Philosophy and Thinking

2345

7591

Religion and Belief

2345

8315

Society and Social sciences

4657

13170

Tech and Applied sciences

3210

26310

We ran our experiment as mentioned in our framework and fuzzy similarity based on fuzzy C-means method, and classified blogposts based on 5 mentioned categories. To evaluate the proposed framework, two popular evaluation measures, precision and recall are used. Precision measures the percentage of “categories found and correct” divided by the “total categories found”. Recall measures the percentage of “categories found and correct” divided by the “total categories correct”. We evaluated the classification performance among the top 10, 20, and 50 of the classified blogs respectively. From our dataset, most of the blogposts are categorized on Culture and Arts, and Geography and Places, and very few posts are categorized on Mathematics and Logics. Table 2. Evaluation Result Precision

Recall

General Reference

Categorical Index

0.810

0.764

Culture and Arts

0.792

0.663

Geography and Places

0.841

0.735

Health and Fitness

0.928

0.762

History and Events

0.787

0.683

Mathematics and Logics

0.741

0.769

Natural and Physical sciences

0.910

0.785

People and Self

0.850

0.694

Philosophy and Thinking

0.962

0.815

Religion and Belief

0.827

0.897

Society and Social sciences

0.942

0.895

Tech and Applied sciences

0.903

0.823

0.8577

0.7654

Average

416

R.K. Ayyasamy et al.

From Table 2 we can note that Mathematics and Logics and Religion and Belief have less precision and more recall than the other four categories. Our framework produced better precision of 85.77% and recall of 76.54% on classifying blogs.

4

Conclusions

In this paper, a framework in optimizing fuzzy clustering with concept based modeling approach for blog classification is presented. We demonstrated the framework’s effectiveness by measuring precision and recall through real blog dataset. We conclude that using of Wikipedia knowledge base combined with fuzzy clustering and fuzzy similarity measures produce better results than the traditional clustering techniques. In our future work, we plan to improve our experiment using the full TREC Blogs08 dataset and combining the blog features such as blog tags to measure the clustering performance and its scalability.

References 1. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. J.Information Processing & Management 24, 513–523 (1988) 2. Zadeh, L.A.: Fuzzy Sets, Information and Control 8, 338–353 (1965) 3. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. of Cybernetics 3(1), 32–57 (1973) 4. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981) 5. Mendes, M.E.S., Sacks, L.: Evaluating fuzzy clustering for relevance-based access. In: IEEE International Conference on Fuzzy Systems, pp. 648–653 (2003) 6. Miyamoto, S.: Fuzzy multisets and fuzzy clustering of documents. In: 10th IEEE International Conference on Fuzzy Systems, pp. 1191–1194 (2001) 7. Saraçoglu, R., Tütüncü, K., Allahverdi, N.: A fuzzy clustering approach for finding similar documents using a novel similarity measure. Expert Systems with Applications 33(3), 600–605 (2007) 8. Widyantoro, D.H., Yen, J.: A Fuzzy Similarity Approach in Text Classification Task. In: IEEE International Conference on Fuzzy Systems, pp. 653–658 (2000) 9. Ayyasamy, R.K., Tahayna, B., Alhashmi, S., Eu-gene, S.: Concept Based Modeling Approach for Blog Classification using Fuzzy Similarity. In: 8th IEEE International Conference on Fuzzy Systems and Knowledge Discovery, pp. 1007–1011 (2011) 10. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI, Park (2006) 11. Ayyasamy, R.K., Tahayna, B., Alhashmi, S., Eu-gene, S., Egerton, S.: Mining Wikipedia Knowledge to improve Document Indexing and Classification. In: 10th Int. Conf. on Information Science, Signal Processing and their Applications, pp. 806–809 (2010) 12. Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering Documents Using a WikipediaBased Concept Representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 628–636. Springer, Heidelberg (2009) 13. Hu, J., Fang, L., Cao, Y., Hua-Jun Zeng, H., Li, H.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: ACM SIGIR, pp. 179–186 (2008)

Suggest Documents