Dwi H. Widyantoro and John Yen. Department of Computer Science. Texas A&M University. College Station, TX 77844-3112. Abstract-We present a fuzzy ...
A Fuzzy Similarity Approach in Text Classification Task Dwi H. Widyantoro and John Yen Department of Computer Science Texas A&M University College Station, TX 77844-3112
Abstract-We present a fuzzy similarity approach to solve a text categorization problem. The effectiveness of various fuzzy conjunction and disjunction operators used in fuzzy similarity formula and several document representations were evaluated using test sets from three text document collections. Based on empirical results obtained from using these collections, a special case of the fuzzy similarity formula performs very well. Keywords : Text categorization, fuzzy relation, fuzzy similarity, fuzzy operators, document representation.
I. INTRODUCTION Text categorization is a problem of assigning a textual document into one or more categories. The importance of this task has gained significant attention, as the volume of data becomes increasingly unmanageable due to the widespread of World Wide Web and the rapid development of the Internet technology. As a result, demand of a high precision method that performs automatic text categorization is inevitable to reduce the impact of this information boom. For example, it will help people finding interesting Web pages efficiently (Pazzani et al., 1996), filtering electronic mail message (Sahami et al., 1998), filtering netnews (Lang, 1995), etc. Various approaches have been investigated to perform text categorization task. Several of these approaches use techniques drawn from machine learning such as inductive decision tree (Quinlan, 1986; Apte et al., 1998), Bayesian (Tzeras and Hartmann, 1993), neural net (Wiener et al., 1995) and k-nearest neighbor (Masand and Linoff, 1992) to induce the category of a document based on the training examples. Some other approaches use mathematical tools as the core of the algorithm. For example, LSI (Deerwester et al., 1990) uses Singular Value Decomposition (SVD) to reduce the dimensionality of vector space that represents words and documents. Another example is Support Vector Machine (SVM) that turns the text classification problem into a quadratic-programming problem (QP) and then solves the optimum solution of the QP problem (Joachims, 1998; Dumais et al., 1998). In this paper, we describe a fuzzy similarity approach in text classification task. The approach is originated from Rocchio algorithm adapted to solve this problem. Specifically, a cluster center in Rocchio is created for each category from training documents and the similarity between a test document and a category is measured using cosine coefficient (Rocchio, 1971). In the fuzzy similarity approach, a fuzzy term-category relation is developed, where the set of membership degree of words to a particular category represents the cluster prototype of the learned model. Based on this relation, the similarity between a document and a category's cluster center is calculated using fuzzy conjunction and disjunction operators, and the
calculated similarity represents the membership degree of document to the category. Additionally, we also investigated the effect of several document representations to the accuracy of prediction. We observed that the simplest representation of document, where the document to be categorized is represented as boolean features vector, performs very well and this representation simplifies greatly the fuzzy similarity formula. The remaining of this paper is structured as follows. Section 2 describes the fuzzy term-category relation, fuzzy similarity approach for text categorization, and a special case of the fuzzy similarity. In Section 3, we describe our evaluation process and present the experimental results using several document collections, followed by conclusion in Section 4. II. TEXT CATEGORIZATION This section describes the fuzzy similarity approach in text categorization problem. A notion of membership degree of words to each category as well as the fuzzy similarity formula will be presented. We also present a simple analysis of how the representation of document will affect the fuzzy similarity formula. Finally, how to use the fuzzy similarity measure for categorizing text will be described. A. Fuzzy Term-Category Relations Text categorization problem involves two finite crisp sets, a set of recognized terms T = {t1, t2, …, tn} and a set of relevant categories C = {c1, c2, …, cm}. Inspired by fuzzy information retrieval (Miyamoto, 1990), a notion of fuzzy binary relation is used to describe the relation between a term and a category. The relevance of terms to categories is therefore expressed by a fuzzy relation R : T × C → [0,1] , where the membership value of R(ti, cj) specifies the degree of relevance of term ti to category cj. The membership values of this relation are determined by using a set of training examples, where each training example contains the text document and its category. Let D = { d1, c(d1) , d2, c(d2) , …, dn, c(dn) } be a set of training examples consisting of n text documents and their crisp classifications. Although the classification of a text document can be represented as a fuzzy set (i.e. under multiple categories with their membership degree), we assume that the classification of document is a crisp set or binary. The binary classification would ease the evaluation and comparison of the performance of this approach with the one of non-fuzzy approaches. Additionally, the existing documents' classifications in text document collections are all binaries that make it too difficult and even impossible to provide training set with fuzzy set of document's classes. Each document in the training set is represented by a set of term-frequency pairs, d = { t1, w1 , t2, w2 , .. , tm, wm
} where wj is the occurrence frequency of term tj in the document. Given a set of training documents D, the membership value of R(ti,cj), denoted µR (ti, cj), is calculated as follows. First, all documents are grouped according to their categories and the occurrence frequency of each term for each category is collected by summing up the term frequency of individual document in that category. Then the value of µR(ti, cj) is calculated from the total number of occurrences of term ti in category cj divided by the total number of term frequency ti in all categories. This process is expressed in (1). wi
µR (ti, cj) =
{ wi∈dk ∧ dk∈D ∧ c(dk)=cj }
wi
(1)
{ wi∈dk ∧ dk∈D}
Thus, the membership of term ti to category cj will be one if the term occurs only on documents classified as category cj. Conversely, penalty will be given to a term accurring evenly in documents classified by many categories so that its membership value to each category will be low. The value of µR (ti, cj) is basically the distribution of term ti over categories. Table 1 gives a simple illustration in constructing the fuzzy term-category relation. Suppose document d1 and d2 belong to category c1, and document d3 and d4 belong to category c2. The distribution of each term in each document is described in Table 1.a. Based on this information, the frequency occurrence of each term to each category is collected from each document (see Table 1.b). Finally, the membership degree of each term to each category is calculated and the results are shown in Table 1.c. TABLE I
THE MEMBERSHIP VALUE OF TERM-CATEGORY BINARY RELATION.
(a) documents, their categories and the frequency of occurrence in each document, (b) summary of term-category frequency, (c) the membership value of term-category relation µR (ti, cj).
Doc d1 d2 d3 d4
t1 2 3
t2 1 2
(a) Term t3 t4 2 1
2 3
t5 3 1
(b)
Term t1 t2 t3 t4 t5 t6
Category c1 c2 5 0 3 0 2 1 0 5 0 4 1 1
t6 1 1
Cat c1 c1 c2 c2
TABLE II TYPICAL PAIRS OF T-NORMS AND T-CONORMS. t-norm t(x, y) Einstein Product: x.y
Einstein Sum: x.y
2 – [x+y – (x . y)]
Term t1 t2 t3 t4 t5 t6
[1 + x . y]
Algebraic Product: x. y
Algebraic Sum: x+y - x.y
Hamacher Product: x.y x+ y – (x . y)
Hamacher Sum: x+ y - 2 x y
Minimum:
Maximum:
1 – (x . y)
min{x, y}
Drastic Product: min{x, y} if max{x, y}= 1 0 if x, y < 1 Bounded Difference: max{0, x + y − 1}
max{x, y}
Drastic Sum: max{x, y} if min{x, y}= 0 1 if x, y > 0 Bounded Difference: min{1, x + y}
B. Fuzzy Similarity Measure Once the membership values of fuzzy term-category relation are known, we need a way to measure the similarity between a test document to be categorized and the category’s cluster centers, which are represented by the membership values of terms in the same category. The use of fuzzy similarity is motivated by the fact that the category of a document cannot be determined only from a single term, rather it is determined from a set of terms that cooccur in training documents classified as the same category. This scheme is different from fuzzy information retrieval that uses composition rules to compute the set of documents relevant to a query given fuzzy binary relation between terms and documents. Let a test document d = { t1, µd(t1) , t2, µd(t2) , .. , tm, µd(tm) }, where µd(ti) represents the membership degree that
term i belongs to d. Given a binary relation R(T,C), the similarity between d and a category cj is given by (2).
(c)
Category c1 c2 1 0 1 0 0.67 0.33 0 1 0 1 0.5 0.5
t-conorm s(x, y)
sim(d, cj) =
t∈d t∈d
µR (t, cj) ⊗ µd(t) µR (t, cj) ⊕ µd(t)
(2)
where ⊗ and ⊕ denote the fuzzy conjunction and disjunction operators, respectively. It would be interesing to observe how the various fuzzy operators would perform. Table 2 lists the typical pairs of conjunction and disjunction operators that we consider in our work. In the subsequent, we will used t-norm to refer fuzzy conjunction operators and t-conorm to refer fuzzy disjunction operators interchangebly. We consider two alternatives to determine µd(t), the membership value of a term to a document. In the first
alternative, the test document to be categorized is represented as a boolean features vector that assigns µd(t) to one for every term t that occurs in d at least once and zero if the term t is not found in d. Since only terms that appear in d are used in the computation of fuzzy similarity as defined in Equation 2, µd(t)=1 for ∀ t∈d. In the second alternative, the value of µd(t) is assigned with a more fine grain value determined based on the relative frequency occurrence of term t in d. In particular, the membership value of terms having the largest number of frequency of occurrence will be assigned to one. The membership values of other terms with smaller number of frequency of occurrences are proportional to the membership values of the largest one. Equation 3 defines the membership value assignment in the second alternative.
µd (ti) =
wi max {w}
(3)
t∈d
C. A Special Case of Fuzzy Similarity We observed that the fuzzy similarity measure in Equation 2 can be simplified by representing the document to be categorized as boolean features vector (first alternative), that is, µd(t)=1 for all t∈d. The simplified formula is unique, regardless of the use of fuzzy conjunction and disjunction operators. Theorem 1. Given a fuzzy term-category relation R and a document d represented as boolean features vector. The membership degree of d to a category cj is determined by 1 Card[R(t, cj)] m
sim(d, cj) =
(6)
1 number of distinct terms in the where t∈d and m is the total document d. Proof. Since µd(t)=1 for all t∈d, the nominator and denominator of Equation 2 can be re-written as t(µR(t,cj), 1) and s(µR(t,cj), 1) respectively. According to the condition boundary of t-norm operators, t(µR(t,cj), 1)=µR(t,cj). Meanwhile, it is a proven property that all t-conorm operators are bounded below by max and bounded above by drastic sum sDS, that is, max(µR(t,cj), 1) ≤ s(µR(t,cj), 1) ≤ sDS (µR(t,cj), 1) Since µR(t,cj) = [0,1], then max(µR(t,cj), 1) =1 and according to Table 2, the drastic sum value sDS(µR(t,cj), 1) is one for all conditions. Therefore, 1 ≤ s(µR(t,cj), 1) ≤ 1 and s(µR(t,cj), 1) = 1. Thus, the Equation 2 can be re-written by
sim(d, cj) =
t∈d
µR (t, cj)
t∈d
1
(7)
Suppose m is the number of distinct terms in d, the denominator of (7) can be replaced by m. 1
sim(d, cj) =
1
m
t∈d
µR (t, cj)
(8)
Since the sum of µR (t, cj) is basically the cardinality of fuzzy subset of term-category relation R over terms occurring in d and category cj. Therefore, 11 Card[R(t, cj)] m
sim(d, cj) =
Hence, the theorem is proved. D. Document Classification The fuzzy similarity measure between a document d and a category cj as defined in Equation 2 can be used to assign the membership degree of d to category cj, which is denoted by µcj (d). Thus,
µcj(d) = sim(d, cj)
(4)
By calculating µcj(d) for all categories, we obtain a fuzzy set of document's category, denoted by CAT(d), such that CAT(d) = {µc1(d)/c1, µc2(d)/c2, …, µc1(d)/cn} In text classification task, however, we have to choose among of these known categories the one that represents best the category of document. In CAT(d), it can be taken from category whose membership value is the highest, which is expressed in the following equation, cj = CAT(d)Height[CAT(d)]
(5)
Hence, the category of d is the α-cut of a fuzzy set CAT(d) at a value that is the height of the fuzzy set CAT(d). III. EVALUATION In this section we will describe the data sets that we used for evaluation, experiments procedures and results. The naïve Bayes classifier, with which the performance of the fuzzy similarity is compared, will also be described briefly. A. Data and Procedure We used three different text document collections to evaluate the effectiveness of fuzzy similarity approach in text categorization problem. The first collection is the Reuters-21578 distribution 1.0, which is available at http://www.research.att.com/~lewis. This collection contains 135 overlapping topics and 21,578 stories obtained from Reuters newswire in 1987. Of these stories, 12,902 had been assigned to 118 categories and each story is assigned to one or several categories. The Reuters data set represents a large collection with uneven number of categories. For instance, one category (“earn”) consists of about 4000 documents while many other categories have less than ten documents. We divided the stories into training and test documents according to "ModApte" split, where 9,603 documents had been marked as the training set and 3,299 documents for the test set. For evaluation, we use only the top ten categories, which consist of about 77% of the documents. Since the training and test set on this
collection are fixed, we ran the experiment on this collection only once. For a document in the training set that was assigned to several categories, we used the document to train each category to which it belonged. All documents are pre-processed by removing stop words defined in a stop list, stemming word and identifying bigram. The stop list contains common words (e.g. “a”, “the”, “of”, “although”, “even though” etc.) that are not useful for category discriminator. We included 293 words in the stop list. Word stemming is used to convert various word forms into their root. For example, word assistance, assisting and assistant will be converted into root assist. This process is useful to reduce the number of features. Finally, the bigrams are extracted from two-word phrases that co-occur in a document. Bigrams are occasionally good category predictors. The second data set is the UseNet NewsGroups collected by Ken Lang. The collection consists of 19,997 documents evenly divided into 20 UseNet discussion groups, where 19 categories consist of 1000 documents and one category has 997 documents. Of these documents, about 4% appear in two of the newsgroups (Joachims, 1997) and therefore they are assigned in two categories. In this collection, we only performed stop words removal during pre-processing the documents. Two third of them are used as the training set and the remaining 1/3 documents for the test set. The experiment results on this collection were averaged over 10 runs, where on each run the training and the test documents were randomly chosen. The third data set, which represents a small collection, is our own collection (OnlineNews) consisting of 577 text documents of Html format. The documents were collected from various online newspaper and magazines (i.e. UsaToday, Times, Excite, Yaho!, etc.) at different time in 1998. This collection is divided into six general topics: money, sport, health, weather, technology and world news. Since the documents are of Html format, we removed all Html tags, links and Java scripts. The rest of document preprocessing is similar to the one of the Reuters data set. Similar to the NewsGroups data set, 2/3 and 1/3 of the documents on this collection are used as the training and test set, respectively. The training and test documents were selected in random and the results of the experiment were averaged over 30 runs. B. Performance Comparison We used naïve Bayes classifier as a base line to compare with our approach on exactly the same condition since this classifier has been popular as an effective probabilistic approach for classification task. Briefly, the naïve Bayes assumes that attributes are conditionally independent given a class. According to this assumption, the probabilities of words occurring in one text position are independent. Moreover, the estimates of the same words appearing in several different positions are given single probability regardless of its position. Given a class, the probability of a document in that class by observing words found in the document is the product of the probabilities for the individual words and its prior class probability. To avoid zero count words that will dampen the information in other probabilities, Laplace estimate is used to compute the
probabilities of words. Then the classification of the naïve Bayes is to maximize the probability of words that actually occur in document and is defined as follows:
arg max P(ci )
Π P(wk | ci )
ci k Although independence assumption adopted by this classifier is incorrect in practice, it does not necessarily degrade its classification accuracy (Domingos and Pazzani, 1997). C. Results The performance presented in this subsection is measured from the accuracy to predict the category of test documents. The prediction accuracy is defined as the percentage of test documents that are correctly categorized. For a document in the test set of the Reuters that was assigned into several categories, the prediction is considered correct if it is one of the given categories. However, we did not perform this procedure for the documents in the NewsGroups because we could not figure out which of the documents in this data set were cross-posted. Since the performance of the naïve Bayes was sensitive to the presence of irrelevant words, we pruned some of words so that only a subset of vocabulary obtained from the training set was used. In particular, words that appear in document less than a certain number were removed from the vocabulary before computing the words’ probabilities. We present only the best results from the naïve Bayes experiments after varying the number of removed words. The results from fuzzy similarity experiments were obtained by using full vocabulary from the training set. For simplicity, we use SCFSim to refer the special case of fuzzy similarity, a fuzzy similarity that represents test documents as boolean features vectors. Drastic, MinMax, Bounded, Hamacher, Einstein and Algebraic are referred to fuzzy similarity using the corresponding fuzzy conjunction and disjunction operators and the weighted features vectors of test documents. Table III summarizes the experiment results on all data sets. On Reuters data set, the special case of fuzzy similarity and fuzzy similarity measure using Algebraic sum & product. (SCFSim) performs slightly better than the naïve Bayes classifier. The accuracy of SCFSim (89.2%) is 0.3% higher than the naïve Bayes (88.9%) and the naïve Bayes outperforms fuzzy similarity that uses weighted features vector of test documents. TABLE III THE SUMMARY OF EXPERIMENT RESULTS SCFSim Algebraic Einstein Bouded Hamacher Drastic MinMax Naïve Bayes
Reuters 89.2 88.1 87.8 81.1 86.4 76.5 80.6 88.9
NewsGroup 90.7 86.6 86.2 78.4 69.7 69.6 47.4 89.8
Online News 95.7 96.7 97.0 96.9 64.8 96.9 29.1 89.3
On the NewsGroups data set, the SCFSim still performs best among other methods. Using t-test, the accuracy of SCFSim (90.7%) is significantly better than the naïve Bayes classifier (89.8%). Joachims reported that on this data set, the best prediction accuracy was 89.6% using the naïve Bayes, which is very close to our results on the same algorithm, and 91.8% for the Probabilistic TFIDF (PrTFIDF) (Joachims, 1997). However, Joachims handled documents having two categories (4%) in the same way as we did for the multiple-category documents of the Reuters, and we did not do this in the NewsGroups collection. The ranking position of Algebraic and Einstein on this collection is still consistent with their ranking positions on the Reuters. It is not surprising that the Algebraic Product & Sum performs well since the fuzzy similarity formula using these operators is exactly the same as the Jaccard coefficient, a widely used similarity measure in information retrieval literature (Salton, 1983). Table 4 describes the accuracy of each category from the first three best methods. The best performance on online news collection is obtained by the Einstein (97%), which is followed by the Drastic (96.9%), Bounded (96.9%), Algebraic(96.7%) and SCFSim (95.7%) subsequently. The accuracies of these methods are significantly higher than the prediction accuracy of the naïve Bayes (89.3%). Although the SCFSim is not the best on this data set, its accuracy is still considered very high. In addition to the two alternatives of the test documents representation described above, we also performed experiments where normalized term frequency (TF) and TFIDF weighting scheme were used to estimate the membership degree of word occurring in the documents. However, the performances obtained by employing these two weighting schemes are even worse. We conjecture that the performance drop obtained by fine-tuning words in the test documents, represented by µd(t), is due to the different meaning between µd(t) and the membership value of fuzzyterm relation, µR (ti, cj). In particular, the value of µd(t) calculated from all weighting schemes captures the importance of words over other words in the same document and this value could be different among documents for the same word. On the other hand, the value of µR(ti, cj) represents the distribution of word over categories, and provides an evidence of how much a word would vote to each category. Additionally, the values of µR(ti, cj) are calculated from the statistical information of the training documents so that they adequately capture the model of each category. In the fuzzy similarity formula, the value of µd(t), which has nothing to do with the classification task, would change the estimate value of µR(ti, cj), which is more relevant in the task. Therefore, the prediction accuracy would be improved when the change of µR(ti, cj) by µd(t) were reduced in the fuzzy similarity formula. In fact, the values of µR(ti, cj) are intact by the fuzzy disjunction operators whenever all µd(t) are assigned to one. Additionally, by representing test documents as boolean features vector, all values resulting from the fuzzy disjunction would be one, which provide normalization factor for the “vote” collected by µR(ti,cj). This is the
condition where the performance of the fuzzy similarity formula is maximized. IV. CONCLUSION We have described in this paper a fuzzy similarity approach to address a text classification task. Using fuzzy term-category relation derived from the same membership function, the representation of test documents to be categorized becomes a critical factor. Algebraic and Einstein Product & Sum perform consistently well across data sets when the document features are weighted. The best performance is achieved by the special case of fuzzy similarity, where the documents are represented as boolean features indicating the presence and absence of words, and used only the fuzzy term-category relation to predict the category of documents. REFERENCES [1] Apte, C., Damerau, F. and Weiss, S. (1998) Text Mining with Decision Rules and Decision Trees. In Proceedings of the Conference on Automated Learning and Discovery, CMU, June. [2] Deerwester, S., Dumais, S., Furnas G., Landauer, T. Harshman, R. (1990) Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407. [3] Domingos, P. and Pazzani, M. (1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29,103-130. [4] Dumais, S., Platt, J. Heckerman, D., Sahami, M. (1998) Inductive Learning Algorithm and Representations for Text Categorization. In Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, 148-155. [5] Joachims, T. (1997) Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the International Conference on Machine Learning (ICML’97), 143-151. [6] Joachims, T. (1998) Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML), Springer Verlag. [7] Lang, K (1995). NewsWeeder: Learning to Filter News. Proceedings of the 12th International Conference on Machine Learning, 331-339, Lake Tahoe, CA. [8] Masand, B., Linoff, G., Waltz, D. (1992) Classifying News Stories Using Memory Based Reasoning. In Proceedings of the 15th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 59-65. [9] Miyamoto, S. (1990) Fuzzy Sets in Information Retrieval and Cluster Analysis, Kluwer Academic Publisher, Boston.
[10] Pazzani, M., Muramatsu, J., and Billsus, D. (1996) Syskill & Webert: Identifying Interesting Web Sites. In Proceedings of the 13th National Conference on Artificial Intelligence, 54-61, Portland, OR. [11] Quinlan, J.R. (1986) Induction of Decision Trees. Machine Learning, 1:81-106. [12] Quinlan, J.R. (1993) Constructing Decision Tree in C4.5: Programs for Machine Learning, 17-26, Morgan Kauffman Publishers. [13] Rocchio, J. (1971) Relevance Feedback Information Retrieval. In Gerald Salton(Ed.), The Smart Retrieval System - Experiments in Automated Document Processing, 313-323. Prentice-Hall, Englewood Cliffs, NJ.
[14] Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. [15] Sahami, M., Dumais S., Heckerman, D., Horvitz, E. (1998) A Bayesian Approach to Filtering Junk e-mail. AAAI 98 Workshops on Text Categorization, July. [16] Tzeras, K. and Hartmann, S. (1993) Automatic Indexing Based on Bayesian Inference Networks. In Proceedings of the 16th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 22-34. [17] Wiener, E., Pederson, J. and Weigend, A. (1995) A Neural Network Approach to Topic Spotting. Fourth Annual Symposium on Document Analysis and Information Retrievel.