Automatic Text Categorization - Google Sites

0 downloads 182 Views 140KB Size Report
Dec 5, 2006 - novelty presented, is the application of ML based ATC to sentiment classification. The corpus used was col
Automatic Text Categorization Oren B. Yeshua [email protected] December 5, 2006

1 Introduction The Internet and its widespread adoption have ignited a ceaseless proliferation of electronic documents of all kinds. In an e ort to organize all of this information, the problem of automatic text categorization (ATC) has come to the fore. Given a particular document, we would like to assign it to a category to which it belongs. Practical applications abound, and the resulting algorithms are used daily by millions in news lters and aggregators, web indices (eg. Yahoo Directory), and spam lters. The potential uses in the national defense and intelligence gathering community make ATC an active area of research even with several e ective approaches already in place. As with most natural language challenges, the general problem of text categorization is ill-posed and we will need additional assumptions to make it concrete. For the purposes of this survey, we consider text categorization to be a multi-label classi cation problem. Given a natural language document, the task is to assign to it the single label/category (from a xed number of categories) which best describes the document. Typically the labels are categories representing the possible topics a document may refer to, however, as we shall see, there has been a good deal of work done in non-topic-based categorization as well. Naturally, the gold standard is provided by human categorization and tagged corpora are used to measure results. One of the most commonly used corpora for ATC is the Reuters-21578 corpus. Assembled in 1987, the corpus contains 21,578 documents that appeared in the Reuters news media during that year. With one notable exception, all of the papers in this survey report results on Reuters-21578, so comparisons made between the various approaches will be benchmarked by this corpus. Furthermore, the key statistic that is consistently reported is the precision/recall-"break-even" point. The trade-o between precision and recall can be achieved by adjusting the con dence threshold of a binary classi er. The "break-even" point, refers to the point at which precision is equal to recall. It should be noted that ATC is not a binary classi cation so some method of averaging is necessary to combine the results of the various classi cations. Some results are reported using micro-averaging while others use macro-averaging. Furthermore, there are several standard train/test "splits" used with the Reuters corpus and di erent researchers may have chosen to use di erent splits. For the purposes of this survey, we will not concern ourselves with these 1

details. While every attempt will be made to choose comparable values from among those reported, the values will be used to make only "high level" comparisons between the various approaches.

2 Background One of the earliest algorithms applied to the task of ATC is Rocchio's algorithm - a technique rst developed for information retrieval. Like most of the techniques we will discuss, Rocchio's algorithm makes the "bag-of-words" assumption. Essentially, it views a document as a collection of words, each used with varying frequency, and ignores the relative ordering of the words. Explicitly, it treats a document as a vector where the ith entry is the term-frequency (TF) of the ith word in the vocabulary. Since such a vector would be extremely long and require an inordinate amount of computation time, simple techniques are used to eliminate certain words from being used as features. Commonly, words with extremely low occurrence (< 3) in the corpus are eliminated, along with "stop" words that are so freqeunt as to impart little value to the categorization task. Rather than simple TF, TF-IDF ranking is often used to create weights for the features that better re ect the importance of a given feature/word towards categorizing the document. In TF-IDF ranking, TF refers again to term frequency (the number of occurrences of a given word in the document) and IDF is the inverse document frequency. The document frequency is the number of documents in the corpus in which the given word appears at least once. The TF-IDF weight of word wi in document d is then given by

T F (w ; d)  log i

 num docs in corpus  DF (w ) i

(1)

Rocchio's algorithm proceedes to compute a "prototype" vector for each class from the positive and negative examples in the training set, and uses the cos distance of a document vector to the prototype of each class to determine the categorization (the largest cos distance implies the smallest angular distance and is chosen as the closest match). It has been shown (Joachim) that the basic Rocchio algorithm is not particularly well suited to the task of categorization (in comparison with more sophisticated machine learning techniques) but it serves as a good baseline, and also provided for the introduction of several recurring concepts such as the "bag-ofwords" model and TF-IDF weights.

3 Machine Learning & Support Vector Machines In 1997, Thorsten Joachims introduced the use of Support Vector Machines (SVMs) to the ATC domain in [Joachim 97b]. Joachims uses the same basic setup as was described for the Rocchio algorithm. Each document is represented as a feature vector of TF-IDF weights for non-stop words that appear at least 3 times in the training data. The document feature vectors are normalized to unit length to account for varying document sizes. Joachims goes to great length to justify the appropriateness of SVMs to the ATC task. A problem that most ATC classi ers 2

face is the large number of features (distinct words in the training corpus), often on the order of 104 . A simple approach, common to other ML applications, is to utilize some form of feature selection to reduce the size of the feature vectors by using only a subset of features that are deemed most important to the classi cation. One popular and e ective method of feature selection is the information gain criterion, where the features are ranked in order of the information 1 they posess regarding the classi cation and features ranked below some threshold are discarded. Joachims demonstrates empirically that this approach is not particularly e ective for ATC, by training a Naive Bayes classi er only on the set of features ranked lowest in information gain. This classi er was still much better than random guessing at categorizing the Reuters corpus, indicating that very few of the features are irrelevant in ATC. What we are left with is a long but sparse (since any given document is likely to use only a small fraction of the total words in the vocabulary) feature vector - a prime candidate for learning via the SVM [Kivinen et al. 1995]. It is also pointed out that most ATC tasks are linearly separable 2 and exceptions are often caused by errors in the datasets. Experiments indicate that the Ohsumed (a popular medical document corpus) categories are all linearly separable, and many of the Reuters categories are linearly separable as well. SVM classi cation involves nding a maximum margin linear separator, so SVMs are a natural choice for any linearly separable classi cation task. Joachims compares the results of SVM classi cation with that of several other popular and previously demonstrated ML techniques for ATC. The rst is the Rocchio algorithm which we have already discussed, which scores a precisio/recall-breakeven point of 79:9% on the Reuters corpus. Addionally, results are reported for Naive Bayes, k-Nearest Neighbors, and Decision Tree classi cation. Brie y, Naive Bayes (NB) is a probabilistic classi er that makes a strong (and untrue) independence assumption. When applied to ATC, the assumption is that a given word occurs independently of the other words in the document. Despite this assumption, NB performs rather well in ATC with a P/R-breakeven point of 72:0% for the Reuters corpus in this study (random guessing yields 21%). Decision Trees (DTs) are ecient multi-label classi ers but are prone to over tting of the training data. They are a simple data structure typically built using the information gain criterion previously discussed [see Russell & Norvig]. The C4.5 DT learning algorithm does somewhat better than NB on the Reuters corpus with a breakeven point of 79:4%. Finally, the k-Nearest Neighbors (k-NN) approach is simply to classify a sample based on a weighted average of the labels (+1/-1) of its k nearest neighbors. In the case of ATC, "nearness" is determined by the cos distance and the multi-label classi cation is achieved via multiple binary classi cations. Of the four ML algorithms used for comparison in this study, k-NN is the most e ective for ATC, with a breakeven point of 82:3%. The above ML algorithms were the convention at the time of publication and their results on ATC had already been rmly established. The results reported, however, were reproduced under the same experimental setup used for the SVM so as to verify the validity of the comparison.

Here we use 'information' in the information theoretic sense of the word which is closely related to entropy - see Shannon 2 The positive and negative examples can be separated by a hyperplane in the feature space see Kearns & Vazirani 1994 1

3

Several results are reported for the SVM using both polynomial and RBF kernels with varying parameters. In general, the SVM classi er signi cantly outperformed the others on all the corpora tested, with a breakeven point around 86% for the Reuters corpus. This paper represents the results of the straightforward application of machine learning techniques to ATC. In a sense, it is the culmination of such investigation to date as SVMs are still the state of the art in ML. The SVM a ords several advantages over the previous ML algorithms when applied to ATC, including improved performance, robust performance across domains (some of the other algorithms uctuate, demonstrating strong performance on certain datasets and poor performance on others), automatic tuning of parameters, and ecient learning in the presence of many features (no need for feature selection). On the other hand, it is by no means a moratorium on ATC research in general. Even with the improved results of the SVM classi er, there is still much room for investigation and improvement. The remaining papers we will review, all expand on the work done thus far.

4 Boosting Much of the success in ML to date has been the direct result of advances in computational learning theory (CLT) - the branch of theoretical computer science devoted to exploring the inherent complexities involved in learning and teasing out what is possible from what is not. Many ML algorithms including SVMs grew out of exactly such formal inquiries. Another direction taken by CLT is known as accuracy boosting. It turns out that given a weak learning algorithm, one can increase (boost) its accuracy (arbitrarily, in theory) by repeated application of the algorithm, taking a weighted majority vote among the hypotheses generated. Intuitively, the result is achieved as follows: each time the weak learner is run, it is given training data that the previous hypothesis failed to classify correctly. There is a signi cant di erence in approach here, in that the particular weak learning algorithm used can be any of the methods we've described thus far and even much weaker ones. In a sense, boosting is a meta-level learning algorithm. Robert Schapire proved that boosting was possible by presenting the rst boosting algorithm [Schapire, 1990]. Since then, many variations have been developed that work well in practice. At the center of these is the AdaBoost algorithm [Fruend & Schapire, 1997]. In BoosTexter, Schapire and Singer present a a variant of AdaBoost tailored toward ATC and compare the results with that of the previous work we have discussed. Two new boosting algorithms are introduced in this paper. The rst, AdaBoost.MH is a straightforward adaptation of AdaBoost to multi-label classi cation. The usual reduction from multi-way classi cation to binary classi cations is employed, breaking the multi-label problem into a series of binary classi cations. Additionally, another boosting algorithm AdaBoost.MR - designed to natively support multi-label classi cation - is presented. We will omit description of the algorithm here as AdaBoost.MH outperformed it by most performance measures. As previously mentioned, boosting is intended to work as a meta-algorithm, independent of any particular choice of learning algorithm employed. However, this study focused on a very simple set of weak hypotheses, perhaps to higlight the 4

role of boosting in the learning process. The hypotheses utilized all have the same basic form - if term appears in document, then category is label, where term is any unigram or bigram appearing in the training data (as usual stop-list words are removed from the corpus prior to processing), and label is either one of the potential categories or a real number indicating a con dence that the document in question belongs to a speci c category (real valued hypotheses are used with real AdaBoost which runs more eciently than the discreet version). The weak learning algorithm will work as follows:



for each termi { scorei



Z given h (h constructed from term ) t

i

t

i

return hi where i = argmini scorei

Analysis of AdaBoost indicates that the best choice of hypothesis is the one that minimizes Zt (the normalization factor in the computation to update the probability distribution over examples after iteration t of boosting). The paper presents derivations for the hypothesis scores in the algorithm above as well as t (the weighting factor for hypothesis t in the nal vote) for each of three variants of AdaBoost.MH (discreet, real, and real-abstaining). The results reported are extensive. Along with comparisons among the various boosting variants, boosting was performed on a wide variety of corpora and compared with a full range of ATC techniques, including several outside of the ML eld. Results on an ASR task reduced to ATC are reported as well. To maintain continuity, we note only that BoosTexter was able to outperform its competition in nearly every scenario given sucient iterations of boosting. On the Reuters corpus, AdaBoost.MH achieved a breakeven point of 86% whereas k -NN (as the next highest performer) was reported at 85%. Given the slightly di erent results for k NN we hesitate to draw a direct comparison with SVM (which was conspicuously absent from this study), but it suces to say that BoosTexter (at least at the time of publication) easily rivaled the state-of-the-art in ATC.

5 Non-Topic-Based Text Categorization We break momentarily from the race to push the envelope of ATC performance, to explore some of the work that has branched o from the standard ATC formulation. In the lightheartedly titled Thumbs Up? Sentiment Classi cation using ML Techniques, Bo Pang and Lillian Lee explore an area adjacent to the topic classi cation we have seen thus far. Suppose we would like to categorize a document by the opinion or sentiment of the author rather than by the subject of its content. Bo and Lee use the movie review domain which a ords a wealth of examples available online, most of which have an associated human labeling (thumbs up/down, stars, etc). They also discuss previous work in non-topic-based text categorization which we will relate here.

5

Text categorization has branched into several adjacent areas. Researches have attempted to automatically determine the "style" of a document using ATC techniques, where style can refer to the publisher (NY Times vs. The Daily News ), author, genre (editorial), "brow" (high-brow vs. low-brow), register, native-language background, and more. Attempts at identifying the use of subjective language have successfully recognized that opinion is expressed, without identifying what opinion is expressed. Prior to Thumb's Up?, sentiment-based classi cation was mostly knowledge-based, focussing on semantic understanding of the text or individual words (CU's own McKeown, 1997 is cited in this regard). It is claimed that the novelty presented, is the application of ML based ATC to sentiment classi cation. The corpus used was collected from the Internet Movie Database (IMDb) archive of the rec.arts.movies.reviews newsgroup. Ratings were extracted and converted to a standard categorization of fpositive | neutral | negativeg. Neutral reviews were discarded leaving a corpus of 752 negative and 1301 positive reviews from a total of 144 reviewers. Preliminary tests to guage the diculty of the problem revealed that a short human generated list of indicator words chosen by intuition does quite poorly, only besting random guessing by a small margin. The hypothesis to be tested, is whether the movie review sentiment classi cation problem can be posed as a binary classi cation ATC problem. To that end, NB, MaxEnt, and SVM were each applied to the corpus. In preparation, rating indicators and HTML were removed from the documents but no stemming or stop list was used. Both unigrams and bigrams were used as features with a TF cuto of 4 for unigrams and 7 for bigrams (lower cuto s did not improve results). Accuracies of as high as 82% were achieved with the SVM classi er, with NB and MaxEnt performing slightly below. The researchers attempted several variations, bigrams inclusion/exclusion, negation tags (the tag NOT_ was inserted after a negation word such as \isn't" or \not"), POS tagging, and word position in an attempt to boost the accuracy to the levels demonstrated in topic-based ATC (per category levels in the 90s are typical which correspond to binary class cations such as this sentiment task) , but to no avail. They conclude that the sentiment task is inherently more dicult because it is more dependent on semantic meaning, but that the standard bag-of-words approach worked better than anticipated. The authors leave us with a telling example of the diculties faced in the movie review domain: \This lm should be brilliant. It sounds like a great plot, the actors are rst grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However,..."

6 Encyclopedic Knowledge Let's fastforward to the present day and see if we can't catch up with the present state of the art in ATC. In an article published this year, Gabrilovich and Markovitch of the Technion are going beyond the now exhausted bag-of-words approach in an e ort to reach new levels of accuracy. Clearly, the standard term features are insucient, which brings us to the realm of feature generation. It is theorized that 6

what was lacking from previous attempts at categorization is apriori knowledge of the categories. The apriori insight has su used modern AI from cognitive psychology to connectionist modeling - it seems only natural to bring it to bear on the ATC problem as well. A rst attempt at incorporating knowledge was the ever popular WordNet, but it was found that the hierarchical organization of WordNet was not well suited for this task. Inherently, it is a random access task [Valiant, 2000] that requires associations all across the knowledge base. Naturally, the investigators took to the largest repository of human knowledge in existence - Wikipedia. The approach is relatively simple, requiring only statistical techniques and no parsing of the articles or explicit knowledge representation, although the devil is in the details (dealing with the enormous size of Wikipedia and making the right associations was the major challenge). A document is rst passed to a feature generator which nds Wikipedia articles that relate to the document. A second step lters out articles that are deemed extraneous. Finally, the article titles are added to the feature vector for the document, enriching the feature space. From their everything proceedes as normal, the classi er is learned using an SVM with linear kernel and subsequent test documents are categorized by the resulting classi er. The magic happens in the feature generator where a document is matched with related Wikipedia articles. The authors describe a centroid classi er [Han & Karypis, 2000] at the core of the generator, made possible by an inverted index. They also stress the importance of their multi-resolution approach which involves feature generation at the word, sentence, paragraph, and document levels to implicitly perform word sense disambiguation and address polysemy. The results are quite striking. Not only did the knowledge based approach break the \plateau" achieved by SVM on the Reuters dataset (a 1:5% improvement in breakeven point), but signi cant improvements of over 18% and 30% were achieved on individual categories in Ohsumed and RCV1 respectively. The Wikipedia based classi er was even tested on the movie review domain we encountered earlier and posted a 3:6% improvement. The most improvement came from classi cation of short documents that were previously unclassi able due to the limited amount of information available to the classi er. The authors note that this is just the tip of the iceberg in the use of Wikipedia as a knowledge base. Further work will involve following links between articles to generate more associations.

7 Conclusion We've discussed the development of the ATC eld from its inception to the current state of the art through the channel of four seminal papers in the eld. Many approaches to ATC from various elds were covered (among those that didn't make the cut are rule based approaches and techniques from data mining which have also contributed to this burgeoning area of investigation). The current state of the art is quite impressive with precision/recall-breakeven points routinely in the 90 94% range on a wide variety of datasets and categorization tasks. However, there is still plenty of room for improvement, and the infusion of knowledge based approaches is just beginning. 7

References [1] T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, 1997. [2] R. Schapire, Y. Singer, BoosTexter: A Boosting-based System for Text Categorization, 2000. [3] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up? Sentiment Class cation using Machine Learning Techniques. Proceedings of EMNLP, 2002. [4] E. Gabrilovich, S.Markovitch, Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, 2006. [5] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, 1997. [6] S. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive Learning Algorithms and Representations for Text Categorization, 1998. [7] M. Kearns, U. Vazirani, An Introduction to Computational Learning Theory. The MIT Press, 1994. [8] Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997. [9] L. Valiant, Circuits of the Mind. Oxford Press, 2000. [10] Russel, Norvig, Arti cial Intelligence: A Modern Approach. Prentice Hall, 2003. [11] M. Antonie, O. Zaiane, Text Document Categorization by Term Association, 2002.

8

Suggest Documents