TEXT CATEGORIZATION Building a kNN classifier for the Reuters

0 downloads 0 Views 89KB Size Report
Dec 4, 2006 - Document frequency for a word is the number of documents in which the word occur. We use .... function is used to calculate the impurity of the split. .... Text categorization, assigning free text to one or more predefined.
TEXT CATEGORIZATION Building a kNN classifier for the Reuters-21578 collection Anita Krishnakumar University of California, Santa Cruz Department of Computer Science [email protected]

December 4, 2006

Abstract Categorization of texts into topical categories has gained booming interest over the past few years. There is a growing need for tools that help in finding, filtering and managing the highdimensional data due to the rapid growth of online information. Building a text classifier by hand is time consuming and costly and hence automated text categorization has gained a lot of importance. A general inductive process automatically builds a classifier by learning, from a set of previously classified documents, the characteristics of one or more categories. In this project we look at the main approaches that have been taken towards text categorization. Also, the K-nearest neighbour algorithm is used for building a classifier for the Reuters collection.

1 Introduction Text categorization is the process of identifying the class to which a text document belongs. This generally involves learning, for each class, its representation from a set of documents that are known to be members of that class. A number of statistical classification and machine learning techniques has been applied to text categorization, including regression models, nearest neighbour classifiers, decision tress, Bayesian classifiers, Support Vector machines and neural networks. In this project a study of different methods for text categorization has been done and a survey of previous work on text categorization on the Reuters-21578 collection is summarized. Results of my experiment on the Reuters collection using the kNN algorithm are also presented.

1

2 Feature Extraction 2.1 Pre-processing The first step is to convert the documents, which are strings of characters into a representation suitable for the learning algorithm. The transformation usually involves Removal of HTML or other tags Removal of commonly occuring stop-words Performing word-stemming

2.2 Indexing Vector space model is the most commonly used representation in which each document is represented as a vector of its words. The vector usually has the weights  of each word i in document k. The more frequently a word appears in the document, the more relevant it is to the topic of the document.



Let  be the frequency of word i in document k, N the number of documents in the collection and   the total number of times the word i occurs in the whole collection. Some of the different weighting techniques are described below.

Boolean Weighting The weight is if it the word occurs in the document and otherwise:

 





if 

otherwise



Word frequency weighting The weight is the frequency of the word in the document:



 



tf  idf weighting The weight of a word i in document k is directly proportional to the frequenct of the word i in the document k and inversely proportional to the number of documents in the collection that have the word i:

 



  "!

2

Entropy weighting The weight for word i in document k is given by:

 # $  &%$ (' ! )* +% -/.103, 2

+465

= = 7" 869:@!

where

-/.10A, 2

+4B5

= = 7C 869 :

is the average uncertainity or entropy of word i.

2.3 Dimensionality Reduction 2.3.1

Feature Selection

Here we remove non-informative words from the documents to improve categorization effectiveness and reduce computational complexity. Some feature selection methods described in [9] are Document Frequency Thresholding and Information Gain. Document Frequency Thresholding Document frequency for a word is the number of documents in which the word occur. We use some predetermined threshold value and remove words that have document frequency lesser than the threshold. Information gain It is the number of bits of information obtained for category prediction by knowing the presence or absence of a word in a document. Let D '3'3' DF denote teh set of possible categories. The ,AE E information gain of the word G is defined to be:

HJI

@G ! LK

7C8 P D 7 ! 6 PD 7 1! ! % @ G ! 7"8 P D 7 R G ! S PD 7 R G !1! % N N O , N N 5$M ,ON Q 5 M 7 U R 7 R  G T ! 6 PD G T !1!  6 PD G T !1! 6 N N N

PD 7 ! is estimated from the fraction of documents in the total collection that belongs to class D 7 Nand S  G T ! from the fraction of documents in which the word G occurs. PD 7UR G ! is computed as N N the fraction of documents from class D 7 that have at least one occurance of word G and PD 7UR G T ! N 7 as the fraction of documents from class D that does not contain word G . Words with information gain less than a predetermined threshold are removed.

2.3.2

Re-parameterisation: Latent Semantic Indexing

Latent Semantic Indexing (LSI) assumes that there is some underlying or latent structure in the pattern of word usage across documents, and that statistical techniques can be used to estimate that 3

structure. LSI uses SVD(Singular Value Decomposition), a technique closely related to eigenvector decomposition and factor analysis. If we have a V x W word-by-document matrix A, the SVD of A is given by XY [Z \^]`_ . Z and ] have orthonormal columns and \ is the diagonal martiz of singular values. The singular values of \ are ordered by size, the a largest may be kept and the remaining smaller ones set to zero. The product of the resulting matrices is a matrix X which is an approximation to A with rank a .

X

M

bZ

M

\

M

]

M

_

M

To compare a word i with document k, the cosine between the ith row of the matrix Z \ ,dc*e and kth row of the ] \ ,dc*e . M M

M

M

To compare a new document with the documents in the training set, one starts with its document vector f and derive f g by fh and rows of ] \ g [f _ Z \ , . Taking the cosine between fJg \ gives the degree of similarity between M the M i new document and the documentsM in the training set. M M

3 Methods for Text Categorization Let fj lkmf '3'3' fUnpo be the document vector to be classified and D '3'3' D be the possible ,E ,E E E topics. Let W be the number of documents in the training set with classes q 3' M '3' q . W 7 is the ,3E E  number of training documents for which the true class is D 7 .

3.1 Naive bayes The naive Bayes classifier ignores possible dependencies, namely, correlations, among the inputs PD 7 !sr 7Cn 8 Pf  R D 7 ! . and reduces a multivariate problem to a group of univariate problems. PD 7 R f ! , N N N Despite the fact that the assumption of conditional independence is generally not true for word appearance in documents, the Naive Bayes classifier is suprisingly effective.

3.2 K-nearest neighbours To classify an unknown document vector d, the k-nearest neighbour (kNN) algorithm matches the document vector of the query document with all the documents in the test set and use the k-nearest neighbours to predict the class of the query document. The similarity may be computed using Euclidean distance or Cosine-similarity between documents. kNN is a lazy learning instance-based algorithm. The main computation is the online scoring of the test documents in order to find the k-nearest neighbours. The k-NN method has linear learning time, but its classification time increases linearly with the number of training samples.

4

3.3 Decision trees In this approach, the query document is matched against a decision tree to determine whether the document is relevant to the user or not. The tree is constructed from the training samples. One of the most popular approaches for this task is the CART algorithm. CART builds a binary decision tree by splitting the set of training vectors at each node according to a function of one of the vector elements. To find the element that is the best splittler, the entropy function is used to calculate the impurity of the split.

H#tvuwx y{z u

q|Vb}mU~

wx

z

u z u z }) Q\ 7"8 PD 7 R ! 6 PD 7 ! R ! , M

where PD 7 R ! is the2€ probability of a training sample being in class D 7 given that it falls into node t. = u z z z And PD 7UR !  2 4 . W 7  !  ‚f#Wƒ ! are the number of samples of class D 7 and the total number  4 respectively. of samples at nodet, To choose the best splitter we pick the vector element that minimizes the impurity after the split. The process is repeated till we find no split that significantly decreases the diversity at a node. Each leave is assigned a class. The final step is the pruning of the tree where we remove branches which provide the least information. These branches can be identified using the adjusted error rate of a tree T:

„…

‡† !

„

‡† ! %‰ˆŠWŒ‹dސs‘’‡† !

where W‹dސs‘‡† ! is the number of leaves in the tree and ˆ is a parameter. We chose another … training set to find the the best pruned tree †  and the tree that gives the lowest overall error is the best pruned tree.

3.4 Rocchio’s algorithm It is the classification method for document routing or filtering in information retrieval. The dot product of the document vectors or Jaccard similarity measure is used to compute the distance between one document and all the others. Prototype vector for a class D 7 is given by the average vector over all the training documents that belong to class D 7 . Learning is very fast for this method.

3.5 Support Vector Machines Support Vector Machines(SVMs) yeild a good generalization performance on a wide variety of classification problems, including text categorization [4]. The SVM integrates dimension reduction and classification and is applicable only to binary classification problems. SVM classifies a vector d to either -1 or 1 using 5

~) “G_•”–Pf ! %‰—&  8 ˆ  q  a‰Pf E f  ! %‰— , 5  ~ j~ 9 if Œ and q` K˜ otherwise Here kmf  o  8

is the set of training vectors as before and kAq  o  8 are the corresponding classes , , @ q P ™ K E ! '(a‰Pf E f  ! is denoted a kernel and is often chosen as a polynomial of degree d.

The training of SVM consists of determining the w that maximizes the distance between the trainging samples from the two classes. An interesting property of SVM is that the decision surface is determined only by the data point which have exactly the distance mš RR G T from the decision plane.These points are called the support vectors, which are the only effective elements in the training set; if all other points were removed, the algorithm will learn the same decision function. This property makes the SVM theoretically unique and different from many other methods such as naive bayes, knn, etc where all points are used to train the model.

3.6 Neural networks A neural network is a network of units where the inputt units represents terms, the output units represent the categories of interest and the weights on the edges connecting units represents dependence relations. A typical way of training a neural network is backpropogation, where the term weights of a training documnet are loaded into the input units and if a misclassification occurs the error is ’backpropogated’ so as to change the parameters of the network and eliminate or minimize the error.

3.7 Linear Least Squares Fit It is a multivariate regression modeland it is automatically learned from a set of training documents and their categories. The training data are represented in the form of input/output vector pairs where the input vector vector is a document in the vector space model and the output vector consists of categories of the corresponding document. By solving a linear-squares fit on the training parts of vectors, one can obtain a matrix of word category regression coefficient. By sorting these categories weights a ranked list of categories is obtained for the input document. By thresholding on these weights, category assignments to the input document are obtained.

4 Related Work The papers that have been described here are the works of Joachims [4], Dumais et al. [3], Shapire et al. [7], Yang[9] and Weiss et al[8]. All authors have used the ModApte split. Table 1 shows the number of documents and the methods used by the authors to construct the classifier as well as the indexing and feature extraction method used and the evaluation criteria used.

6

Author Joachims Duamis Shapire Weiss yang

Train 9603 9603 9603 9603 7789

Test 3299 3299 3299 3299 3309

Topics 90 118 * 95 93

Indexing tfc boolean tf x idf frequence ltc

Reduction IG MI None ?

› e

Method binary binary multiclass binary binary

Measure break-even break-even precision break-even break-even

Table 1: Summary of previous work. Author Joachims Duamis Shapire Weiss yang

Rocchio 79.9 61.7 x 78.7 75.0

Bayes 72.0 75.2 x 73.4 71.0

kNN 82.3 86.3 85.0

Tree 79.4 78.9 79.0

SVM 86.0 x 86.3 -

Table 2: Summary of previous wors(2) An x in the table signifies that the method was tested but a different performance measure other than break-even point was used. Except Yang all the other authors have used the ModApte split of the Reuters collection. The number of topics each author has used is given in the next column. The work of Shapire does not specify what category set is used for training. They have not used the standard set which has 118 categories. The indexing methods each author has used is listed in th next field and the dimensionality reduction technique used is also listed. Joachims has used Information Gain, Dumais Mutual Information and Yang › e statistics. Most of the authors have used binary classification problem for text categorization except Shapire who uses multi-class. And all authors have used the precision/recall break-even point. Shapire has used 3 other measures which are described in [7].Only the methods described in section 3 are discussed here. Table 2 summarizes the precision-recall breakeven points. Shapire has tested the Rocchio and Bayes algorithms but the performance measures are not the break-even points. For the kNN, Decision tree, Naive bayes method, and the SVM the results have slight variations. The average break-points are 84.5, 79.1, 72.9 and 86.4. The break-even point for Rocchio algorithm reported by Dumais alone is low (61.7) and the average for the other authors is 77.9.

5 Methodology 5.1 Data set The Reuters-21578 collection is publicly available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/

7

The ModApte split is used to divide the Reuters collection into a training set and a test set. According to documentation, this should lead to a set consisting of 9,603 documents in the training set and 3299 in the test set. However for a lot of these documents, the topic or body was missing. Hence they were removed from the data set.

5.2 Feature Extracting Each of the documents was converted from the original SGML format to a word vector.

5.2.1

Preprocessing

In this phase, first the SGML tags should be removed. I use Matlab to remove all the SGML tags and extract each document from the Reuters file and write it into an individual file, so that it is can be easily indexed. The individual words were extracted and then stop words were removed using a list of frequent English words. Word stemming was performed using the Porter stemmer. I use Lucene for stemming and to extract the word vectors from each document.

5.2.2

Indexing

Tf-idf weighting is used to index the documents. The weight a  for word i in document k is given by

 



   !

where f is the frequency of the word i in document k, N is the number of training documents and n is the total number of times word i occurs in the whole collection. 5.2.3

Dimensionality reduction

Feature selection is performed here using Document Frequency Thresholding. Words occuring in just one document are removed based on the assumption that rare words do not affect category prediction.

5.3 Method used The kNN method is a very simple approach and from the survey of various previous works, we see that kNN showed good performance on text categorization tasks. Hence, this method was chosen for classification. Through investigations on suitable choices for k we see that the performance of kNN is relatively stable for a large range of k values. The value of k chosen for experiment here is 25. 8

Topic earnings acquisitions money-fx crude grain trade interest ship wheat corn

Recall 0.94 0.93 0.87 0.88 0.81 0.89 0.80 0.85 0.67 0.31

Precision 0.91 0.89 0.60 0.70 0.78 0.66 0.71 0.77 0.52 0.63

Table 3: Summary of recall and precision for the 10 most frequent topics. To classify the query document q into one of the categories, we first compute the cosine similarity between term-frequency vector of q and each of the documents d  in the training set. Cosine similarity œ ~ is given by the formula

œ ~ Pž E f  !

 ¢ "Ÿ ¢   ¡ ¢  ¢ Ÿ  ¡

Then the documents are ranked according to the decreasing order of cosine similarity values and the first k documents are returned as the nearest neighbours.

6 Experimental Results To evaluate the performance of the model, I use

ux

y y x }AD ~  and }mDF £P£ .

Tabel 3 shows the summary of recall and precision for 10 of the most frequent categories. Both precision and recall decrease when the number of training sample decreases.We consider only the ten most frequent categories and the breakeven point for recall/precision is 79.8% using a thresholding value of 0.3.

7 Summary With the dramatic rise in the use of the Internet, there has been an explosion in the volume of online documents and electronic mail. Text categorization, assigning free text to one or more predefined categories based on their content, is an important component in many real-time tasks like sorting of email into folders, topic identification and other information management tasks. This project has reviewed progress in the field with emphasis on the work where Reuters-21578 collection has been used for evaluation. 9

A typical text categorization process consists of the folowing steps; preprocessing, indexing, dimensionality reduction, and classification. In this report different approaches for all these steps have been discussed. Moreover, a summary of the results from the previous work on textcategorization where the Reuters-21578 collection has been used for evaluation is given. The following methods were evaluated: Naive bayes, K-nearest neighbour algorithm, Decision trees, Rocchio’s algorithm, Support vector machines, Neural networks and Linear least squares fit. All the approaches seemed to perform reasonably well. In this project, results from my experiments using the Reuters-21578 collection are also described. In the experiment, tfidf was used for word indexing. Dimensionality reduction was performed by first removing rare words. The K-nearest neighbour algorithm was used for classification. It is a very simple approach, but has shown to have approximately the same performance as more complicated methods. The results gave a precision/recall breakeven point of approximately 79.3% which is comparable to the other studies reported for the Reuters-21578 collection.

8 Appendix 8.1 Matlab Code for extracting individual files from the Reuters files http://anitakumar.multiply.com/journal/item/2

8.2 Java Code The following program does the following: Removal of stop-words and stemming

†

B¤ y  f indexing

kNN Algorithm Precision-Recall calculation http://anitakumar.multiply.com/journal/item/1.

References [1] A PTE , C., DAMERAU , F., rules.

AND

W EISS , S. Text mining with decision trees and decision

[2] ATTARDI , G., G ULL´I , A., AND S EBASTIANI , F. Automatic Web page categorization by link and context analysis. In Proceedings of THAI-99, European Symposium on Telematics, 10

Hypermedia and Artificial Intelligence (Varese, IT, 1999), C. Hutchison and G. Lanzarone, Eds., pp. 105–119. [3] D UMAIS , S., P LATT, J., H ECKERMAN , D., AND S AHAMI , M. Inductive learning algorithms and representations for text categorization. In CIKM ’98: Proceedings of the seventh international conference on Information and knowledge management (New York, NY, USA, 1998), ACM Press, pp. 148–155. [4] J OACHIMS , T. Text categorization with suport vector machines: Learning with many relevant features. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (London, UK, 1998), Springer-Verlag, pp. 137–142. [5] O N , I. C. Yang, y., pederson, j. o. (1997). feature selection in statistical learning of text categorization. [6] S EBASTIANI , F. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (2002), 1–47. [7] S HAPIRE , R., 1998.

AND

Y.S INGER. Boostexter: A system for mulit-label text categorization,

[8] W EISS , S. M., A PTE , C., DAMERAU , F. J., J OHNSON , D. E., O LES , F. J., G OETZ , T., AND H AMPP, T. Maximizing text-mining performance. IEEE Intelligent Systems 14, 4 (1999), 63–69. [9] YANG , Y. An evaluation of statistical approaches to text categorization. Inf. Retr. 1, 1-2 (1999), 69–90. [10] YANG , Y., AND L IU , X. A re-examination of text categorization methods. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 1999), ACM Press, pp. 42–49.

11

Suggest Documents