Experiments with multi-label text classifier on the Reuters collection Domonkos Tikk
Gy¨orgy Bir´o
Department of Telecommunications and Media Informatics Budapest University of Technology and Economics H-1117 Budapest, Magyar Tud´osok k¨or´utja 2. Hungary Email:
[email protected]
Department of Informatics E¨otv¨os Lor´and Science University H-1117 Budapest, P´azm´any s´et´any 1/c Hungary Email:
[email protected]
Abstract— Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow of gradually creating a classifier by weight adjusting method. We experimented on the well-known Reuters-21578 document corpus with different taxonomies. Results show that our approach outperforms existing ones by up to 10%.
I. I NTRODUCTION Traditionally, document categorization has been performed manually. However, as the number of documents explosively increases, the task becomes no longer amenable to the manual categorization, requiring a vast amount of time and cost. This has lead to numerous researches for automatic document classification. A text classifier assign a document to appropriate category/ies, also called topic, in a predefined set of categories. Originally, research in text categorization addressed the binary problem, where a document is either relevant or not w.r.t. a given category. In real-world situation, however, the great variety of different sources and hence categories usually poses multi-class classification problem, where a document belongs to exactly one category selected from a predefined set [1], [2], [3], [4], [5], [6]. Even more general is the case of multilabel problem, where a document can be classified into more than one category. While binary and multi-class problems were investigated extensively [7], multi-label problems have received very little attention [8]. As the number of topics becomes larger, multi-class categorizers face the problem of complexity that may incur rapid increase of time and storage, and compromise the perspicuity of categorized subject domain. A common way to manage complexity is using a hierarchy1 , and text is no exception [9]. Internet directories and large on-line databases are often organized as hierarchies; see e.g. Yahoo and IBM’s patent database2 . 1 In
general hierarchy is considered to be an acyclic digraph; in this paper we restrict our investigation to tree structured hierarchies. 2 http://www.yahoo.com, http://www.ibm.com/patents
This paper describes a hierarchical text categorization approach. The main part of the approach is an iterative learning module that gradually trains the classifier to recognize constitutive characteristics of categories and hence to discriminate typical documents belonging to different categories. The iterative learning helps to avoid overfitting of training data and to refine characteristics of categories. Characteristics of categories are captured by typical terms occurring frequently in documents assigned to them. We represent categories by weight vectors, called category descriptors (or simply descriptors), assigned to terms occurring in the document collections. The higher weight a term has in a category descriptor the more significant it is to the given category. Category descriptors are tuned during the iterative learning based on the their sample training documents. We report on our experience with our approach on the Reuters-21578 newswire benchmark. This document collection has been the most widely used benchmark corpus among researchers on the field of text categorization. A comparative study of different approaches using flat, i.e. not-hierarchical category system, on this corpus can be found in [7, page 38]. One of the first works on hierarchical text classifiers, Koller and Sahami, [10], also reported on experiences with two taxonomies on the subset of Reuters collection. Chakrabarti et al [9] also applied their hierarchical text classifier on this corpus but without taxonomy. Wibowo and Williams [11] applied the hierarchical category system of Hayes [12] on the corpus, but their achieved results were not superior to flat results. On the other hand, D’Alessio et al [13] defined 3 different hierarchies that comprise all categories of the original flat settings and they achieved better numbers in accuracy that has been reached so far by flat categorizers. We applied all explicitly given hierarchies on the Reuters-corpus to compare our algorithm to the others. The paper is organized as follows. Section II presents the approach detailed. Section III report on our experience. The conclusion is drawn in Section V. II. T HE PROPOSED METHOD The core idea of the categorization method is the training algorithm that sets and maintains category descriptors in a way
that allows the classifier to be able to correctly categorize the most training, and consequently, test documents. We start the categorization with empty descriptors. We now briefly describe the training procedure. First, when classifying a training document we compare it with category descriptors and assign the document to the most category of the most similar descriptor. When this procedure fails finding correct category we raise the weight of such terms in the category descriptors that appear also in the given document. If a document is assigned to a category incorrectly, we lower the weight of such terms in the descriptors that appear in the document. We tune category descriptors by finding the optimal weights for each terms in each category descriptor by this awarding–penalizing method. The training algorithm is executed iteratively and ends when the performance of the classifier cannot be further improved significantly. See the block diagram of Figure 1 for an overview and details in Subsection II-B about the training algorithm. For test documents the classifier works in one pass by omitting the feedback cycle. Training documents
Preprocession - removal of function words - stemming - dimensionality reduction - term indexing
To be improved
Classifier
Topic descriptors Raise weight of cooccurring terms in topic descriptors
Correct
Misclassified
Category not found
Lower the weight of cooccurring terms in topic descriptors
Performance check (Q-measure) Acceptable
Fig. 1.
The flowchart of the training algorithm
The rest of this section is organized as follows. Subsection II-A describes the topic hierarchy, vector space model and descriptors. Subsection II-B presents classification and the training method. A. Definitions Let C be the fixed finite set of categories organized in a topic hierarchy. In this paper, we deal with tree structured topic hierarchies.
Let D be a set of text documents and d ∈ D an arbitrary element of D. In general, documents are pre-classified under the categories of C, in our case into leaf categories. We differentiate training, d ∈ DTrain , and test documents, d ∈ DTest , where DTrain ∩DTrain = ∅, and DTrain ∪DTrain = D. Training documents are used to inductively construct the classifier. Test documents are used to test the performance of the classifier. Test documents do not participate in the construction of the classifier in any way. Each document dj ∈ D is classified into a leaf category of the hierarchy. No document belongs to non-leaf categories. We assume that a parent category owns the documents if its child categories, i.e., each document belongs to a topic path containing the nodes (representing categories) from the root to a leaf. Formally, topic(dj ) = {c1 , . . . , cq ∈ C|}
(1)
determines the set of topics document dj belongs to along the topic path (c1 is the highest). Note that the root is not administrated in the topic set, as it owns all documents. As a consequence of our previous condition, cq is a leaf-category. The index refers to the depth of the category. Texts cannot be directly interpreted by a classifier. Because of this, an indexing procedure that maps a text d into a compact representation of its content needs to be uniformly applied to all documents (training and test). We choose to use only words as meaningful units of representing text, because the use of n-grams (word sequences of length n) increases dramatically the storage requirement of the model, and, as it was reported in [14], [15], the use of more sophisticated representation than simple words do not increase effectiveness significantly. As most research works, we also use the vector space model, where a document dj is represented by a vector of term weights dj = (w1j , . . . , w|T |j ), (2) where T is the set of terms that occurs at least ones in the training documents DTrain , and 0 ≤ wkj ≤ 1 represents the relevance of kth term to the characterization of the document d. Before indexing the documents function words (i.e. articles, prepositions, conjunctions, etc.) are removed, and stemming (grouping words that share the same morphological root) is performed on T . We utilize the well-known tf×idf weighting [16], which defines wkj in proportion to the number of occurrence of the kth term in the document, okj , and in inverse proportion to the number of documents in the collection for which the terms occurs at least once, nk : N , (3) wkj = okj · log nk Term vectors (2) are normalized before training. We characterize categories analogously as documents. To each category is assigned a vector of descriptor term weights descr(ci ) = hv1i , . . . , v|T |i i,
ci ∈ C
(4)
where weights 0 ≤ v1i ≤ 1 are set during training. All weights are set initially to 0. The descriptor of a category can be interpreted as the prototype of a document belonging to it. B. Classification and training 1) Classification: When classifying a document d ∈ D its term vector (2) is compared to topic descriptors (4) and based on the result the classifier selects (normally) a unique category. At a given stage of the classification only a subset of topic are considered as potential category; this is based on the following greedy selection process. The classification method works downward in the topic hierarchy level by level. First, it determines the best among the top level categories. Then its children categories are considered and the most likely one is selected. Considered categories are always siblings linked under the winner category of the previous level. Classification ends when a leaf category is found. Let us assume that at an arbitrary stage of the classification of document dj we have to select from k categories: c1 , . . . , ck ∈ C. Then we calculate the conformity of term vector of dj and each topic descriptors descr(c1 ), . . . , descr(ck ), and select that category that gives the highest conformity measure. We applied the unnormalized cosine measure that calculates this value as a function f of the sum of products of document and descriptor term weights: |T | X conf(dj , descr(ci )) = f wkj · vki , (5) k=1
where f : R → [0, 1] is an arbitrary smoothing function with limx→0 f (x) = 0 and limx→∞ f (x) = 1. The smoothing function is applied (analogously as in control theory) to alleviate the oscillating behavior of training. McCallum [17] criticized the greedy topic selection method because it requires high accuracy at internal (non-leaf) nodes. In order to alleviate partly the risk of a high level misclassification, we control the selection of the best category by a minimum conformity parameter conf min ∈ [0, 1], i.e. the greedy selection algorithm continues when conf(dj , descr(cbest )) ≥ conf min satisfied, where cbest is the best category at the given level. 2) Training: In order to improve the effectiveness of classification, we apply supervised iterative learning, i.e. we check the correctness of the selected categories for training documents and if necessary (a document is classified incorrectly), we modify term weights in category descriptors. The classifier can commit two kinds of error: it can misclassify a document dj into ci , and usually simultaneously, it cannot determine the correct category of dj . Our weight modifier method is able to cope with both types of error. We scan all the decisions made by the classifier and process as follows.
For each considered category ci at a given level we accumulate a vector δ(ci ) = hδ(v1i ), . . . , δ(vT i )i where δ(vki ) = α·(conf req − conf(dj , descr(ci ))))·wij ,
1≤k≤T (6) where conf req = 1 when ci ∈ topic(dj ), 0 otherwise. Here α ≥ 0 ∈ R is the learning rate. The category descriptor weight vki is updated as vki + δ(vki ), 1 ≤ k ≤ T , whenever category ci takes part in an erroneous classification. If dj is misclassified into ci then (conf req − conf(dj , descr(ci ))) is negative, hence the weight of co-occurring terms in dj and ci are reduced in descr(ci ). In the other case, if ci is the correct but unselected category of dj , then (conf req − conf(dj , descr(ci ))) is positive, thus the weight of co-occurring terms in dj and ci are increased in descr(ci ) We also experimented with a more sophisticated weight setting method where the previous momentum of the weight modifier is also taken into account in the determination of the current weight modifier. Let δ (n) (vki ) be the weight modifier in the nth training cycle, and δ (0) (vki ) = 0 for all 1 ≤ k ≤ T . Then the weight modifier of the next training cycle is δ (n+1) (ci ) = hδ (n+1) (v1i ), . . . , δ (n+1) (vT i )i, and its elements are calculated as δ (n+1) (vki ) = α · (conf req − conf(dj , descr(ci )))) · wij + δ (n) (vki ) · β
(7)
where β ∈ [0, 1] is the momentum coefficient. The value of α and β can be uniform for all categories, or can depend on the level of the category. We experienced that at a lower value, typically 0.05..0.2 is better if the number of training documents is plentiful, i.e. higher in the hierarchy, and a higher value is favorable when only a few training documents are available for the given category, i.e. at leaf categories. The number of nonzero weights in category descriptors increases as the training algorithm operates. In order to avoid their proliferation, we propose to set descriptor term weights to zero under a certain threshold. The training cycle is repeated until the given maximal iteration has not been finished or the performance of the classifier reaches a quasi-maximal value. We use the following optimization (or quality) function to measure inter-training effectiveness of the classifier for a document d: #(correctly found topics of d) Q(d) = #(total topics of d) 1 . · 1 + #(incorrectly found topics of d) The overall Q is calculated as average of Q(d) values: P Q(d) Q = d∈DTrain |DTrain |
(8)
The quality measure Q is more sensible to small changes in the effectiveness of the classifier than, e.g., F -measure [18] that we use to qualify the final performance of the classifier (see Section III). Therefore, it is more suitable for intertraining utilization. By setting a maximum variance value
TABLE I D ESCRIPTION OF H IER 1 DATA SET ON R EUTERS -21758 ( AFTER [10])
major topics grain business money effects crude oil
minor topics corn wheat dlr interest nat-gas ship
Total
data set size training testing 182 56 212 70 131 44 347 131 75 30 196 89 1068 392
TABLE II D ESCRIPTION OF H IER 2 DATA SET ON R EUTERS -21758 ( AFTER [10])
In the case of E3, E4, and E5, we replaced in all taxonomies the “one-of-M ” relationships [13] by binary ones. Here we have adopted also the mode-Apt´e [14] split that removes all unlabelled documents resulting in total 7760 training and 3009 test documents. It means that we have not used the 1843 training and 290 test documents without topic. Some documents in Reuters collection are classified into more than one topic. The average number of categories/document 1.238. The distribution of documents among categories are quite uneven, there exist categories with several thousands training documents, and with no documents at all. The number of categories without training and test documents is 19. B. Dimensionality reduction
major topics business veg. oil business Total †
minor topics acq† earn oilseed palmoil
data set size training testing 1657 721 2877 1087 124 47 30 10 4661 1857
We substituted c-bonds by acq because category c-bonds were removed from Reuters-21578.
varmax (typically 0.9..1.0) we stop training when actual Q best best drops below the varmax ·Q , where Q is the best Q achieved so far during training. III. E XPERIMENTAL RESULTS A. Taxonomies The Reuters-215783 newswire benchmark data collection has been used to test the effectiveness of the proposed classifier. The collection contains 135 independent categories, i.e. originally they are not organized in hierarchy. In order to use, though, the collection as benchmark of hierarchical categorization and to prove its superiority to flat categorization several authors organized the original categories into hierarchies. Mainly, we make use of 2 (partial) hierarchies that contains only a small subset of the total set of categories (Koller and Sahami [10]); and 3 other (complete) hierarchies proposed by D’Alessio et al [13, page 10–11] (Experiments E3, E4, E5), but we also applied are our classifier to the original flat category system. Tables I and II depict the two partial taxonomies. We could not get from Koller and Sahami the original setting of training/testing documents they had used for the experiments in [10], so we used all documents from the specified categories according to the adopted mode-Apt´e [14] split. As some documents are attached to more than one leaf category, the total number of documents are less than the sum of training and testing documents. 3 The Reuters-21578 collection may be freely downloaded from http://www.daviddlewis.com/resources/testcollections/ reuters21578/.
When dealing with large document collection, the large number of terms, |T |, can cause problem in document processing, indexing, and also in category induction. Therefore, before indexing and category induction many authors apply a pass of dimensionality reduction (DR) to reduce the size of |T | to |T 0 | |T | [7]. Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [10], [11]). In our previous experiments [19] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. We applied DR by removing the least frequent terms in the overall collection. In general, we reduced |T | by disregarding terms that either occur less than minoccur times in the entire train document set, or occur more often than a certain threshold in the training set, i.e. if nk /|DTrain | ≥ maxfreq . By the former process we disregard words that are not significant in the classification, while by the later process we ignore words that are not discriminative enough between categories. The typical values are minoccur ∈ [1 .. 10] and maxfreq ∈ [0.05 .. 1.0]. C. Performance evaluation In our view the quality of hierarchical classification is better if a document classified incorrectly only from an intermediate level than if it is assigned a completely incorrect topic path. In such cases when the correct leaf-category of a document is not found, but the original and the inferred topic paths partially equal the classification is not completely wrong. This view is reflected in our quality measure Q (see (8)), and we also adopt when calculating the final effectiveness of the classifier (both for training or test documents) by means of the F1 measure. Many authors used accuracy or break-even point to measure the effectiveness of categorization. The former has been criticized due to its insensitivity [7, page 34],[6], while the latter has been recently criticized also by its proposer Lewis [20],[7, footnote 19] “since there may be no parameter setting that yields the break-even”, and “to have equal recall and precision value is not necessarily desirable”. The Fβ -measure originally defined by van Rijsbergen [18] is a combination of the microaveraged recall (ρ) and precision
TABLE IV R ESULTS WITH E3–E5 AND FLAT TAXONOMIES ON R EUTERS -21578
(π) values: Fβ =
(β 2 + 1)πρ , β2π + ρ
COLLECTION
0 ≤ β < +∞.
As many authors, we have been used F1 -measure to test the effectiveness of the classifier on all categories, i.e. not only on leaf-categories (but, obviously, excluding the root, as it is, by definition, not included in topic paths, see (1)). We also used another measure, called Top, that checks whether the best category given by the classifier is one of the real categories. This measure shows if the best guess of the classifier is correct or not, that is an important aspect in real-world applications.
Our method Top # iter.
taxonomy
F1
π
ρ
E3 E4 E5 flat
91.46 93.62 92.89 89.43
90.68 93.33 92.43 88.58
92.25 93.92 93.36 90.30
94.8 96.2 95.7 93.4
12 12 14 14
maxvar
minconf
0.93 0.96 0.95 0.95
0.12 0.12 0.14 0.14
TABLE V R EFERENCE RESULTS OF D’A LESSIO ET AL [13] WITH E3–E5 AND FLAT TAXONOMIES ON R EUTERS -21578 COLLECTION
D. Results Table III gives the results achieved with the Hier1 and Hier2 taxonomies on the Reuters-21758 corpus. We indicated the corresponding results from [10] as a reference despite the fact that our experiments have been achieved with a different settings, because (1) we could not get the original document setting from the authors, (2) we used also such documents that are pre-classified to more than one categories (3) they used accuracy to measure the effectiveness of the method, that has been criticized by many authors due to its insensitivity [7, page 34],[6]. The results are very good, especially on Hier2, due to the small number of categories and the large number of documents. The minconf parameter is set to 0.1, maxvar is set to 0.98. We applied DR and checked its effect on the results. We found that it has no significant effect on the effectiveness of the classifier (within 1% difference from the results obtained with full dictionary), but it improves considerably the required time for training since it decreases the average size of document vectors. The full dictionary in the case of Hier1 (Hier2) contains 6139 (12713) words, the average document size is 52.8 (31.3), respectively. With typical DR parameters minoccur = 10 and maxfreq = 0.2 these values diminish to 42.6 and 24.2, respectively; while the duration of training falls to 63% (37.5 %) of the original. As a consequence, we propose to use DR methods as, it has no negative affects on the effectiveness, but it speeds up the training. TABLE III R ESULTS ON THE H IER 1 AND H IER 2 SUBSETS OF R EUTERS -21578 COLLECTION
taxonomy Hier1 Hier2
best result from [10] accuracy 0.940 0.909
F1 0.944 0.990
FRT π ρ 0.944 0.944 0.989 0.991
Top 0.972 0.995
taxonomy E3 E4 E5 flat
D’Alessio et al [13] F1 π ρ 79.8 82.8 82.5 78.9
81.4 85.9 86.4 80.3
78.3 79.9 79.0 77.6
that our results shows the same relationship among taxonomies as reported by D’Alessio et al: E4 yields the best result (see Table V). Wibowo and Williams [11] experienced this collection with another hierarchy [12], perhaps this is the reason why they best result is even lower, 73.74%, than that has been achieved by flat categorizers. Our method achieved remarkable results on flat category system as well. The obtained result, 89.43%, is somewhat better than the best known results of flat categorizers (87.8% Weiss et al [4], committee of decision trees). IV. ACKNOWLEDGEMENT This work was funded by the Hungarian Scientific Research Fund (OTKA) Grant No. D034614. V. C ONCLUSION We proposed a new method for text categorization, which uses a iterative supervised learning method to train the classifier. We showed the effectiveness of the algorithm on the well-known Reuters-21578 corpus with six topic hierarchies of different sizes. The main advantage of our algorithm is that it builds up the classifier gradually by a supervised iterative learning method, thus we can feedback the intermediate experiments to the method when training. We intend to extend the experiments with our algorithm on other larger document corpora consisting of much more documents in the near future. R EFERENCES
Table IV presents the results achieved with the E3–E5 taxonomies [13] and with the regular flat category system (we indicate the reference results of D’Alessio et al in Table V). A quasi-optimal setting of parameters conf min , maxvar is also indicated; one may observe that there is no significant difference in their values at the various taxonomies. We remark
[1] K. D. Baker and A. K. McCallum, “Distributional clustering of words for text classification,” in Proc. of the 21th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Melbourne, Australia, 1998, pp. 96–103. [2] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” University of Dortmund, Dept. of Informatics, Dortmund, Germany, Technical Report, 1997.
[3] D. D. Lewis and M. Ringuette, “A comparison of two learning algorithms for text classification,” in Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 81–93. [4] S. M. Weiss, C. Apte, F. J. Damerau, D. E. Johnson, F. J. Oles, T. Goetz, and T. Hampp, “Maximizing text-mining performance,” IEEE Intelligent Systems, vol. 14, no. 4, pp. 2–8, July/August 1999. [5] E. Wiener, J. O. Pedersen, and A. S. Weigend, “A neural network approach to topic spotting,” in Proc. of the 4th Annual Symposium on Document Analysis and Information Retrieval, 1993, pp. 22–34. [6] Y. Yang, “An evaluation of statistical approaches to text categorization,” Information Retrieval, vol. 1, no. 1–2, pp. 69–90, 1999, http://citeseer.nj.nec.com/yang97evaluation.html. [7] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, March 2002. [8] L. Aas and L. Eikvil, “Text categorisation: A survey,” Norwegian Computing Center, Raport NR 941, 1999. [9] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, “Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies,” The VLDB Journal, vol. 7, no. 3, pp. 163–178, 1998. [10] D. Koller and M. Sahami, “Hierarchically classifying documents using a very few words,” in International Conference on Machine Learning, vol. 14. San Mateo, CA: Morgan-Kaufmann, 1997. [11] W. Wibovo and H. E. Williams, “Simple and accurate feature selection for hierarchical categorisation,” in Proc. of the 2002 ACM symposium on Document engineering, McLean, Virginia, USA, 2002, pp. 111–118. [12] P. J. Hayes and S. P. Weinstein, “CONSTRUE/TIS: a system for contentbased indexing of a database of news stories,” in Proc. of the 2nd Conference on Innovative Applications of Artificial Intelligence (IAAI-
[13]
[14] [15]
[16] [17]
[18] [19] [20]
90), A. Rappaport and R. Smith, Eds. Menlo Park: AAAI Press, 1990, pp. 49–66. S. D’Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum, “The effect of using hierarchical classifiers in text categorization,” in Proc. of 6th International Conference Recherche d’Information Assistee par Ordinateur (RIAO-00), Paris, France, 2000, pp. 302– 313, http://133.23.229.11/∼ysuzuki/Proceedingsall/ RIAO2000/Wednesday/26BO2.ps. C. Apte, F. J. Damerau, and S. M. Weiss, “Automated learning of decision rules for text categorization,” ACM Trans. Information Systems, vol. 12, no. 3, pp. 233–251, July 1994. S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in Proc. of 7th ACM Int. Conf. on Information and Knowledge Management (CIKM-98), Bethesda, MD, 1998, pp. 148–155. G. Salton and M. J. McGill, An Introduction to Modern Information Retrieval. McGraw-Hill, 1983. A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng, “Improving text classification by shrinkage in a hierarchy of classes,” in Proc. of ICML-98, 1998, http://www2.cs.cmu.edu/∼mccallum/papers/hier-icml98.ps.gz. C. J. van Rijsbergen, Information Retrieval, 2nd ed. London: Butterworths, 1979, http://www.dcs.gla.ac.uk/Keith. D. Tikk, J. D. Yang, and S. L. Bang, “Hierarchical text categorization using fuzzy relational thesaurus,” Submitted to Kybernetika. D. D. Lewis, “An evaluation of phrasal and clustered representations on a text categorization task,” in Proc. of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 1992, pp. 37–50.