Experiment with a hierarchical text categorization ... - Semantic Scholar

2 downloads 27464 Views 355KB Size Report
label problems have received much less attention [1]. As the number of topics becomes larger, multi-class cat- egorizers face the problem of complexity that may ...
Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection ∗ Domonkos Tikk Department of Telecom. and Media Informatics Budapest University of Technology and Economics 1117 Budapest, Magyar Tud´osok krt. 2., Hungary Email: [email protected] Abstract Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. This paper focuses on the special case when categories are organized in hierarchy. We presents a new approach on this recently emerged subfield of text categorization. The algorithm applies an iterative learning module that allow of gradually creating a classifier by trialand-error-like method. We present a software that has been developed on the basis of the algorithm to illustrate the capability of the algorithm on large data collection. We experimented on the very large benchmark collection, on the WIPO-alpha (World Intellectual Property Organization, Geneva, Switzerland, 2002) English patent database that consists of about 75000 XML documents distributed over 5000 categories. Our software is able to index the corpus quickly and creates a classifier in a few iteration cycle. We present the results achieved by the classifier w.r.t. various test setting.

1. Introduction Traditionally, document categorization has been performed manually. However, as the number of documents explosively increased, the task became no longer amenable to the manual categorization, requiring a vast amount of time and cost. This has lead to numerous researches for automatic document classification. A text classifier assign a document to appropriate category/ies, also called topic, in a predefined set of categories. Originally, research in text categorization addressed the binary problem, where a document is either relevant or not ∗

This work was funded by the Hungarian Scientific Research Fund (OTKA) Grants No. D34614 and T34212.

Gy¨orgy Bir´o Department of Informatics E¨otv¨os Lor´and Science University 1117 Budapest, P´azm´any st. 1/c, Hungary Email: [email protected] w.r.t. a given category. In real-world situation, however, the great variety of different sources and hence categories usually poses multi-class classification problem, where a document belongs to exactly one category selected from a predefined set [2, 7, 15, 17]. Even more general is the case of multi-label problem, where a document can be classified into more than one category. While binary and multiclass problems were investigated extensively [10], multilabel problems have received much less attention [1]. As the number of topics becomes larger, multi-class categorizers face the problem of complexity that may incur rapid increase of time and storage, and compromise the perspicuity of categorized subject domain. A common way to manage complexity is using a hierarchy (in this paper we restrict our investigation to tree structured hierarchies), and text is no exception [3]. Internet directories and large online databases are often organized as hierarchies; see e.g. Yahoo (http://www.yahoo.com). Patent databases are typically such where the use of a hierarchical category system is a necessity. Patents cover a very wide area of topics, and each field can be further divided into subtopics, until a reasonable level of specialization is reached. (The hierarchy of topics are often called taxonomy.) The International Patent Classification (IPC) is a standard taxonomy developed and administered by WIPO (World Intellectual Property Organization) for classifying patents and patent applications. The use of patent documents and IPC for research into automated categorization is interesting for the following reasons [4]: 1. IPC covers a huge range of topics and uses a diverse technical and scientific vocabulary. 2. IPC is a complex, hierarchical taxonomy, where over 40 million documents have been classified worldwide. The number of documents classified each year is rising fast. 3. Domain experts in national patent offices currently

Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) 0-7695-1997-0/03 $17.00 © 2003 IEEE

classify patent documents fully manually. These experts have an intimate knowledge of the IPC system. 4. Patent documents are often available in several languages. Professional translators have already performed large numbers of translations manually. As a courtesy of WIPO, we could experiment with the WIPO-alpha English patent database issued in late 20021 that is a large collection (3 GB) of about 75000 XML documents distributed over 5000 categories in four levels. Due to the strongly business sensitive nature of the research, studies on WIPO-alpha collection are not publicly available. Our primary purpose with this database is to analyze the applicability of our algorithm, having been tested successfully on smaller corpora (see [11, 12, 13, 12]), on a very large real-world collection in terms of efficiency and feasibility (time and space requirements). The paper is organized as follows. Section 2 presents classifier approach. Section 3 report on our experiences with the WIPO-alpha database. The conclusion is drawn in Section 4.

2. The classifier This section describes our hierarchical text categorization approach [12, 13]. The main part of the approach is an iterative learning module that gradually trains the classifier to recognize constitutive characteristics of categories and hence to discriminate typical documents belonging to different categories. The iterative learning helps to avoid overfitting of training data and to refine characteristics of categories. Characteristics of categories are captured by typical terms occurring frequently in documents of the corresponding categories. We represent categories by weight vectors, called category descriptors (or simply descriptors), where an element of this vector refers importance of a term (typically word) discriminating the given category from others. The higher weight a term has in a category descriptor the more significant it is to the given category. Category descriptors are tuned during the iterative learning based on the their sample training documents. The training algorithm sets and maintains category descriptors in a way that allows the classifier to be able to correctly categorize the most training, and consequently, test documents. We start the categorization with empty descriptors. We now briefly describe the training procedure. First, when classifying a training document we compare it with category descriptors and assign the document to the most 1

The collection is available after registration at http://www.wipo.int/ibis/datasets/index.html

category of the most similar descriptor. When this procedure fails finding correct category we raise the weight of such terms in the category descriptors that appear also in the given document. If a document is assigned to a category incorrectly, we lower the weight of such terms in the descriptors that appear in the document. We tune category descriptors by finding the optimal weights for each terms in each category descriptor by this awarding–penalizing method. The training algorithm is executed iteratively and ends when the performance of the classifier cannot be further improved significantly. For test documents the classifier works in one pass, omitting the feedback cycle. See [12, 13] for a detailed description. A brief summary of the method is given next.

2.1. Notations Let C be the fixed set of topics (categories), where the elements d ∈ D of the document collection are classified. The set D is partitioned into DTrain and DTest , where DTrain ∩ DTrain = ∅, and DTrain ∪ DTrain = D. Each dj ∈ D is classified into a leaf category of the topic hierarchy; a parent category owns the documents if its child categories, formally topic(dj ) = {c1 , . . . , cq ∈ C|}.

(1)

where index refers to the depth of the topic. Texts cannot be directly interpreted by a classifier. Because of this, an indexing procedure that maps a text d into a compact representation of its content needs to be uniformly applied to all documents (training and test). We apply the usual vector space model, where a document dj is represented by a vector of term weights (2)

dj = (w1j , . . . , w|T |j ),

where T is the set of terms that occurs at least ones in DTrain , and 0 ≤ wkj ≤ 1 represents the relevance of kth term to the characterization of the document d. We experimented with tf×idf (3) and entropy (4) weighting [9]:   N , (3) wkj = okj · log nk      N fki 1  fki log . (4) wkj = log(okj +1) 1+ log N i=1 nk nk Here okj is the occurrence of the kth term in dj ; nk is the number of documents for which kth term occurs at least once; N = |DTrain |. Term vectors (2) are normalized before training. We characterize categories by a vector of descriptor term weights descr(ci ) = v1i , . . . , v|T |i ,

Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) 0-7695-1997-0/03 $17.00 © 2003 IEEE

ci ∈ C

(5)

where weights 0 ≤ v1i ≤ 1 are set during training. The descriptor of a category can be interpreted as the prototype of a document belonging to it.

3. Experiments on WIPO-alpha

2.2. Classification and training

The WIPO-alpha collection consists of 3 GB XML documents (in total about 75000 documents); documents are assigned one main category, and can be also linked to several other categories. The IPC taxonomy has four levels termed: section, class, subclass, and main group (top-down). The collection is provided as two sub-collection of a training set of 46324 documents, and a test set of 28926 documents. The training collection consists of documents roughly evenly spread across the IPC main groups, subject to the restriction that each subclass contains between 20 and 2000 documents. The test collection consists of documents distributed roughly according to the frequency of a typical year’s patent applications, subject to the restriction that each subclass contains between 10 and 1000 documents. All documents in the test collection also have attributed IPC symbols, so there is no blind data. Each document includes a title, a list of inventors, a list of applicant companies or individuals, an abstract, a claims section, and a long description. IPC codes (topic labels) have been attributed to each document. A very detailed description of the collection can be found in [4].

The classification method works downward in the taxonomy (greedy type algorithm). When classifying a document d its term vector (2) is compared to topic descriptors (5) as   |T |  conf(dj , descr(ci )) = f  (6) wkj · vki  , k=1

that results in a rank of topics (f is a smoothing function). From the rank the best topic (c) is selected. If c is not at the lowest level, the algorithm continues for the children of c. The algorithm starts with the root of the taxonomy and ends when a leaf category is found. One may also consider the best k topics of the rank at each stage, hence with a broader search an ordered list of topics can be obtained. For details see [12, 13]. In order to alleviate partly the risk of a high level misclassification [8], we control the selection of the best category by a minimum conformity parameter, where conf(dj , descr(c)) ≥ conf min ∈ [0, 1] is required to go downward in the hierarchy. The classifier can commit two kinds of error: it can misclassify a document dj into ci , and usually simultaneously, it cannot determine the correct category of dj . The following training algorithm aims at minimizing both types of error. We scan all the decisions made by the classifier and process as follows. For each considered category ci at a given level we accumulate a vector δ(ci ) = δ(v1i ), . . . , δ(vT i ) where δ(vki ) = α · (conf req − conf(dj , descr(ci )))) · wij , (7) where 1 ≤ k ≤ T . conf req = 1 when ci ∈ topic(dj ), 0 otherwise. Here α ≥ 0 is the learning rate. The category descriptor weight vki is updated as vki + δ(vki ), 1 ≤ k ≤ T , whenever category ci takes part in an erroneous classification. If dj is misclassified into ci then (conf req − conf(dj , descr(ci ))) is negative, hence the weight of co-occurring terms in dj and ci are reduced in descr(ci ). In the other case, if ci is the correct but unselected category of dj , then (conf req − conf(dj , descr(ci ))) is positive, thus the weight of co-occurring terms in dj and ci are increased in descr(ci ). The training cycle is repeated until the given maximal iteration has not been finished or the performance of the classifier reaches a quasi-maximal value. We use the quality function Q (introduced in [12, 13]) to measure inter-training effectiveness of the classifier that is able to reflect more sensibly to small changes in the effectiveness than, e.g., the F measure.

3.1. The document collection

3.2. Performance measure We have adopted three heuristic evaluation measures for categorization success proposed by the provider of the WIPO-alpha collection [4]. Let us suppose that algorithm returns an ordered list of predicted IPC codes, where the order is determined by the confidence level (see (6)) used as weight. Then we can define the following measures (see Figure 1): 1. Top prediction (briefly: Top) The top category predicted by the classifier is compared with the main IPC class, shown as [mc] in Figure 1. 2. Three guesses (Top 3) The top three categories predicted by the classifier are compared with the main IPC class. If a single match is found, the categorization is deemed successful. This measure is adapted to evaluating categorization assistance, where a user ultimately makes the decision. In this case, it is tolerable that the correct guess appears second or third in the list of suggestions. 3. All classes (Any) We compare the top prediction of the classifier with all classes associated with the document, in the main IPC symbol and in additional IPC symbols, shown as (ic) in Figure 1. If a single match is found, the categorization is deemed successful.

Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) 0-7695-1997-0/03 $17.00 © 2003 IEEE

Figure 1. Explanation to the three evaluation measures Top, Top 3, Any [4]

Although in [4] it is suggested to use these measures solely on IPC class level, in our experiments we also use them on the lower subclass and main group levels.

means that all guesses are considered, while 0.8 means that only those decision are considered where the confidence level is not less than 0.8. Obviously, the higher is the confidence level, the lower is the number of considered documents. The figures show all the three performance measures at class, subclass and main group levels by increasing confidence levels of 0.1 step. Table 1 compares some significant values of each parameter setting. The best results have been achieved after 7 iterations when only the fields of inventors, applicants, title and abstract are used for dictionary creation (“ipta” setting); entropy weighting (4) is used; minoccur = 2 and maxfreq = 0.25. (Figures 2). The other settings are modification of this one. We denote there only the modified values.

100

Top (class)

When dealing with a huge document collection, the large number of terms, |T |, can cause problem in document processing, indexing, and also in category induction. Therefore, before indexing and category induction many authors apply a pass of dimensionality reduction (DR) to reduce the size of |T | to |T  | |T | [10]. Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [6, 16]). In our previous experiments [13] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. We reduced |T | by disregarding terms that either occur less than minoccur times, or occur more often than a certain threshold in DTrain , i.e. if nk /|DTrain | ≥ maxfreq . By the former process we disregard words that are not significant in the classification, while by the later process we ignore words that are not discriminative enough between categories. The typical values are minoccur ∈ [1 .. 10] and maxfreq ∈ [0.05 .. 1.0]. The construction of patent documents provides another way of DR as well. One may select certain XLS fields as the basis of the term set (dictionary), and index the other parts of the documents using this dictionary. E.g., the long description sections can be ignored for dictionary creation because it may contain lot of dummy words.

Top3 (class)

80

Any (class)

70

Top (subclass) Top3 (subclass)

60

Any (subclass)

50

Top (main group) Top3 (main group)

40

Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Top (class)

100 Top3 (class)

90

% of documents recalled

3.3. Dimensionality reduction

correct guess (%)

90

80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30

Top (main group)

20 10 35

45

55

65

75

85

95

% of correct assignments relative to recalled documents

Top3 (main group) Any (main group)

Figure 2. Setting “ipta” with entropy weighting and minoccur = 2 and maxfreq = 0.25.. a) Precision by confidence levels. b) Comparisons of precisions, extrapolated to 100% recall

3.4. Results We present the obtained results by means of a series of figures (Figure 2–5) and a summarizing table (Table 1). We differentiated results based on the confidence level. Here 0.0

The next setting delivered very similar results as “ipta”; obtained when also words occurring in claims fields are used for dictionary creation (“iptac” setting). See Figure 3.

Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) 0-7695-1997-0/03 $17.00 © 2003 IEEE

nificantly higher.

100

Top (class) Top3 (class)

80

Any (class)

70

Top (subclass) Top3 (subclass)

60

Any (subclass)

50

Top (main group) Top3 (main group)

40

Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

100

Top (class)

90

correct guess (%)

correct guess (%)

90

confidence level

Top3 (class)

80

Any (class)

70

Top (subclass) Top3 (subclass)

60

Any (subclass)

50

Top (main group)

40

Top3 (main group) Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Figure 3. Setting “iptac”. Precision by confidence levels.

Top (class)

100 Top3 (class)

We investigated the effect if only the main category of each patent document is used for training. This experiment was suggested by the developers of the collection [4], and can be argued that this selection makes ambiguous training documents (having more topics) unique for training purpose. The obtained result did not support this hypothesis, the obtained results were inferior than the ones with regular setting (Figure 4).

100

80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30

Top (main group)

20 10 30

50

70

90

% of correct assignments relative to recalled documents

Top3 (main group) Any (main group)

Top (class)

90

correct guess (%)

% of documents recalled

90

Top3 (class)

80

Any (class)

70

Top (subclass) Top3 (subclass)

60

Figure 5. Use of tf×idf weighting. a) Precision by confidence levels. b) Comparisons of precisions, extrapolated to 100% recall

Any (subclass)

50

Top (main group) Top3 (main group)

40

Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Figure 4. Setting: only main categories used for training. Precision by confidence levels

We also investigated the use of tf×idf weighting (3). The obtained results are considerably worse then the ones by entropy weighting, but the time requirement for indexing the collection is decreased by about 30%. This is because tf×idf weighting requires one pass less to index the collection. See Figure 5. The inferiority of the results are also due to the high maxvar parameter that is 0.5 here, while 0.01 with other setting. Consequently, the number of documents taken into account at high consistency levels are sig-

It worth to note that the graph of Top 3 measure is dropping when the consistency level increases. The reason is that at lower consistency level more categories are returned, and based on the results, in some case the one returned with lower consistency can be the correct one, but that is left out at higher consistency levels. When the consistency level goes higher, much fewer documents are considered, and then the Top 3 values increases again. (This arguing does not apply for the tfidf setting with high variance, because there the number of inferred documents is low even at low consistency levels.) One can observe that the relationship between evaluation measures is Top < Any < Top 3 except when tf×idf weighting scheme is applied: then Any gives the best values. Naturally, the lower we go in the taxonomy the more imprecise predictions are. At IPC classes the Top 3 measure attains 85.5% at several settings with the lowest confidence level (i.e. when basically all documents are considered) that is a quite significant result. This value hints that the algo-

Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) 0-7695-1997-0/03 $17.00 © 2003 IEEE

Eval. ms. Top

Top3

Any

Set ipta iptac main tfidf ipta iptac main tfidf ipta iptac main tfidf

cl./0.0 65.75 65.50 62.81 64.04 85.56 85.61 83.61 70.71 73.68 73.41 71.32 72.22

IPC/conf. level cl./0.8 s.cl./0.0 92.93 53.25 92.93 53.14 84.37 49.41 83.56 50.76 92.93 75.05 92.93 75.00 84.37 70.89 84.11 56.57 95.83 62.45 95.64 62.28 94.97 58.97 90.38 60.18

References m.g./0.0 36.89 36.78 32.28 33.75 55.44 55.58 48.98 37.05 46.46 46.38 41.51 42.71

Table 1. Summary of results on WIPO-alpha patent collection

rithm can be used for large document corpora in real-world applications. Because of the very large taxonomy and range of documents the results on the main group level seems to be quite weak. However, if we consider that human experts can do this categorization with about 64% accuracy [14] then this result turns out to be much more significant. Let us shortly remark the time efficiency of the method. Our experiments were executed on a regular PC (Linux OS, 2 GHz processor, 1 GB RAM). The indexing of the entire train collection took around one hour with entropy weighting and just over 40 minutes with tf×idf weighting. The training algorithm (7 iterations) required about 2 hours with each settings. If more iterations were done the results did not improved significantly.

4. Conclusion We presented an automated text classifier and its application to the WIPO-alpha English patent collection in the IPC taxonomy. IPC covers all areas of technology and is currently used by the industrial property offices of many countries. Patent classification is indispensable for the retrieval of patent documents in the search for prior art. Such retrieval is crucial to patent-issuing authorities, potential inventors, research and development units, and others concerned with the application or development of technology. An efficient automated patent classifier is crucial component in providing an automated classification assistance system for categorizing patent applications in the IPC, that is a main aim at WIPO [4]. The presented method can be a prominent candidate for this purpose.

[1] L. Aas and L. Eikvil. Text categorisation: A survey. Raport NR 941, Norwegian Computing Center, 1999. [2] K. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proc. of the 21th Annual Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 96–103, Melbourne, Australia, 1998. [3] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7(3):163–178, 1998. [4] C. J. Fall, A. T¨orcsv´ari, and G. Karetka. Readme information for WIPO-alpha autocategorization training set, December 2002. http://www.wipo.int/ibis/datasets/wipoalpha-readme.html. [5] D. Koller and M. Sahami. Hierarchically classifying documents using a very few words. In International Conference on Machine Learning, volume 14, San Mateo, CA, 1997. Morgan-Kaufmann. [6] D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text classification. In Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, pages 81–93, 1994. [7] A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proc. of ICML-98, 1998. http://www2.cs.cmu.edu/∼mccallum/papers/hiericml98.ps.gz. [8] G. Salton and M. J. McGill. An Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [9] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, March 2002. [10] D. Tikk and Gy. Bir´o. Experiments with multilabel text classifier on the Reuters collection. To appear at International Conference on Computational Cybernetics (ICCC03), Si´ofok, Hungary, 2003. [11] D. Tikk, Gy. Bir´o, and J. D. Yang. A hierarchical text categorization approach and its application to FRT expansion. Submitted to Fuzzy Sets and System. [12] D. Tikk, J. D. Yang, and S. L. Bang. Hierarchical text categorization using fuzzy relational thesaurus. To appear in Kybernetika. [13] A. T¨orcsv´ari. Personal communication. [14] S. M. Weiss, C. Apte, F. J. Damerau, D. E. Johnson, F. J. Oles, T. Goetz, and T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4):2–8, July/August 1999. [15] W. Wibovo and H. E. Williams. Simple and accurate feature selection for hierarchical categorisation. In Proc. of the 2002 ACM symposium on Document engineering, pages 111–118, McLean, Virginia, USA, 2002. [16] Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1– 2):69–90, 1999. http://citeseer.nj.nec.com/ yang97evaluation.html.

Proceedings of the Fourth International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) 0-7695-1997-0/03 $17.00 © 2003 IEEE

Suggest Documents