A Modular System for Rule-based Text Categorisation Marco Del Tredici, Malvina Nissim Expert System, University of Bologna
[email protected],
[email protected] Abstract We introduce a modular rule-based approach to text categorisation which is more flexible and less time consuming to build than a standard rule-based system because it works with a hierarchical structure and allows for reusability of rules. When compared to currently more wide-spread machine learning models on a case study, our modular system shows competitive results, and it has the advantage of reducing manual effort over time, since only fewer rules must be written when moving to a (partially) new domain, while annotation of training data is always required in the same amount. Keywords: text categorisation, rule-based, hierarchical structure
1.
Introduction and Background
Automatic text categorisation is the task of classifying into a finite number of preselected categories a set of unknown documents. It is indeed also known as “document classification”. Nowadays, thanks to the availability of a large quantity of preclassified documents in digital form and effective learning algorithms, the dominating approach is based on supervised machine learning techniques, where a classifier is built by learning from a set of manually labelled documents (Sebastiani, 2002; Sebastiani, 2005). Statistical approaches have thus gradually been favoured over ruledbased ones — which had however achieved competitive results (Dejong, 1982; Jacobs and Rau, 1988; Hayes and Weinstein, 1990; Goodman, 1990) — also because of portability issues: creating new annotated sets for training statistical models is generally less time consuming and requires a lesser degree of expertise than creating entire new sets of rules (Sebastiani, 1999); see also (Yang and Liu, 1999) and (Steinbach et al., 2000) for a comparison including unsupervised methods). However, the Pascal challenge in Large Scale Hierarchical Text classification, at its fourth edition1 , highlights the need for ways of dealing with large amounts of data where distributions are skewed at different levels of the hierarchy, implying that learning methods are challenged by the distribution and statistical dependence of the classes (Kosmopoulos et al., 2010). In this paper we explore the possibility of making a more flexible, reusable ruled-based system aimed at improving portability while eliminating the annotation effort typical of supervised machine learning approaches. Specifically, rather than creating a unique set of rules that directly defines a target category, we suggest a double categorisation process, in which atomic categories are first created as independent units, and then combined into complex structures corresponding to the target classes. Such a modular rule-based system also naturally lends itself to the aforementioned hierarchical document classification task, as it reflects the structural dependencies of the categories (see Section 5. for a detailed discussion of this issue). 1
http://lshtc.iit.demokritos.gr/LSHTC4_ CALL
2.
Method
The method we propose is a modular two-stage categorisation process. Final categories, which we call complex categories (Hayes and Weinstein, 1990), can be seen as the sum of several basic categories, which we call atomic categories. Considering the final target categories, several relevant atomic categories are created and stored in a database as independent units. Complex categories are then built up by aggregating appropriate basic categories among those available, with a specific weighing strategy (see details of phase 2 below). By combining them in different ways, new complex structures can be formed. The underlying, guiding principle, which addresses the portability problem typical of rule-based systems, is that basic categories constitute atoms of information that can be reused. As an advantage to statistical methods, no new training sets have to be annotated when dealing with new categories. And as an advantage with respect to rule-based approaches, no new rules have to be rewritten completely from scratch when including new categories or moving to (partially) new domains. The two phases of the categorisation process require specific tools, and are described below. • In phase 1, atomic categories are created with R Studio, a programming environment in COGITO which sets of rules are manually written using a R Studeclarative language. At the heart of COGITO dio is a semantic network (similar to the WordNet hierarchy (Fellbaum, 1998)), which is used to perform word sense disambiguation and entity recognition. Within the same environment, rules are then written to define the conditions that an input text has to obey to be placed in a given atomic category. Every time such conditions are verified, a score is assigned to the corresponding category. All created atomic categories are stored in a database. • In phase 2, we developed a Java application that implements the combination of the atomic categories in the database towards the construction of complex cat-
egories. When assigning the target category, a weighing system takes care of assigning different scores to matching shared atomic categories (lower score) vs matching class-specific atomic categories (higher score). New complex categories can be built any time by combining in different ways the existing atomic categories. We apply our method to a case study (Section 3.), and then compare it to standard machine learning approaches (Section 4.), both in terms of time required to build the systems and in terms of their performance. In Section 5., we also discuss some issues related to the hierarchical aspects of the categories. Finally, in Section 6., we provide a summary of our contribution, and outline directions for future work.
3.
Case study
We tested our method within a project of homeland security in Italy. Digital documents had to be automatically classified as pertaining to one of several pre-defined categories in the general areas of terrorism and bullism. For the former, the final categories are ”Islamic terrorism” (TI), “Christian terrorism” (TCC), “mafia terrorism” (TM), “far-left terrorism” (TDS), “far-right terrorism” (TDD) and “separatist terrorism” (TS), while for the latter we have “bullism at school” (BU) and “cyberbullism” (CB). Finally, we added a more generic category for “criminality” (CR). 3.1.
Modular categories and rules
According to the approach described above, in phase 1 we developed appropriate atomic categories with corresponding rules. For “Islamic terrorism”, for instance, we have the following atomic categories: Islamic fanaticism, violence, threat, weapons, Islamic terrorist groups, antiAmericanism, attack, crime. We wrote a set of rules to match each of these atomic categories. In phase 2, the atomic categories are combined to create complex categories by means of the Java application. Figure 1 provides an example of the creation of the complex category “Islamic terrorism”. As another example, the category “Christian terrorism”, can be split in: Christian fanaticism, violence, threat, weapons, Christian terrorist groups, anti-Semitism, attack, crime. As we can see in Figure 2, five out of eight atomic categories overlap with those already created for “Islamic terrorism” and could therefore be reused. In other words, no new rules had to be rewritten for describing 62.5% of the complex category. We followed the same procedure for the other categories. To highlight the domain independence and wider reusability of the atomic categories, we used them not only in the field of terrorism but also in the field of bullism, so that new rules only had to be written for new atomic categories (e.g. school, cyberspace, adolescence, etc), while we could reuse violence, weapons, threat and crime. Overall, for nine target classes, we created 66 atomic categories, and a total
Figure 1: Creation of the complex category “Islamic terrorism” using atomic categories. of about 1000 rules. Approximately 65% of the rules was used for realising two or mores complex categories, saving a considerable amount of time and resources. 3.2.
Evaluation
We created a test set by annotating 260 randomly selected documents (all downloaded from the web), ensuring balance across categories. All documents are in Italian, and their average length is around 425 words. Two nativespeakers (not involved in the document selection procedure) annotated the texts, classifying them in one of the nine target categories listed above. Reliability was assessed via the kappa statistic (Cohen, 1960), which was measured at .892. Agreement per category was measured via precision, recall, and f-score, and we found that only two categories had an f-score below .900, namely “far-right terrorism” (TDD) (.828) and “Christian terrorism” (TCC) (.777). Table 1: Distribution of documents in the test set, and interannotator agreement (IAA) per category (f-score). category # docs % docs IAA (f-score) TDS 30 11.5 0.965 TDD 29 11.1 0.828 TI 33 12.7 0.926 TCC 22 8.5 0.777 TM 29 11.1 0.929 TS 28 10.8 0.928 CR 31 11.9 0.908 BU 30 11.5 0.900 CB 28 10.8 0.918 total 260 99.9 – In a reconciliation phase, the annotators discussed the 25 instances upon which there was disagreement to eventually create a reference gold standard set. Its distribution is given in Table 1, along with the f-score for each category.
Figure 2: Sharing of atomic categories by complex categories (“Islamic terrorism” and “Christian terrorism”). The system was evaluated against the gold standard measuring global accuracy and also precision, recall, and fscore for each category. Because the system was designed to output the first and the second most likely categories, evaluation was done in strict mode considering only the first choice, and in relaxed mode considering the second choice was, too. Accuracy was measured at .880 for the strict mode and .988 for the relaxed mode. Details of the evaluation per category are given in Table 2, and a confusion matrix between gold standard categories and system predictions in Table 3.
Table 2: Evaluation overall (accuracy) and per category (precision, recall, f-score) strict relaxed global accuracy 0.880 0.988 category TDS TDD TI TCC TM TS CR BU CB
precision 1.000 0.964 0.964 0.916 1.000 1.000 0.620 0.805 0.916
recall 0.966 0.870 0.794 0.578 0.933 0.925 0.968 0.935 0.846
f-score 0.983 0.914 0.871 0.708 0.965 0.962 0.756 0.868 0.879
While results are satisfactory for most categories, for two of them — namely “Criminality” and “Christian terrorism” — the system performed worse (.756 and .708, respectively). As for the latter, the low f-score is determined by a poor result in recall (.578). Apart from observing that the level of agreement between annotators was also quite low, indicating that it is a harder category to identify, we can also note that this category is the one with the lowest percentage of non-shared atomic categories. This seems to imply that a proper balance between specificity and reuse is crucial towards satisfactory results. The poor result for “Crim-
inality”, on the contrary, is due to low precision (.620). It isn’t particularly surprising that a category that is more general than (and encompasses) the others gets assigned too often, as it is evident from the CR column in Table 3, especially if rules pointing to more specific categories cannot be matched.
4.
Comparison to statistical approaches
The reason why our approach would be preferable over a standard rule-based system that does not envisage combination and reusability of atomic categories is rather obvious. However, current categorisation systems are all statistically based, and it is important to see how our proposal compares to them. A proper comparison is not straightforward, as in order to show that our approach is interesting and worth pursuing, we would need to show either that it is easier/faster to create the atomic rules than annotating the documents, or that our resulting system is competitive with respect to a supervised machine learning classifier. In both cases, we would need to annotate a larger set of documents to have a fair amount of annotated data to train a state-ofthe-art classifier, and to check how long it takes to create a larger set of annotated data. Although we haven’t annotated more documents than what is already in our dataset, in this section we sketch two comparisons that should give an idea of the competitiveness of our approach, both in terms of manual effort required (Section 4.1.) and in terms of actual performance (Section 4.2.). 4.1.
Manual effort
Even if our training corpus is composed of only 260 documents, we can still make some observations regarding the effort required to write rules and that necessary to annotate documents. It took approximately one day to the two human annotators to classify all the documents while an expert linguist can write more or less 10 atomic categories per day, depending on the difficulty of the specific cate-
gold standard
Table 3: Confusion matrix of categorisation results. system prediction TDS TDD TI TCC TM TS CR BU TDS 29 0 0 0 0 0 1 0 TDD 0 27 0 1 0 0 2 1 TI 0 0 27 0 0 0 7 0 TCC 0 1 1 11 0 0 4 1 TM 0 0 0 0 28 0 2 0 TS 0 0 0 0 0 25 2 0 CR 0 0 0 0 0 0 31 1 BU 0 0 0 0 0 0 1 29 CB 0 0 0 0 0 0 0 4
gory. Thus, since we implemented 66 atomic categories for 9 complex categories, the work of writing rules took us approximately 7 days, 1.2 complex categories per day. At first glance, then, our system is more time consuming than creating material to train a machine learning classifier. However, the key point of our approach is reusability, and this should be taken into account also when assessing manual effort. Indeed, for each new category, document annotation for machine learning classification will be carried out starting from scratch, i.e. annotators won’t have the possibility to reuse the knowledge implemented while annotating the previous categories, and the time needed document annotation will always remain the same. On the contrary, the knowledge implemented in the set of rules composing an atomic category is always available for next categories. Reusing knowledge means saving time, and so the required time for the creation of complex categories will progressively decrease as the number of available atomic categories increases. As an example, let us assume we need to create the new complex category “vandalism”. As it can be seen in Figure 3, five atomic categories already implemented for our homeland project and already employed in two different complex structures can be reused to create the new complex category. Predicting this new structure will be composed of nine or ten atomic categories, we can see how the time required for its creation is at maximum the half of the time required when no atomic categories were available. In our approach time is thus saved incrementally. 4.2.
Performance
A standard reference dataset for training and testing statistical models of text categorisation is the Reuters collection, a set of newswire documents classified under several categories in the field of economics (see (Yang, 1999) for an overview of the different versions of the Reuters dataset and models trained on them). Although the number of documents in the various versions of Reuters ranges between 13,000 and 21,400 (and the number of categories from 10 to 135), i.e. substantially larger than the dataset we used, there are also experiments using a number of documents/categories comparable to ours (Larkey, 1998), or even categories with less than ten instances (Yang, 1999). Making do with the amount of data that we have, we trained
CB 0 0 0 1 0 0 0 1 22
a few bag-of-words models with the most established algorithms for good text categorisation, namely Naive Bayes, Support Vector Machine, and k-Nearest Neighbours, which we ran with k = 1 and with k = 5 (Yang and Liu, 1999). For the implementation, including pre-processing such as filtering out stop words and converting strings to word vectors, we used the Weka workbench (Hall et al., 2009). We built all models with three different settings: one without any stop list in building the bag-of-words model (no stop), and two using stop lists of different sizes, one containing 138 words (stop 1), and a larger one with almost 400 (stop 2). In Table 4, we report accuracy for the different models in a ten-fold cross-validation setting. Table 4: Accuracy for machine learning models (ten-fold cross-validation): Naive Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbours (k-NN).
algorithm NB SVM k-NN (k=1) k-NN (k=5)
no stop 0.865 0.869 0.400 0.304
accuracy stop 1 0.869 0.877 0.342 0.250
stop 2 0.858 0.850 0.435 0.423
As far as algorithms are concerned, the Naive Bayes and the Support Vector Machine models perform similarly and dramatically better than k-Nearest Neighbour, independently of the set value of k.2 Overall accuracy is comparable to that achieved by our system (see Table 2). Only marginal differences can be observed when employing a list of stop words, and best results are obtained with the shorter list stop 1. As mentioned, these are rather simple models which could be improved if pursuing a purely machine learning setting, which is however not the focus of our contribution. For example, an improvement over the bag-of-words model was more recently achieved including semantic-based features, deriving them from lex2 In the literature on text categorisation, systems based on a kNN model are developed with largely varying values for k, and best performance is reported with values ranging from 20 to 45 (Larkey and Croft, 1996; Yang, 1999; Sebastiani, 2002). However, we have observed that best, albeit unsatisfactory, results on our data are obtained with lower k values.
Figure 3: Reusability of existing atomic categories when extending to new domains (in this example “vandalism”). ical databases such as WordNet (Rosso et al., 2004; Navigli et al., 2011) and/or encyclopaedic resources such as Wikipedia (Wang et al., 2009). However, we didn’t implement anything of the sort due to time constraints and because we consider it an upperbound rather than a system for comparison, given its complexity.
5.
Hierarchical issues
Overall, while scalability of the approach to a drastically larger number of categories must still be investigated, it is promising to see that accurate categorisation can be achieved writing precise rules whose number can be substantially reduced by adopting a modular strategy, which also reflects the hierarchical organisation typical of categorisation tasks. Precisely concerning the hierarchical aspect, we want to make two further observations. First, the categories can indeed be thought as the nodes of a taxonomy where we can have different degrees of granularity: the deeper the node, the more specific. As pointed out by Sebastiani (2005), the choice regarding the granularity in text categorisation tasks is inherently arbitrary, and it can be made only taking into account the specific aims of a project. This could seem in contrast with the concept of portability: how can atomic categories be reused in different projects if their level of granularity is project-specific? Even if this problem does exist in theory, from a practical point of view we never had to deal with it, since the atomic categories stored in the database were successfully reused in all the other projects. Thus, even if picking a good level of granularity for the atomic categories is a crucial issue, in practice it seems completely feasible to chose the “right” level of granularity, where “right” means appropriate for a given task and geared towards maximising reusability. Additionally, if needed, atomic categories with different degrees of granularity can be created, stored in the same database, and used for the purposes of other projects.
Second, machine learning approaches suffer from little and/or skewed training data, as some categories might be represented by very few training documents. This is especially true for hierarchical text categorisation, as it was noted in the context of the Pascal challenge (Kosmopoulos et al., 2010), since data sparseness is common for categories down the hierarchy. In order to improve Naive Bayes hierarchical text classification when data is insufficient, McCallum et al. (1998), and subsequently Toutanova et al. (2001), exploit the hierarchy itself for smoothing, since more general classes can provide more reliable estimates for class-conditional term probabilities. The rationale behind this is that a parent node will exhibit some, more general properties of the child node, and this information can be used in case more specific information is missing. In other words: whenever necessary, an instance of a specific category gets to be described by means of a coarser category. A rule-based system does not suffer from data sparseness, but our modular approach does exploit the principles of a hierarchical organisation of categories, in a way that is somewhat complementary to the above. Indeed, we model finer-grained (atomic) categories and derive from them coarser-grained (complex) categories. Overall, it appears that a taxonomy of target classes is the best way to model text categorisation, as it provides a backoff strategy for statistical models in the case of data sparseness, and an inherently modular structure that allows for minimising rule writing and maximising reusability of lower level categories in the context of rule-based systems.
6.
Summary and Outlook
We introduced a modular rule-based approach to text categorisation which addresses the problem of portability. Indeed, target classes are conceived as complex categories which can be composed by combining a series of atomic categories. It is for the latter that rules are written. The
focus is on reusability, since atomic categories can be used to create several complex categories, in a way that defining target classes never implies writing rules from scratch. Such a system is obviously preferable to standard rulebased systems, at least in terms of time saving, which is usually considered their the main drawback. Compared to mainstream, established, statistical models, our system shows competitive levels of performance, achieving state-of-the-art figures. Also, our system does not suffer from data sparseness or skewed categories in training data, which often affect machine learning methods. However, assessing manual effort is more difficult: annotating documents requires a smaller amount of time which however stays constant when adding new target classes (and thus new documents to be annotated), whereas writing rules is more time consuming at a first stage, but the required amount of time is inversely proportional to the amount of already written atomic categories. In other words, time saving is incremental: the more atomic categories are already defined, the less time is needed to create new target categories. Considering that this is a major issue under investigation for determining whether our approach is worth pursuing on a larger scale, we plan to devise a more accurate strategy to quantify the gain of our approach concerning making up complex categories vs annotating documents. This will also include investigating costs of moving to more radically different domains, possibly including a higher number of categories.
7.
Acknowledgments
We are grateful to Sara and Elena for performing the manual annotation of the documents in our gold standard dataset. We also would like to thank the anonymous reviewers, who provided helpful suggestions, especially as far as the comparison of our method to statistical approaches is concerned. All errors remain obviously our own.
8.
References
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46. Dejong, G. (1982). An Overview of the Frump System. In Lehnert, W. G. and Ringle, M. H., editors, Strategies for Natural Language Processing, page 149176. Fellbaum, Christiane, editor. (1998). WordNet. An Electronic Lexical Database. Language, Speech, and Communication. The MIT Press. Goodman, Marc. (1990). Prism: A Case-Based Telex Classifier. In Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, IAAI ’90, pages 25–38. AAAI Press. Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Bernhard, Reutemann, Peter, and Witten, Ian H. (2009). The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18, November. Hayes, Philip J. and Weinstein, Steven P. (1990). CONSTRUE/TIS: A system for content-based indexing of
a database of news stories. In Rappaport, Alain T. and Smith, Reid G., editors, Proceedings of the 2nd Conference on Innovative Applications of Artificial Intelligence (IAAI-90), May 1-3, 1990, Washington, DC, USA, pages 49–64. AAAI Press, Chicago, IL, USA. Jacobs, P. S. and Rau, L. F. (1988). A friendly merger of conceptual analysis and linguistic processing in a text processing system. In Proceedings of the Fourth IEEE AI Applications Conference, pages 351–356, Los Alamitos, USA. IEEE Computer Society Press. ´ Kosmopoulos, Aris, Gaussier, Eric, Paliouras, Georgios, and Aseervatham, Sujeevan. (2010). The ECIR 2010 large scale hierarchical classification workshop. SIGIR Forum, 44(1):23–32. Larkey, Leah S. and Croft, W. Bruce. (1996). Combining classifiers in text categorization. In Proceedings of SIGIR, pages 289–297. Larkey, Leah S. (1998). Automatic essay grading using text categorization techniques. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, pages 90–95, New York, NY, USA. ACM. McCallum, Andrew K., Rosenfeld, Ronald, Mitchell, Tom M., and Ng, Andrew Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Shavlik, Jude W., editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 359–367, Madison, US. Morgan Kaufmann Publishers, San Francisco, US. Navigli, Roberto, Faralli, Stefano, Soroa, Aitor, de Lacalle, Oier, and Agirre, Eneko. (2011). Two birds with one stone: Learning semantic models for text categorization and word sense disambiguation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 2317–2320, New York, NY, USA. ACM. Rosso, Paolo, Ferretti, Edgardo, Jim´enez, Daniel, and Vidal, Vicente. (2004). Text Categorization and Information Retrieval Using Wordnet Senses. In In Proceedings of the 2nd Global Wordnet Conference (GWC’04, volume 2945, pages 299–304. Sebastiani, Fabrizio. (1999). Tutorial on automated text categorisation. In Amandi, Analia and Zunino, Alejandro, editors, Proceedings of the 1st Argentinian Symposium on Artificial Intelligence (ASAI 1999), pages 7–35, Buenos Aires, Argentina. Sebastiani, Fabrizio. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47. Sebastiani, Fabrizio. (2005). Text categorization. In Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pages 109–129. WIT Press. Steinbach, Michael, Karypis, George, and Kumar, Vipin. (2000). A comparison of document clustering techniques. In Grobelnik, Marko, Mladenic, Dunja, and Milic-Frayling, Natasa, editors, KDD-2000 Workshop on Text Mining, August 20, pages 109–111, Boston, MA. Toutanova, Kristina, Chen, Francine, Popat, Kris, and
Hofmann, Thomas. (2001). Text classification in a hierarchical mixture model for small training sets. In Paques, Henrique, Liu, Ling, and Grossman, David, editors, Proceedings of CIKM-01, 10th ACM International Conference on Information and Knowledge Management, pages 105–113, Atlanta, US. ACM Press, New York, US. Wang, Pu, Hu, Jian, Zeng, Hua-Jun, and Chen, Zheng. (2009). Using wikipedia knowledge to improve text classification. Knowl. Inf. Syst., 19(3):265–281. Yang, Yiming and Liu, Xin. (1999). A re-examination of text categorization methods. In Proc. Int. ACM Conf. on Research and Development in Information Retrieval(SIGIR), pages 42–49, New York, NY, USA. ACM. Yang, Yiming. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69–90.