Applying an Existing Machine Learning Algorithm ... - Semantic Scholar

1 downloads 0 Views 199KB Size Report
Mar 16, 1987 - though it is a rather simple task, compared to other natural language ..... rade performs better than both of Lewis' approaches, under the same ...
Applying an Existing Machine Learning Algorithm to Text Categorization Isabelle Moulinier and Jean-Gabriel Ganascia LAFORIA-IBP-CNRS Universite Paris VI 4 place Jussieu F-75252 Paris Cedex 05 { FRANCE Ph: +33 (1) 44 27 70 10 Fax: +33 (1) 44 27 70 00 [email protected] [email protected]

Abstract. The information retrieval community is becoming increasingly interested in machine learning techniques, of which text categorization is an application. This paper describes how we have applied an existing similarity-based learning algorithm, Charade, to the text categorization problem and compares the results with those obtained using decision tree construction algorithms. From a machine learning point of view, this study was motivated by the size of the inspected data in such applications. Using the same representation of documents, Charade offers better performance than earlier reported experiments with decision trees on the same corpus. In addition, the way in which learning with redundancy in uences categorization performance is also studied.

1 Introduction Text categorization, which can be de ned as the automatic content-based assignment of prede ned categories to texts, is a topic of increasing interest. Even though it is a rather simple task, compared to other natural language processing problems, it possesses properties that are inherent to natural language data. Text categorization systems are primarily designed to assign categories to documents in order to support information retrieval, or to provide an aid to human indexers in the assignment task [1]. Other applications of categorization components include document routing to topic-speci c processing mechanisms [2, 3]. The manual assignment of categories to documents is a time-consuming and expensive task. The automation of text categorization by knowledge engineers, although very e ective, is also time-consuming and expensive [1]. These drawbacks are rather signi cant, since changes to the set of categories invalidate earlier assignments. Some research in machine learning has focused on the automatic construction of decision means; this research is also relevant to text categorization where the main problem is to decide whether or not to assign a given category to a document. Given a set of documents with assigned categories, inductive learning yields rule-based or probabilistic classi ers that assign categories to new documents. Classi ers do not directly handle raw texts, but rather a set of features such as the words the documents contain. Data sets are

needed to train classi ers; these training sets can be arti cially constructed or extracted from previous experiments such as indexed databases. Several of the di erent systems based on inductive learning have been evaluated on the same data, namely the Reuters corpus [4, 5]. As a result, and owing to its size, this corpus provides an interesting medium of comparison between learning algorithms. The purpose of this paper is the experimental application of Charade, a similarity-based learning algorithm, to the text categorization problem. It was carried out through a comparison of Charade with decision tree construction algorithms, using Reuters data, since results on a decision tree approach have already been reported in [4]. The Charade system was chosen because certain of its properties, in particular its ability to generate redundant rules and the search technique it uses, di er from other inductive learning approaches. Section 2 describes text categorization as a challenging application to machine learning. Section 3 gives a brief overview of Charade and its potential. Section 4 reports the experimental results obtained with the Reuters corpus and explains the in uence of redundancy when learning is performed on textual data. Section 5 presents properties we believe a learning algorithm needs to possess in order to successfully deal with textual data.

2 Text Categorization Text categorization has been applied to support information retrieval on technical abstracts or newspapers articles for business or research uses. Text categorization is concerned both with parts of documents, like sentences or paragraphs, and entire documents. Text categorization presents challenging properties when machine learning is confronted with such a task. Only two of the most interesting aspects are described in this paper. First of all, the size of textual data is itself a challenge: a real-size corpus, composed of several thousand texts, may include tens of thousands of words, even more { hundreds of thousands of features { when phrases are considered. Secondly, natural language data have properties, such as synonymy or ambiguity, that existing learning algorithms can hardly handle. Moreover, categorization is context-dependent with regards to documents. Let us consider the story in Fig. 1. The keyword gold was relevant for a human indexer and subsequently assigned to the story, whereas the terms lead and copper were overlooked as meaningless. Retaining this distinction is a hard task for inductive classi ers. However, natural language characteristics are not directly addressed here, but represent future research directions. In order to make category decisions, a representation of texts must be chosen. Most categorization systems consider documents as sets of words: they ignore the initial ordering of the text and consider the presence or absence of a word in a text as a binary feature. Local frequency of words in texts is sometimes chosen as an alternative to presence/absence. Other approaches rely on spatial and linguistic structures, multi-word phrases and any other information as features.

16-MAR-1987 04:09:11.15 TOPICS: gold END-TOPICS PLACES: japan END-PLACES JAPAN'S DOWA MINING TO PRODUCE GOLD FROM APRIL TOKYO, March 16 - said it will start commercial production of gold, copper, lead and zinc from its Nurukawa Mine in northern Japan in April. A company spokesman said the mine's monthly output is expected to consist of 1,300 tonnes of gold ore and 3,700 of black ore, which consists of copper, lead and zinc ores. A company survey shows the gold ore contains up to 13.3 grams of gold per tonne, he said. Proven gold ore reserves amount to 50,000 tonnes while estimated reserves of gold and black ores total one mln tonnes, he added. REUTER

Fig. 1. A categorized story from the Reuters newswire Text representation is a preliminary step to both arti cial intelligence (AI) related approaches and learning-based systems. AI techniques, similar to those used in expert systems dedicated to classi cation or diagnosis, apply to the text categorization problem. In such categorization systems [1], knowledge engineers introduce one or more intermediate concepts, structured in layers, between the input representation of texts and the output category assignments. They also write rules that involve mapping layers together and determining whether to keep or remove concepts. For instance, in the story from Fig. 1, gold is an intermediate concept between the gold ore feature and the gold category. Several learning techniques have been applied to the text categorization task: they use existing sets of categorized documents in order to construct classi ers by induction. These methods include Bayesian classi cation [6], neural networks [7], decision tree construction [4, 8], memory-based learning [9] and optimization techniques [5]. Most of these approaches have been proposed by the Information Retrieval community. Experiments on the Reuters corpus, composed of nancial news stories from Reuters newswire, have compared Bayesian classi cation and decision tree construction [4]. The experiments were conducted under the following constraints: text representation was identical; feature extraction (i.e. the means by which words are turned into learning features) was carried out using the same technique. Before reporting the experimental results on the Reuters data set, we present the main outlines of the Charade system.

3 An Overview of Charade Most of the learning techniques used on a categorization task are based on induction, whatever their theoretical basis is. Inductive learning algorithms aim at detecting empirical correlations between examples [10]. Top-down inductive systems (TDIS), such as ID3 [11] or CN2 [12], can be characterized by the use of an attribute-value representation and by a top-down search strategy. Symbolic algorithms furthermore deal with domain or learned knowledge expressed as axioms or graphs. Charade [13] belongs to this family. Informally, given a set of training examples, the Charade system extracts rules, written as k-DNF expressions, which cover some positive examples in the training set and no negative examples.

3.1 Generating k-DNF Expressions versus Constructing Decision Trees Some TDIS, ID3-like algorithms, generate decision trees and choose attributes at each step of the decision tree construction, according to heuristic criteria. The building tree method introduces a bias, which we intuitively describe below. Let us consider the learning set given in Table 1. It can be classi ed using the four production rules in Fig. 2.

Table 1. A toy learning set E1 E2 E3 E4 E5 E6 E7 E8 E9 A a1 a1 a1 a2 a3 a2 a2 a2 a1 B b1 b3 b1 b2 b2 b1 b1 b3 b4 D d1 d2 d1 d2 d1 d1 d2 d2 d2 Class c1 c1 c1 c2 c2 c3 c3 c1 c1

R1 : if R2 : if R3 : if R4 : if

A = a1 then B = b2 then A = a2 and B = b1 then B = b3 then

Class = c1 Class = c2 Class = c3 Class = c1

Fig. 2. Rule set classifying the learning set from Table 1

These rules are produced by Charade or CN2. However, no decision tree technique is able to build this knowledge base; any decision tree will contain empty, over-speci c and useless nodes. Let us examine one such tree (Fig. 3), derived from the learning set with ID3.

Attribute B

B = b1

B = b2

Attribute A

A=a1

A = a2

Class = c1

Class = c3

4

5

B = b3

B = b4

Class = c2

Class = c1

Class = c1

1

2

3

A = a3

6

Fig. 3. One of the possible decision trees for the training set in Table 1 It is easy to see that node 1 corresponds to rule R2 , node 2 to rule R4 and node 5 to rule R3 . Rule R1 does not itself appear in the tree; it cannot be generated. In fact, node 4 corresponds to an over-speci cation of rule R1 : all examples in node 4 are covered by rule R1 : if A = a1 and B = b1 then Class = c1 , which is more speci c than R1 . Lastly, it appears that node 6 is empty and node 3 useless, when R1 is available. For the sake of clarity, this inability has been explained through a toy example. The construction bias does not weaken on larger sets of data. Charade, which has no such bias, overcomes this inability and enables the production of more concise rules, for instance R1 instead of R1 . 0

0

3.2 Charade: a Theoretical Framework The generation of k-DNF expressions conforms to a theoretical framework presented brie y below. Charade uses a distributive lattice formalism as a theoretical framework to generate production rule systems. Thanks to the properties of lattices, generalization does not rely on pattern matching. The proposed framework, as claimed in [14], covers classical TDIS. The formalism is based on the use of two distributive lattices: the description

space lattice, D, which contains all possible descriptions1 and the example lattice, E , which corresponds to the set of parts of the training set. Let us, for instance, consider the small training set given in Table 1. ((A = a1 ) ^ (B = b3)) belongs to D, whereas fE1 E2 E3 g belongs to E . As illustrated in Fig. 4, two functions between these lattices are then introduced:  : D ! E and : E ! D.  associates to each description the subset of training examples covered by this description; so ((A = a1 ) ^ (B = b3 )) = fE2 g.

generates the most speci c description common to all the examples of a subset of the training set. For instance, (fE1 E2 E3 g) = ((A = a1 ) ^ (Class = c1 )). These two correspondences and  de ne a Galois connection between D and E [15]. As shown in [14], this characteristic enables the construction of an inductive mechanism. γ(E): description shared by all examples of subset E

E

D δ(d): set of examples covered by description d

Fig. 4.  and functions The inductive mechanism in Charade is based on a top-down exploration of the description space D, i.e. from general to speci c. The search space can be reduced by two means: properties of rules and properties of the generated knowledge base. In the theoretical framework of Charade, it is established in [14] that once all examples covered by description d1 are also covered by description d2 , descriptions that are more speci c than d1 ^ d2 are useless. This consideration allows drastic cuts in the top-down exploration of the description space. A secondary bene t from this property is that Charade naturally tends to generate discriminant rules with a minimum number of premises. In a classi cation system { categorization systems are related to this kind {, requirements on rule generation may be speci ed. The minimum percentage of positive examples a rule has to cover may be such a requirement. This constraint, which considerably reduces the search space, may be useful when large numbers of examples and descriptors are studied. 1

A description is a conjunction of descriptors. A descriptor is an attribute-value pair.

Finally, some applications may bene t from redundancy. For instance, when working on a categorization task humans often decide to assign some category to a document according to several criteria. This can be expressed in terms of distinct rules for the same example. Charade o ers the opportunity to generate redundant rules. This property is discussed next.

3.3 Learning with Redundancy Redundant learning has the faculty of producing various means to classify a given training example. Redundancy is almost impossible with decision trees, whereas approaches which handle k-DNF expressions have this faculty. In fact, the bias introduced in decision trees construction methods forbids redundant learning. Once a training example is a ected to a leaf (or to a subtree), it cannot be found in any other leaf (or subtree). Let us consider the learning set in Table 1. Given the decision tree in Fig. 3, it is easy to see that example E8 can only be found in leaf 2 . On the contrary, the rule set given in Fig. 2, generated by both Charade and CN2, is redundant: example E8 is covered by rule R1 and rule R4 . Although it naturally tends to lessen useless redundancy, the Charade system emphasizes redundant learning via the redundancy parameter . Take the case where  is set to 2. Charade will attempt to generate a rule base in which each training example is covered at least twice. Gams, in [16], claims that learning may bene t from redundancy. This is illustrated by the previous example. However, redundancy may lead to less accurate prediction on unseen test sets. Since premises of rules are not exclusive, each unseen (testing) example may re several rules. Yet these rules may have distinct conclusions, i.e. assign the unseen example to di erent classes, and this may lessen the predictive power of the learned rule base. For instance, let us consider example E10 = (a1 ; b2 ; d2 ; c1 ) as a testing example. Rule R1 and rule R2 from Fig. 2 are used to cover E10 . It results in an obviously incorrect prediction for rule R2 . It is worth noticing that the decision tree from Fig. 3 also leads to the incorrect class assignment. Figure 5 illustrates the di erence between rule-based and tree-based approaches on the learning set from Table 1.

4 Experimental Results 's performance was assessed by measuring its ability to reproduce manual assignments on a given data set. The Reuters data set provides a means of comparison, since it has been investigated using other techniques [1, 4, 5]. Charade

4.1 The Reuters Data Set Over the past few years the Reuters data set has emerged as a benchmark for text categorization. It consists of a set of 21,450 Reuters newswire stories from

Examples covered by R1 E1 E10

Examples covered by R1 and R 4

E8

E5 E4

E

E3 E9

Examples in leaf 4

1

E2

Examples covered by R 4

Examples covered by R2

Charade

E

3

E

E2

8

E9 Examples in leaf 2 Examples in leaf 3 Examples in leaf 1

E10 E4

E5

Decision Trees

Fig. 5. In uence of redundancy on learning and testing examples 1987 which have been indexed manually using 135 nancial topics to support document routing and retrieval for Reuters customers. The original corpus is divided into a training set containing 14,704 stories dated from April 7, 1987 and earlier, and a test set, containing all the other 6,746 stories. Several thousand documents in the data set have no topic assignments and we have chosen to ignore them as we cannot possibly learn from them. As a result, the raw data we have worked on has 7,789 training examples and 3,875 testing examples. The examples are described by 22,971 words, using Lewis' representation [4].

4.2 Evaluation Measurements In order to compare our approach with published work, we use the performance measures recall and precision as they are de ned in information retrieval. Recall is the percentage of times a particular category should have been assigned to a document and was, in fact, assigned. Precision is the percentage of category assignments actually made that were correct. Recall and precision have been designed to evaluate performance on a single category. Given a set of n documents and k categories, nk assignment decisions are made. Micro-averaging [4] considers all nk decisions as a whole group and synthesizes recall and precision for the group. Text categorization systems classically have a parameter controlling their willingness to assign categories. The more the system is willing to make decisions, the more recall goes up and precision usually goes down. A break-even point, where recall equals precision, is often chosen to summarize the results.

4.3 Overall Performance Our primary aim was to apply the Charade system to a text categorization task and to compare the results obtained using Charade with those obtained using DT-min10 [4], an ID3-like algorithm. To do so we conducted a set of experiments.

From Texts to Learning Examples In order to introduce as little bias as possible in our experimental comparison, we used Lewis' preprocessing, which was available with the Reuters corpus. Each document was then turned into a training example using an information gain criterion: for each category, features with the best information measurement were retained as attributes in Charade's description language. We rst represented learning examples using binary features, i.e. we were concerned solely with the presence/absence of words in texts. All categories assigned to a document were included in its corresponding learning example description. However, the induction step was considered as a multiple two-class problem: assignment decisions were learned separately for each category. Experimental Results We ran our experiments with various sizes of attributes

per category. The results of our experiments, given in Table 2, show that Charade performs better than both of Lewis' approaches, under the same preprocessing constraints. Experiments with 25 attributes will be discussed in the following section.

Table 2. Recall/Precision break-even points for various sizes of attribute set Method

Representation Break-even(%) Charade Bool. (50 att.) 73.8 Charade Bool. (100 att.) 73.4 Charade Bool. (200 att.) 71.5 Charade Bool. (25 att.) { (cf. Table 3) Decision Tree Bool. 67.0 Bayes Boolean 65.0 Using the learned knowledge base to classify unseen testing examples, we examined the performance of this rule base on the most frequently assigned category. Recall is at 84.4% when precision peaks at 93.2%. The price of this result, however, is the large number of generated rules for this category, around 450 rules. On the other hand, when two categories are semantically related, the stories indexed by these categories may be described with the same subset of attributes. In this case the Charade system, like most TDIS, produces classi ers with less outstanding performance.

As a means of further comparison CONSTRUE, a hand-written rule-based text categorization system, achieves a break-even point of 90% [1] on a smaller test set (723 unseen news stories). In [5], results using a di erent representation of examples have been presented. The di erence resides in text representation, i.e. extraction of features and conversion into attributes. Figures of 75.5% for boolean attributes and 80.5% when local frequency of terms is used, are presented. However, a comparison with this approach is inconclusive as it is dicult to assess the quality of learning separately from the quality of preprocessing.

In uence of Redundancy on Text Categorization Earlier reported experiments did not consider the in uence of redundancy on performance, whereas the experiments described in this paper have. We investigated this in uence on text categorization. No added redundancy ( = 1) favors precision over recall, while a higher redundancy favors recall over precision (cf. Table 3). Best performance, when both recall and precision have a high value, is obtained when each training example has been covered 3 or 4 times. This backs up our earlier intuition: there may be several reasons to assign a given category to a document. Table 3. In uence of redundancy on precision/recall  Nb. of att. Nb. of rules Precision Recall

1 2 3 4 5 8 4 8 16

1

50 50 50 50 50 50 25 25 25 25

1689 3227 4669 6005 7269 9982 3282 5442 8617 30570

82.8% 77.2% 74.8% 72.2% 70.2% 69.1% 78.5% 74% 70.3% 57.1%

65.2% 70.5% 73.3% 74.6% 75.6% 76.1% 62.4% 64.2% 65.3% 66.4%

Experiments with 25 attributes lead to the following observations. It is not possible to relate redundancy directly to the willingness of assignment presented by Lewis. In addition, high redundancy tends to deteriorate precision considerably, whereas recall reaches an asymptote. However, this last result is correlated with the following major notion: redundant learning cannot be an alternative to example representation, i.e. an inadequate representation will not bene t from redundancy, as is the case with 25 attributes.

5 Discussion We have shown experimentally that a machine learning algorithm can be successfully applied to the text categorization task, even though some experiments have shown limitations of the Charade system where the size of the description space prohibits extensive use of the characteristics of the learning algorithm. In order to assess whether the question of the size of the description space is peculiar to Charade, we have started a series of experiments using the CN2 algorithm. These experiments are still in progress, but preliminary results suggest a salient di erence between the two algorithms. Let us consider that n binary features are relevant to a category. Each news story is described by exactly n values in the CN2 formalism. On the other hand, examples in Charade are only expressed in terms of the presence of words in a given document, i.e. the description of each example does not need to be complete. This di erence in description induces di erences in the learned knowledge base. CN2 mostly generates rules in which premises are expressed in terms of the absence of certain features, while Charade generates rules based on the presence of terms. In our view, assigning categories with regards to the absence of certain features makes little sense. However, we have encountered some problems { as yet unsolved { with the description language based on binary features. They can be stated as follows. The basic representation used has the following major drawback, when the corpus is composed of documents of di erent lengths. Documents are mapped onto the set of all attributes found relevant to some category, on an information basis. Hence shorter documents may be represented by a couple of terms, whereas longer ones may include several dozen attributes. Thus, the representation of short documents may be included in the description of longer ones. Let us consider examples E and E , respectively described by D ^ Class = c1 and D ^ d ^ Class = c2 . When Charade is faced with E and E , it does not generate rules that cover E , since E is a negative example to any rule if D then Class = c1 . Nevertheless, we believe that these problems will be lifted once the full descriptive power of Charade is taken into account. Charade supports learning using domain knowledge which can, for instance, be expressed using hierarchies. In the context of natural language, synonymy or ambiguity can be considered as additional knowledge that could be provided to enhance learning. We intend to carry out future research in this direction with the help of a knowledge acquisition software program [17] in which Charade has been embedded. To sum up, reducing the search space is one of our research directions, in order to address more complex problems. Several approaches are possible: exploring the description space using new heuristics, reducing the number of training examples and re ning the learned rule-base using post-processing. Furthermore, generating rules with exceptions may help in exploring the description space and overcoming the lack of descriptive power of binary representation. Finally, we have studied text categorization as a rst approach to text indexing using a controlled vocabulary. We shall now analyze how external knowledge can be represented and how it in uences the performance of our learning system. 0

0

0

References 1. P. Hayes and S. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In Second Annual Conference on Innovative Applications of Arti cial Intelligence, 1990. 2. P. Hayes, P. Andersen, I. Nirenburg, and L. Schmandt. TCS: A Shell for ContentBased Text Categorization. In Proceeding of the Sixth IEEE CAIA, pages 321{325, 1990. 3. G. DeJong. An Overview of the FRUMP system. In W. H. Lehnert and M.H. Ringle, editors, Strategies for Natural Language Processing. Lawrence Erlbaum Associates, Hillsdale, New Jersey, USA, 1982. 4. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, 1994. 5. C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, July 1994. 6. M. E. Maron. Automatic indexing: an experimental inquiry. Journal of the Association for Computing Machinery, (8):404{417, 1961. 7. Y. Yang. Expert network: E ective and ecient learning from human decisions in text categorization and retrieval. In Proc. of the 17th SIGIR, 1994. 8. N. Fuhr, S. Hartmann, G. Lustig, M. Schwantner, and K. Tzeras. AIR/X | a Rule-Based Multistage Indexing System for Large Subject Fields. In Proc. of RIAO'91, Barcelona, Spain, 1991. 9. B. Masand, G. Lino , and D. Waltz. Classifying News Stories using Memory Based Reasoning. In Proc. of the 15th SIGIR, Copenhagen, Denmark, 1992. 10. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors. Machine learning: an arti cial intelligence approach (Vol. 2). Morgan Kaufmann, Los Altos, California, USA, 1986. 11. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81{106, 1986. 12. P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261{ 284, 1989. 13. J.-G. Ganascia. Deriving the learning bias from rule properties. In J. E. Hayes, D. Mitchie, and E. Tyngu, editors, Machine Intelligence 12, pages 151{167. Clarendon Press, Oxford, 1991. 14. J.-G. Ganascia. TDIS: an Algebraic Formalization. In International Joint Conference on Arti cial Intelligence, Chambery, France, 1993. 15. G. Birkho . Lattice Theory. American Mathematical Society, Providence, Rhode Island, third edition, 1967. 16. M. Gams. New measurements that highlight the importance of redundant knowledge. In M. Morik, editor, Proc. of the Fourth European Working Session on Learning, Montpellier, France, 1989. Pitman{Morgan Kaufman. 17. J. Thomas, J.-G. Ganascia, and P. Laublet. Model-driven knowledge acquisition and knowledge-based machine learning: an example of a principled association. In Workshop IJCAI 16, Chambery, France, 1993.

Suggest Documents