Combining rule-based and case-based learning for ... - CiteSeerX

58 downloads 900 Views 94KB Size Report
theory may be increased by combining it with a case-based approach in a classification .... Having obtained this first layer of clauses, let us call it T1, we are able to classify .... tag(s1,3,v), tag(s1,4,adj), tag(s1,5,dot), tag(s1,6,pr), tag(s1,7,v), ..... Advances in Inductive Logic Programming, De Raedt, L. (Ed.), IOS Press, 1996. 8.
Combining rule-based and case-based learning for iterative part-of-speech tagging Alneu de Andrade Lopes, Alípio Jorge LIACC - Laboratório de Inteligência Artificial e Ciências de Computadores Universidade do Porto - R. do Campo Alegre 823, 4150 Porto, Portugal E-mail: [email protected], [email protected]

Abstract. In this article we show how the accuracy of a rule based first order theory may be increased by combining it with a case-based approach in a classification task. Case-based learning is used when the rule language bias is exhausted. This is achieved in an iterative approach. In each iteration theories consisting of first order rules are induced and covered examples are removed. The process stops when it is no longer possible to find rules with satisfactory quality. The remaining examples are then handled as cases. The case-based approach proposed here is also, to a large extent, new. Instead of only storing the cases as provided, it has a learning phase where, for each case, it constructs and stores a set of explanations with support and confidence above given thresholds. These explanations have different levels of generality and the maximally specific one corresponds to the case itself. The same case may have different explanations representing different perspectives of the case. Therefore, to classify a new case, it looks for relevant stored explanations applicable to the new case. The different possible views of the case given by the explanations correspond to considering different sets of conditions/features to analyze the case. In other words, they lead to different ways to compute similarity between known cases/explanations and the new case to be classified (as opposed to the commonly used global metric). Experimental results have been obtained on a corpus of Portuguese texts for the task of part-of-speech tagging with significant improvement.

1

Introduction

Often, computational natural language processing requires that each word in a given text is correctly classified according to its role. This classification task is known as part-of-speech tagging and it consists in assigning to each word in a given body of text an appropriate grammatical category like noun, article, ordinal number, etc., according to the role of the word in that particular context. These categories are called part-of-speech tags and may total a few tens, depending on the variants one considers for each particular category. The difficulty of this task lies in the fact that a given word may play different roles in different contexts. Although there is, for each word, a relatively small set of possible tags, for many words there is more than one tag.

Words with a single possible tag are handled by employing a simple lookup-table (dictionary). The way to solve the ambiguity for words with more than one possible tag is by considering the context of the word and possibly employing background knowledge. Tagging words manually is a tedious and error-prone activity, and learning approaches have been proven useful in the automation of this task. In (Jorge & Lopes 1999) an iterative approach was proposed that can learn, from scratch, a recursive first order decision list able to tag words in a text. This approach proceeds by learning a theory containing context free tagging rules in the first iteration and then by learning context dependent recursive theories in subsequent iterations until all words can be tagged. The training data employed for each iteration consists of the words that could not be tagged in previous iterations. Despite the high predictive ability of the iterative tagging approach, it was clear that the theory produced in the last iteration was responsible for a large number of errors that would have a large impact in the overall result. These remaining cases include noise, rare exceptions or examples that cannot be expressed in the given language. Therefore, we expected that a case-based approach could cope with these residual cases better than an induced theory. However, experiments show that a traditional case-based reasoning algorithm, based on overlapping similarity measures and weighted features, does not improve the results of the rule-based approach (section 8). Besides, it is difficult to choose the appropriate weights for the features. Although typically closest neighbors are more relevant, and therefore require higher weights, it is often the case that the tag of one word is determined by other more distant neighbors. Situations like these may be common in the residual cases handled in the last iteration. The current approach exploits the concept of case understanding to allow retrieval of good similar cases to solve it. For that, we have developed a case-based reasoning algorithm, RC2, consisting of two phases. In the learning phase, it constructs explanations of the cases. In the classification phase, it uses these explanations to classify new cases. Instead of using a fixed metric, each case is analyzed using an appropriate set of conditions defined by constructed explanations. In summary, in this paper we present an iterative strategy that combines a rule based learning in the first iterations with a case-based learning in the last one. This paradigm shift during learning overcomes the limitations of the rule language and increases the predictive accuracy in the last iteration and overall. The approach proposed is new and will be described in some detail.

2

The Problem

A sentence is a sequence of words for which we are interested in assigning an appropriate tag for each of its constituents. The tag is an appropriate grammatical category. Usually, from the original sequence we take a part of it (a window) with some number of elements to the left and to the right of a given position, and we try to

predict the value related to the element in the central position; in general employing background knowledge too. In the task of Part-of-speech tagging the assignment must take into account the role of the word in that particular context. The difficulty of this task lies in the fact that a given word may play different roles in different contexts. Below we show two sequences representing two sentences. Words Tags

The art

car n

is v

red adj

. dot

I pr

like v

the art

car n

. dot

We start by representing the text or the primary sequence of words to be tagged as a set of facts. {word(s1,1,'The'), word(s1,2,car), word(s1,3,is), word(s1,4,red), word(s1,5,’.’), word(s2,1,’I’), word(s2,2,like), word(s2,3,the), word(s2,4,car), word(s2,5,’.’) }

In the above example, s1 and s2 are sentence labels. The second argument is the position of the word within the sentence. Punctuation marks such as "." are regarded as words. The corresponding tags are also represented as facts. {tag(s1,1,art), tag(s1,2,noun), tag(s1,3,v), … , tag(s2,5,dot)}

3

The Iterative Induction Strategy

The iterative approach to part-of-speech tagging presented in Jorge & Lopes (1999) tackles the problem of learning recursive first order clauses and is mainly based on work done in the context of inductive program synthesis (Jorge & Brazdil 1996) and (Jorge 1998). In this approach, we start by inducing clauses that are able to determine the tag of some of the words in a given set, without any context information and with confidence above of a given threshold. These are the first clauses to be applied in the classification. They are the base clauses of the recursive definition we want to induce and are not recursive. These clauses are also used to enrich the background knowledge, thus enabling and/or facilitating the synthesis of recursive clauses in the following iterations. Having obtained this first layer of clauses, let us call it T1, we are able to classify (tag) some of the words in the text used for training. Using the answers given by this theory T1 we may induce some recursive context clauses thus obtaining theory T2. By iterating the process, we obtain a sequence of theories T1, T2, ..., Tn. The final theory is T = T1 ∪ T2 ∪ ... ∪ Tn. To induce each theory in the sequence we may apply a sort of covering strategy, by considering as training examples in iteration i only the ones that have not been covered by theories T1, ..., Ti-1. We stop iterating when all the examples have been

covered, or when we cannot find any clauses. To handle the remaining examples, we consider all clauses for selection regardless of their confidence (quality). The construction of each theory T1, T2, ... is done by a given learning algorithm. In this article the learning algorithm ALG used is the first order rule inducer CSC(RC1), for all but the last iteration. There we employ a case-based reasoning strategy. Algorithm 1: Iterative Induction Given Language L, examples E and background knowledge BK, Confidence level C Learning algorithm ALG(E, BK,C) Find A theory T in L Algorithm: Uncovered ← E, T ← ∅, i ← 1 Do Ti ← ALG(Uncovered, BK,C) T ← T ∪ Ti BK ← BK ∪ Ti Uncovered ← Uncovered – covered_examples(Ti) i←i+1 Until covered_examples(Ti) = ∅ or Uncovered = ∅ T ← T ∪ ALG(Uncovered, BK, 0 ) Example: Assume that the background knowledge includes the definition of the predicate word/3 (describing the text) and window/9 defined as window(P,L1,L2,L3,L4,R1,R2,R3,R4)← L1 is P-1, L2 is P-2, L3 is P-3, L4 is P-4, R1 is P+1, R2 is P+2, R3 is P+3, R4 is P+4.

In iteration 1 non recursive rules like the following are induced: tag(A,B,adj) ← word(A,B,portuguesa),!. tag(A,B,n) ← word(A,B,documento),!.

These rules are defined solely in terms of the background predicates word/3. They do not depend on the context of the word to be tagged. Before proceeding to iteration 2 we add these rules to the background knowledge. In iteration 2, some words can be tagged using the rules induced in iteration 1. Now these rules are defined in terms of the word to tag and the context. In this second iteration we also find many non recursive rules. In subsequent iterations more clauses will appear until the stopping criterion is satisfied. Therefore recursive rules like the following appear:

tag(A,B,art)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4),tag(A,L1,prep), tag(A,R1,n),tag(A,L2,n),tag(A,R2,virg),tag(A,L3,prep),!. tag(A,B,art)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,a), tag(A,R2,prep), tag(A,R3,n),tag(A,R4,prep),!.

In general, the total number of iterations depends on the data, the language, and the underlying learning algorithm employed. For the experiments described in this article, the typical number of iterations was 5.

4

The Case-Based Approach in the Iterative Strategy

By observing the partial results of each theory T1, T2,... produced by iterative induction it was clear that the last theory in the sequence is responsible for a large number of wrong answers. This is not surprising, since the examples left for the last iteration are the most difficult ones. Previous iterations failed to find good clauses in the given language. To improve the results we have shifted the bias in the last iteration by applying case based reasoning. We have two main approaches to compute similarity between cases, syntactic and semantic (Aamodt 1994). In the syntactic methods the similarity is inversely proportional to the distance between cases, and a case is described as a vector of feature-values and a corresponding class. On the other hand, semantic methods employ background knowledge and are able to explain cases and use these explanations to retrieve and adapt cases. For this work, we adopted this second view of cases. Preliminary experiments we have conducted employed syntactic methods. In these experiments we defined the cases as set of features corresponding to a neighborhood of the word to tag of length 11. The overlapping metric used divided the number of matching features by the total number of features representing the context of the word. Weights of features were manually set. The best result was not better then previous results obtained with rules only (Table 1 and Figure 1). Besides, setting the appropriate weight to each position in the window is a difficult task. Although closer neighbors tend to be more relevant for tagging, more distant words may be important in certain contexts. These results motivated the use of the semantic approach to CBR. For that, we developed the new algorithm RC2 (Rules and Cases) that constructs explanations from cases, and uses explanations to classify new cases. Explanations are constructed in different levels of generality, enabling different views of the case. These different views correspond to different case filters that are suited for a particular kind of cases. A consequence of this is that, differently from usual case-based systems, we do not use a fixed metric to retrieve cases, but an appropriate set of conditions according to the new case being analyzed. In the following sections we describe in detail our concept of cases and the construction and use of explanations.

5

Cases

To decide which tag T should be assigned to a given word, in a given position P in a sequence, the new case description must take into account either the word at that position or the context of that word or both. For practical reasons, this context was limited to the tags on the five positions at the left of P and five positions at the right of P (a window of size 11). In general, the context may include any information regarding that word. In our approach, case descriptions can be regarded as ground clauses of the form: tag(S,P,T)← window(P,L1,L2,L3,L4,L5,R1,R2,R3,R4,R5), word(S,P,W), tag(S,R1,TR1),tag(S,R2,TR2),tag(S,R3,TR3), tag(S,R4,TR4), tag(S,R5,TR5), tag(S,L1,TL1),tag(S,L2,TL2), tag(S,L3,TL3), tag(S,L4,TL4),tag(S,L5,TL5).

For example, the case associated to the position 2 in sentence s1 is described by the following ground clause: tag(s1,2,n)← window(2,1,0,-1,-2,-3,3,4,5,6,7), word(s1,2,car), tag(s1,3,v), tag(s1,4,adj), tag(s1,5,dot), tag(s1,6,pr), tag(s1,7,v), tag(s1,1,art), tag(s1,0,'?'), tag(s1,-1,'?'), tag(s1,-2,'?'), tag(s1,-3,'?').

Notice that the context here corresponds to the literals that define the neighborhood of the position being classified. Also notice that a case corresponds to a maximally specific clause in terms of the description of its context.

6

Case Explanations

For each case we have a set of explanations. These are clauses that are more general than the case, given some language constraints. Let C be a case and L be a clause language, the set of explanations exp(C) is exp(C) = { E: A→B ∈ L | E θ-subsumes C } As described below we will construct a subset of these explanations and select only the ones applied to a large number of cases. That will be measured by the support and confidence parameters defined as: Support( A→B ) = #{ true instances of A∧B }, and Cf( A→B ) = #{ true instances of A∧B }/ #{ true instances of A }. One explanation associated to the case in the previous section could be

tag(S,Pos,n)← window(Pos,L1,L2,L3,L4,L5,R1,R2,R3,R4,R5), word(S,P,car), tag(S,R1,v), tag(S,R2,adj), tag(S,R3,dot), tag(S,R4,pr), tag(S,R5,v), tag(S,L1,art), tag(S,L2,'?'), tag(S,L3,'?'), tag(S,L4,'?'), tag(S,L5,'?').

Other explanations can be obtained by deleting literals in the body of the clause. Each explanation built by RC2 is obtained by generalizing each pair of cases of the same class. This is possible since we are dealing with a relatively small set of residual cases (about 400). The number of explanations can also be controlled by defining appropriate support and language bias. To obtain the generalization of two cases C1 and C2, we first compute the least general generalization (lgg) of C1 and C2 and then remove literals of the form tag(X,Y,Z) where Y or Z are variables that occur nowhere else in the clause. The explanations with support and confidence above given thresholds are stored in a base of explanations. Besides the support and confidence, each explanation is characterized by its level of generality. This is the number of literals defining the context used in the explanation. Algorithm 2: Explanation Base Construction Given Cases C, Background knowledge BK, Minimal support MS, minimal confidence MC Do For each pair of cases (c1, c2) in C, with the same class construct explanation exp = filtered lgg(c1, c2) Support( exp ) ≥ MS, such that Cf( exp ) ≥ MC We call this set of explanations Explanation-Base.

7

Using Explanations

The tagging of a corpus using a theory produced by iterative induction is also done in an iterative way. Initially, no word occurrence in the corpus is tagged. Then, the induced theories T1, T2,..., Tn, are applied in sequence. Each theory tags some of the words, and uses the tagging done by previous theories. In the last iteration we use the case based classification. To tag one occurrence of a word in a sentence using a explanations, we first represent that occurrence as a case in the case language defined (section 5). As described there, the case contains the context information for that particular position in the sentence. Since many of the words have already been tagged by previous iterations, the context of one word contains the known tags neighboring that word.

After this, we look for the explanation in the explanation-base that maximizes the similarity measure described below. This is, in some aspects, similar to a regular instance based approach. The main difference is that here we may have, in the explanation-base, a set of explanations with different levels of generality and different views for each training case. Given a case explanation A and a case B, the similarity metric used combines an overlapping metric, given by the number of matched literals divided by the total number of literals in the case explanation (Dist), with the confidence of the explanation used (Cf) and the level of generality of the explanation (N). Sim(A, B) = Dist × Cf × log(N × 10/M) Where M is the number of literals in the maximally specific explanation. The value of Sim ranges from 0 to 1. To count the matching literals of a case and an explanation, we first unify the head of the clause representing the case with the head of the clause representing the explanation. One literal Lc in the body of the case matches with one literal Le in the body of the explanation if they are the same. We are assuming that all the variables in the explanation will be instantiated after unifying its head. When more than one explanation with the same level of generality and the same number of matched conditions apply to one case, it is preferred the explanation with higher confidence. The factor log(N × 10/M) is an ad hoc generality measure that gives more weight to more specific explanations. Considering that the approach is case-based, it is natural to prefer, under similar circumstances, explanations closer to the new case (the most specific ones). Experiments not reported here have confirmed that retrieving explanations by using this generality measure works better than structuring the retrieval process by level of generalization, starting with the most specific explanations. A maximally specific explanation can be seen as the case itself. In this case, the confidence is typically 1, log(N × 10/M) becomes 1, and the similarity metric is reduced to the usual overlapping metric. Note that log(N × 10/M) is negative when N < M/10. This happens when the explanation used has less than 10% of the literals of the most specific one. For the part-of-speech tagging approach described here, this generality measure ranged from 0 (N = 1, M = 10) to 1 (N = 10, M = 10). It is important to note that the main difference between an explanation and a rule lies in the fact that when using one explanation we do not have to match all literals. Besides, in a rule-based approach it is necessary to select, from all the hypotheses, an appropriate set of rules. Here we only have to store the explanations.

8

Results

In the experiments conducted, we observed that the use of a case-based approach in the last iteration of an iterative induction strategy instead of induced rules improves the accuracy results.

We have used the described approach in a set of experiments with a corpus in Portuguese text containing more than 5000 words. The corpus had been manually tagged. The corpus was divided into a training set with the first 120 sentences (about 4000 words), and a test set with the remaining 30 sentences (1000 words). The theories were induced using the information in the training set only, and then we measured the success rate of the theories on the test sets. Notice that the learning task we consider here starts with no dictionary. In fact, the dictionary is learned and is expressed as rules that will be part of the final theory produced. In this experimental framework, tagging words of the test set is a hard task since approximately 30% of the words do not occur in the training set. We now give some details about the synthesis of the theory associated with the result shown in Table 1. In the first four iterations a large number (more than 350) of rules are induced. Some 350 appear in iteration 1 and do not take the context into account. In the experiments, the minimal confidence of the rules for each iteration was 0.8. The minimal support was 2. In iteration 2 many recursive rules (about 200) appeared. The iterative induction algorithm went through three more iterations. The number of rules induced at each iteration tends to decrease very rapidly. Table 1. Success rates over the test with Lusa corpus

Algorithm Iterative CSC(RC1) with CBR at it. 5 Iterative CSC(RC1) with rules only. Iterative CSC(RC1) with RC2 at it. 5

Ac. Test 0.792 0.806 0.836

Table 1 shows the overall success rates obtained by using iterative induction with each one of three different algorithms in the last iteration (it. 5). Coverage x Error 0,25

Error

0,2 CBR

0,15

RC1

0,1

RC2

0,05 0 50

60

70

80

90

100

Coverage %

Fig. 1. Coverage × Error. The first four iterations use the CSC(RC1) algorithm and in the last one we use the algorithms CBR, RC1, and RC2.

Figure 1 shows the coverage vs. error rate obtained using the CSC(ALG) in each iteration. In the first 4 iterations ALG is the rule learner RC1. In the last iteration we have used the algorithm CBR (Case Based Reasoning with an overlapping metric), RC1 with a new set of parameters of quality to answer the remaining cases, and RC2 (using explanations). The total number of iterations of the learning process depends on the data, the language, and the quality parameters (minimal confidence, support and selection algorithm). The inductive induction stops when the coverage in a given iteration is close to zero. This strategy yielded 5 iterations. In the case of RC2, the untagged words at iteration 5 (about 400) were stored in a case-base and used to construct explanations (about 1300). The result shown for CBR in Table 1 was the best one achieved using a simple overlapping metric, and setting manually the weights.

9

Related work

The system SKILit (Jorge & Brazdil 1996, Jorge 1998) used the technique of iterative induction to synthesize recursive logic programs from sparse sets of examples. Many other ILP (Inductive Logic Programming) approaches to the task of part-ofspeech tagging exist. The ones that are more directly related to our work are (Cussens 1997) and (Dehaspe 1997), where relational learning algorithms are employed in the induction of rule based taggers. More recently, Cussens et al. (1999) used the ILP system P-Progol to tag Slovene words. Lindberg and Eineborg (1999) used P-Progol to induce constraint grammars for tagging of Swedish words. And using linguistic background knowledge. Horváth et al. (1999) tried different learning algorithms for tagging of Hungarian. One of the systems that obtained good results was RIBL, a relational instance based learning system. In (Liu et al. 1998) a propositional learning algorithm was proposed that is similar in structure to CSC. The main differences are that CSC is relational and it is used here in an iterative way. The methodology proposed here is one of a number of possible hybrid approaches combining cases and rules. The main motivations found in the literature for this combination are efficiency improvement and accuracy improvement. For example, Golding and Rosenbloom (1996), use a set of approximately correct rules to obtain a preliminary answer for a given problem. Cases are used to handle exceptions to the rules. Rules are also used for case retrieval and case adaptation. This approach yielded good accuracy results in the task of name pronunciation. Domingos (1996) proposes a propositional framework (and the learning system RISE) for the unification of cases and rules by viewing cases as most specific rules. The class of a new example is given by the nearest rule (or case) according to a given distance function. Rules are constructed by generalizing examples and other rules. Only generalizations that improve global accuracy are maintained. Our approach differs from RISE in some aspects. First, ours is a relational approach that can use background knowledge. Second, contrary to what happens in RISE, when an explanation is generated it does

not replace the cases being generalized. We believe that this use of redundancy is important for a difficult set of examples like the ones treated in the last iteration of the inductive process. Another difference is that we use rules (in the first iterations) while these have a satisfactory quality, and cases only when the rule language exhausts.

10 Conclusion In an inductive process such as iterative strategy, where the most visible patterns (represented as rules) are identified first, we typically get to a set of residual examples that cannot be reliably captured by the initial bias. As can be seen in Figure 1, the effectiveness (coverage) of the bias decreases from iteration to iteration. The paradigm shift in the last iteration, when using a case based approach with explanations (RC2), improves significantly the accuracy in the iteration and overall. Case explanation was able to explore particularities of the cases not explored by the language bias in the previous rule-based inductive process. Generating all explanations could be intractable for large corpora. However, the iterative approach used leaves only a relatively small set of cases for the last iteration. The methodology proposed here has also explored and formalized some concepts such as case explanation, context, similarity assessment considering semantic aspects, as well as the use of background knowledge to understanding and retrieve cases. Although the iterative learning process is described here as starting from scratch, previously acquired tagging knowledge could have been used before learning. Likewise we may have some words tagged before using the theories induced or the explanation-base constructed for tagging. Since we are using a first order setting, richer background knowledge can also be used in the learning process. However, this would probably motivate some more elaboration of the explanation matching concept.

Acknowledgements The authors would like to thank the support of project Sol-Eu-Net IST 1999 - 11495, project ECO under Praxis XXI, FEDER, and Programa de Financiamento Plurianual de Unidades de I&D. The first author would also like to thank the support of CNPq Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil. The Lusa corpus was kindly provided by Gabriel Pereira Lopes and his NLP group.

References 1. 2.

Aamodt, E. Plaza Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications, Vol. 7 Nr. 1, (1994), 39-59. Cussens, J.; Dzeroski, S.; Erjavec, T.: Morphosyntatic Tagging of Slovene Using Progol. th Proceedings of the 9 Int. Workshop on Inductive Logic Programming (ILP-99). Dzeroski, S. and Flach, P. (Eds). LNAI 1634, 1999.

3.

Cussens, J.: Part of Speech Tagging Using Progol. In Inductive Logic Programming. th Proceedings of the 7 Int. Workshop on Inductive Logic Programming (ILP-97). LNAI 1297, 1997. 4. Domingos, P.: Unifying Instance-Based and Rule-Based Induction. Machine Learning 24 (1996), 141-168. 5. Golding, A. R.; Rosenbloom, P.S.: Improving Accuracy by Providing Rule-based and Case-based Reasoning. Artificial Intelligence 87 (1996), 215-254. 6. Horváth, T.; Alexin, Z.; Gyimóthy, T.; Wrobel, S.: Application of Different Learning th Methods to Hungarian Part-of-Speech Tagging. Proceedings of the 9 Int. Workshop on Inductive Logic Programming (ILP-99). Dzeroski, S. and Flach, P. (Eds). LNAI 1634, 1999. 7. Jorge, A.; Brazdil, P.: Architecture for Iterative Learning of Recursive Definitions. Advances in Inductive Logic Programming, De Raedt, L. (Ed.), IOS Press, 1996. 8. Jorge, A. Lopes, A.: Iterative Part-of-Speech Tagging. Learning Language in Logic (LLL) Workshop, Cussens, J. (Ed.), 1999. 9. Jorge, A.: Iterative Induction of Logic Programs: an approach to logic program synthesis from incomplete specifications. Ph.D. thesis. University of Porto, 1998. 10. Lindberg, N; Eineborg, M: Improving Part-of-Speech Disambiguation Rules by Adding th Linguistic Knowledge. Proceedings of the 9 Int. Workshop on Inductive Logic Programming (ILP-99). Dzeroski, S. and Flach, P. (Eds). LNAI 1634, 1999. 11. Liu, B.; Hsu, W.; Ma, Y.: Integrating Classification and Association Rule Mining. In Proceedings of KDD 1998: pp. 80-86. 1998.