Constructing a Decision Tree Classifier using Lexical and Syntactic ...

1 downloads 0 Views 27KB Size Report
rithm (Quinlan, 2002). The features used to train and test the classifier were ..... Similarity.pm. J. Ross Quinlan. 2002. C5.0 Machine Learning Soft- ware.
Constructing a Decision Tree Classifier using Lexical and Syntactic Features Eamonn Newman, John Dunnion and Joe Carthy School of Computer Science and Informatics University College Dublin Dublin, Ireland [email protected]

Abstract Our systems (ndsc and ndsc+) use linguistic features as training data for a decision tree classifier. These features are derived from the text–hypothesis pairs under examination. The classifier uses the features to decide if the the given hypothesis can be entailed from the text. This decision has an associated confidence value, which is a function of the confidences of the different branches of the tree.

1

Introduction

The Second PASCAL RTE Challenge presented participants with a corpus which was very similar to the previous challenge. As a preliminary approach to the new challenge, we decided to augment our original system in an attempt to improve its performance. To gauge the improved performance, we use our original system as a benchmark. This paper describes the new features added to the system and discusses the difference in performance between the two systems, and analyzes the reasons for this. In the paper, we will refer to our original system as ndsc and the augmented system as ndsc+. In Section 2, we give an overview of our system with emphasis on the new features added to improve the performance of ndsc. This is followed in Section 3 by an analysis of the performance in which we present a few examples which show the different performances of ndsc+ and ndsc. We discuss these examples and the corpus in general in Section 4 looking in

particular at certain features of the test set and classifiers. In Section 5, we present our plans for further experimental work and future research.

2

System Description

We submitted two systems to the Challenge. Both systems are decision–tree–based classifiers, produced using C5. The classifiers use lexico-linguistic features of the text–hypothesis pairs to inform their decisions. Ndsc uses the same features as the system which we submitted to the 1st PASCAL Challenge (Newman et al., 2005). This system provides a benchmark against which we can measure our second system, and also allows us to make a comparison between the 1st and 2nd RTE Challenge Corpora. The system uses a decision tree classifier to detect an entailment relationship between pairs of sentences that are represented using a number of different features such as lexical, semantic and grammatical attributes of nouns, verbs and adjectives. This entailment classifier was generated from the RTE training data using the C5 machine learning algorithm (Quinlan, 2002). The features used to train and test the classifier were calculated using the following similarity measures: • The ROUGE (Recall–Oriented Understudy for Gisting Evaluation) (Lin and Hovy, 2004) ngram overlap metrics, which have been used as a means of evaluating summary quality at the DUC summarisation workshops. The Rouge package provides measurement options such as uni-gram, bi-gram, tri-gram and 4-gram

lary overlap by identifying latent relationships between words through the analysis of cooccurrence statistics in an auxiliary news corpus.

term overlap, and a weighted and unweighted longest common subsequence overlap measure. • The Cosine Similarity metric calculates the cosine of the angle between the respective term vectors of the sentence pair.

• The final similarity measure is based on a more thorough examination of verb semantics. This measure finds the longest common subsequence in the sentence–pair, and then detects evidence of contradiction or entailment in the subsequence (such as verb negation, synonymy, near-synonymy, and antonymy) using the VerbOcean taxonomy.

• The Task feature informs the classifier from which portion of the corpus the current sentence pair comes. • The Hirst–St-Onge WordNet–based measure (Budanitsky and Hirst, 2001), is an edge–counting metric that estimates the semantic distance between words by counting the number of relational links between them in the WordNet taxonomy (Fellbaum, 1998). This metric also defines constraints on the length of the path and the types of transitive relationships that are allowed between concepts (nodes) in the taxonomy. These constraints are important because unlike other WordNet– based semantic relatedness measures (which only consider IS–A relationships) the Hirst–St Onge metric searches for paths that traverse the IS–A and HAS–A hierarchies in the noun taxonomy. Hence, this metric provides better coverage at an increased risk of detecting spurious relationships if unrestricted paths were allowed between concepts. This feature was implemented using the Perl Wordnet Similarity modules developed by (Patwardhan et al., 2002). • A verb–specific semantic overlap metric, that uses the VerbOcean semantic network (Chklovski and Pantel, 2004; Chklovski and Pantel, 2005) to identify instances of antonymy and near-synonymy between verbs. The relationships between verb–pairs in VerbOcean were gleaned from the web using lexico–syntactic patterns. Although WordNet provides a verb taxonomy, the VerbOcean data was used because we found it provides better coverage of antonymy relationships needed for detecting entailment. • A Latent Semantic Indexing (Deerwester et al., 1990) measure, like the WordNet measure, attempts to calculated similarity beyond vocabu-

A more detailed description of these features can be found in (Newman et al., 2005). Our second system, ndsc+ is an augmented version of ndsc. It uses all the features available to ndsc along with some new features. These new features are grammatical in nature, extracted from a MINIPAR parse of the sentences. Using MINIPAR (Lin, 1998) to identify the constituent parts of each sentence, we extracted the subject, object and verb of each sentence. We have three input–features based on this: minipar verb, minipar object and minipar subject. These are binary indicators which are set to 1 if there is an exact string match between the lemmatized forms of the verbs, objects, or subjects of the text and hypothesis sentences. Our motivation for implementing this feature was based on our examination of errors made by ndsc on the 2005 corpus. A number of the misclassifications made by the system were due to an ignorance of the grammatical components of the sentences and an over–reliance on lexical (ie word–level) indicators. For example, the entailment–pair in Figure 1 from the 2005 corpus was misclassified by ndsc, but with the addition of the minipar features, ndsc+ correctly identified it as negative entailment. Ndsc+ achieved this by identifying that the objects of the text and hypothesis differ.

3

Performance

The training data required for both systems was extracted from the development corpus using a series of Perl scripts. The data was formatted appropriately and submitted to C5 in order to generate the classifiers (ndsc and ndsc+).

id=2028; entailment=NO; task=QA

4.1

Text: Besancon is the capital of France’s watch and clockmaking industry and of high precision engineering. Hypothesis: Besancon is the capital of France.

There is a 75% agreement between ndsc and ndsc+ on cases that were marked correctly. Since the set of features used by ndsc is a large subset of those used by ndsc+, we actually expected a greater level of agreement. is a large proportion, yet since the set of features used by ndsc are a subset of those used by ndsc+, this proportion is lower than expected. Examination of the C5 output shows that the presence of the new minipar features in the training data had a radical effect on the tree structure. Consequently, many of the classifications have changed in light of the new information available. Such a change shows that the syntactic information available to ndsc+ can be very important in some cases. Conversely, since there is only a small change in the overall performance, we must conclude that the augmented decision tree of ndsc+ may be subject to some overfitting. “borderline”, with the addition of extra features upsetting the balance. While this is much higher than agreement between two random classifiers, since the two systems are largely the same, we would expect greater agreement. the training stage. When we examine the decision trees given for both classifiers, we note that, while there remains a good deal of similarity, some of the important early branches are fundamentally different.

Figure 1: Misclassified by ndsc; correctly classified by ndsc+ The test data was then submitted to ndsc and ndsc+ and classifications were returned. The results achieved by our systems are shown in Table 1. On average, there is a 2% improvement in performance using ndsc+. There is a drop in performance on the Summarisation portion of the test set. Some analysis of this is given in the next section. These results are broadly comparable to those we achieved last year. Though there is a slight drop in the absolute average scores, it is important to remember that the scores from last year were heavily influenced by the CD portion of the corpus (in which we scored 75% accuracy, approximately). When we compare the scores in each task that were present in both corpora (ie IE, IR and QA) we see that the respective scores are not significantly different. The average performance of ndsc on these three tasks was 48.3% in 2005 and 49.5% in 2006. The average score of ndsc+ on these tasks is 53%. Manual examination of the annotated system runs showed a large number of discrepancies between the classifications of ndsc and ndsc+. Investigation shows that the two systems disagreed on 225 of the 800 cases (28.13%). This is a very large difference, yet the relative difference in the accuracy of the system is only 2.5%. A full breakdown of the disagreement is given in Table 2. Further analysis is provided in the following section.

4

Analysis

In this section, we present some observations from a manual analysis of the agreements and disagreements between the two systems. This breaks down into four sections, as used in Table 2, corresponding to the cases when the systems agree and are correct, when ndsc only is correct, when ndsc+ only is correct, and when both systems make the same misclassification.

4.2

Consensus on Correct Classification

Only Ndsc Correct

We see from the analysis that ndsc+ does not perform as well in summarisation as it does in the other tasks. We attribute this to the fact that the summarisation task tends to have more examples which contain rephrasing and synonyms, as opposed to the QA, IR and IE tasks, in which the hypothesis is generally extracted in some manner from the text. However, there are still many cases in each task in which ndsc+ fails to make a correct classification. In the example shown in Figure 2, ndsc is able to make the correct judgment since the lexical overlap metrics (e.g., rouge, lcs) contribute significantly to its decision. On the other hand, ndsc+ makes an incorrect classification due to the influence of the new minipar

ndsc ndsc+ ndsc (2005)

Average 52.50 54.37 NA

IE 47.5 51.5 51.67

IR 51.5 52.5 43.3

QA 49.5 55.0 50.0

SUM 61.5 58.5 NA

Table 1: Our Results IE IR QA SUM All Tasks

Both Systems Correct 58 86 69 102 315

Only ndsc correct 37 17 30 21 105

Only ndsc+ correct 45 19 41 15 120

Neither system correct 60 78 60 62 260

Table 2: Discrepancies between the two systems Pair: 363; Entailment=YES; task=QA

id: 793; entailment=YES; task=QA

Text: Launched in 1990 and still going strong, Ulysses is the first spacecraft ever to pass over the polar regions of the Sun. Hypothesis: The Ulysses spacecraft was launched in 1990.

Text: The FARC is an 18,000-strong guerrilla force that controls large territories in Colombia. Its demands revolve around issues of social welfare, economic development, agrarian and judiciary reform and reorganization of the military. Hypothesis: The FARC is a guerrilla force.

Figure 2: Correctly Classified by ndsc only Figure 3: Correctly Classified by ndsc+ Only features. The pertinent part of the text is contained within an auxilary clause, so there is low correspondence between the verbs, subjects and objects of the text and hypothesis. Ndsc+ overfits on these minipar features, diminishing the influence of all other features and resulting in an incorrect classification. classified this as negative entailment due to the fact that the word overlap was not high enough as a proportion of the length of the text. However, ndsc+ classified this as positive entailment. This is due to the new features, which showed that the subject and object in the sentences were the same. features have led to some overfitting in ndsc+. Because ndsc+ can match only the subjects in both sentences, it deems that there is negative entailment, with the MINIPAR features overriding any information from the other features. Ndsc correctly classifies this, by appeals to word-overlap measures and synonym matching. 4.3

Only Ndsc+ Correct

Any improvements of ndsc+ over ndsc are directly attributable to the presence of the subject–object– verb features. The advantages of using the new features are shown in Figure 3. In this example,

by identifying the subject, object and verb in the sentences, ndsc+ is capable of correctly classifying the pair as negatively entailing since the verbs don’t match. With reference to the performance results, we see that there are 4% and 5.5% improvements in performance by ndsc+ on the IE and QA tasks, respectively. The minipar features are able to utilise the characteristics pertaining to the IE and QA instances better than those of the IR and SUM tasks. 4.4

Neither System Correct

There are still many cases (32.5% of the corpus) where both systems fail to make the correct classification. Some preliminary scrutiny of these cases show a range of different types of semantic equivalence and entailment. In some cases, an appeal to external knowledge bases such as a gazetteer or Named Entity recogniser is necessary, as shown in Figure 4 (where the system needs to know that “N. J.” and “New Jersey” refer to the same entity). While it is difficult to suggest a general means of solving these, it is possible to envisage situations

id=43; entailment=YES; task=IE Text: At the other side of the country, Linden, N.J. is part of an industrial corridor of chemical plants and oil refineries that some federal officials refer to as “the most dangerous two miles in America.” Hypothesis: Chemical plants and oil refineries are located in New Jersey.

Figure 4: Knowledge Base required. where rules or knowledge bases can be adjusted to cater for a particular domain. In other words, classifiers could be developed specifically to suit certain tasks or applications. at least hypotheses which can be tested later on, or used in future cases.

5

Conclusion

Between the two systems, the majority of entailment–pairs are classified correctly. Unfortunately, thus far, we have no way of determining which classifier is likely to get which type of sentence correct. This is similar to the situation in which we found ourselves for the 1st RTE Challenge, where we had one classifier which performed well for certain specific tasks, and another general classifier which had no recourse to the task feature, but performed uniformly across all tasks. Ideally then, to maximise the accuracy of our system, we would like to build some sort of voting mechanism in which the different classifiers all contribute to create a singular overall classification. This has been done with some success in some fields of machine learning, where it was found that certain classifiers performed better on certain ranges or types of data (Lillis et al., 2006). Future work includes further lexico–linguistic feature development, but as we have seen this will not be sufficient to tackle the problem fully. Some of the research in our group currently uses sentence reformulation in the process of addressing the question–answering task. Reformulation of the sentence may allow us to extract certain features with more accuracy, improving our performance. This can be combined with our continuing work with parse trees, especially if we can develop general rules for dealing with certain types of sentences or clauses.

A deeper analysis into the summarisation portion of the corpus is also required. Since the original system was designed and developed using the 1st Pascal Challenge Dataset, which did not include any summarisation examples, it is reasonable to surmise that the three tasks common to both datasets are the best indicators of improved performance. improvement over our previous system. Indeed, 58% remains the best score in any cohort for System 2, even if it does represent a drop for that particular value from System 1’s level. Thus, it seems that for IE, IR and QA, identification of the principal subject, objects and nouns in the sentences can aid in the recognition of textual entailment. However, this approach may not be suitable for a task such as summarisation, which is more likely to have paraphrases and semantically equivalent yet lexically different phrases appearing in the entailment–pairs. Increasingly, it seems that while lexico-linguistic features can make small increments in performance, we require a radical new solution before performance will approach anything that will be usable in a real–world situation.

Acknowledgements The authors of this report wish to acknowledge gratefully the financial support given by Enterprise Ireland for our research as part of the project “CASIA — Combined Approaches to Summarisation for Incident Analysis” (Project number: SC/2003/0255). We also wish to thank our reviewer for an informative and perceptive analysis.

References Alexander Budanitsky and Graeme Hirst. 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Proceedings of Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. Tim Chklovski and Patrick Pantel. 2004. VerbOcean: Mining the Web for Fine–Grained Semantic Verb Relations. Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP-04). Tim Chklovski and Patrick Pantel. 2005. Global Pathbased Refinement of Noisy Graphs Applied to Verb

Semantics. In Proceedings of The Second International Joint Conference on Natural Language Processing (IJCNLP-05). Scott Deerwester, Susan Dumais, George Furna, Thomas Landauer and Richard Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. Christiane Fellbaum (ed.). 1998. “WordNet: An Electronic Lexical Database”. Available from MIT Press. David Lillis, Fergus Toolan, Angel Mur, Liu Peng, Rem Collier, and John Dunnion. 2006. Probability-Based Fusion of Information Retrieval Result Sets. Artificial Intelligence Review, Special Edition on Artifical Intelligence and Cognitive Science (AICS-05), 2006 (in press). Chin-Yew Lin and Ed Hovy. 2004. Automatic Evaluation of Summaries using n-gram co–occurence statistics. Proc. Document Understanding Conference (DUC), National Institute of Standards and Technology. Dekang Lin. 1998. Dependency-based evaluation of MINIPAR . In Proceedings of the Workshop on Evaluation of Parsing Systems at LREC. Eamonn Newman, Nicola Stokes, John Dunnion, and Joe Carthy. 2005. UCD IIRG approach to the Textual Entailment Challenge. In PASCAL Recognising Textual Entailment Challenge Workshop. Siddharth Patwardhan, Jason Michelizzi, Satanjeev Banerjee, and Ted Pedersen. 2002. WordNet::Similarity Perl Module At http://search.cpan.org/dist/ WordNet-Similarity/lib/WordNet/ Similarity.pm J. Ross Quinlan. 2002. C5.0 Machine Learning Software. At http://www.rulequest.com

Suggest Documents