Tree Edit Distance for Recognizing Textual Entailment: Estimating the Cost of Insertion Milen Kouylekov1,2 and Bernardo Magnini1 ITC-irst, Centro per la Ricerca Scientifica e Tecnologica 1 University of Trento2 38050, Povo, Trento, Italy
[email protected],
[email protected]
Abstract The focus of our participation in PASCAL RTE2 was estimating the cost of the information of the hypothesis which is missing in the text and can not be matched with entailment rules. We have tested different system settings for calculating the importance of the words of the hypothesis and investigated the possibility of combining them with machine learning algorithm.
1 Introduction For our participation in the first edition of the PASCAL Recognizing Textual Challenge 1 (Pascal RTE1) (Kouleykov and Magnini 2005) we have implemented an approach based on Tree Edit Distance (TED) algorithm, applied to the dependency trees of the text (T) and hypothesis (H), for recognizing textual entailment. We estimated that the probability of an entailment relation between T and H is related to the ability to show that the whole content of H can be mapped into the content of T. We investigated resources for entailment rules, defined in (Dagan and Glickman 2004) as language expressions with syntactic analysis and optional variables replacing subparts of the structure. We have experimented the TED approach with three linguistic resources: (i) a non-annotated document collection, from which we have estimated the relevance of words; (ii) a database of similarity relations among words estimated over a corpus of dependency trees; (iii) WordNet, from which we have extracted entailment rules,
based on lexical relations. The experiments we have carried out show that using such resources coupled with the edit distance algorithm can be used for successfully recognizing textual entailment. This year our focus was estimating the cost of the information of the H which is missing in the T that can not be matched with entailment rules. We have tested different system settings for calculating the importance of the words of the hypothesis and investigated the possibility of combining them with machine learning algorithm. Our hypothesis was that different approaches, for calculation the edit cost, can perform complementary. The paper is organized as follows. In Section 2 we review some of the relevant approaches proposed by groups participating in the PASCAL-RTE challenge. Section 3 presents the Tree Edit Distance algorithm we have adopted and its application to dependency trees. Section 4 describes the architecture of the system. Section 5 presents experimental settings and the results we have obtained while Section 6 contains a general discussion and describes some directions for future work.
2 Relevant Approaches The most basic inference technique used by participants at PASCAL-RTE is the degree of overlap between T and H. Such overlap is computed using a number of different approaches, ranging from statistic measures like idf, deep syntactic processing and semantic reasoning. The difficulty of the task explains the poor performance of all the systems, which achieved accuracy between 50-60%. In the rest of the Section we briefly mention some of the
systems which are relevant to the approach we describe in this paper. A similar approach to recognizing textual entailment is implemented in a system participating in PASCAL-RTE (Herrera et al. 2005), which relies on dependency parsing and extracts lexical rules from WordNet. A decision tree based algorithm is used to separate the positive from the negative examples. In (Bayer et al. 2005) the authors describe two systems for recognizing textual entailment. The first one is based on deep syntactic processing. Both T and H are parsed and converted into a logical form. An event-oriented statistical inference engine is used to separate the TRUE from FALSE pairs. The second system is based on statistical machine translation models. A method for recognizing textual entailment based on graph matching is described in (Raina et al. 2005). To handle language variability problems the system uses a maximum entropy coreference classifier and calculates term similarities using WordNet.
3 Tree Edit Distance on Dependency Trees We adopted a tree edit distance algorithm applied to the syntactic representations (i.e. dependency trees) of both T and H. A similar use of tree edit distance has been presented by (Punyakanok et al. 2004) for a Question Answering system, showing that the technique outperforms a simple bag-of-word approach. While the cost function they presented is quite simple, for the RTE challenge we tried to elaborate more complex and task specific measures. According to our approach, T entails H if there exists a sequence of transformations applied to T such that we can obtain H with an overall cost below a certain threshold. The underlying assumption is that pairs that exhibit an entailment relation have a low cost of transformation. The kind of transformations we can apply (i.e. deletion, insertion and substitution) are determined by a set of predefined entailment rules, which also determine a cost for each edit operation. We have implemented the tree edit distance algorithm described in (Zhang and Shasha 1990) and apply it to the dependency trees derived from T and H. Edit operations are defined at the level of single nodes of the dependency tree (i.e. transformations
on subtrees are not allowed in the current implementation). Since the (Zhang and Shasha 1990) algorithm does not consider labels on edges, while dependency trees provide them, each dependency relation R from a node A to a node B has been re-written as a complex label B-R concatenating the name of the destination node and the name of the relation. All nodes except the root of the tree are relabeled in this way. The algorithm is directional: we aim to find the best (i.e. less costly) sequence of edit operation that transform T (the source) into H (the target). According to the constraints described above, the following transformations are allowed: • Insertion: insert a node from the dependency tree of H into the dependency tree of T. When a node is inserted it is attached with the dependency relation of the source label. • Deletion: delete a node N from the dependency tree of T. When N is deleted all its children are attached to the parent of N. It is not required to explicitly delete the children of N as they are going to be either deleted or substituted on a following step. • Substitution: change the label of a node N1 in the source tree (the dependency tree of T) into a label of a node N2 of the target tree (the dependency tree of H). Substitution is allowed only if the two nodes share the same part-of-speech. In case of substitution the relation attached to the substituted node is changed with the relation of the new node.
4 System Architecture The system is composed of the following modules, showed in Figure 1: (i) a text processing module, for the preprocessing of the input T/H pair; (ii) a matching module, which performs the mapping between T and H; (iii) a cost module, which computes the cost of the edit operations. 4.1
Text Processing Module
The text processing module creates a syntactic representation of a T/H pair and relies on a sentence splitter and a syntactic parser.For parsing we used Minipar, a principle-based English parser (Lin 1998a)
where Rel1 (w), in the current version of the system, is computed on a document collection as the inverse document frequency (idf) of w, a measure commonly used in Information Retrieval. If N is the number of documents in a text collection and Nw is the number of documents of the collection that contain w then the idf of w is given by the formula:
idf (w) = log
Figure 1: System Architecture which has high processing speed and good precision. 4.2
Matching module
The matching module implements the edit distance algorithm described in Section 3 and finds the best sequence (i.e. sequence with lowest cost) of edit operations between the dependency trees obtained from T and H. The entailment score of a given pair is calculated in the following way:
score(T, H) =
ed(T, H) ed( , H)
(1)
where ed(T, H) is the function that calculates the edit distance cost between T and H and ed( , H) is the cost of inserting the entire tree H. 4.3
Cost Module
The matching module makes requests to the cost module in order to receive the cost of single edit operations needed to transform T into H. We have different cost strategies for the three edit operations. Insertion. The intuition underlying insertion is that its cost is proportional to the relevance of the word w to be inserted (i.e. inserting an informative word has an higher cost than inserting a less informative word). More precisely:
Cost[ed(, w)] = Rel(w)
(2)
N Nw
(3)
The most frequent words (e.g. stop words) have a zero cost of insertion. We have considered also measures for calculating the relevance of a word proportional to its position in the dependency tree of the hypothesis. The words with higher position in the tree (i.e. closer to the root of the tree), or with more children are considered more relevant to the meaning expressed by a certain phrase. Accordingly, two alternative measures for calculating the cost of an insertion are: Rel(w) = #children of w
(4)
Rel(w) = 10 − #parents of w
(5)
were #children(w) is the number of children of w and #parents(w) is the number of the parents of w in the dependency tree of the hypothesis. The maximum possible depth of a dependency trees estimated on the development set is 10. Substitution. The cost of substituting a word w1 with a word w2 can be estimated considering the semantic entailment between the words. The more the two words are entailed, the less the cost of substituting one word with the other. We have used the following formula: Cost[ed(w1 , w2 )] =
(6)
Ins(w2 ) ∗ (1 − Ent(w1 , w2 )) where Ins(w2 ) is calculated using (4) and Ent(w1 , w2 ) can be approximated with a variety of relatedness functions between w1 and w2 . There are two crucial issues for the definition of an effective function for lexical entailment: first, it is necessary to have a database of entailment relations
with enough coverage; second, we have to estimate a quantitative measure for such relations. We have defined a set of entailment rules over the WordNet relations among synsets, with their respective probabilities. If A and B are synsets in WordNet 2.0, then we derived an entailment rule in the following cases: A is hypernym of B; A is synonym of B; A entails B; A pertains to B. For all the relations between the synsets of two words, the probability of entailment is estimated with the following formula: 1 1 Entwordnet(w1 , w2 ) = ∗ Sw 1 Sw 2
(7)
where Swi is the number of senses of wi ; Sw1 is 1 the probability that wi is in the sense which participates in the relation; Entwordnet (w1 , w2 ) is the joined probability. The proposed formula is simplistic and does not take in to account the frequency of senses and the length of the relation chain between the synsets. Deletion. In the PASCAL-RTE2 dataset H is typically shorter than T. As a consequence, we expect that much more deletions are necessary to transform T into H than insertions or substitutions. Given this bias toward deletion, in the current version of the system we set the cost of deletion to 0. Deleted words influence the meaning of already matched words. This requires that the evaluation of the cost of the deleted word is done after the matching is finished. In the future we plan to implement a module that calculates the cost of the deletion separately from the matching module. An example of mapping between two dependency trees is depicted in Figure 2. The tree on the left is the text: Edward VIII became King in January of 1936 and abdicated in December. The tree on the right corresponds to the hypothesis: King Edward VIII abdicated in December 1936. The algorithm finds as the best mapping the subtree with root abdicated. The verb became is substituted by the verb abdicated because it exists an entailment rule between them extracted from one of the resources. Lines connect the nodes that are exactly matched and nodes that are substitutions (became-abdicated) for which an entailment rule is used. They represent
Figure 2: Example the minimal cost match. Nodes in the text that do not participate in a mapping are removed. The lexical modifier 1936 of the noun December is inserted.
5 Experiments and Results In this section we report on the dataset, the experiments and the results we have obtained. 5.1
Experiments
We have ran 6 systems with different settings. In all the systems variants we have tested we used the following settings for substitution and deletion: Deletion: always 0 Substitution: 0 if w1 = w2 , WordNet based rules score (with score > 0.2), infinite in all other cases. The settings correspond to the substitution and deletion functions of the best system reported in (Kouleykov and Magnini 2005). We made experiments with the following system settings: System 1: Insertion as IDF In this configuration, considered as a baseline for the Tree Edit Distance approach, the cost of the insertions is set to the idf of the word to be inserted. In this configuration the system needs a non-annotated corpus. The corpus we used contains 4.5 million news documents from the CLEF-QA (Cross Language evaluation Forum) and TREC (Text Retrieval Conference) collections. System 2: Fixed Insert cost In this configuration we wanted to fix the insertion cost and compare the system performance against the baseline strategy based on idf calculated on a local corpus. The cost was fixed to 200. System 3: Number of Parents. In this configuration we used the number of parents formula de-
scribed in Section 4 for calculating the insertion cost. System 4: Number of Children. In this configuration we used the number of children formula described in Section 4 for calculating the insertion cost. System 5: Number of Children + Number of Parents In this configuration we used the sum of the number of children formula and number of parents formula described in Section 4 for insertion. For systems 1-5 an entailment relation is assigned to an T-H pair if the overall cost of the transformation is below a certain threshold, empirically estimated on the training data for each task of the training set. Such estimation is a simple learning algorithm with two features: the task of the example and the calculated distance. System 6: Combined In this configuration we used the distances calculated by all previous systems as features of the sequential minimal optimization (SMO) algorithm, described in (Smola and Scholkopf 1998) and implemented in (Witten and Frank 2005), for training a support vector classifier. We use this run to test whether different approaches, for calculation the edit cost, can perform in complementary manner. The feature vector for a T-H pair contains the distances calculated by each system and the task to which the pair belongs. An entailment relation is assigned to an T-H pair if the the example is classified as positive. 5.2
Results
Table 1 reports the accuracy calculated on the development and test set using only the distance calculated by each separate system. The results of the baseline system on the test set represent the first submitted run. The combined system results are the results from the second submitted run. Results show that the combined run performs better than the other systems. Combining different approaches for estimating the edit operation cost brings improvement to the overall performance of the system. The different systems are performing complementary. Some T-H pairs are correctly assigned TRUE or FALSE because majority of the systems are classifying them as TRUE or FALSE. The
small difference in the performance is due to the comparative performance of used systems. In order to obtain optimal results, the system must run with a different set of cost functions on the different tasks of the dataset. It is important to notice that the System 3, based on number of parents as insertion out-performs the baseline System 1 which is using a corpus for estimating idf for the cost of insertion. This shows that using IDF for estimating the cost of the insertion operation is not necessary to obtain good results. Results show that some of the systems over fit to the training set. The distance calculated by the system 2 depends on the average number of the inserted words. Thus, the lower performance on the training set is explained by the different value of this number for the two sets. The baseline system produces the most stable(not over fitting) results performance on the development and training sets. Table 2 represents the results obtained by the two submitted runs. Our system performs well on the Summarization task. The traditional summarization systems generate the report using words from the text they process. Because of that, it was easy to distinguish the positive examples from the negative in the development and the test set. The main problem for the systems is represented by the Information Extraction task. Traditional IE systems approach the problem in a linear manner in contrast to our parser based approach. In contrast to the other three tasks, recognizing entailment for IE requires a large resource of complex entailment rules. The simple lexical entailment rules used in this version of the system can not address sufficiently the problem. Although the combined run performs better then the baseline system it has lower precision. This is due to the different algorithms used to calculate the the overall score for it. A more careful combination of systems with respect to each task can improve the results.
6 Discussion and Future Work We have presented an approach for recognizing textual entailment based on tree edit distance applied to the dependency trees of T and H. We have also
development ten fold cross-validation test
System 1 0.581 0.578 0.572
System 2 0.591 0.560 0.570
System 3 0.600 0.590 0.582
System 4 0.579 0.579 0.541
System 5 0.598 0.590 0.571
System 6 0.637 0.613 0.605
Table 1: Accuracy for different systems on the training set
run1(Baseline) run2(Combined)
accuracy precision accuracy precision
IE 0.5050 0.5095 0.5200 0.4978
IR 0.5500 0.4658 0.6000 0.5352
QA 0.5650 0.4658 0.6000 0.5352
SUM 0.6700 0.7067 0.7000 0.5240
Total 0.5725 0.5249 0.6050 0.5046
Table 2: System Performance
investigated different ways of calculating the cost functions for the edit distance algorithm. In the future we plan to extend the usage of WordNet as an entailment resource. Entailment rules found in entailment and paraphrasing resources can also be used. A drawback of the tree edit distance approach presented is that it is not able to observe the whole tree, but only the subtree of the processed node. For example, the cost of the insertion of a subtree in H could be smaller if the same subtree is deleted from T at a prior or later stage. A context sensitive extension of the insertion and deletion module will increase the performance of the system. In this direction, the negative examples (examples that don’t have entailment relation) in the development set on which the system reports small distance can be used fro extracting context dependent rules that estimate the cost of the deletion operation. In the future we plan to develop evolutionary algorithm to combine the different functions for calculating the insertion and deletion costs.
References Samuel Bayer, John Burger, Lisa Ferro, John Henderson and Alexander Yeh. MITRE’s Submissions to the EU Pascal RTE Challenge In Proceedings of PASCAL Workshop on Recognizing Textual Entailment Southampton, UK, 2005 Ido Dagan and Oren Glickman. Generic applied modeling of language variability In Proceedings of PASCAL
Workshop on Learning Methods for Text Understanding and Mining Grenoble, 2004 Jesus Herrera, Anselmo Pe˜nas and Felisa Verdejo. Textual Entailment Recognition Based on Dependency Analysis and WordNet In Proceedings of PASCAL Workshop on Recognizing Textual Entailment Southampton, UK, 2005 Milen Kouleykov and Bernardo Magnini Combining Lexical Resources with Tree Edit Distance for Recognizing Textual Entailment Proceedings of the First PASCAL Recognizing Textual Entailment Workshop, LNAI, Springer, 2005 Dekang Lin. Dependency-based evaluation of MINIPAR In Proceedings of the Workshop on Evaluation of Parsing Systems at LREC-98. Granada, Spain, 1998 Vasin Punyakanok, Dan Roth and Wen-tau Yih. Mapping Dependencies Trees: An Application to Question Answering Proceedings of AI & Math, 2004 Rajat Raina, Aria Haghighi, Christopher Cox, Jenny Finkel, Jeff Michels, Kristina Toutanova Bill MacCartney, Marie-Catherine de Marneffe, Christopher D. Manning and Andrew Y. Ng. Robust Textual Inference using Diverse Knowledge Sources In Proceedings of PASCAL Workshop on Recognizing Textual Entailment Southampton, UK, 2005 Alex J. Smola, Bernhard Scholkopf A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series - NC2-TR-1998-030, 1998 Kaizhong Zhang ,Dennis Shasha. Fast algorithm for the unit cost editing distance between trees Journal of algorithms, vol. 11, p. 1245-1262, December 1990 Ian H. Witten and Eibe Frank Data Mining: Practical machine learning tools and techniques 2nd Edition, Morgan Kaufmann, San Francisco, 2005