Coping with semantic uncertainty with VENSES

60 downloads 0 Views 567KB Size Report
b. hand reins over to --> give starting job to c. same-sex marriage --> gay nuptials d. cast ballots in the election -> vote e. dominant firm --> monopoly power.
Coping with semantic uncertainty with VENSES Rodolfo Delmonte, Antonella Bristot, Marco Aldo Piccolino Boniforti, Sara Tonelli Department of Language Sciences Università Ca’ Foscari – Ca’ Bembo 30120, Venezia, Italy [email protected] Abstract As in the previous RTE Challenge, we present a linguistically-based approach for semantic inference which is built around a neat division of labour between two main components: a grammatically-driven subsystem which is responsible for the level of predicate-arguments well-formedness and works on the output of a deep parser that produces augmented head-dependency structures. A second subsystem tries allowed logical and lexical inferences on the basis of different types of structural transformation intended to produce a semantically valid meaning corrispondence. Grammatical relations and semantic roles are used to generate a weighted score. In the current challenge, a number of additional modules have been added to cope with finegrained inferential triggers which were not present in the previous dataset. Different levels of argumenthood have been devised in order to cope with semantic uncertainty generated by nearly-inferrable Text/Hypothesis pairs where the interpretation needs reasoning.

1

Introduction

The system for semantic evaluation VENSES (Venice Semantic Evaluation System) is organized as a pipeline of two subsystems: the first is a reduced version of GETARUN, our system for Text Understanding [4]; the second is the semantic evaluator which was previously created for Summary and Question evaluation[3] and has now been thoroughly revised for the new more comprehensive RTE task. The reduced GETARUN is composed of the usual sequence of submodules common in Information Extraction systems, i.e. a tokenizer, a multiword and NE recognition module, a PoS tagger based on finite state automata; then a multilayered cascaded RTN-based parser which is equipped with an interpretation module that

uses subcategorization information and semantic roles processing. Eventually, the system has a pronominal binding module that works at text/hypothesis level separately for lexical personal, possessive and reflexive pronouns, which are substituted by the heads of their antecedents - if available. The output of the system is a flat list of fully indexed augmented head-dependent structures (AHDS) with Grammatical Relations (GRs) and Semantic Roles (SRs) labels. Notable additions to the usual formalism is the presence of a distinguished Negation relation; we also mark modals and progressive mood (for similar approaches see [1, 2]. The evaluation system uses a cost model with rewards/penalties for T/H pairs where text entailment is interpreted in terms of semantic similarity: the closest the T/H pairs are in semantic terms the more probable is their entailment. Rewards in terms of scores are assigned for each "similar" semantic element; penalties on the contrary can be expressed in terms of scores or they can determine a local failure and a consequent FALSE decision – more on scoring below. As shown in Fig.1 below, the evaluation system accesses the output of GETARUN which sits on files and is totally independent of it. It is made up of two main Modules: the first is a sequence of linguistic rule-based submodules; the second is a quantitatively based measurement of input structures. The latter is basically a count of heads, dependents, GRs and SRs, scoring only similar elements in the H/T pair. Similarity may range from identical linguistic items, to synonymous or just morphologically derivable ones. As to GRs and SRs they are scored higher according to whether they belong to the subset of core relations and roles, i.e. obligatory arguments, or not, that is adjuncts. Both Modules go through General Consistency checks which are targeted to high level semantic attributes like presence of modality, negation, and opacity

operators, the latter ones are expressed either by the presence of discourse markers of conditionality or by a secondary level relation intervening between the main predicate and a governing higher predicate belonging to the class of non factual verbs, but see below. Linguistic rule-based submodules are organized into a sequence of syntactic-semantic transformation rules going from those containing axiomatic-like paraphrase HDSs which are ranked higher, to rules stating conditions for similarity according to a scale of argumentality (more below) and are ranked lower. All rules address HDSs, GRs and SRs. Modifying the scoring function may thus vary the final result dramatically: it may contribute more True decisions if relaxed, so it needs fine tuning. Fig. 1 Preprocessing and Semantic Evaluation

expressions play a determining role in the task to derive a semantic inference. And of course, in order for these rules to be applied by the SE it is important to be able to address a combination of Syntactic and Semantic Inference, where Lexical Inference also plays a role: the parser in this case plays a fundamental role in recovering all possible pronominal and unexpressed relations. It is important to remark that besides pronominalantecedent relations, our system also recovers all those cases defined as Control in LFG theory, where basically the unexpressed subject of an untensed proposition (infinitival, participial, gerundive) either lexically, syntactically or structurally controlled is bound to some argument of the governing predicate. 2.1 Defining the task of the Semantic Evaluator (SE) As the examples discussed here below will make clear, the task of the SE is in no way definable on a purely theoretical semantic basis. One such particularly revealing case is the one constituted by reasoning and paraphrase match which we will present here below by discussing examples taken from the Test dataset. Consider the following entailed pair, 468, where KILL, the governing predicate of the Hypothesis, should match with the nominal expression “the death toll” T/H pair n. 468 T: The death toll of a fire that roared through a packed Buenos Aires nightclub climbed on Friday to at least 175, with more than 700 injured as young revellers stampeded to reach locked exit doors. H: At least 174 had been killed and more than 410 people injured.

2 The Task of Semantic Inference Evaluation Differently from what happened in the previous Challenge, this year’s Challenge is characterized by the presence of a relatively high number (we counted more than 100 True and another 100 False pairs, i.e. 25%) of T/H pairs which require two particularly hard NLP processes to set up: - reasoning - paraphrases (reformulation) Setting up rules for Paraphrase evaluation, requires the system to actually use the lemmas to be matched together in axiomatic-like structures: in other words, in these cases the actual linguistic

In the example below, n.266, reasoning is required by the need to match “associated with decreased risk” in the T snipped, with PROTECT in the H snipped: T/H pair n. 266 T: Green tea consumption is associated with decreased risk of breast, pancreatic, colon, oesophageal, and lung cancers in humans. H: Tea protects from some diseases.

Reasoning again is required in the following pair, where some numerical computation has to be performed on the T snippet in order to match it with the H snippet: T/H pair n. 389

T: In Rwanda there were on average 8,000 victims per day for about 100 days. H: There were 800,000 victims of the massacres in Rwanda.

One example now which is not entailed where the computation is based on non intersectivity: T/H pair n. 260 T: These folk art traditions have been preserved for hundreds of years. H: Indigenous folk art is preserved.

The presence of “indigenous” in the H snippet is computed as a case of additional information which is not contained in the T sentence: thus entailment does not hold. Here below we show some AHDS starting from T/H pair 260: AHDS for T snippet of 260 T/H pair: folk_art(sn4, mod, these). tradition(sn1, ncmod-theme, folk_art-sn4). hundreds(sn2, ncmod-specif, of, year-sn3). preserve(cl1, ncmod-theme, for, hundreds-sn2). preserve(cl1, subj-predicate, tradition-sn1, obj). preserve(aux, be). preserve(aux, have).

AHDS for H snippet of 260 T/H pair: folk_art(sn1, mod, 'Indigenous'). preserve(cl2, subj-predicate, folk_art-sn1, obj). preserve(aux, be).

These representations can contain wrong relations, in particular modifying ones, as is the case here with “these” which modifies “folk_art” rather than “traditions”. Another more complex example can be the T/H pair 687 which evaluates to tru and which we report here below, followed by AHDS structures: T/H pair n. 687 T: It takes 560 years to complete one trip around the Sun (versus 250 years for Pluto). H: Pluto's trip around the Sun lasts 250 years.

AHDS for T snippet of 687 T/H pair: take(expl, it). year(sn6, ncmod-specif, for, pluto-sn7). complete(cl2, ncmod-theme, versus, year-sn6). year(sn6, det, 250). complete(cl2, iobj-theme, around, sun-sn5). sun(sn5, det, the). complete(subj-agent, years-sn2). complete(cl2, obj-theme, trip-sn3). trip(sn4, det, one). take(cl1, xcomp-prop, nil, complete-cl2). take(cl1, obj-theme_aff, year-sn2). year(sn2, det, 560). take(cl1, subj-agent, it-sn1).

AHDS for H snippet of 687 T/H pair:

trip(sn1, ncmod-locative, around, sun-sn4). trip(sn1, ncmod-specif, 'Pluto-s_'). last(cl3, obj-theme, year-sn2). year(sn2, det, 250). last(cl3, subj-theme_nonaff, trip-sn1).

The system needs to match “last(trip,Pluto’s)” with “takes(complete,trip) & complete(trip, for,Pluto)”, and this is done by the paraphrase module – indexing does the rest by addressing the right “year” relation. Besides, “complete” serves many relations. Relations encoded in the AHDS of the following example allow semantic entailment to be computed smoothly. T/H pair n. 5 T:Scientists have discovered that drinking tea protects against heart disease by improving the function of the artery walls. H:Tea protects from some diseases.

AHDS for T snippet of 5 T/H pair: discover(ccomp-prop, that, protect). function(sn5, ncmod-specif, of, wall-sn6). wall(sn6, det, the). wall(sn6, ncmod-specif, artery). function(sn5, det, the). improve(subj-agent, 'Scientists'-sn4). improve(cl3, obj-theme_aff, function-sn5). discover(cl2, xcomp-prop, nil, improve-cl3). discover(cl2, subj-actor, scientist-sn4). tea(sn1, ncmod-specif, drinking). protect(cl1, iobj-goal, against, heart_disease-sn2). protect(cl1, subj-agent, tea-sn1). discover(aux, have).

AHDS for H snippet of 5 T/H pair: protect(cl4, iobj-goal, from, disease-sn2). disease(sn2, det, some). protect(cl4, subj-agent, tea-sn1).

3. Implementing the Semantic Evaluator (SE) As said above, the SE is organized into two main modules: a quantitative based module, and a sequence of linguistic rules where scoring is also taken into account when needed, to increase confidence in the decision process. The two modules must then undergo General Consistency Checks which have the task to ascertain the presence of possible mismatches at semantic level. In particular, these checks take care of the following semantic items: - presence of spatiotemporal locations relatively to the same governing predicate, or a similar one as has been computed from previous modules; - presence of opacity operators like discourse markers for conditionality having scope over the governing predicate under analysis;

- presence of quantifiers and other referentiality related determiners attached to the same nominal head in the T/H pair under analysis and chosen as relevant one by previous computation; - presence of antonyms in the T/H pair at the level of governing predicates; - presence of predicates belonging to the class of “doubt” expressing verbs, governing the relevant predicate shared by the T/H pair. 3.1 The Linguistic Rule-Based Module This Module is organized as a sequence of submodules which start from exceptional cases down to default cases. Exceptional cases of Semantic Inference are those constituted by Paraphrase and Reformulation rules where axiomatic structures are addressed; then Lexical Inference follows in snippets where the linguistic elements checked should be identical, synonymous, morphologically derivable or sharing a congruent number of semantic features; Syntactic and Semantic inference rules follow in which different structures are transformed and checked on the basis of GRs and SRs first, then lexical chains are attempted; finally Hybrid (partially heuristic) rules are tried in which both structures and lexical items are matched. Each rule operates at propositional level within the main clause to find matches: the level of main clause is then switched to that of dependent clause when needed. Dependent clauses may be very important to determine the outcome of an inference: they may be tensed (head label CCOMP) or untensed (head label XCOMP). The governing head predicate is responsible for the factitivity of the dependent. To this aim, important elements checked at propositional level are opacity, modality, negation. Opacity is determined by type of governing predicates, basically those belonging to the class of nonfactive predicates. Modality is revealed by the presence of modal verbs at this level of computation. Modality could also be instantiated at sentence level by adverbials, and be verified by General Consistency Checks. Finally, negation may be expressed locally as adjunct of the verb, but also as negative conjuction and negative adverbial – see examples above. It may also be present in the determiner of the nominal head and checked separately when comparing referring expressions considered in the inference. Negation may also be incorporated

lexically in the verb in the class of the socalled “doubt” verbs. 3.1.1 The Paraphrase Rule SubModule This submodule addresses definition-like H sentences, or simple paraphrases of the meaning expressed by the main predicate of the T text. Generally speaking, every time one such rule is fired, the T/H pair contains a conceptually complex lexical predicate and its paraphrase in conceptually simple components. Examples of such cases are constituted by pairs like the following: a. a person killed --> claimed one life b. hand reins over to --> give starting job to c. same-sex marriage --> gay nuptials d. cast ballots in the election -> vote e. dominant firm --> monopoly power f. death toll --> kill g. try to kill --> attack h. lost their lives --> were killed i. left people dead --> people were killed none of which were actually present in WordNet. Definitions and paraphrases are looked up at first in the glosses made available by WordNet. In case of failure a list of some 100 manually made up axiomatic rules are accessed. Each such rule addresses main predicates in the T/H pair, together with presence of semantically relevant dependent if needed, and whenever the concept expressed by the lexically complex predicate requires it. Together with the predicates, the rules select relevant GRs and SRs when needed. In addition, more restrictions are introduced on additional arguments or adjuncts. Eventually, as is the case with all the rules, penalties are explored in terms of semantic operators of the main predicate like negation, modality and opacity inducing verbs which must either absent or be identical in the T/H pair. 3.1.2 The SubModule

Syntactic-Semantic

Rule

The syntactic-semantic rule submodule is organized into a sequence of subcalls where the T/H pairs are checked for semantic similarity starting from sameness of main predicates to semantic approximate match. The first subcall requires the presence of same HDs as main predicates with core arguments, i.e. the ones which have been computed as subject, object, indirect object, arg_mod (passive “by” agent adjunct), xcomp. Nonconflicting SRs are checked in all subcalls: i.e. subject-agent are

allowed to match with arg_mod-agent and subject-theme_affected with objecttheme_affected but not viceversa. These matches take care of what are usually referred to as lexical alternations for verb sucategorization frames, and lexical rules in LFG terms which encompass such syntactic phenomena as passive, intransitivization, ergativization, dative shift, etc. The following calls on the contrary check predicate-argument structures for the soundness of Semantic Roles and Grammatical Relations. Scores are then produced which are summed up and computed as Weight. This is then finally evaluated at higher level together with high level General Consistency Checks, but see Fig. 2 below. Fig. 2 Semantic Evaluation Phase 2

assert a property of the subject, individuate a location of the subject etc. These verbs are matched against main predicates and core arguments of the T portion, which must be identical to H. Quantitative measures are added to confirm the choice. Notable exceptions are sentences containing “be_born” predication which require specific constructions on the other member of the T/H pair. The fourth subcall looks for different main predicates with core arguments which however must be non antonyms, non negative polarity and be synonyms. In addition, there must be at least another important identical non argument HD structure shared. Quantitative measures are added to confirm the choice. These cases must be treated appropriately to distinguish them from what happens in real opposite meaning snippets where the SE considers SRs which must also be opposite. Here below we show the ADHS of an interesting case of a false pair, where negation must be recovered from a governing predicate, “look_to”: T/H pair n. 59 T: Although many herbs have been grown for centuries for medicinal use, do not look to herbal teas to remedy illness or disease. H: Tea protects from illness.

AHDS for T snippet of 59 T/H pair:

The second subcall requires the presence of semantically similar HDs as a combination of main head and main dependent and at least another identical HD structure within the core argument subset. Other subcalls included in this group check nominalization derivational relations intervening between main predicate of T and H, which in one case is checked with edit distance measures. A certain number of additional rules checks for semantic similarity, which can range from synonymous, down to morphologically derived. The third subcall takes as input a list of “lightverbs” in semantic terms, i.e. verbs including “be”, “have”, “appear”, and other similar copulative and locational verbs – like “live”, “hold”, “take_place”, “participate”, “work_for”, etc. - which are used to either make a definition,

use(sn10, mod, medicinal). grow(cl4, iobj-goal, for, use-sn10). grow(cl4, iobj-goal, for, century-sn9). grow(cl4, subj-experiencer, herb-sn8, obj). herb(sn8, det, many). grow(cl3, cmod-subord, although, look_to). use(sn7, mod, medicinal). grow(cl3, iobj-goal, for, use-sn7). grow(cl3, iobj-goal, for, century-sn6). grow(cl3, subj-experiencer, herb-sn5, obj). herb(sn5, det, many). remedy(subj-agent, herbal_teas-sn1). remedy(cl2, obj-theme_aff, illness-sn3). remedy(cl2, obj-theme_aff, disease-sn2). remedy(cl2, obj-theme_aff, disease_or_illness-sn4). look_to(cl1, xcomp-prop, nil, remedy-cl2). look_to(cl1, subj-agent, s-sn1). look_to(neg, not). look_to(aux, do). grow(aux, be). grow(aux, have).

AHDS for H snippet of 59 T/H pair: protect(cl5, iobj-goal, from, illness-sn2). protect(cl5, subj-agent, tea-sn1).

3.3 The Quantitative Module In this module all Heads, Dependents, GRs and SRs are collected for each member of the T/H pair and then they are passed to a scoring function that takes care of identical or similar members by assigning a certain score to every hit. A threshold is then set at a certain value which should encode the presence of a comparatively high number of identical/similar linguistic items. As said above, higher scores are assigned to core GRs and there is a scale also for SRs where Agent has higher score. As with previous subcalls, at the end of the computation semantic consistency and integrity is checked by collecting and comparing semantic operators, as well as performing a search of possible governing “doubt” verbs. Generally speaking, we also treat short utterances differently from long ones. A stricter check is performed whenever an utterance has 3 or less HD structures, the reason being that in these structures some of the above mentioned subcalls would fail due to insufficient information available.

4 Results and Discussion The RTE task is a hard task: this may be partly due to the way in which it has been formulated – half of the snippets are TRUE, the other half are FALSE. It is usually the case that 10-15% mistakes are ascribable to the parser or any other analysis tool; another 5-10% mistakes will certainly come from insufficient semantic information. Table 1. Results for First Run IE IR QA SUMM TOTAL

ACCURACY

AVERAGE PRECISION

0.5350 0.5350 0.5300 0.5900 0.5475

0.5627 0.5385 0.5237 0.60.96 0.5495

The system classified correctly 278/400 False annotated snippets, and 176/400 True annotated snippets. As a result, the system misclassified 346 pairs out of 800 test set examples. The majority of mistakes – 202, i.e. 58.38% - were false negatives, which are T/H pairs which the system classified as false but were annotated as true. The number of false positives is thus 144, i.e. 41.61%, which the system wrongly classified

as true. If we compute the internal consistency of classification of the algorithm, we come up with the following data: 144 false positives over an overall number of 320 pairs classified as true by the system corresponds to 55% recall for true snippets; 202 false negatives over a total number of 480 pairs classified as false by the system corresponds to 57.92% recall for false snippets. In a second run we modifies some parameters to allow for more true decision to take place: as can be seen below the overall results did not change. Table 2. Results for Second Run ACCURACY

AVERAGE PRECISION

0.5350 0.5350 0.5300 0.5900 0.5475

0.5627 0.5385 0.5237 0.60.96 0.5495

IE IR QA SUMM TOTAL

We looked into our mistakes to evaluate the import of the parser on the final Recall and we found out that: 15 snippets out of 100 TRUE ones have a wrong parse which can be regarded the main cause of the mistake. In other words only 15% of wrong results can be ascribed to bad parses. The remaining 25% is due to insufficient semantic information. In turn, this may be classified as follows: - 80% is due to lack of paraphrases and definitions; - 10% is due to wrong SemanticRole assignment; - 10% is due to lack of synonym/antonym relations.

References 1. Bos, J., Clark, S., Steedman, M., Curran, J., Hockenmaier, J.: Wide-coverage semantic representations from a ccg parser. In: Proc. of the 20th International Conference on Computational Linguistics. Geneva, Switzerland (2004) 2. Raina, R., et al.: Robust Textual Inference using Diverse Knowledge Sources. In: Proc. of the 1st. PASCAL Recognision Textual Entailment Challenge Workshop, Southampton, U.K., (2005) 57-60 3. Delmonte, R.: Text Understanding with GETARUNS for Q/A and Summarization, Proc. ACL 2004 - 2nd Workshop on Text Meaning & Interpretation, Barcelona, Columbia University (2004) 97-104 4. Delmonte R.: Semantic Parsing with an LFG-based Lexicon and Conceptual Representations. Computers & the Humanities, 5-6 (1990) 461-488