A Distributed Shallow Approach to Reference ... - Semantic Scholar

7 downloads 0 Views 201KB Size Report
The federal chancellor Helmut Kohl. 7. detection of relationships: Pe- ter Mihatsch ,. Vorsitzender der. Gesch aftsf uhrung der Mannesmann. Mobilfunk GmbH ...
A Distributed Shallow Approach to Reference Resolution in the Context of Information Extraction Thierry Declerck and Gunter Neumann DFKI GmbH Stuhlsatzenhausweg 3 D-66123 Saarbruecken, Germany

fdeclerck,[email protected]

Paper ID: 79 Keywords: anapher resolution, information extraction Contact Author: Thierry Declerck Under consideration for other conferences (specify)? Abstract

In this paper we present the design of a reference resolution module (RRM) within an Information Extraction (IE) system for German free text. We show in some details how such a module can be distributed over sub-components of a shallow NL processing chain, de ning a modular approach to reference resolution. We also brie y discuss recent extension of this work, in the context of an EU funded project, where the German IE system is getting integrated into a multilingual/multimedia environment and where the contribution of multiple sources for improving the quality of the IE is under investigation.

A Distributed Shallow Approach to Reference Resolution in the Context of Information Extraction Paper-ID: 79

Abstract In this paper we present the design of a reference resolution module (RRM) within an Information Extraction (IE) system for German free text. We show in some details how such a module can be distributed over sub-components of a shallow NL processing chain, de ning a modular approach to reference resolution. We also brie y discuss recent extension of this work, in the context of an EU funded project, where the German IE system is getting integrated into a multilingual/multimedia environment and where the contribution of multiple sources for improving the quality of the IE is being investigated.

Introduction

Information Extraction (IE) systems are identifying, collecting and normalizing relevant information from natural language (NL) text. The information of interest is typically prespeci ed in form of uninstantiated domain and application speci c frame-like structures called templates. The major task of an IEsystem is thus the identi cation of the relevant parts of the (free) text that can be used to ll a template's slot. Filled templates can further be combined and/or merged in order to put the information in a coherent succession of events (so for example in the case of management succession scenarios). Reference resolution is necessary in order to ensure and improve both the template lling and the template combining/merging task. In the rst case it is important, for example, to know when a pronoun refers to a

domain-relevant entity, so that the pronoun (with a pointer to its related referential expression) can be selected for lling the appropriate template's slot. And in the second case, it is useful to know when distinct linguistic realizations of domain-relevant entities are coreferent, so that referential continuity can be established through a set of lled templates. It is also important to handle ellipsis resolution since otherwise, due to the lack of an overt syntactic realization, some domain-relevant entities, relations and events would not be detected. For all those reasons we have been de ning and implementing a general approach to reference resolution, dealing with anaphora, coreference1 and nominal ellipsis resolution2. Following a generally recommended strategy for handling coreferentiality in IE3 we designed a separate module for reference resolution, which we call RRM. Since IE systems have to deal with large collection of unrestricted texts, the linguistic processing component of such systems We adopt the de nition of \coreference" as being a symmetrical and transitive relation between referential NPs, and distinguish thus coreference from anaphora resolution, since the latest involves at least one non-referential element (the pronoun) and is not inherently symmetric. We stress this distinction here, since often in the literature the word \coreference" is used in the more general sense of two items sharing a reference. 2 In our actual implementation, only \nominal" ellipsis are handled: we are looking for missing heads in nominal sequences and generate (re-construct) the corresponding NP. (Lappin and Shih, 1996) is proposing an algorithm for ellipsis resolution taking into account verbal constructs, which is the common view on ellipsis 3 For example (Appelt, 1999) notes that anaphoric elements are so constrained in the way they can match their possible antecedents, that a sole template uni cation algorithm could not solve this task in a satisfying way, and therefore stresses the need for of a separate coreference module coming into play between the output of the NLP components and the domainspeci c analysis. 1

has to be robust and ecient and most IE systems are therefore using shallow parsing techniques. It is in such a (cascaded) shallow processing architecture that we integrated RRM. But contrary to most of the IE systems known to us4, RRM do not just apply to the result of the whole shallow NL processing chain but is interleaving with the subsystems of the NLP component and de nes resolution rules at the various levels of the linguistic processing, proposing thus a modular approach to reference resolution.5 In the rest of the paper we brie y present the shallow text processing core engine in which the RRM is integrated and then describe the phenomena for which we have been implementing the reference resolution algorithm and give an informal description of the strategy applied. Finally we describe new strategies under consideration for the integration of the IE system in a multilingual and multimedia environement, where we also investigate the use of distinct types of sources for improving the the performances of RRM and of the IE task.

1 Overview of the shallow core text processor In this section we introduce the architecture of the (cascaded) shallow text processor, in which the reference resolution module has been embedded. The shallow text processor has been designed speci cally for ecient and robust processing of German text documents (see (Piskorski and Neumann, 2000)). German is a language with free word order, where morpho-syntactic informa4 See for example (MUC, 1995), (MUC, 1998) or the general description of IE technology in (Appelt, 1999)). 5 The design and the coverage of the reference resolution module has been directed by a manual corpus analysis and is therefore text oriented, but various theoretical insights of distinct binding theories (GB and HPSG) or discourse theories (Discourse Representation Theory (DRT) and Dynamic Predicate Logic (DPL)) have been taken into account. We could also bene t from some works dealing with shallow approaches to anaphora resolution for English (for example (Kennedy and Boguraev, 1996) and (Mitkov, 1998b) (the latest also proposing solutions for Arabic and Polish).

tion is encoded via a rich in ectional system (e.g., di erent case and tense forms). Additionally, German has productive ways of creating words by means of compounding, including proper names (\Siemensgeneralvertreter", the general representative of Siemens) and hyphen-coordination (\An- und Verkauf" purchase and sale). In case of syntactic analysis, German allows for nearly arbitrary ordering of phrases and the splitting of verb groups into separate parts into which other phrases might be spliced in.

1.1 Architecture

The architecture consists of two major components, a) the Linguistic Knowledge Pool lkp, which is not discussed in this paper, and b) stp, the shallow text processor itself, which processes a NL-text through a chain of modules (see (Piskorski and Neumann, 2000) for more details). Two primarily levels of processing are distinguished, the word level and the phrase level. Word level processing is subdivided into several components. First, a text tokenizer maps sequences of characters into greater units, usually called tokens. Each token, which is identi ed as a potential word form is then lexically processed by a subsequent lexical processor. Lexical processing includes morphological analysis, on-line recognition of compounds and hyphen coordination. Afterwards the control is passed to the nal step of processing on the word level, a pos filter, which performs word-based disambiguation. The overall task of the phrase level processing is to construct a hierarchical structure over the words of a sentence. This level in stp is divided into three modules. First a named entity finder identi es specialized expressions for time, date and several proper name expressions. Then a phrase recognizer identi es general nominal phrases, prepositional phrases and verb groups. In the nal step, a clause recognizer analyzes the dependency-based structure of the fragments of each sentence following a divide-andconquer strategy as described in (Neumann et al., 2000). All those modules are based on

nite-state devices. The result of each component as well of the whole processing is available in form of XML annotated text. It is important to stress here stp's high modularity: each component can be used in isolation. This capability was also one the motivation for implementing a distributed algorithm for reference resolution: adding available co-reference information to an NLP component helps on the one side in improving the quality of the applications building on this component, and is on the other side supporting the subsequent processing components.

2 The Reference Resolution Module (RRM)

2.1 A Multi-Layered and Modular Approach to Reference Resolution

RRM implements a general reference resolution strategy, dealing with (nominal) ellipsis, pronoun and coreference resolution tasks at various processing levels of the generic (cascaded) linguistic components of the IE system6 , de ning thus a multi-layered and modular strategy for reference resolution. This strategy also allows to verify for German some of the results achieved by studies on anaphora resolution at lower levels of NL processing, at a pre-parser level (Kennedy and Boguraev, 1996) or even at a level involving still less linguistic knowledge (Mitkov, 1998b). So our approach can also be seen in the context of works looking for robust reference resolution. But the algorithms implemented in RRM will not always force a decision in a case of multiple candidates for a resolution, since in the cascaded chunck approach some decisions can be left open for a further processing step7 .

Excluding for the time being the clause recogmodule. 7 The last processing step is not necessarily the parser (with GFs detection). In IE systems one can delay some decisions till the template lling/merging task (including domain analysis). Thus delaying decisions involving semantics or domain-knowledge to the level where this information is fully available and can be processed by the correpsonding techniques. In (Gaizauskas and Humphreys, 1996) interesting strategies are described, solving references at the level of (template-based) discourse modeling. 6

nizer

2.2 Phenomena handled by RRM

Local and non-local phenomena are dealt with, locality being de ned among others by syntactic criteria like the c-command criterium. The non-local phenomena can be intra- or inter-sentential (anaphora), or even go beyond paragraph boundaries (coreference). RRM handles a variety of pronouns (personal, demonstrative, re exive and relative), some types of nominal ellipsis for the local and non-local cases and also some types of referential expressions for detecting coreference relations. Not all possible instances of those phenomena are covered yet, but we assume that the solutions proposed for some instance of a phenomenon can be applied to other instances. A list of (instances of) phenomena is given below: 1. local ellipsis resolution at the level of compound analysis: "Rohsto und Handelsunternehmen" ) "Rohsto unternehmen und Handelsunternehmen (Raw material and trade enterprises) 2. non-local ellipsis at the level of fragment recognition: "Hatten die Hersteller noch vor der Wende 318 000 Mitarbeiter, waren es zu Jahresmitte 1992 nur noch 50 000." ) "Hatten die Hersteller noch vor der Wende 318 000 Mitarbeiter, waren es zu Jahresmitte 1992 nur noch 50 000 Mitarbeiter." (The producers had right before the new era 318,000 workers, and mid 1992 there were only 50,000 left.) 3. resolution of re exive pronouns: \Da

uchten [sich] [die einen] ins Ausland"; \[Herbert Henzler], Chef von McKinsey in Deutschland, muss [sich] ..." ([Some] have been searching for [themselves] a solution abroad; [Herbert Hanzler], Chief of McKinsey in Germany, sees [himself] ...; 4. resolution of relative pronouns: \[Abhangige Gesellschaften], [die] das Grossenkriterium erfullen ..."; Die Zuordnung der Unternehmen [zu den

14 Branchen], ...; ([Dependent societies] [that] comply with the criteria ...; The association of the enterprises with [the 14 branches] [whose] rankings ...) 5. resolution of personal and demonstrative pronouns: \[Henzler], der ... seinen ehemaligen [Direktoren Hartmut Emans und Axel Eckhardt] fristlos kundigte, als [er] erfahren hatte, dass [diese] sich ..." ([Henzler], who ... red without notice [his directors]..., as [he] heared that [those] ...) 6. detection of coreference: \[Der Munchner Strickwarenhersteller] [Marz GmbH]"; \... [Mannesmann Mobilfunk GmbH] ([MMO])"; \[Bundeskanzler] [Helmut Kohl]" ([The knitted goods producer] [Marz Inc.] from Munich; [Mannesmann Mobilfunk Inc.] ([MMO]); [The federal chancellor] [Helmut Kohl]) 7. detection of relationships: \[Peter Mihatsch] , Vorsitzender der Geschaftsfuhrung der [Mannesmann Mobilfunk GmbH]" (\[Peter Mihatsch], fpresident of the conduct of businessg of [Mannesmann Mobilfunk GmbH] In the case of 1 & 2 the reader can recognize the resolved ellipsis in the italic form, the information coming from the term written in bold face. In the cases of pronominal resolution (3, 4 & 5), both the pronoun and the corresponding referential expression are enclosed in square brackets. The reader can see that in some cases (3) the algorithm has also to search to the right for nding the referential element for a pronoun. In the case of relative pronouns (4) one can see that the relative (possessive) pronoun \deren" is part of a NP structure, so that the algorithm has also to go into syntactic structures for detecting anchors for referential relations. In the example given in 5 two anaphoric relations are present, involving a personal (marked with bold face font) and a demonstrative pronoun (marked with italic font). The elements of a coreference relation (6) are

detected within appositive structures and are marked with square brackets. Coreference between named entities and their corresponding acronyms and between designators acting as de nite descriptions and person names are detected. Concerning the detection of relationships between entities (7, also within an appositive structure), the entities are marked with square brackets whereas the type of relation is printed in italic. The detection of relationships between named entities is not really a case of reference resolution, but it helps in detecting such, so in the example: \Der MMO-Chef lasst derzeit ausloten", where "MMO-Chef" can be solved as being (possibly) "Peter Mihatsch", since we know that \MMO" is coreferent to \Mannesmann Mobilfunk GmbH" and that a person called \Peter Mihatsch" is the Vorsitzender der Geschaftsfuhrung of this company, which is semantically very close to \Chef"8 . The approach presented here allows thus to a certain extent to propose inferences for coreference detection, based on an accurate named entity recognition.

2.3 The Processing Levels on which RRM applies

RRM applies for the time being at three distinct level of the NL processing:

2.3.1 Compound Analysis

As we saw in section 2.2, already at the level of morphological compound analysis some reference resolution is proposed. The traditional compound analysis has been extended to the so-called \hyphen coordination" of words, where some common substring of the conjuncts is only realized by the last conjunct of the coordination, whereas in the other conjuncts a hyphen is marking the elision: \Obst-, Gemse- und Milcherzeugnispreise sind..." (The price of fruits, vegetables and milk products). The successful 8 At this level of processing we just consider \Peter Mihatsch" as being a candidate for the interpretation of \MMO-Chef". The nal decision can be taken after consulting some other resources, like semantic nets or domain knowledge. Important here is the fact that the algorithm can provide informaiton for subsequent inferential calculus.

compound analysis of the last conjunct is extended to the other conjuncts, substituting the hyphen mark with the detected compound sux: \Obstpreise, Gemusepreise und Milcherzeugnispreise sind...". The whole (resolved) sequence is marked by the algorithm as being a \hyphen coordination".

2.3.2 Named Entities Recognition

At this level we look at all kinds of coreference involving named entitites. So for example in the case of the word sequence \Martin Marietta Corp." occuring in our corpus, the whole string is detected as a company name (because of the designator \Corp.") and if in the continuation of the text the string \Martin Marietta" (or \Marietta" alone) is occuring, then this string will be corefered to the rst (complete) string introducing this named entity. This feature is also used in order to establish a coreference relation between all occurences of the substrings of the detected NE. If one of the substrings occuring alone is also encoded as a normal word in the generic lexicon9 ambiguity will be generated (to be solved at subsequent processing levels) and the expression is considered to act a as candidate NE.

2.3.3 Linguistic Fragment Recognition

At this level we have been handling the nonlocal (nominal) ellipsis resolution, since we try to nd in the precedent text larger structures giving some information for the syntactic reconstruction of the elided structure, and for this we need structural information delivered by the fragment recognizer. We also solve all kinds of pronominal resolutions at this level10 , using agreement information as a lter for the detection of antecedent candidates. We prefer to solve this 9 An famous example being given by the former chancellor of Germany, Helmut Kohl, \Kohl" meaning as well \cabbage". 10 At least RRM establishes a list of candidates and propose some preferences, since in some cases the nal decision cannot be taken at this level. So for example when agreement and distance do not suce in uniquely determining an antecedent, the detection of GFs can often help in taking a decision.

task at this level as to just look at the items tagged as substantives by the POS lter since the computation of the agreement information of a phrase helps in excluding a lot of morpholgical ambiguities attached to the lexical item and thus avoid considering certain candidates at the word level (this remark being specially valid for rich morphological languages, like German). Recognized NPs also carry de facto information about de nitness of a nominal expression, an important information for determining the binding scope of NPs (as has been shown by various semantic discourse theories). And a further advantage of proposing the anaphoric resolution at this level of processing concerns the fact that attaching a pronoun to a whole fragment allows to bind the pronoun also to the modi cations of the head noun, in the case the fragment is an NP11 . In the case of multiple candidates for the antecedent relation, we have established certain preferences in dependency of the distance and the type of pronouns involved, but we take no nal decisions at this level. The preference rules de ned at this level of the analysis can be overruled by some stronger rules de ned at some subsequent level of processing.

3 Evaluation

The actual evaluation data we propose are probably not fully representative, since we still do not have a large amount of annotated data for testing the reference resolution. But we assume that the few evaluation data we have for the time being gives a trend information on the feasability of the task. Around 3000 words out of our corpus12 have been annotated wrt pronoun resolution. The evaluation has been done using the recall/precision metrics. For the time being we just control if the right candidates have been found, but we are not evaluating the preferWhich is mostly the case but sometimes pronouns can also refer to whole event, which are normally expressed by a complete clause. 12 The corpus consists of articles taken from a German Economic weekly newspapers, the \Wirtschaftswoche". 11

ence marking, since we did not do enough work on this topic (we did not consider for example the attribution of weights). We are also not counting the reference chains in this small evaluation study. Concerning anaphora resolution, 72 linkings have been annotated13 and the the systems achieved a recall 0f 62.5% (45 linkings correctly detected) and a precision of 88,2% (51 links detected). Those gures show that our approach is quite promising, also taking into consideration that we have been evaluating the very rst implementation of the anaphora resolution, guided by a small corpus analysis only, and that improvments can be expected. But for sure we still have to provide a real overall evaluation, something which will be done in the context of a new projects involving the IE system (see below).

4 First Conclusions We have presented an integrated modular approach to reference resolution in the context of the shallow NL analysis. The RRM module has been succesfully implemented and covers already most of the phenomena necessary for a high-level IE. We have shown that the combination of a parameterizable search strategy for referential elements with the look up in (sophisticated) dynamic lexicons allows to achieve a dynamically enriched treatment of referential information already at a low level of linguistic processing, con rming and further developing hypothesis made by (Kennedy and Boguraev, 1996) and (Mitkov, 1998b). In general the design and implementation of RRM showed that reference resolution can propose a very enriched information at the level of shallow processing thus easing the task of the further syntactic or semantic processing of the text and so improving the quality of NLP applications building on the top of the output of an integrated shallow processing. In future work we will extend this strat13 The type of texts plays an important role: so for example (Mitkov, 1998b): reports the presence of only 71 pronouns in a technical manual of 140 pages.

egy to the parsing component (including the detection of grammatical functions) and the domain-analysis (going here into the topic of template combination and merging). We expect from those processing steps some support for decision in cases being left open, due to the fact that there was not possibility to express a strong preference at earlier stages or that some items were just declared as candidate. So parser and domain-analysis will be freed from detection of co-referentiality and concerned only with consolidation work.

5 Actual Developments

As mentioned in the introduction, the German IE system described in this paper is playing a role in a EU funded funded project, that recently started. A major project goal of the project is to merge the formal annotations extracted (with the help of IE) from textual and audio material (including the audio part of videos) on a big soccer competition in three languages: English, German, Dutch. This is thus giving us the oppotunity to extend our strategy to multilingual IE (within the soccer domain), where we will propose further modularization and parameterization (so it might be that some phenomena we handle for German at the level of fragment recognition can be succesfully handled for other languages at the processing level of POS tagging.) of the system. But not only a generic multilingual reference resolution strategy is being developed, also the availability of various documents on one event (a soccer game) extends the resolution task to a cross-document (and crosslingual) task. The set of available textual documents on soccer games can be classi ed as free texts (general reports about speci c games), semi-formal texts (Tickers, close captions, Action-Databases), and formal texts (formal descriptions of speci c games). Since the information contained in formal texts can be considered as a database of true facts, they play an important role for the involved IE systems. The few selected informations about a game such formal texts contain (the team wise listing of players involved,

the nal score, the goals, the substitutions, penalties, yellow and red cards) is specially processed by the IE, which extract selected information on entities (like name and age of the players, the clubs they are normally playing for, etc.) and store those in a \referential" database, which is getting dynamically extended while analysing other kind of documents. This database is o ering thus a dynamic storage of information for operating reference resolution for more \verbose" types of texts, which in the context of soccer are quite \poetic" with respect to the naming of agents (the \Kaiser" for Beckenbauer, the \Bomber" for Mueller etc...), which would be quite dicult to achieve within the sole referential information available within the boundary of one document. The project will also investigate the use of inferential mechanisms (in the context of template merging) for supporting reference resolution. So for example, \knowing" from the formal texts the nal socre of a game and the names of the scorers, following formulation can be resolved form this kind of formulation in a free text (in any language): \With his decisive goal, the \Bomber" gave the victory to his team."

References

Doug E. Appelt. 1999. An introduction to information extraction. AI Communications, 12. Thierry Declerck and P. Wittenburg. 2001. Mumis { a multimedia indexing and searching environment. In Proceedings of the 1st International Workshop on MultiMedia Annotation, MMA-2001, Tokyo. Thierry Declerck. 1996. Dealing with crosssentential anaphora resolution in ALEP. In Proceedings of the 16th International Conference on Computational Linguistics, COLING96, pages 280{286, Copenhagen. Robert Gaizauskas and K. Humphreys. 1996. Quantitative evaluation of coreference algorithms in an information extraction system. In Proceedings of Discourse Anaphora and Anaphor Resolution Colloquium, DAARC-96). Kevin Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cunningham, and

Y. Wilks. 1998. University of sheeld: Description of the lasie-ii system as used for muc7. In SAIC, editor, Proceedings of the 7th Message Understanding Conference, MUC-7, http://www.muc.saic.com/. SAIC Information Extraction. Christopher Kennedy and B. Boguraev. 1996. Anaphora for everyone: Pronominal anaphora resolution without a parser. In Proceedings of the 16th International Conference on Computational Linguistics, COLING-96, pages 113{118, Copenhagen. Hans-Ulrich Krieger and U. Schaefer. 1994. TDL|a type description language for constraint-based grammars. In Proceedings of the 15th International Conference on Computational Linguistics, COLING-94, pages 893{899. Shalom Lappin and H-H. Shih. 1996. A generalized algorithm for ellipsis resolution. In Proceedings of the 16th International Conference on Computational Linguistics, COLING96, pages 687{692, Copenhagen. Ruslan Mitkov. 1998a. The latest in anaphora resolution: going robust, knowledge-poor and multilingual. Procesamiento del Lenguaje Natural, 23:1{7. Ruslan Mitkov. 1998b. Robust pronoun resolution with limited knowledge. In Proceedings of the 17th International Conference on Computational Linguistics, COLING-98, pages 869{875, Montreal. MUC, editor. 1995. Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann. MUC, editor. 1998. Seventh Message Understanding Conference (MUC-7), http://www.muc.saic.com/. SAIC Information Extraction. Guenter Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. 1997. An information extraction core system for real world german text processing. In Proceedings of the 5th Conference on Applied Natural Language Processing, ANLP-97, pages 209{216. Guenter Neumann, C. Braun, and J. Piskorski. 2000. A divide-and-conquer strategy for shallow parsing of german free texts. In Proceedings of the 6th Conference on Applied Natural Language Processing, ANLP-00. Jakub Piskorski and G. Neumann. 2000. An intelligent text extraction and navigation system. In Proceedings of the 6th Conference on Recherche d'Information Assistee par Ordinateur, RIAO2000.

Suggest Documents