Saliha Azzam, Kevin Humphreys, Robert Gaizauskas and Yorick Wilks. Department of Computer Science, University of Sheffield. Regent Court, 211 Portobello ...
Using a Language Independent Domain Model for Multilingual Information Extraction Saliha Azzam, Kevin Humphreys, Robert Gaizauskas and Yorick Wilks Department of Computer Science, University of Sheffield Regent Court, 211 Portobello Street Sheffield S1 4DP UK Abstract The volume of electronic text in different languages, particularly on the World Wide Web, is growing significantly, and the problem of users who are restricted in the number of languages they read obtaining information from this text is becoming more widespread. This paper investigates some of the issues involved in achieving multilingual Information Extraction (IE), describes the approach adopted in the M-LaSIE-II IE system which addresses these problems, and presents the results of evaluating the approach against a small parallel corpus of English/French newswire texts. The approach is based on the assumption that it is possible to construct a language independent representation of concepts relevant to the domain, at least for the small well-defined domains typical of IE tasks, allowing multilingual IE to be successfully carried out without requiring full Machine Translation.
INTRODUCTION The development of Language Engineering applications, Information Extraction (IE) in particular, has demonstrated a need for the full range of NLP and AI techniques, from syntactic part-of-speech tagging through to knowledge representation and reasoning. IE is the mapping of unstructured natural language texts (such as newswire reports, newspaper and journal articles, patents, electronic mail, World Wide Web pages, etc.) into predefined, structured representations, or templates, which, when filled, represent an extract of key information from the original text. This information pertains to entities of interest in the application domain (e.g. companies or persons), or to relations between such entities, usually in the form of events in which the entities take part (e.g. company takeovers, management successions). Once extracted, the information can then be stored in databases to be queried, mined, summarised in natural language, etc. The volume of electronic text in different languages, particularly on the World Wide Web, is growing significantly, and the problem of users who are limited in the number of languages they read accessing information in this text is becoming more widespread. The ability to extract information in a standard form from texts in a variety of languages therefore has considerable potential. IE systems are achieving increasingly accurate and informative results, as reviewed in (Gaizauskas and Wilks, 1998). The best systems in the MUC-6 (DARPA, 1995) and MUC-7 (DARPA, 1998) evaluations achieved over This work was supported by EC DGXIII grant LE 2238-1 (AVENTINUS) and EPSRC grant GR/K25267 (GATE/LaSIE).
Multilingual Information Extraction 50% combined recall and precision , against human performances of around 80-85%. However, the current IE systems are largely language-specific and little work has been done so far in considering how to extend them to do multilingual IE (though see (Kameyama, 1996), which addresses these general issues and focuses on crosslinguistic named entity extraction, and (Gaizauskas et al., 1997)). This paper investigates some of the issues involved in achieving multilingual IE, and describes the M-LaSIE-II multilingual IE system which has been built for French and English. Spanish and German are being added in the context of the AVENTINUS project (Cunningham et al., 1996), which aims to construct a multilingual IE system for drug enforcement. We also report on an evaluation of the system using the MUC-6 IE task definition and evaluation software. We discuss several alternative approaches to constructing a multilingual IE system, describing in detail the strategy we have chosen. Our choice is based on the assumption that it is possible, when dealing with a particular domain or application, to construct a language-independent domain model — a representation of the concepts relevant to the domain. A representation of language-independent universal concepts is a familiar, if uncomfortable, idea in theoretical NLP generally, but its practical application for the small well-defined domains typical of IE tasks, where nothing like full machine translation (MT) is required does not seem to have been exploited.
IE AND MULTILINGUALITY Consider a simple multilingual IE task where we have texts in French and English, and the user requires templates to be filled in English from texts in either language. Suppose the template definition specifies information concerning business meetings (organisations and individuals involved, location, date, etc.), so that both the following examples should give rise to the same template fill: Fr: Gianluigi Ferrero a assist´e a` la r´eunion annuelle de Vercom Corp a` Londres. En: Gianluigi Ferrero attended the annual meeting of Vercom Corp in London. ORGANISATION: LOCATION: TYPE: PRESENT:
:= ’Vercom Corp’ ’London’ ’annual meeting’
:= NAME: ORGANISATION:
’Gianluigi Ferrero’ UNCLEAR
We see three main ways of addressing the problem: Recall is the number of template slot values correctly filled by the system against the total number of slot values in the manually filled templates, and Precision is the number of slot values correctly filled by the system against the number of slot values filled by the system in total.
2
Multilingual Information Extraction
1. A full French-English MT system translates all the French texts to English. An English IE system then processes both the translated and the English texts to extract English template structures. This solution requires a separate full IE system for each language and a full MT system for each language pair. 2. Separate IE systems process the French and English texts, producing templates in the original source language. A ‘mini’ French-English MT system then translates the lexical items occurring in the French templates. This solution requires a separate full IE system for each language and a mini-MT system for each language pair. 3. A general IE system, with separate French and English front ends, uses a language-independent domain model to produce a language-independent representation of a text — a discourse model. The discourse model will allow the extraction of the required information via mappings from ‘concepts’ to an English lexicon, to produce templates with English lexical items. This solution requires a separate syntactic/semantic analyser for each language, and the construction of mappings between the domain model and a lexicon for each language (see Figure 1). Of the three alternative architectures we have implemented the latter, because, in principle, it will allow new languages to be integrated independently, with no requirement to consider language pairs. The rest of the paper expands on the main issues involved in this approach.
THE M-LASIE-II SYSTEM In this section we first give an overview of the M-LaSIE-II system, then focus on how a language-independent representation of domain-specific text content is derived and used in M-LaSIE-II. The following section describes the intermediate semantic representation language, the quasi-logical form (QLF), which each language-specific parser is required to produce. The QLF is still language-dependent, but the first step in moving towards a common language-independent representation. The next sections describe the process of discourse interpretation in which the QLF representations of successive sentences are merged into a language-independent model of the whole text, and how the underlying language-independent representation is used to generate language-specific templates. Finally, we discuss some residual problems with the approach.
System Overview We now outline the M-LaSIE-II (Multilingual Large Scale Information Extraction) system, which is based on the monolingual LaSIE-II system (Humphreys et al., 1998) used in the recent MUC-7 evaluation (DARPA, 1998),
3
Multilingual Information Extraction
and is a development of the prototype M-LaSIE system described in (Gaizauskas et al., 1997). LaSIE-II is an English only general purpose IE research system, capable of, but not restricted to, carrying out the tasks specified in MUC-6 and MUC-7: named entity recognition (NE), coreference resolution (CO), template element filling (TE), template relation identification (TR), and scenario template filling (ST) (see (DARPA, 1995) and (DARPA, 1998) for details of the tasks). In addition, the system can generate brief natural language summaries of any scenario-specific information it has detected in a text. All of these tasks are carried out by building a single rich model of the text — the discourse model — from which the various results are read off. The LaSIE-II system is a essentially a pipeline of modules each of which processes the entire text before the next is invoked. The following is a brief description of each of the component modules in the system: Tokenizer Identifies token boundaries and section boundaries (header, body, paragraphs, etc.). Gazetteer Matcher Looks for matches in multiple domain-specific name (locations, organisations, etc.) and keyword (company designators, person first names, etc.) lists. Sentence Splitter Identifies sentence boundaries in the text body. Brill PoS Tagger (Brill, 1992) Assigns one of the 48 Penn TreeBank (Marcus et al., 1993) part-of-speech tags to each token in the text. Morphological Analyser Simple morphological stemming to identify the root form and inflectional suffix for tokens which have been tagged as noun, verb, or adjective. Bottom-up Chart Parser Does two pass chart parsing (see, e.g., (Gazdar and Mellish, 1989)), pass one with a special named entity grammar, and pass two with a general phrasal grammar. A ‘best parse’ is then selected, which may be only a partial parse, and a predicate-argument representation, or quasi-logical form (QLF), of each sentence is constructed compositionally. Name Matcher Matches variants of named entities across the text. Discourse Interpreter Adds the QLF representation to a semantic net, which encodes the system’s domain model as a hierarchy of concepts. Additional information presupposed by the input is also added to the model, then coreference resolution is performed between new and old instances, and finally information consequent upon the input is added, producing an updated discourse model. Template Writer Writes out the MUC ST, TR, and TE results by traversing the discourse model and extracting the required information.
4
Multilingual Information Extraction
The LaSIE system contrasts with many other current IE systems, against which it competed well in MUC-6 and 7 (see (DARPA, 1995) and (DARPA, 1998) for detailed results), in its derivation of a ‘deeper’ representation of the text. That is, rather than mapping directly from surface patterns to template representations, LaSIE derives an intermediate ‘meaning representation’ of each sentence in a text. Templates are then filled from this meaning representation. One of the strengths of this approach is that the meaning representation, being languageindependent, supports the generation of templates or summaries in languages other than that of the input texts. Developing the multilingual M-LasIE-II system from the English-only LaSIE-II has involved the following:
introducing language-independent labels for concept nodes in the domain model; providing a mapping between concept nodes and entries in monolingual lexicons (either by explicitly including lexical items in the domain model or, more practically, via pointers to and from an external lexical database);
defining any language-specific knowledge, such as coreference restrictions on particular pronouns, in the language’s own monolingual lexicon, which can then be imported into the domain model in the form of attributes;
the addition of a syntactic/semantic analysis phase for each language, to produce a common quasi-logical form (QLF) representation of a text.
QLF The semantic representation or quasi-logical form produced by each language-specific parser in M-LaSIE-II, while not language-independent itself, is uniform across languages and its production marks the point in the system at which a language-independent representation begins to emerge. The QLF representation used in M-LaSIE-II is cruder than that used in (Alshawi, 1992), but shares the characteristics of retaining various proximities to the surface form and of postponing some disambiguation, e.g. full analysis of quantifier scope and word sense disambiguation. Syntactically, QLF expressions are simply conjunctions of first order logical terms. In our system the predicates in the QLF representation are derived from:
the lexical morphological roots of head words (hence the language-dependence of the QLF); semantic classifications assigned by the proper name recogniser; a fixed stock of relation predicates which express grammatical or semantic role relations. QLF terms are either unary semantic types or binary attributes. Attributes can record either:
5
Multilingual Information Extraction
simple properties with values taken from lexical items in the text, as in the name attribute (e.g. name(e1, ’London’)), pronoun (e.g. pronoun(e2, ’ils’)), adj(ective), adv(erb), det(erminer), etc.;
simple properties with values from a fixed set assigned by the grammar’s semantic interpretation rules, as in the gender attribute (e.g. gender(e1, masc)), tense, number, etc.; or
relations between instances, as in lobj (e.g. lobj(e1,e2)), lsubj, qual(ifier), coord(inated), apposed, etc. This class of attributes also includes prepositional relations, where the predicate itself a(e3,e4). is a lexical item from the text, e.g. in(e1,e2), ` The stages required to produce the QLF representation for each language do not necessarily need to follow the initial stages of the current LaSIE architecture. Any technique or combination of techniques that yields the QLF representation, or indeed any semantic representation that can be mapped into QLF is acceptable. Here is an example of the QLF as produced by the French and English parsers for the simple example from section 2, i.e.: Fr: Gianluigi Ferrero a assist´e a` la r´eunion annuelle de Vercom Corp a` Londres. En: Gianluigi Ferrero attended the annual meeting of Vercom Corp in London. assister_a(e1), tense(e1,past), voice(e1,active), lsubj(e1,e2), lobj(e1,e3), name(e2,’Gianluigi Ferrero’), reunion(e3), det(e3,’la’), gender(e3,fem), adj(e3,’annuelle’), number(e3,sing), name(e5,’Vercom Corp’), organisation(e5), compagnie(e5), de(e3,e5), name(e6,’Londres’), lieu(e6), ville(e6), a(e3,e6)
attend(e1), tense(e1,past), voice(e1,active), lsubj(e1,e2), lobj(e1,e3), name(e2,’Gianluigi Ferrero’) meeting(e3), det(e3,’the’), adj(e3,’annual’), number(e3,sing), name(e5,’Vercom Corp’), organisation(e5), company(e5), of(e3,e5), name(e6,’London’), location(e6), city(e6), in(e3,e6)
Here each noun and verb phrase recognised by the grammar leads to the introduction of a unique instance constant in the QLF representation (of the form ‘eN’) which serves as an identifier for the object or event referred to in the text. For example, the word r´eunions (meetings) would be assigned the QLF representation reunion(e1), number(e1,plural) (the root form of r´eunions being used as the predicate name). Verb 6
Multilingual Information Extraction
complements are represented via lsubj (logical subject) and lobj (logical object) relations between instance constants. Prepositional phrase attachments are represented using the preposition itself as predicate name and the instances associated with the complement of the prepositional phrase and the head of the modified noun phrase as arguments – e.g. the QLF representation for meeting . . . in London is meeting(e3), city(e6), name(e6,’London’), in(e3,e6).
Discourse Interpretation Language-specific QLF expressions for individual sentences are passed on to the discourse interpreter. The principal task of the discourse processing module in M-LaSIE-II is to integrate the semantic representations of multiple sentences into a single model of the text, from which the information required for filling a template may be derived. The representation of the text finally arrived at after discourse interpretation is language-independent.
Processing Overview The discourse interpretation stage of M-LaSIE-II relies on an underlying domain model, a declarative knowledge base that both contains general conceptual knowledge and serves as a frame upon which a discourse model for a multi-sentence text is built. This domain model is expressed in the XI knowledge representation language (Gaizauskas, 1995) which allows straightforward definitions of cross-classification hierarchies, the association of arbitrary attributes with classes or individuals, and the inheritance of these attributes by individuals. Returning to the example of extracting English templates from French and English texts, Figure 2 shows a fragment of the domain model related to ‘meeting’ events. Discourse processing proceeds in four sub-stages for each new QLF representation passed on from the parser: 1. each instance in the QLF together with its attributes is added to the discourse model which has been constructed so far for the text; 2. presuppositions (stored as attributes of concepts in the domain model) are expanded, leading to further information being added to or removed from the model; 3. all instances introduced by the current sentence are compared with previously existing instances to determine whether any pair can be merged into a single instance, representing a coreference in the text; 4. consequences (also stored as attributes of concepts in the domain model) are expanded, leading to further information being added to or removed from the model, specifically information pertaining to template objects and slots.
7
Multilingual Information Extraction
Following stage 1, the representation of the text is language-independent. However, some of the processing which takes place in subsequent stages may need to take certain features of the source language into account. For example, coreference resolution, carried out in stage 3, is sensitive to word order, and languages which have different word order from those we have considered to date (e.g. Subject-Object-Verb languages) may require discourse processing to be parameterised by word order class.
Example In the first stage, the predicate names in the QLF will be mapped to language-independent concept nodes in the domain model, via the lexicon. The instance constants will then be added below the appropriate nodes to create a discourse-specific model. For example, e5 has the class compagnie in the QLF given earlier, which will be searched for in the source language’s lexicon (French in this case) to determine the corresponding concept node name in the domain model (n4.1 for compagnie). The instance e5 will then be added below the node n4.1, as in Figure 3, which shows the state of the discourse model after the discourse interpretation phase (bold type shows what has been added to the underlying domain model (cf. Figure 2) during discourse processing of this sentence). Some instances may be mentioned in the QLF representation without an explicit class, but a class may be inferable from an instance’s attributes. For example, an attribute of the name attribute node in the domain model may specify that all instances with this attribute must be instances of the object class, so that object(e2) can be inferred from name(e2, ’Gianluigi Ferrero’). Further, in the example, e2 is the logical subject of e1, and an attribute of e1’s class (attend) indicates that the subject must be of the class n3 (person). This allows the placement of e2 as an instance of n3, in the absence of any explicit class in the input. The addition of e6 is a special case. If the value of a name attribute , such as London, can be found as an entry in the lexicon, it will be possible to map the value, and therefore the instance, to a concept node in the domain model. The name attribute of e6 then becomes redundant since a realisation can be inherited from the parent node for the required target language. Other named instances, such as Gianluigi Ferrero, will not be found in the source lexicon and will retain the name attribute for use as a realisation in all target languages. However, as the examples in (Kameyama, 1996) pertaining to Japanese and English name translation demonstrate, the simple solution we propose here will not always be acceptable. Once a QLF representation has been added to the discourse model in this way, the subsequent stages of presupposition expansion, coreference, and consequence expansion can take place within the same representation. Details of these operations may be found in (Gaizauskas and Humphreys, 1997). In the following we give further Attribute predicate names, with the exception of prepositional relations, are expressed in English in both the QLF and the domain model, but the language of these labels is not significant and they could be mapped to a language-independent notation.
8
Multilingual Information Extraction
details of the mapping between QLF predicates and the domain model.
Mapping QLF to a Language Independent Domain Model As noted above, predicates in the QLF may be divided into unary predicates expressing semantic type information, and binary predicates expressing attributes of instances or relations between instances. Unary predicates expressing semantic type information are mapped into language-independent concept nodes in the domain model via a lexicon-to-concept mapping held in the source language lexicon. In principle this stage involves solving the word sense disambiguation problem (since the mapping will be from word sense to concept node), but for the limited domains common in information extraction applications we have not found this issue to be of serious practical significance for the key nominal and verbal terms of interest in a domain (see below for further discussion of this point). Binary predicates in the QLF pose a number of problems. When attributes with a closed class of lexical items as values are added to the discourse model, any language-specific information associated with their values is imported from the monolingual lexicons. For example, the syntactic type of a pronoun (personal, reflexive, possessive, demonstrative) can be added to the discourse model from a lexicon whenever a pronoun instance (e.g. pronoun(e1,’ils’)) is encountered in the input, and the surface form of the pronoun need not be referred to again. The domain model itself does not need to represent all information about all possible pronouns in advance, importing attributes as needed from the monolingual lexicons. The same approach holds for determiners, which are also represented in the QLF as a binary relation holding between an instance and a closed class of languagespecific lexical values (e.g. det(e1,’the’)). For attributes with an open class of lexical items as values, the situation is more problematic and we do not yet have a fully satisfactory solution. The case of the name attribute has been discussed in the preceding section, where significant named individuals in the application area may be contained in multiple language-specific lexicon-to-concept mappings, allowing a satisfactory language-independent representation of the named individual. However, the QLF also uses open class lexical items as values for the adj (adjective modification) and qual (noun modification) attributes. So, for example, annual meeting is represented in QLF as meeting(e3), adj(e3,’annual’) and computer user is represented as user(e2),qual(e2,’computer’). This is done in order to postpone the identification of the semantic relationship between the adjective or qualifier and the noun it modifies, from the parsing stage until later on. The semantic relation between annual and meeting indicates that the frequency of occurrence of the meeting is once a year; but in the syntactically parallel annual flower, the semantics need to indicate that the flower is one that lives for one year only. Similarly a computer user
9
Multilingual Information Extraction
is a person who uses a computer, while a computer disk is a disk for use in a computer. Ideally, the semantic relations between modifiers and head nouns in such QLF representations should be determined and translated to a languageindependent conceptual representation during the initial stage of adding these attributes to the discourse model. We believe there is nothing in principle to stop this being done (which is not to say it is easy). To date, however we have not implemented a mechanism for doing so, and as a result these attributes are added to the discourse model with language-specific lexical values. This omission does not appear to have significant impact on information extraction applications, though we intend to address it shortly. Again, the limited domain of IE applications means that a fully general mechanism for determining the semantic relations between adjectival and nominal modifiers and their heads is not needed: for key nouns in the domain, modifiers of interest will most likely fall into stable classes with predictable semantic relations between modifier and head. Attributes representing relations between instances fall into two classes, as described above: those specifying grammatical relations between instances (e.g. lsubj(e1,e2), apposed(e3,e4)) and those indicating prepositional relations, where the attribute type is a lexical item, e.g. in(e1,e2), ` a(e3,e4). The former involve no language-specific information, apart from whether a particular relation exists in a particular language. The latter, however, are clearly language-specific, since the predicate names are lexical items from the text. Further, mapping these predicates onto a language-independent concept node may be non-trivial, as it raises issues of whether the preposition meaning should be represented independently or whether, at least in some cases, it should be viewed as altering the meaning of the verb to which it is attached. As with adjectival and nominal modification, our treatment of the semantics of prepositions is limited at present. We assume that prepositions that are verb particles are identified by the tagger and parser, and that a single QLF predicate, and subsequently concept in the discourse model, results – for example step down (as chairman) will become step down. Other prepositions become binary attributes in the QLF and, where available, information about the semantic type of the complement of the prepositional phrase is used to map the preposition into a language-independent relation in the discourse model. So, for example meeting . . . in London generates the QLF meeting(e3), city(e6), name(e6,’London’), in(e3,e6) and the fact that e6 is a location is used to map the predicate in to a language-independent attribute, e.g. loc prep1, in the domain model. Again, this approach is appropriate for IE tasks where the relations of interest are limited and are signalled via limited types of prepositional phrase construction. It is worth noting that a particular QLF attribute can exist in one language but not in others. If gender does not exist in a language, for example, there is no need for a parser for that language to produce an attribute for it (with, say, a nil value). If the parser for a language does produce a gender attribute, however, it must be represented in the domain model for it to be used in discourse interpretation. Thus, gender must be represented in the domain 10
Multilingual Information Extraction
model as a single-valued attribute (an attribute for which an instance may only take one value at a given time), if it is to be useful, which indeed it can be, in coreference resolution. The domain model must therefore classify all possible attributes that the current set of parsers in the system can produce; values of these attributes are then supplied, when available, from the monolingual lexicons.
Template Generation Once a text has been fully processed and a language-independent representation of those aspects of it required for the IE task has been derived, this representation can be used to generate template structures. These template structures will include pointers back to the discourse model, e.g. e6 will occur in a location slot. This forms a language-independent representation of the templates, e.g., the left column below, for the example used in section 2: Slot: Slot: Slot: Slot:
n4, n5, n6, n3,
Value: Value: Value: Value:
e5 e6 e10 e2
ORGANISATION: LIEU: REUNION: PERSONNE:
’Vercom Corp’ ’Londres’ ’reunion annuelle’ ’Gianluigi Ferrero’
The abstract representation will then be filled with lexical realisations for the required target language, obtained via the pointers to the lexicon, as in the right column if the target language is French. Simple natural language summarisation can also be carried out in this way, using sentence templates associated with particular lexical entries.
Issues in Lexicon-Domain Model Mapping The approach used here relies on a robust domain model that will constitute the central bridge through which all multilingual information will circulate. The addition of a new language to the IE system will mainly consist of mapping a new monolingual lexicon to the domain model and adding a new syntactic/semantic analysis front-end, with no interaction at all with other languages in the system. The language-independent domain model can be compared to the use of an interlingua representation in MT (see, e.g., (Hutchins, 1986)). An IE system, however, does not require full generation capabilities from the intermediate representation, and the task will be well-specified by a limited ‘domain model’ rather than a full unrestricted ‘world model’. This makes an interlingua representation feasible for IE, because it will not involve finding solutions to all the problems of such a representation, only those issues directly relevant to the current IE task. However, we discuss here a number of particular issues related to the mapping between the domain model and monolingual lexicons.
11
Multilingual Information Extraction
Lexical Gaps Lexical gaps occur between languages when a concept expressed by a single word in one language has no direct match in a single word in another, but rather requires several words or a p hrase to express it. For example, Sicilians have no word to translate the English word lawn . However, if, for a particular multilingual IE task, such a concept is included in a domain model, language-specific rules can be used to map phrases to single concepts where necessary. It is tempting to presume that lexical gaps will be rare, since the domain model will generally define an area of interest which has much in common across all the languages involved, but they will occur. In the AVENTINUS drug domain, for example, the domain model may include a drug concept node for the English word crack, but pointers to the lexicons of non-English languages may not exist. To express the concept in a target language without a pointer will then require the selection of another language’s realisation, either the text’s original language or some fixed default (e.g. English).
Word sense disambiguation The word sense problem becomes serious if the domain model is created from a large-scale pre-existing resource, such as a machine-readable dictionary or thesaurus. If used without restriction such resources will provide us with a large number of nodes in the hierarchy for each word (e.g. bank has 9 nominal senses in WordNet (Miller, 1990)). This problem is minimised in the current approach because words will often have only a single sense relevant to a particular domain. However we can still consider potential solutions within the proposed architecture. One solution is to develop a word sense tagger, the results of which could be used to determine which concept node in the domain model each referring term should map to. While word sense tagging is still an open area for research, current techniques hold promise, e.g. (Wilks and Stevenson, 1997). Another approach is to appeal to knowledge held in the domain model hierarchy, using, e.g., selectional restrictions as a set of constraints which, when satisfied, will hopefully eliminate all but one sense for each semantically ambiguous word in the input. Other information, such as the frequency of sense occurrences, may also be brought to bear in the form of constraints. Further, the discourse model built so far can be used to define a context in which certain senses are much more likely to occur.
Unknown Words In general, if the root form of a word cannot be mapped to a concept node in the domain model, the word will simply be interpreted as irrelevant to the current domain. Of course this assumes that the domain model is We are grateful to Roberto Garigliano for this example.
12
Multilingual Information Extraction
complete and correct, which may not always be the case. To allow for this problem requires mechanisms for extending the domain model as new words are encountered. This is effectively a form of training to construct new domain models from a representative text corpus, and as such is beyond the scope of this paper. Suffice to say that the issue of creating multilingual mappings between lexicons and new concept nodes, derived either manually or automatically through training, is likely to require the use of bilingual dictionaries to find, say, the French lexical entry corresponding to a node created from the English word crack, as the name of a drug.
M-LASIE-II PERFORMANCE An evaluation of the multilingual system has been carried out using the scoring software provided as part of the MUC evaluations. The software compares manually filled templates (‘keys’ in MUC terminology) with system output, producing a detailed breakdown of performance on each slot category. The MUC-6 domain of management succession events (promotions, retirements, etc.) was used for the evaluation, requiring templates to be filled with details of companies, posts, and people moving into or out of those posts. Two corpora, one for training and one for evaluation, each containing 20 parallel French and English newswire texts (average 240 words per text) relevant to the domain, were obtained from Canada Newswire Ltd. (www.newswire.ca), and templates were filled manually for these texts using the Scenario Template task guidelines provided for MUC-6 (DARPA, 1995). The MUC templates include ‘string fill’ slots, i.e. where strings from the original texts are used as slot values, so parallel French and English template keys were necessary for the evaluation, rather than a single language template for each corresponding pair of texts. Most string fill slots contained person or organisation names, which were generally the same in both languages, but were translated in some cases (e.g. Ford Motor Company of Canada, Limited and Ford du Canada Limit´ee). Names of management posts, however, were always translated (e.g. CEO and chef de la direction). For these slot values, the lexicon-to-concept index of the source language was used in the template generation stage, rather than that of the target language (English), as used for all other slots. The French and English texts, while generally parallel, did vary slightly in the information contained in each corresponding pair of texts. In one case two separate posts occurred in an English text (president and CEO), but only one was mentioned in the French version (pr´esident-directeur g´en´eral): RESTON, Va., Aug. 5 /CNW/ - At its recent quarterly meeting, the Board of Directors of Lafarge Corporation, one of the leading construction materials companies in North America, appointed a new president and CEO, Mr. John M. Piccuch, effective Oct 1. RESTON, Virginie, 5 aoˆut /CNW/ - Au cours de sa r´ecente r´eunion trimestrielle, le Conseil d’Administration de Lafarge Corporation, l’une des plus importantes soci´et´es de mat´eriaux de construction en Am´erique
13
Multilingual Information Extraction
du Nord, a annonc´e la nomination de Mr. John M. Piccuch au poste de nouveau pr´esident-directeur g´en´eral de la soci´et´e a` compter du 1er octobre 1996. Other variations include the French convention of always including an explicit person title with surnames (e.g. Mr. Piccuch), whereas the English version would use the surname alone. This causes an additional slot to be filled in the MUC template extracted from French (and also serves to reliably disambiguate person names in French). These differences in the information contained in the texts mean that results from the two languages should not be compared directly, as they could be for a truly parallel corpus. The filled templates have the following form (see (DARPA, 1995) for details), the example here corresponding to the French text above:
:= DOC_NR: "c0531.txt" CONTENT: := SUCCESSION_ORG: POST: "president-directeur general" IN_AND_OUT: VACANCY_REASON: REASSIGNMENT := IO_PERSON: NEW_STATUS: IN ON_THE_JOB: NO := ORG_NAME: "Lafarge Corporation" ORG_TYPE: COMPANY := PER_NAME: "John M. Piccuch" PER_TITLE: "Mr." For the evaluation, the M-LaSIE-II system uses the same scenario-specific parts of the domain model and template writer as the LaSIE system used in MUC-6, with only the lexicon-to-concept index used to map QLF predicates into the model requiring modification in the multilingual system. The results of the evaluation are shown in Table 1. Each corpus (training and evaluation) contained 20 parallel texts, with the evaluation texts kept blind. The results compare very well with the highest MUC-6 system scores (50% recall, 73% precision, 56.40% P&R), and demonstrate that the LaSIE domain model developed for this task through training on the English MUC-6 corpora can be easily applied in another language. Initial investigations suggest that the slightly higher recall for the French corpora is due to a consistently more ‘formal’ style across the French texts. In particular, the regular use of phrases such as le poste de pr´esident, where the English version uses simply president, allows more specific patterns in the domain model to match the French
14
Multilingual Information Extraction
input. However, the increase in recall is offset by a greater decrease in precision, much of which can be attributed to the weaker French grammar producing more partial semantics than the English grammar, resulting in the use of the more general, and therefore less precise, domain-specific inference rules during discourse interpretation. Further evaluations on much larger corpora, preferably from different sou rces, are needed to provide a clearer comparison of the system’s performance in different languages.
ADDING A FURTHER LANGUAGE A further language, Spanish, has recently been integrated into the M-LaSIE architecture and required only one person month, using existing front-end Spanish modules. This demonstrates the flexibility of the system architecture for integrating a new language, one of our main motivations, as explained in the first section of this paper. The principal activities involved in integrating Spanish were:
adding gazetteer lists for Spanish; building the concept-index lexicon for Spanish for the required domain; writing an interface between the Spanish parser and the multilingual discourse interpreter to produce the QLF predicates input to the discourse interpreter;
adding language-specific information to the domain model, concerning Spanish pronouns for coreference. One of the main technical difficulties that occurred during the integration of Spanish concerned the ‘QLF interface’. The system architecture ‘imposes’ such a representation as an input to the multilingual discourse interpreter, even when the existing parser in the language to be integrated is not a QLF-based parser. This is the main limitation of the system. However, in the case of the Spanish parser which produced a non-QLF-based semantics, the set of QLF predicates and properties were easily derivable from the original parser output. The main issue, as expected, was dealing carefully with prepositions. On the other hand, an advantage of the proposed architecture, is that the maintenance of the resources does not require a huge amount of work, as only the domain model and the concept-index lexicon are affected by any ‘improvement’ or extension of the resources. Adding new entries to the concept-index lexicon or domain model, is straightforward, as both are declarative resources and can be easily hand-edited or automatically enhanced, given appropriate techniques.
15
Multilingual Information Extraction
CONCLUSION Multilingual information extraction, as the term is used here, refers to the extraction of templates in a single language from texts in several languages. We have described the M-LaSIE-II system which implements an approach to multilingual IE avoiding the need for separate machine translation systems. As pointed out in (Kameyama, 1996), a key question in multilingual IE is how much monolingual IE infrastructure (e.g. rules and data structures) can be either reused or shared when m ore languages are added. The system presented here is based on a shared language-independent domain model, representing a hierarchy of concepts relevant to a particular IE task. The multilingual capabilities are provided via mappings between the concept nodes and entries in a number of monolingual lexicons, which allow the construction of a language-independent discourse model for a text, based on the initial domain model. A language-independent representation of a template can then be constructed from the discourse model and, again via the concept/lexicon mappings, the template representation can be expressed in a particular target language. The system has been implemented as part of the AVENTINUS project (Cunningham et al., 1996), based on the existing LaSIE (Gaizauskas et al., 1995) and LaSIE-II (Humphreys et al., 1998) MUC systems. An evaluation of system performance in French and English has been carried out in the MUC-6 domain of management succession events, using the MUC scoring software, with results showing little compromise in IE performance due to the addition of multilingual capabilities. The use of a language-independent representation involves a number of difficulties, such as allowing for lexical gaps and word sense ambiguity, issues which are not specific to IE. However, the well-defined application domains used in IE permit these problems to be largely avoided. The approach taken here has the advantage that new languages can be integrated into an existing system with the minimum of interaction with other language-specific information. The main effort is the construction of an accurate and complete initial domain model, an effort which will, of course, also be necessary for any monolingual IE system.
REFERENCES Alshawi, H. (ed.) 1992. The Core Language Engine. MIT Press. Cambridge MA. Brill, E. 1992. A Simple Rule-Based Part-of-Speech Tagger. In Proc. Third Conference on Applied Natural Language Processing (ANLP’92). Cunningham, H., Azzam, S., Wilks, Y., Humphreys, K. and Gaizauskas, R. 1996. AVENTINUS Domain Model Specifications (WP 4.1/T12).
16
Multilingual Information Extraction
Defense Advanced Research Projects Agency (DARPA). 1995. Proc. Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann. Defense Advanced Research Projects Agency (DARPA). 1998. Proc. Seventh Message Understanding Conference (MUC-7). Morgan Kaufmann. (forthcoming) Gaizauskas, R. 1995. XI: A Knowledge Representation Language Based on Cross-Classification and Inheritance. Technical report. Department of Computer Science, University of Sheffield. Gaizauskas, R., Wakao, T., Humphreys, K., Cunningham, H. and Wilks, Y. 1995. Description of the LaSIE system as used for MUC-6. In (DARPA, 1995). Gaizauskas, R. and Humphreys, K. 1997. Using a Semantic Network for Information Extraction. Journal of Natural Language Engineering, 3(2/3), 147-169. Gaizauskas, R., Humphreys, K., Azzam, S. and Wilks, Y. 1997. Concepticons vs. Lexicons: An Architecture for Multilingual Information Extraction. In Proc. Summer School on Information Extraction (SCIE-97), ed. M.T. Pazienza. Springer-Verlag. Gaizauskas, R. and Wilks, Y. 1998. Information Extraction: Beyond Document Retrieval. Journal of Documentation, 54(1), 70-105. Gazdar, G. and Mellish, C. 1989. Natural Language Processing in Prolog. Addison-Wesley. Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B., Cunningham, H. and Wilks, Y. 1998. Description of the LaSIE-II system as used for MUC-7. In (DARPA, 1998). Hutchins, W.J. 1986. Machine Translation: past, present, future. Chichester: Ellis Horwood. Kameyama, M. 1996. Information Extraction across Linguistic Barriers. AAAI Spring Symposium on Cross Language text and Speech Processing. Marcus, M.P., Santorini, B. and Marcinkiewicz, M.A. 1993. Building a Large Annotated Corpus of English: The Penn TreeBank. Computational Linguistics, 19(2), 313-330. Miller, G.A. (ed.). 1990. WordNet: An on-line Lexical Database. International Journal of Lexicography, 3(4), 235-312. Wilks, Y. and Stevenson, M. 1997. Sense Tagging: Semantic Tagging with a Lexicon. In Proc. Applied Natural Language Processing Conference (ANLP’97): SIGLEX Workshop on Tagging Text with Lexical Semantics 17
Multilingual Information Extraction
Corpus English (training) French (training) English (evaluation) French (evaluation)
Recall 53% 56% 52% 54%
Precision 79% 71% 81% 71%
P&R 63.29% 62.91% 63.31% 61.17%
Table 1: M-LaSIE-II results
18
Multilingual Information Extraction
Language 1 Language 2 Language N Syntactic/Semantic Syntactic/Semantic Syntactic/Semantic Analysis Analysis Analysis
Language 1 Lexicon Language 2 Lexicon
DISCOURSE INTERPRETATION
Language Independent Domain Model
Language Independent Discourse Model
Language N Lexicon TEMPLATE GENERATION
Language Specific Template Structures Figure 1: Multilingual IE system architecture
19
Multilingual Information Extraction
Language Independent Domain Model entity object n3 n4 [animate:yes]
n5
n4.1
n5.1
event n6 v3
attribute single_val multiple_val
v3.1 number lsubj name lsubj_of [lsubj_type:n3]
person n3 organization n4 company n4.1 location n5 city n5.1 London n5.1.1 meeting n6 go v3 attend v3.1
English Lexicon
personne n3 organisation n4 compagnie n4.1 lieu n5 ville n5.1 Londres n5.1.1 reunion n6 aller v3 assister_a v3.1
French Lexicon
Figure 2: Domain Model Fragment
20
Multilingual Information Extraction
entity object n3
n4
n5
n4.1
n5.1
event n6 v3
attribute single_val multiple_val
v3.1 number lsubj name lsubj_of
e3 n5.1.1[number:sing] [lobj_of:e1] e6 e1 [tense:past] e5 [name:Vercom Corp.] [aspect:simple] [voice:active] [lsubj:e2] [lobj:e3] e2 [name:Gianluigi Ferrero] [lsubj_type:n3] [lsubj_of:e1] [animate:yes]
Figure 3: Discourse Model Fragment
21