automated knowledge acquisition meets

LIF

Computational Linguistics Research Group

Albert-Ludwigs-Universitat Freiburg im Breisgau Germany

AUTOMATED KNOWLEDGE ACQUISITION MEETS METAREASONING: INCREMENTAL QUALITY ASSESSMENT OF CONCEPT HYPOTHESES DURING TEXT UNDERSTANDING Udo Hahn, Manfred Klenner & Klemens Schnattinger

1996 LIF

REPORT 6/96

AUTOMATED KNOWLEDGE ACQUISITION MEETS METAREASONING: INCREMENTAL QUALITY ASSESSMENT OF CONCEPT HYPOTHESES DURING TEXT UNDERSTANDING Udo Hahn, Manfred Klenner & Klemens Schnattinger LIF

Computational Linguistics Research Group Albert-Ludwigs-Universitat Freiburg Werthmannplatz 1 79085 Freiburg, Germany

http://www.coling.uni-freiburg.de fhahn,klenner,[email protected]

Abstract

We introduce a methodology for automated knowledge acquisition and learning from texts that relies upon a quality-based model of terminological reasoning. Concept hypotheses which have been derived in the course of the text understanding process are assigned speci c \quality labels" (indicating their signi cance, reliability, strength). Quality assessment of these hypotheses accounts for conceptual criteria referring to their current knowledge base context as well as linguistic indicators (grammatical constructions, discourse patterns), which led to their generation. We advocate a metareasoning approach which allows for the quality-based evaluation and a bootstrapping-style selection of alternative concept hypotheses as text understanding incrementally proceeds. We also provide a preliminary empirical evaluation, with focus on the learning rates and the learning accuracy that were achieved using this approach.

Appeared in: KAW'96 - Proc. 10th Knowledge Acquisition Workshop, 1996, pp 58-1 { 58-20

In: KAW’96 - Proc. 10th Knowledge Acquisition Workshop, 1996, pp 58-1 -- 58-20

Automated Knowledge Acquisition Meets Metareasoning: Incremental Quality Assessment of Concept Hypotheses during Text Understanding Udo Hahn, Manfred Klenner & Klemens Schnattinger LIF

Computational Linguistics Lab { Text Knowledge Engineering Group Freiburg University Platz der Alten Synagoge 1, D-79085 Freiburg, Germany fhahn,klenner,[email protected]

Abstract

We introduce a methodology for automated knowledge acquisition and learning from texts that relies upon a quality-based model of terminological reasoning. Concept hypotheses which have been derived in the course of the text understanding process are assigned speci c \quality labels" (indicating their signi cance, reliability, strength). Quality assessment of these hypotheses accounts for conceptual criteria referring to their current knowledge base context as well as linguistic indicators (grammatical constructions, discourse patterns), which led to their generation. We advocate a metareasoning approach which allows for the quality-based evaluation and a bootstrapping-style selection of alternative concept hypotheses as text understanding incrementally proceeds. We also provide a preliminary empirical evaluation, with focus on the learning rates and the learning accuracy that were achieved using this approach.

INTRODUCTION

The work reported in this paper is part of a large-scale project aiming at the development of a German-language text knowledge acquisition system (Hahn et al., 1996c) for two realworld application domains { test reports on information technology products (current corpus size: approximately 100 documents with 105 words) and medical ndings reports (current corpus size: approximately 120,000 documents with 107 words). The knowledge acquisition problem we face is two-fold. In the information technology domain lexical growth occurs at dramatic rates { new products, technologies, companies and people continuously enter the scene such that any attempt at keeping track of these lexical innovations by hand-coding is clearly precluded. Compared with these dynamics, the medical domain is lexically more stable but the sheer size of its sublanguage (conservative estimates range about 106 lexical items/concepts) also cannot reasonably be coded by humans in advance. Therefore, the designers of text understanding systems for such challenging applications have to nd ways to automate the lexical/concept learning phase as a prerequisite and, at the same time, as a constituent part of the text knowledge acquisition process. Unlike the current mainstream with its focus on statistically based learning methodologies (Lewis, 1991; Resnik, 1992; Sekine et al., 1992), we advocate a symbolically rooted approach in order to break the concept acquisition bottleneck. This approach is based on expressively rich knowledge representation models of the underlying domain (Hahn et al., 1996a; 1996b; Hastings, 1996). We consider the problem of natural language based knowledge acquisition and concept learning from a new methodological perspective, viz. one based on metareasoning about statements

expressed in a terminological knowledge representation language. Reasoning either is about structural linguistic properties of phrasal patterns or discourse contexts in which unknown words occur (assuming that the type of grammatical construction exercises a particular interpretative force on the unknown lexical item), or it is about conceptual properties of particular concept hypotheses as they are generated and continuously re ned by the on-going text understanding process (e.g., consistency relative to already given knowledge, independent justi cation from several sources). Each of these grammatical, discourse or conceptual indicators is assigned a particular \quality" label. The application of quality macro operators, taken from a \quali cation calculus" (Schnattinger & Hahn, 1996), to these atomic quality labels nally determines, which out of several alternative hypotheses actually hold(s). The decision for a metareasoning approach is motivated by requirements which emerged from our work in the overlapping elds of natural language parsing and learning from texts. Both tasks are characterized by the common need to evaluate alternative representation structures, either re ecting parsing ambiguities or multiple concept hypotheses. For instance, in the course of concept learning from texts, various and often con icting concept hypotheses for a single item are formed as the learning environment usually provides only inconclusive evidence for exactly determining the properties of the concept to be learned. Moreover, in \realistic" natural language understanding systems working with large text corpora, the underdetermination of results can often not only be attributed to incomplete knowledge provided for that concept in the data (source texts), but it may also be due to imperfect parsing results (originating from lacking lexical, grammatical, conceptual speci cations, or ungrammatical input). Therefore, competing hypotheses at dierent levels of validity and reliability are the rule rather than the exception and, thus, require appropriate formal treatment. Accordingly, we view the problem of choosing from among several alternatives as a quality-based decision task which can be decomposed into three constituent parts: the continuous generation of quality labels for single hypotheses (re ecting the reasons for their formation and their signi cance in the light of other hypotheses), the estimation of the overall credibility of single hypotheses (taking the available set of quality labels for each hypothesis into account), and the computation of a preference order for the entire set of competing hypotheses, which is based on these accumulated quality judgments.

ARCHITECTURE FOR QUALITY-BASED KNOWLEDGE ACQUISITION

The knowledge acquisition methodology we propose is heavily based on the representation and reasoning facilities provided by terminological knowledge representation languages (for a survey, cf. Woods & Schmolze (1992)). As the representation of alternative hypotheses and their subsequent evaluation turn out to be major requirements of that approach, provisions have to be made to re ect these design decisions by an appropriate system architecture of the knowledge acquisition device (cf. Fig. 1). In particular, mechanisms should be provided for: Expressing quality-based assertions about propositions in a terminological language; these metastatements capture the ascription of belief to these propositions, the reasons why they came into existence, the support/weakening they may have received from other propositions, etc. Metareasoning in a terminological knowledge base about characteristic properties and relations between certain propositions; the corresponding second-order expressions refer to propositions (ABox elements) as well as concept and role de nitions (TBox elements).

reasoning system initial context KB kernel hypo generation hypo integration

metacontext qualification rules

translation rules

reified KB kernel translation rules

qualification qualifier rules

[i-th hypo space]

text parser

qualified hypo space

hypo space text knowledge base

translation rules [(i+1)-th hypo space]

qual. reif. hypo space

reified hypo space reified text knowledge base

selection criteria

Figure 1: Architecture for Text Knowledge Acquisition The notion of context we use as a formal foundation for terminological metaknowledge and metareasoning is based on McCarthy's context model (McCarthy, 1993). We here distinguish two types of contexts, viz. the initial context and the metacontext. The initial context contains the original terminological knowledge base (KB kernel) and the text knowledge base, a representation layer for the knowledge acquired from the underlying text by the text parser (Hahn et al., 1994). Knowledge in the initial context is represented without any explicit quali cations, attachments, provisos, etc. Note that in the course of text understanding { due to the working of the basic hypothesis generation rules (cf. Section \Hypothesis Generation") { a hypothesis space is created which contains alternative subspaces for each concept to be learned, each one holding dierent or further specialized concept hypotheses. Various truth-preserving translation rules map the description of the initial context to the metacontext which consists of the rei ed knowledge of the initial context. By rei cation, we mean a common re ective mechanism, which splits up a predicative expression into its constituent parts and introduces a unique anchor term, the rei cator, on which reasoning about this expression, e.g., the annotation by qualifying assertions, can be based. This kind of rei cation is close to the one underlying the FOL system (Weyhrauch, 1980; Giunchiglia & Weyhrauch, 1988). Among the rei ed structures in the metacontext there is a subcontext embedded, the rei ed hypothesis space, the elements of which carry several quali cations, e.g., reasons to believe a proposition, indications of consistency, type and strength of support, etc. These quality labels result from incremental hypothesis evaluation and subsequent hypothesis selection, and, thus, re ect the operation of several second-order quali cation rules in the quali er (quality-based classi er). The derived labels are the basis for the selection of those representation structures which are assigned a high degree of credibility { only those quali ed hypotheses will be remapped to the hypothesis space of the initial context by way of (inverse) translation rules. Thus, we come full circle. In particular, at the end of each quality-based reasoning cycle the entire original i-th hypothesis space is replaced by its (i+1)-th successor in order to re ect the quali cations computed in the metacontext. The (i+1)-th hypothesis space is then the input of the next quality assessment round.

FORMAL FRAMEWORK OF QUALITY-BASED REASONING Terminological Logic. We use a standard terminological concept description language,

referred to as CDL, which has several constructors combining atomic concepts, roles and individuals to de ne the terminological theory of a domain (for a subset, see Table 1).

Syntax

Constructors Semantics

I j Catom is atomicg fd 2 Catom I C \ DI C I [ DI fd 2 I j RI (d) \ C I 6= ;g fd 2 I j RI (d) C I g f(d; e) 2 RIatom j Ratom is atomicg RI \ S I

Catom CuD CtD 9R:C 8R:C Ratom RuS Table 1: Syntax and Semantics for a Subset of CDL

Terminological Axioms

Axiom A =: C

AvC

Axiom a:C

Semantics AI = C I

AI C I

Assertional Axioms Semantics

aI 2 C I aRb (aI ; bI ) 2 RI Table 2: CDL Axioms

Concepts are unary predicates, roles are binary predicates over a domain , with individuals being the elements of . We assume a common set-theoretical semantics for CDL { an interpretation I is a function that assigns to each concept symbol (the set A) a subset of the domain , I : A ! 2 , to each role symbol (the set P) a binary relation of , I : P ! 2 , and to each individual symbol (the set I) an element of , I : I ! . Concept terms and role terms are de ned inductively. Table 1 contains corresponding constructors and their semantics, where C and D denote concept terms, while R and S denote roles. RI (d) represents the set of role llers of the individual d, i.e., the set of individuals e with (d; e) 2 RI . By means of terminological axioms (for a subset, see Table 2) a symbolic name can be intro: duced for each concept. It is possible to de ne necessary and sucient constraints (using =) or only necessary constraints (using v). A nite set of such axioms is called the terminology or TBox. Concepts and roles are associated with concrete individuals by assertional axioms (see Table 2; a; b denote individuals). A nite set of such axioms is called the world description or ABox. An interpretation I is a model of an ABox with regard to a TBox, i I satis es the assertional and terminological axioms. Terminology and world description together constitute the terminological theory for a given domain. Rei cation. Let us assume that any hypothesis space H contains a characteristic terminological theory. In order to reason about that theory we split up the complex terminological expressions by means of rei cation. We here de ne the (bijective) rei cation function