LIF
Computational Linguistics Research Group
Albert-Ludwigs-Universitat Freiburg im Breisgau Germany
A QUALITY-BASED TERMINOLOGICAL REASONING MODEL FOR TEXT KNOWLEDGE ACQUISITION Udo Hahn, Manfred Klenner & Klemens Schnattinger
1996 LIF
REPORT 2/96
A QUALITY-BASED TERMINOLOGICAL REASONING MODEL FOR TEXT KNOWLEDGE ACQUISITION Udo Hahn, Manfred Klenner & Klemens Schnattinger LIF
Computational Linguistics Research Group Albert-Ludwigs-Universitat Freiburg Werthmannplatz 1 79085 Freiburg, Germany
http://www.coling.uni-freiburg.de fhahn,klenner,
[email protected]
Abstract
We introduce a methodology for knowledge acquisition and concept learning from texts that relies upon a quality-based model of terminological reasoning. Concept hypotheses which have been derived in the course of the text understanding process are assigned speci c \quality labels" (indicating their signi cance, reliability, strength). Quality assessment of these hypotheses accounts for conceptual criteria referring to their given knowledge base context as well as linguistic indicators (grammatical constructions, discourse patterns), which led to their generation. We advocate a metareasoning approach which allows for the quality-based evaluation and a bootstrapping-style selection of alternative concept hypotheses as text understanding incrementally proceeds.
Appeared in: N.Shadbolt, K.O'Hara, G.Schreiber (Eds.), EKAW'96 - Advances in Knowledge Acquisition. Proceedings of the 9th European Knowledge Acquisition Workshop, Nottingham, U.K., May 14 - 17, 1996, Berlin etc: Springer, 1996, pp.131-146 (LNAI 1076).
In: N.Shadbolt, K.O’Hara, G.Schreiber (Eds.), EKAW’96 -Advances in Knowledge Acquisition. Proceedings of the 9th EuropeanKnowledge Acquisiti
A Quality-Based Terminological Reasoning Model for Text Knowledge Acquisition Udo Hahn, Manfred Klenner & Klemens Schnattinger Freiburg University Computational Linguistics Group Europaplatz 1, D-79085 Freiburg, Germany LIF
fhahn,klenner,
[email protected]
Abstract We introduce a methodology for knowledge acquisition and
concept learning from texts that relies upon a quality-based model of terminological reasoning. Concept hypotheses which have been derived in the course of the text understanding process are assigned speci c \quality labels" (indicating their signi cance, reliability, strength). Quality assessment of these hypotheses accounts for conceptual criteria referring to their given knowledge base context as well as linguistic indicators (grammatical constructions, discourse patterns), which led to their generation. We advocate a metareasoning approach which allows for the quality-based evaluation and a bootstrapping-style selection of alternative concept hypotheses as text understanding incrementally proceeds.
1 Introduction The work reported in this paper is part of a large-scale project aiming at the development of a German-language text knowledge acquisition system for two real-world application domains | test reports on information technology products (current corpus size: approximately 100 documents with 105 words) and medical ndings reports (current corpus size: approximately 120,000 documents with 107 words). The concept acquisition problem we face is two-fold. In the information technology domain lexical growth occurs at dramatic rates { new products, technologies, companies and people continuously enter the scene such that any attempt at keeping track of these lexical innovations by hand-coding is clearly precluded. Compared with these dynamics, the medical domain is lexically more stable but the sheer size of its sublanguage (conservative estimates range about 106 lexical items/concepts) also cannot reasonably be coded by humans in advance. Therefore, the designers of text understanding systems for such challenging applications have to nd ways to automate lexical/concept learning as a prerequisite and, at the same time, as a constituent part of the text knowledge acquisition process. Unlike the current mainstream with its focus on statistically based learning methodologies (Lewis, 1991; Resnik, 1992; Sekine et al., 1992), we advocate a symbolically rooted learning approach in order to break the concept acquisition bottleneck, one which is based on expressively rich (terminological) knowledge representation models of the underlying domain (Hahn et al., 1996b; Hastings, 1996).
We consider the problem of natural language based knowledge acquisition and concept learning from a new methodological perspective, viz. one based on metareasoning about statements expressed in a terminological knowledge representation language. Reasoning is about structural linguistic properties of phrasal patterns or discourse contexts in which unknown words occur (assuming that the type of grammatical construction exercises a particular interpretative force on the unknown lexical item), or it is about conceptual properties of particular concept hypotheses as they are generated and continuously re ned by the ongoing text understanding process (e.g., consistency relative to already given knowledge, independent justi cation from several sources). Each of these grammatical, discourse or conceptual indicators is assigned a particular \quality" label. The application of quality macro operators, taken from a \quali cation calculus" (Schnattinger & Hahn, 1996), to these atomic quality labels nally determines which out of several alternative hypotheses actually hold(s). The decision for a metareasoning approach is motivated by requirements which emerged from our work in the overlapping elds of natural language parsing and learning from texts. Both tasks are characterized by the common need to evaluate alternative representation structures, either re ecting parsing ambiguities or multiple concept hypotheses. For instance, in the course of concept learning from texts, various and often con icting concept hypotheses for a single item are formed as the learning environment usually provides only inconclusive evidence for exactly determining the properties of the concept to be learned. Moreover, in \realistic" natural language understanding systems working with large text corpora, the underdetermination of results can often not only be attributed to incomplete knowledge provided for that concept in the data (source texts), but it may also be due to imperfect parsing results (originating from lacking lexical, grammatical, conceptual speci cations, or ungrammatical input). Therefore, competing hypotheses at dierent levels of validity and reliability are the rule rather than the exception and, thus, require appropriate formal treatment. Accordingly, we view the problem of choosing from among several alternatives as a quality-based decision task which can be decomposed into three constituent parts: the continuous generation of quality labels for single hypotheses (re ecting the reasons for their formation and their signi cance in the light of other hypotheses), the estimation of the overall credibility of single hypotheses (taking the available set of quality labels for each hypothesis into account), and the computation of a preference order for the entire set of competing hypotheses, which is based on these accumulated quality judgments.
2 Architecture for Quality-Based Knowledge Acquisition The knowledge acquisition and concept learning methodology we propose is heavily based on the representation and reasoning facilities provided by terminological knowledge representation languages. As the representation of alternative hypotheses and their subsequent evaluation turn out to be major requirements of that approach, provisions have to be made to re ect these design decisions by an
knowledge acquisition system initial context KB kernel hypothesis generation
metacontext qualification rules
translation rules
reified KB kernel translation rules
qualification qualifier rules
[i-th hypo space]
text parser
qualified hypo space
hypothesis space text knowledge base
translation rules [(i+1)-th hypo space]
qual. reif. hypo space
reified hypothesis space reified text knowledge base
selection criteria
Figure1. Architecture for Text Knowledge Acquisition appropriate system architecture of the knowledge acquisition device (cf. Fig. 1). In particular, mechanisms should be provided for: { Expressing quality-based assertions about propositions in a terminological language; these metastatements capture the ascription of belief to these propositions, the reasons why they came into existence, the support/weakening they may have received from other propositions, etc. { Metareasoning in a terminological knowledge base about characteristic properties and relations between certain propositions; the corresponding secondorder expressions refer to factual propositions (ABox elements) as well as concept and role de nitions (TBox elements). The notion of context we use as a formal foundation for terminological metaknowledge and metareasoning is based on McCarthy's context model (McCarthy, 1993). We here distinguish two types of contexts, viz. the initial context and the metacontext. The initial context contains the original terminological knowledge base (KB kernel) and the text knowledge base, a representation layer for the knowledge acquired from the underlying text by the text parser (Hahn et al., 1994). Knowledge in the initial context is represented without any explicit quali cations, attachments, provisos, etc. Note that in the course of text understanding { due to the working of the basic hypothesis generation rules (cf. Section 4) { a hypothesis space is created which contains alternative subspaces for each concept to be learned, each one holding dierent or further specialized concept hypotheses. Various truth-preserving translation rules map the description of the initial context to the metacontext which consists of the rei ed knowledge of the initial context (cf. Section 3). By rei cation, we mean a common re ective mechanism (Friedman & Wand, 1984), which splits up a predicative expression into its constituent parts and introduces a unique anchor term, the rei cator, on which reasoning about this expression, e.g., the annotation by qualifying assertions, can be based. Among the rei ed structures in the metacontext there is a subcontext embedded, the rei ed hypothesis space, the elements of which carry several quali cations, e.g., reasons to believe a proposition, indications of consistency, type and strength of support, etc. These quality labels result from incremental hypothesis evaluation and subsequent hypothesis selection, and, thus,
re ect the operation of several second-order quali cation rules in the quali er (quality-based classi er). The derived labels are the basis for the selection of those representation structures that are assigned a high degree of credibility { only those quali ed hypotheses will be remapped to the hypothesis space of the initial context by way of (inverse) translation rules. Thus, we come full circle. In particular, at the end of each quality-based reasoning cycle the entire original i-th hypothesis space is replaced by its (i+1)-th successor in order to re ect the quali cations computed in the metacontext. The (i+1)-th hypothesis space is then the input of the next quality assessment round.
3 Formal Framework of Quality-Based Reasoning Description Logics. We use a standard concept description language, referred to as CDL, which has several constructors combining atomic concepts, roles and individuals to de ne the terminological theory of a domain (for a subset, see Table 1; Woods & Schmolze (1992) give a survey of terminological languages). Concepts are unary predicates, roles are binary predicates over a domain , with individuals being the elements of . We assume a common set-theoretical semantics for CDL { an interpretation I is a function that assigns to each concept symbol (the set A) a subset of the domain , I : A ! 2, to each role symbol (the set P) a binary relation of , I : P ! 2, and to each individual symbol (the set I) an element of , I : I ! . Concept terms and role terms are de ned inductively. Table 1 contains corresponding constructors and their semantics, where C and D denote concept terms, while R and S denote roles. RI (d) represents the set of role llers of the individual d, i.e., the set of individuals e with (d; e) 2 RI . By means of terminological axioms (for a subset, see Table 2) a symbolic name can be introduced for each It is possible to de ne necessary and : or concept. sucient constraints (using =) only necessary constraints (using v). A nite set of such axioms is called the terminology or TBox. Concepts and roles are associated with concrete individuals by assertional axioms (see Table 2; a; b denote individuals). A nite set of such axioms is called the world description or ABox. An interpretation I is a model of an ABox with regard to a TBox, i I satis es the assertional and terminological axioms. Terminology and world description together constitute the terminological theory for a given domain. Syntax
Semantics
I d 2 Catom Catom j Catom is atomic I C uD C \ DI C tD C I [ DI I I I 6= ; 9R:C d 2 I j RI (d) \ C I 8R:C (d) C d 2 j R Ratom (d; e) 2 RIatom j Ratom is atomic I \ SI R u S SyntaxRand Table1. Semantics for a Subset of CDL
Terminological Axioms
Axiom A =: C
Semantics AI = C I
AvC
AI C I
Axiom a:C
Semantics
Assertional Axioms
a?I 2 CI I ; bI 2 R I a R Table2. b CDLaAxioms