DB-MAT: A NL Based Interface to Domain Knowledge Galia Angelova, Kalina Bontcheva
Bulgarian Academy of Sciences, Linguistic Modelling Lab Acad. G.Bonchev Str. 25A, So a 1113, Bulgaria fgalja,
[email protected]
Abstract
Successful user-friendly interfaces will enable the application of knowledge based techniques in systems oriented towards end users who are not specialised in computer science. This paper discusses an approach to knowledge based Machine Aided Translation (MAT) which provides an user-friendly interface to Knowledge Bases (KB) of conceptual graphs. It is shown that each knowledge item is accessible in a simple and transparent way. The system explanation { synthesised by a Natural Language (NL) generation { allows for further clari cations within the context of previous answers. An evaluation of our approach in comparison to the strategy of follow-up questions is discussed as well.
1 Introduction and rationale
KB methods from Arti cial Intelligence are seldom combined with computational linguistics. There are two major diculties for the adaptation of knowledge based approaches to Natural Language Processing (NLP) systems: (i) a task-adequate and useful distinction between language and knowledge is often hard to de ne; (ii) such an architecture requires theoretically sound knowledge processing with a clear idea how the knowledge structures underly the surface language phenomena. On the other hand, many typical NLP tasks investigate problems where the knowledge plays a crucial role. For instance, in machine translation, domain knowledge is usually encoded in semantic lexicons in order to facilitate the production of adequate and high-quality target texts (surveys show that during the translation process the human translators often need at least an elementary understanding of the domain either because the terms' meaning aects the translation of the whole context, or because there are lexical gaps between the languages and thus paraphrases are needed). The importance of domain knowledge in translation1 is re ected in some recent machine translation projects [5] although the problems of integrating domain knowledge in translation systems are far from being solved. Our approach applies knowledge based techniques to achieve a exible and coherent representation of terminology in a MAT system. A KB represents the conceptual grid of the domain (the domain ontology) as well as the facts necessary for the translator
Current: University of Sheeld, Department of Computer Science, Regent Court, 211 Portobello Str., Sheeld S1 4DP,
[email protected] 1 Especially for technical texts which are more than 95% of the total amount of translated texts.
reorganisation of the semantic entries in the terminological lexicon of MAT systems: the semantics of each term is encoded into a KB and a link to the respective knowledge item is given in the lexicon. Consequently, we designed DB-MAT2 with the following main features [12, 13, 14]: translators obtain linguistic as well as domain information within an environment containing interconnected linguistic data, terminological and domain knowledge; explanations (requested for terms) are generated from extracts of the knowledge base and verbalised in a selected natural language; the users always formulate their requests for clari cation by highlighting a sequence in the source text and selecting the request type out of a hierarchical menu. In this way we aim at creating a user interface which is similar to an ordinary text editor. This goal requires, in fact, a strategy that provides a user-friendly interface to the KB. In section 2 below we comment some related approaches for building NL based interfaces to domain knowledge which also apply NL generation techniques to produce explanations. Section 3 discusses quite brie y the DB-MAT paradigm and Conceptual Graphs (cgs) { the underlying knowledge representation formalism. In section 4 we give a more detailed account of our algorithms for knowledge extraction and verbalisation. Section 5 summarises our current experience in building NL interfaces to domain knowledge. Sections 6 and 7 contain some implementation details and the conclusion.
2 Related Work
DB-MAT could be successfully compared to other applied NL Generation (NLG) systems which generate explanations from a KB with domain knowledge, e.g., IDAS [9]. The latter generates short targeted responses that support users of complex machinery. The system dwells upon a KL-ONE-like knowledge base and has a hypertext interface where the user can obtain additional information by following links and choosing follow-up questions from a context-sensitive pop-up menu. IDAS has a nite number of questions it can answer (as well as DB-MAT). The IDAS explanations do not take into account previous interaction and always comprise a closed focus space, i.e. no referring expressions identifying objects from previous explanations are allowed3. A more exible (and complex) NLG system is PEA [7] which explains the reasoning of an expert system and allows for follow-up questions to assure user's understanding. PEA accounts for focus, previous dialogue and user's knowledge when providing the clari cations. The user can ask about noun phrases and clauses. When the mouse cursor passes over a text which is explainable, it becomes highlighted and a mouse click displays a menu with relevant follow-up questions. The common feature of all these systems is the direct manipulation menu-based interface. The latter has proved to have multiple advantages over a (constraint) NL KB interface which also requires a sophisticated NL understanding module. Probably the most important asset of the menu-based interface is its transparency to the user, i.e., it shows only those questions the system is prepared to answer and avoids problems A German-Bulgarian Knowledge Based MAT project, funded by the Volkswagen Foundation in 1993-1995 (www.informatik.uni-hamburg.de/Arbeitsbereiche/NATS/projects/db-mat.html). 3 An object introduced in one node can not be referred to in another unless it is reintroduced. 2
about).
3 DB-MAT: a Knowledge Based MAT using Conceptual Graphs
DB-MAT is a prototype knowledge-based translation environment providing linguistic as well as domain knowledge support. The system has a user-friendly interface providing the standard functionality of text processing systems with some innovative elements: (i) gures are attached to the lexical entries. Thus, the user is shown relevant graphics that illustrate the textual explanation (as in standard encyclopaedias and textbooks); (ii) the menu item for accessing domain knowledge - called Explanation - is a submenu of the main menu Information like Grammar, Lexicon, Figures. Hence, the user obtains domain clari cations in a standard and intuitive way; (iii) the translator selects explanation language and details for the generated explanations with radio buttons. Throughout the paper we will pay a special attention to the variations of the detailslevel, which directly in uences the exible search strategies. DB-MAT uses cgs as a knowledge representation formalism due to their wellde ned operations and their features to represent contexts [10, 11]. In brief, cgs are nite, bipartite graphs. The nodes are either concepts or conceptual relations. The two kinds of nodes are connected by pointed arcs. Concepts can have arcs only to conceptual relations and vise versa. All concept types form a type hierarchy (a lattice). Concepts denote basic notions in the domain and the conceptual relations show the semantic relations that hold between the connected concepts. cgs have a well-de ned mechanism for inheritance and operations like projection, join, type expansion etc. Below we exemplify the use of these operations as a semantic basis for NLP.
4 Interfacing Domain Knowledge in DB-MAT
The user interface design was motivated by the following ideas: translators obtain domain clari cation by highlighting strings in the (source) text and selecting a request from the Explanation menu; therefore the system needs a mapping between the domain-speci c knowledge items and the NL lexemes; the menu items should convey the expected semantic content of the resulting explanation; dierent menu items should yield dierent system answers (although some answers may partially overlap); more experienced users should be allowed to modify the menu design and behaviour since each translator could have 'favourite' names for the semantic relations and individual ideas regarding the expected semantic content; since the KB contains arbitrary concepts and conceptual relations, each knowledge item should be explainable if suitable requests are formulated. 4.1 Mapping the User's Request to Domain Knowledge In our approach, we have no xed prede ned schemata mapping a user request to certain knowledge fragments. Given a highlighted term (i.e., its underlying concept), and a user request for domain knowledge, the Query Mapper (QM) searches the KB on a y and extracts (by projection without subsumption) all relevant facts according to the conceptual relations. Depending on the details level and request type, our algorithm also retrieves all attributes and characteristics inherited from a more generic concept (see [1]).
hierarchy and types of conceptual relations. There are six subitems in the Explanation menu: What is?, Types of ..., Characteristics, More ..., Examples, Want All. Our assumption is that in a system with a well-elaborated KB, the user should always get the relevant answer (moreover we assume that the user will seldom ask all menu questions about the same highlighted lexical item). Types of ... verbalises the type hierarchy, while the other queries require processing of the conceptual graphs from the KB. Section 4.2 below illustrates the extraction of temporal subgraphs which constitute the semantics of the resulting NL explanation. Submenu What is? Types of...
Characteristics
More... Examples Want All
Menu Item Corresponding Conceptual Relations Inheritance Types of.../All + ATTR, CHAR, PART OF Yes
All General Concrete Similar
super- + sub- + sister-concepts All superconcepts from the hierarchy All subconcepts from the hierarchy All sister-concepts from the hierarchy
All Attributes Who Object How Where
attributes+who+obj+how+where ATTR, CHAR AGNT OBJ, PTNT INST LOC,DEST,FROM,IN,TO All the remaining conceptual relations Individual concepts All from above without repetitions
Yes Yes
Yes
Table 1: Query types from menu Explanations and their corresponding conceptual relations The QM maintains the list of relevant conceptual relations (cf. Table 1) currently available in the KB, as well as their correspondence to the menu items. Experienced users may customise this mapping (together with the names of the menu items) according to their own view of the domain. So far, the QM has a xed scope of extraction: for most of the conceptual relations it is "one step around" (i.e., one relation "far" from) the selected concept. Nested graphs (e.g., situations) are extracted as unbreakable knowledge fragments due to their speci c semantics [11]. 4.2 Extracting the Relevant Knowledge { Examples Figure 1 presents a partial type hierarchy and cgs in the so called linear notation: concepts like [PARTICLE] are related to conceptual relations like (CHAR). For simplicity, the internal Prolog representation of cgs, as well as the links between the lexicon and the KB items are not discussed here. The given KB is, of course, much simpler than the real DB-MAT one - we have shown just small simple parts of graphs that yield short answers and the relevant part of the type hierarchy. All type de nitions are omitted for brevity. Operations like type expansion/contraction are not discussed here. Therefore, the KB contains concepts with dierent granularity (for instance, [LIGHTER THAN WATER] is not actually represented as a basic concept, i.e., it has a corresponding type de nition). A more detailed discussion of the KB, the QM and the problems of linking lexical and domain knowledge is given in [1, 2].
PPPPP ... ... ENTITY LL``````` PHYS OBJECT NON MOBILE ! PPLIVING ! ... ... P ! P P ! ... ... ADMIXTURE "" aaaa " OIL ! aaa bbb !!PARTICLE aa PARTICLE ! OIL-PARTICLE MINERAL
graph 1:
[WASTE WATER] { {> (CHAR) {> [DISPERSION] {>(OF) {> [PARTICLE] {> (CHAR) {> [CONCENTRATION] {>(OF) {> [PARTICLE] {> (CONTAIN) {> [ADMIXTURE].
graph 2:
[PRECIPITATION] {> (OBJ) {> [SITUATION: [WASTE WATER] {> (CONTAIN) {> [OIL PARTICLE] { {> (ATTR) {> [LIGHTER THAN WATER] {> (ATTR) {> [SWIMMING] {> (ATTR) {> [ROUGHLY DISPERSED] ].
graph 3:
[INDUSTRY] { {> (ATTR) {> [OIL PROCESSING] {> (RESULT) {> [SITUATION: [WASTE WATER] {> (CONTAIN) {> [OIL] ].
graph 4 :
[OIL PARTICLE] {> (CHAR) {> [DENSITY].
graph 5 :
[PARTICLE] {> (CHAR) {> [DIMENSION].
Figure 1: Sample Knowledge Base Now let us demonstrate the system's behaviour while the user translator reads a source German text in the domain of admixture separation, highlights text strings (domain terms in the simple case) and formulates menu-based requests for explanations (the menu items are given on Table 1). In addition to the two parameters highlighted text string and request type, DB-MAT always takes into account the required detail level (details: less/more) and the explanation language (explanations in: german/bulgarian) both selected by the user from the system interface. Hence, each of the following examples should be considered within the dynamic context of details and language. However, here we have xed the explanation language to german while the details level changes among the examples. If no details level is explicitly mentioned, then less is assumed. For instance, if the user highlights Abwasser (waste water) and asks What is?, the QM will search for all occurrences of the concept type [WASTE WATER] in the context of conceptual relations (ATTR) and (CHAR) and will extract by projection all attributes and characteristics from graph 1 in the temporal graph 1 (cf. Figure 2). Using temporal graph 1 as a semantic pool, DB-MAT will generate the answer: Abwasser ist gekennzeichnet durch Dispersion und Konzentration. Note that given the sample KB on Figure 1, the answer will be the same for the
[WASTE WATER] {> (CHAR) {> [DISPERSION] {> (CHAR) {> [CONCENTRATION].
extracted by projection from graph 1: [WASTE WATER] {> (CONTAIN) {> [ADMIXTURE]. temporal graph 3 - a whole situation - extracted by projection from graph 2: [WASTE WATER] {> (CONTAIN) {> [OIL PARTICLE]{> (ATTR) {> [LIGHTER THAN WATER] {> (ATTR) {> [SWIMMING] {> (ATTR) {> [ROUGHLY DISPERSED]. temporal graph 4 - another situation - extracted by projection from graph 3: [WASTE WATER] {> (CONTAIN) {> [OIL]. temporal graph 2
Figure 2: Extracted Temporal Graphs queries What is?, Characteristics/All and Characteristics/Attributes since the search strategies (cf. Table 1) overlap. It is also worth mentioning that if the sample KB contained the following graph 1a, instead of graph 1 : [WASTE WATER] {>(CHAR) {>[SITUATION: [DISPERSION] {>(OF) { >[PARTICLE]] {>(CHAR) {>[SITUATION: [CONCENTRATION] {>(OF) { >[PARTICLE]] {> (CONTAIN) {> [ADMIXTURE].
then the QM would extract both situations as unbreakable chunks of knowledge, thus yielding the explanation: Abwasser ist gekennzeichnet durch Dispersion der Partikeln und Konzentration der Partikeln. However, our sample KB represents more knowledge concerning WASTE WATER. If the user highlights Abwasser and asks More, the QM will extract temporal graphs 2, 3 and 4 given on Figure 2. After joining these temporal graphs, DB-MAT will generate: a) In case of details = less und Beimischungen. Abwasser enthalt Ol b) In case of details = more Abwasser enthalt Beimischungen und grobdisperse und ausschwimmende Olpartikeln welche leichter als Wasser sind. The QM not only extracts the relevant knowledge, but it also processes the temporal graphs in order to assure more relevant semantics of the answer. For instance, equivalent temporal graphs are erased to avoid repetitions. Moreover, as evident from the last two examples, the type hierarchy is also taken into consideration. Since OIL PARTICLE inherits OIL (cf. Figure 1), the QM will select to verbalise either the superconcept, or the subconcept, depending on the required details level (less, more). Thus, the QM presents more general knowledge for detailness less and more speci c one for more. Hitherto we have brie y outlined the semantic preselection in DB-MAT (for a more detailed account of the QM ltering functions see [1]). If the user highlights Beimischung (admixture) in the source German text or in the explanation window, for a What is? question the answer will be:
which is taken from the type hierarchy (let us remind that the answer is given within the context of the sample KB on Figure 1). In case of question More for Beimischung, the DB-MAT generates the explanation: Beimischungen sind enthalten in Abwasser. which is a verbalisation of the temporal graph 2 on Figure 2. As shown above, the same graph is to be verbalised as an explanation for Abwasser. However, in this case the focus is another highlighted concept - Beimischung - and, therefore, the temporal graph is verbalised in passive voice (for more details about the generation algorithm see Section 4.3 and [3]). The last examples will illustrate the role of inherited knowledge and synonyms in the process of forming the internal semantics of the answer: Since Olpartikel and Olphase are synonyms in the lexicon, they are both connected to the same underlying concept, i.e., OIL PARTICLE. Thus, for Olphase and What is? in case of details=less, the DB-MAT answer is: Olphasen (Olpartikel) gehoren zu Partikeln. Die Olphasen sind gekennzeich net durch Dichte. Die grobdispersen und ausschwimmenden Olphasen welche leichter als Wasser sind, sind enthalten in Abwasser. Here the rst sentence of the generated explanation verbalises the type hierarchy together with the synonym available in the DB-MAT lexicon. The second sentence comes from graph 4 (see Figure 1). The third sentence is a passive verbalisation of temporal graph 3. In case of details=more, however, the above explanation will change to: Olphasen (Olpartikel) gehoren zu Partikeln. Die Olphasen sind gekennzeichnet durch Dichte und Dimension. Die grobdispersen und ausschwimmenden Olphasen welche leichter als Wasser sind, sind enthalten in Abwasser. The second sentence in this case is a verbalisation of graphs 4 and 5. All characteristics of higher concept types are inherited and verbalised as characteristics of the highlighted concept. 4.3 Verbalising the Graphs Since the generation component is tightly bound to the underlying semantic representation, the applied algorithms are strongly in uenced by some speci c cgs features. We have focused mainly on the surface realisation task, but we also tackle the problem of organising the input semantics into a coherent unit. The generator (EGEN) receives as an input: (i) relevant knowledge pool formed by the QM (temporary cgs to be verbalised in a possibly multisentential output); (ii) explanation language; (iii) highlighted concept(s) (the corresponding term(s) will become global focus of the generated explanation); (iv) query type (necessary for the selection of an appropriate text-organisation schema); (v) iterative call ag indicating a request for further clari cation of a term contained in the generated output. In case of iterative explanations EGEN preserves the previous discourse and may introduce a comparison or refer to already known terms. Please note that here we do not discuss the semantics of the generated explanation which depends entirely on the underlying KB. 4
the fact that generation may start from any concept node. Therefore, the generator may select the subject and the main predicate from a linguistic perspective rather than being in uenced by the structuring of the underlying semantics (as is the case with tree-like notations frequently used in generation). In order to assure coherency, EGEN orders the extracted cgs in a well-structured explanation by applying text organisation schemata [3]. Even though this text planning technique lacks some exibility, it can be successfully adopted in restricted technical domains with established language conventions. DB-MAT supports three schemata | one for de nitions, one for similarity and one for dierence (rather similar to those introduced in [6]). For the surface realisation of conceptual structures, in our system we have adopted the utterance path approach for its proven eciency. As proposed in [10], concepts are mapped into nouns, verbs, adjectives and adverbs, while conceptual relations are mapped into "functional words" or syntactic elements. EGEN uses APSG rules implemented in Prolog. We have extended the standard utterance path approach[10] in several ways[3]: to process extended referents (e.g., measures, conjunctive and disjunctive sets); to group relevant features together (e.g., rst utter all "dimension" attributes, then all "weight" attributes). That information is taken from the type hierarchy; to introduce relative clauses (if a concept has more than one adjacent OBJ or AGNT relation, then a relative clause is generated [15]). to maintain discourse history, thus allowing for the generation of simple referring expressions { pronouns and de nite noun phrases. to output a sentence tree, instead of a word sequence. Thus some postprocessing (e.g., punctuation) could be applied.
5 Evaluation
While designing our interface to the KB, which anyway should re ect the conceptual relations occurring in the conceptual graphs, we turned to existing approaches of mapping NL to cgs { e.g., [4]. There the authors report about their experience in building a semantic network starting from 5000 (French) lexical units which were classi ed into 57 conceptual hierarchies based on 23 types of primitive conceptual relations. Another eort aimed at building an information retrieval system that converts a large volume of NL texts into cgs [8] and it distinguished 57 semantic relations. These numbers show that any attempt to acquire a KB even in a restricted domain will lead to the identi cation of numerous conceptual relations. Additionally, the cgs theory allows the introduction of new types with arbitrary granularity. They are maintained by type and relation de nitions and the operations type/relation expansion and contraction. Therefore, in principle, we can expect a relative variety in the number of relation names and granularity (especially in KBs acquired by dierent knowledge engineers). Such a variety can make the strategy of follow-up questions rather inconvenient: it is not clear in advance how many the necessary questions could be, which means that the design solutions should t to interfaces with 5 (or 50) menu-elements (or buttons). Then even such technical questions like "physical place of a button on the screen" can become problematic. Moreover, the multiple choices in the menu could complicate the user navigation.
a background attachment of numerous conceptual relations to the few menu items - is a suitable way to present the complex knowledge structures to the end user. Thus, the system oers a simple interface with an uniform access to the heterogeneous data (in the system as a whole and in the KB in particular). Another bene t of our menu-based interface is the simple customisation of the mapping between menu items (i.e., requests) and conceptual relations { the content of Table 1 could be modi ed by a dialogue in the user interface.
6 Implementation
The present DB-MAT demo system is implemented in LPA MacProlog by the authors and Zdravko Markov (the module for lexicon look-up). However, the successful DB-MAT compilation is also due to Walther von Hahn, University of Hamburg, who worked with us from the very beginning and who nalised the current German linguistic data. Many other colleagues also contributed to the project: Nevelin Boynov and W. v. Hahn developed special lexicon acquisition tools to conceal the complex lexicon structure (these eorts are comparable to the development eorts for the DB-MAT demo); Lutz Euler participated in the design of the lexicon and the internal Prolog representation of cgs (the latter was designed together with Heike Petermann and K. Bontcheva). Svetlana Dimitrova and Georgi Arnaudov worked in the German lexicon and the German morphology respectively. Kiril Simov and Heike Winschiers participated in the initial project phases. The current DB-MAT main interface has adopted some ideas from their earlier implementations.
7 Conclusion
DB-MAT demonstrates a possible application of KB-methods to computational terminology and translation aid tools where the system interface must be simple and user-friendly, regardless of the underlying complex linguistic and semantic data. At the same time DB-MAT outlines architectural solutions which could also prove adequate for other knowledge based NLP systems. The future elaboration of the DB-MAT system { which enters now its second stage { will be oriented towards re nement of the present system components and enhancements with new ones. For instance, we plan to help the user in identifying the 'sensitive' parts of the text/explanations, i.e., to provide highlighting or hypertext links similarly to PEA and IDAS. The DB-MAT KB also requires some further elaboration. This will enable the implementation of more sophisticated QM algorithms which will account for the explanation context when extracting relevant information for iterative requests. Additionally, we plan to enhance the QM to process similarity and dierence between concepts. In respect to the NL generation module, we intend to explore more exible planning techniques (e.g., RST) and the impact of a richer user model on the quality of the produced explanations.
Acknowledgements
We are much obliged to all colleagues who made their theoretical and/or applied contributions to DB-MAT, thus forming the basis for the successful nal elaboration. In particular, we are most grateful to the DB-MAT project leader { Prof. Dr. Walther von Hahn, University of Hamburg { for his support, eorts and patience during the three years of cooperation. However, most limitations of the current system architecture and implementation are due to ourselves.
[1] Angelova, G., and Bontcheva, K. DB-MAT: Knowledge Acquisition, Processing and NL Generation using Conceptual Graphs. In Proceedings of the 4th Int. Conference on Conceptual Structures (ICCS'96) (Sydney, Australia, 1996). To appear. [2] Angelova, G., and Bontcheva, K. NL Domain Explanations in Knowledge Based MAT. In Proceedings of COLING'96 (1996). Poster presentation. [3] Bontcheva, K. Generation of Multilingual Explanations from Conceptual Graphs. In Recent Advances in Natural Language Processing 1995, R. Mitkov and N. Nikolov, Eds. John Benjamins, Amsterdam, 1996. To appear. [4] Fargues, J., and Perrin, A. Synthesizing a Large Concept Hierarchy from French Hyperonyms. In COLING-90, Vol. 2 (1990), pp. 112{117. [5] Goodman, K., and Nirenburg, S., Eds. The KBMT project: A Case Study in Knowledge Based Machine Translation. Morgan Kaufmann Pub., 1991. [6] McKeown, K. Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge University Press, 1985. [7] Moore, J. Participating in Explanatory Dialogues. MIT Press, Cambridge, MA, 1995. [8] Myaeng, S., Khoo, C., and Li, M. Linguistic Processing of Text for a Large-Scale Conceptual Information Retrieval System. In Proceedings of 2nd Int. Conf. on Conceptual Structures (ICCS'94) (College Park, 1994), W. Tepfenhart, J. Dick, and J. Sowa, Eds., no. 835 in LNAI, Springer-Verlag. [9] Reiter, E., Mellish, C., and Levine, C. Automatic Generation of On-Line Documentation in the IDAS Project. In Proceedings of 3rd Conference on Applied NL Processing (Trento, Italy, 1992). [10] Sowa, J. Conceptual Structures: Information Processing in Mind and Machine. Addison Wesley, 1984. [11] Sowa, J. Conceptual Graphs Summary. In Conceptual Structures: Current Research and Practise (1992), T. Nagle, J. Nagle, L. Gerholz, and P. Eklund, Eds., Ellis Horwood. [12] v.Hahn, W. Innovative Concepts for MAT. In Proceedings VAKKI (Vaasa, 1992), pp. 13{25. [13] v.Hahn, W., and Angelova, G. Providing Factual Information in MAT. In Proceedings of the International Conference "Machine Translation: Ten Years On" (1994), Cran eld. [14] v.Hahn, W., and Angelova, G. Knowledge Based MAT. Computers and AI, Bratislava (1996). To appear. [15] Zock, M. Sentence Generation by Pattern Matching: the Problem of Syntactic Choice. In Recent Advances in Natural Language Processing 1995, R. Mitkov and N. Nikolov, Eds. John Menjamins, 1996. To appear.