KNOWLEDGE BASE REPRESENTATION FOR AN ... - CiteSeerX

2 downloads 0 Views 69KB Size Report
The research project at MSU involves the use of portions of the Merck manual for veterinary medicine, a widely recognized reference manual that describes a ...
KNOWLEDGE BASE REPRESENTATION FOR AN EXPERT SYSTEM CAPABLE OF SELF-EXTENSION THROUGH NATURAL LANGUAGE TEXT ANALYSIS

Julia E. Hodges Jose L. Cordova Lois M. Boggess G. Jan Wilms Rajeev Agarwal

Department of Computer Science Box 9637 Mississippi State, MS 39762-9637

Technical Report MSU-900215

This work is supported in part by National Science Foundation Grant No. IRI-9002135.

Knowledge Base Representation for an Expert System Capable of SelfExtension through Natural Language Text Analysis Julia E. Hodges Jose L. Cordova Lois M. Boggess G. Jan Wilms Rajeev Agarwal Abstract An ongoing research project in the Department of Computer Science at Mississippi State University involves the development of appropriate knowledge representation tools and natural language processing tools to provide the capability for an expert system to extend its own knowledge base by processing natural language text in machine readable form. This paper describes the knowledge representation methods being used in the development of an early prototype for such a system. The knowledge is arranged in what is called a multilevel semantic network, which is a collection of semantic networks arranged in logical partitions. Some of the complex issues surrounding the development of such a self-updating system - dealing with contradictory information, recognizing redundant information, and detecting relevancy of new information - are also discussed. Introduction Traditionally, artificial intelligence (AI) systems such as expert systems have been limited by their lack of commonsense knowledge and by the brittleness of their knowledge bases. The result has been systems that are both limited and difficult to extend. An ongoing project in the Department of Computer Science at Mississippi State University (MSU) is addressing the issues involved in designing AI system that have the capability to extend their own knowledge bases through the analysis and processing of natural language text in machine readable form. In addition to technical references in their domain, such systems would have access to commonsense, everyday knowledge through dictionaries, encyclopedias, etc. An expert system with a self-extending capability could be used in a variety of problem domains. For example, expert systems dedicated to troubleshooting components in a complex system could continually update their knowledge bases by processing the information contained in new technical reports and revised technical manuals. Expert systems that serve as diagnostic aids in the medical world could keep their knowledge upto-date by incorporating new information in medical references. Expert systems used to aid in the prediction of social trends in social science research could incorporate information from various forms such as newspaper and magazine articles. The research project at MSU involves the use of portions of the Merck manual for veterinary medicine, a widely recognized reference manual that describes a variety of disorders and their symptoms, suggesting appropriate tests and treatments based upon the diagnosis. This information is being used in the design and development of a prototype. The project is a large project that encompasses a number of issues in various areas of AI such as knowledge base representations, inferencing techniques, and natural language processing. The focus in this paper is on the knowledge base representation methods that are being used in the development of the prototype. Multilevel Semantic Network The knowledge representation approach being used in this project is one that has been, and continues to be, successfully used in other expert system projects at MSU (Cordova 1988; Hodges and Cordova 1989). Each domain entity is represented as a frame. Relationships between entities are established by filling slots with links that connect

the corresponding frames. A frame may have different links, each one representing a different relationship. The knowledge base is logically partitioned into various levels. Each level contains a network of frames corresponding to a portion of the knowledge base. The various frames in a given level typically correspond to instances of a class of objects in the knowledge base. The arrangement of the frames in a particular network may differ from one level to another. Such a structure was chosen for various reasons, both theoretical and practical. Even though little is known about the precise cognitive processes performed by humans, several researchers have noted that structures like the one described seem to reflect the way in which the human mind organizes knowledge (Coombs 1984; Cordova and Hodges 1988). Perhaps the most important characteristic required of the knowledge base is flexibility. One must be able to represent not only different types of entities from the problem domain, but also different types of relationships between those entities. For example, given that the knowledge base contains information on different diseases, symptoms, and treatment options, different types of relationships must be stored and identified in the knowledge base. While related diseases may have to be arranged in taxonomic fashion, causal relationships must be established between diseases and symptoms, and treatment options will typically be associated with a disease or groups of diseases. The multilevel semantic network also incorporates varying certainty and/or frequency measures associated with a relationship. For example, statements of the form: may result from is usually caused by occur frequently in the veterinary medicine manual. While both statements imply a causal relationship between a disorder and a cause, the second statement clearly makes a stronger connection. This information is expressed in the knowledge base using certainty factors, which are values between 0.0 and 1.0. Although initially the certainty factors are to be defined based strictly upon key words such as “sometimes,” “frequently,” and “usually,” it is possible that future versions of the prototype will modify the certainty factors as more evidence regarding such connections is found in the manual. Currently the knowledge base is arranged in five levels, as shown in Figure 1. On level 1 is a semantic network representing information about the various symptoms that may be exhibited. On level 2 is a semantic network representing information about various treatments, including both treatment procedures (such as “apply drops to eye” or “administer 250 mg three times per day”) and medications. On level 3 is a semantic network containing information about various diagnostic aids (ocular examination, radiography, etc.). On level 4 is a semantic network containing information about the various diseases and disorders. Objects in the diseases/disorders network participate in has_symptom relationships with objects in the symptoms network as well as with other diseases/disorders objects. That is, a given disease or disorder may actually be a symptom of some other disease or disorder. The diseases/disorders information also has has_treatment and has_test relationships with the treatments and diagnostic aids information, respectively. Level 5 contains information that is needed only by the natural language processor. This includes such information as lexical categorization: content words may belong to classes such as animal, body part, instrument (e.g., forceps),, and disease agent (e.g., bacteria and herpes virus). A directory is maintained to allow direct access to the particular level of the knowledge base that is needed by a particular query. This directory is implemented as a hash file using a directory class definition in C++. Information such as certainty factors and frequency counts is attached to the links in the semantic networks. It is possible that, in the future, other types of both procedural and non-procedural information can be attached to any link to reflect characteristics of the relationship between the two entities. For example, Cordova (Cordova 1988) described the Page 3

attachment of heuristic information to the links of a frame-based expert system as part of an effort to improve the performance of the system. Our experience in designing and experimenting with a prototype system (a working expert system designed for a research and training center for low vision) indicates that heuristic search techniques on a system so organized allow search to begin at or near the most appropriate point in the knowledge base (Hodges and Cordova 1989). In addition, a natural way of handling exceptions is to attach to any given link one or more conditions that must be satisfied before the link can be considered active. Network structures facilitate the implementation of path-based inference mechanisms, as described by Fox et al. (1988). Path-based inference avoids the need for exhaustive searches since the information pertinent to a given object is readily available (Hodges and Cordova 1989; Fox et al. 1988). Knowledge Representation Issues There are a number of important issues that must be dealt with in the design of the knowledge base for a self-extending expert system. These include the handling of redundant information, the analysis and appropriate handling of conflicting information, and determining the relevancy of new information. The research team has defined some simple techniques for dealing with some of these issues in the early prototype currently under development. It is understood, however, that these techniques will have to become more sophisticated over time. Knowledge which is irrelevant should be discarded. Knowledge which is only peripherally relevant may be either discarded or stored on a contingent basis, with the hope of being able to make a stronger connection later. Initially, this is being handled in a fairly simple manner through the use of synonyms such as those that may be extracted from a dictionary The handling of redundant information presents some special problems since it is not a simple pattern matching technique. This is due to the fact that it cannot be assume that the new information, derived from a variety of technical sources, is in exactly the same format as the information already in the knowledge base. Currently, capabilities involving the chunking of information so that basic patterns can be detected are being investigated. The analysis of conflicting information will bed particularly important for very practical reasons beyond the fact that a knowledge base should avoid inconsistency. Two valuable aspects of self-updating systems are their potential for the ability to “read” relevant technical literature to augment their understanding of their own areas of competence and also to remain up-to-date in fields in which the best information changes from year to year. Especially in the latter case, new information is likely to conflict with old, and the appropriate handling of the “conflicting” information is one of the positive features of such a system. Related Work Natural language processing systems and knowledge representation systems have for a long time used very restricted domains. large natural language texts seemed intractable and the goal of processing large blocks of language as it is actually produced by the general writing community seemed remote and unattainable. Nevertheless, over the years some linguists worked with very large natural language texts, (for example, the Brown corpus of more than a million words of American English text (Kucera and Francis 1967)) and more recently the speech recognition research community has built simple language models of enormous bodies of text. At the same time, researchers in computational linguistics have begun characterizing the structure of, and extracting various kinds of information from, machine-readable texts such as dictionaries and thesauri (see, for example, (Boguraev and Brisco 1987), (Jensen and Binot 1988), (Fox et al. 1988) (Chodorow et al. 19088)). The CYC project proposes to incorporate into a knowledge base the commonsense knowledge derivable from a onePage 4

volume desk-top encyclopedia, by explicit human intervention in the extraction, structuring and encoding processes (Lenat et al. 1986). Humphrey (1989) reports on a knowledgebased expert system designed to facilitate the work of human experts whose task is to index papers in the medical literature according to the concepts contained in the papers Both of the latter systems are designed to assist the human experts and make suggestions, as well as to prevent over- and under-generalizations by the humans. In contrast, it is our intention to examine the feasibility of automating the actual extraction of information. Virkar and Roach (1989) point out numerous advantages of expert systems capable of extracting information from natural language texts, and demonstrate with a prototype system which parses research paper abstracts to assimilate knowledge into an expert system. However, the semantic grammar they used was very tightly coupled with the frame system devised to capture the information. Their reported test bed is considerably smaller than our initial base, and given the tight coupling between their grammar and their frame system, it would appear that expanding their system would be increasingly difficult for the grammar writer. We hope to design a system having the opposite effect. The natural language component of our research owes much to work undertaken at several IBM research centers in Europe. Research reported by Derouault and Merialdo (1986) serves as the model for our overall approach. Their techniques is to hand label a moderate-sized body of text, build a probabilistic model of the hand-labeled information, use the model to process a much larger body of text, correct the automatically processed information by hand, and use this larger body as training data for the next iteration. Although Derouault and Merialdo were primarily interested in determining part of speech for words in the text, with little use for semantics, their colleagues Antonacci, Russo, Pazienza, and Velardi (1989) describe the use of a hand-encoded nucleus as the basis for automatic acquisition of “surface semantic patterns, (SSPs)” from natural language texts (in this case, press releases on economic topics). The SSPs were defined using a case grammar approach, with access to conceptual categories in a concept hierarchy. It is possible that their “formatted text database,” which is implemented as conceptual graphs, may actually be a potentially powerful knowledge base. In any case, their primary interest lay in the ability to answer queries about the press releases, rather than in issues of integrating the information into an existing, complex knowledge base, where matters of redundancy, relevance, and contradiction become critical. Summary Researchers at Mississippi State University and elsewhere have begun to address the problem of building systems to extract information from large, real-world natural language texts. It should be emphasized that the texts are technical in nature and highly structured. This paper has described several issues which must be addressed regarding the development of knowledge bases for such systems. One of the major issues involves appropriate partitioning of the knowledge base into semantically relevant subunits (levels). This approach facilitates the structuring of the knowledge base so as to allow objects with similar characteristics or functions to be packaged together and enables the search mechanism to begin at the most appropriate point in the he knowledge base. Other issues include the detection of redundant and conflicting new information and the discarding of irrelevant information. A brief survey of related research has also been provided.

Page 5

disease agents

animals

Level 5 body parts

instruments

has_symptom

diseases / disorders

has_treatment

Level 4

has_test

diagnostic aids

Level 3

treatment

Level 2

has_symptom

symptoms

Figure 1. Multilevel Knowledge Base

Page 6

Level 1

References Antonacci, F., M. Russo, M. Pazienza, and P. Velardi. 1989. A system for text analysis and lexical knowledge acquisition. Data and Knowledge Engineering 4(1):1-20. Boguraev, B., and T. Briscoe. 1987. Large lexicons for natural language processing: Utilising the grammar coding system of LDOCE. Computational Linguistics 13(3-4):20318. Chodorow, M., Y. Ravin, and H. Sachar. 1988. A tool for investigating the synonymy relation in a sense disambiguated thesaurus. In Proceedings of the Second Conference on Applied Natural Language Processing, 144-51. Coombs, M. 1984. Developments in expert systems. London: Academic Press, Inc. Cordova, J. 1988. The use of heuristics in a frame-based multidimensional expert system. Master’s thesis. Mississippi State University. Cordova, J., and J. Hodges. 1988. CAT-KBES: A multilevel approach to knowledge representation. In Proceedings of the ACM Southeast Regional Annual Conference, 63742. Derouault, A., and B. Merialdo. 1986. Natural language modeling for phoneme-to-text transcription. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):742-9. Fox, E., J. Nutter, T. Ahlswede, M. Evens, and J. Markowitz. 1988. Building a large thesaurus for information retrieval. In Proceedings of the Second Conference on Applied Natural Language Processing, 101-8. Hodges, J., and J. Cordova. 1989. Improving the performance of an expert system with the use of heuristic search techniques. In Proceedings of the ACM Southeast Regional Annual Conference, 127-31. Humphrey, S. 1989. A knowledge-based expert system for computer-assisted indexing. IEEE Expert 4(3):25-38. Jensen, K., and J-L. Binot. 1988. Dictionary text entries as a source of knowledge for syntactic and other disambiguations. In Proceedings of the Second Conference on Applied Natural Language Processing, 152-9. Kucera, H., and W. Francis. 1967. A computational analysis of present-day American English. Providence, Rhode Island: Brown University Press. Lenat, D., M. Prakash, and M. Shepherd. 1986. CYC: Using commonsense knowledge to overcome brittleness and knowledge acquisition bottlenecks. AI Magazine 6(4):65-85. Virkar, R., and J. Roach. 1988. Direct assimilation of expert-level knowledge by automatically parsing research paper abstracts. International Journal of Expert Systems 1(4):281-305.

Page 7

Suggest Documents