In: Rozprawy Uniwersytetu. Warszawskiego. Warsaw University Press, 1992. 7. Vetulani Z.: POLINT - system automatycznej interpretacji pytan w jezyku polskim.
Partial Parsing Method Applied to Rules Acquisition for Medical Expert System Maciej Piasecki and Jerzy Sas Computer Science Department of Wroclaw University of Technology ul. Wybrze˙ze Wyspia´ nskiego 27, 50-370, Wroc¯law, Poland {Piasecki, Sas}@ci.pwr.wroc.pl
Abstract. The paper presents the variant of partial parsing method (PPM) applied to acquisition of expert rules from a Polish medical text. PPM is based on the premise that knowledge domain is already defined by knowledge engineer (i.e. names for classes, attributes, values etc.). The definitions are automatically translated from natural language into formal expressions stored partially in knowledge base and partially in semantic dictionary. PPM preserves composi-tionality principle and is based on sublanguage method. Subsequent sentences are scanned for occurrences of words belonging to subcategories. Parsing is used for recognition of compound phrases.
1
Introduction
During the implementation of expert system most efforts and costs is consumed by the process of knowledge acquisition. These costs could be reduced using existing electronic texts. However, the full understanding and intelligent analyses of text meaning is still impossible. In the paper, a simpler solution is proposed: intelligent text scanning based on previously given detailed knowledge domain specification. The whole process is designed to be controlled by a human operator - a knowledge engineer. The specification of domain is done in natural language and then automatically translated into a given knowledge representation language. The whole task is focused on a specific area of application: a medical expert system providing disease diagnosis. The expert system is based on some probabilistic knowledge (i.e. expert rules to which some probabilities values are assigned and a set of cases of real diagnosis). The expert rules are very often one level, unstructured. The main task of the system is to identify in the text all sentences possibly containing rule-like information and to propose to the knowledge engineer a draft version of rules. In the following sections knowledge representation (KR) formalism will be presented together with some more details concerning the expert system. Next, the effects of linguistic analysis of the collected corpus will be discussed and, finally the applied Partial Parsing Method will be presented together with its implementation.
2
Knowledge Representation
Expert knowledge is represented as a set of rules of the general form: IF wk (x) T HEN J = i W IT H < pk , pk > where x is a vector of attributes describing the particular case for diagnosis, wk (x) is a logical sentence dependent in its value on the vector of attributes, and j is a symbol of concluded diagnosis. We assume that the set of possible diagnosis is finite. Moreover, each rule is associated with a pair of values (< pk , pk >) describing the boundaries for a posteriori probability p(j/wk (x)). The set of rules of this form is used by the diagnostic algorithm (based on combined: rule and case based approach) which evaluates the probabilities of classes for given particular features vector (diagnostic case) [2]. Rules are written down in the special formal language RECLAN. The set of rules must be associated with a knowledge domain specification - an input specification. It includes the following elements: – specification of classes of recognition (mainly definition of symbols for each one), – - specification of attributes and their domains (including symbols for values). An example of the typical domain specification and a rule (based on Polish terms) is given below: ATTRIBUTES: cisnienie_krwi WITH VALUES: obnizone,podwyzszone,w_normie CLASSES: przewlekla_niewydolnosc_nerek IF cisnienie_krwi = podwyzszone THEN przewlekla_niewydolnosc_nerek WITH
3
General Linguistic Analysis of the Problem
The main goal of the system is to recognise sentences containing the rule like information in the introduced text and to generate rules written in RECLAN language for them. Simplifying a little the problem we assume that: – each rule is contained in a separate sentence, – texts delivered to the system use a correct language, – texts are relevant. The only interesting for us information conveyed by sentences are possible KR rules. Starting with these assumptions we designed an ’optimistically’ working semantic parser delivering only a draft versions of rules to a knowledge engineer. Additionally we left a difficult problem of assignment of posteriori probabilistic values as untouched. Even a preliminary approach to it needs a cognitive analysis. From the very beginning we rejected the possibility of full parsing as an ineffective for Polish.
The simplest possible solution to the problem of text scanning seems to be a kind of pattern matching technique used to look for names of features and values. However, application of it to Polish text is almost impossible. Due to almost free word order in Polish and richness of Polish morphology the number of possible forms for an average phrase is big e.g. the following simple phrases have the same meaning: ci´snienie krwi (eng. blood pressure), krwi ci´snienie, and ci´snieniem krwi (the same meaning but different syntactic case). Another difficult constructions are compound phrases including attributes, values and conjunctions e.g. [ci´snienie krwi i temperatura] podwy˙zszone (eng. increased [blood pressure and temperature]). Next, synonyms are used very often interchangeably, including very specific synonymous relations (not existing in everyday Polish) e.g. silna gor¸aczka (eng. strong fever), gor¸aczka (eng. fever) and infekcja (eng. infection). However, some characteristic features of investigated example corpus simplifies the task of Knowledge Acquisition (KA) e.g. because most of the sentences has generic character, they communicate general rules and dependencies in some reality, the problem of reference and partially anaphora has a minor importance in KA. The chosen approach is based on partial syntactic parsing and sublanguage method [5]. There are defined additional syntactic categories influenced by semantic considerations: class name, attribute name and value name - all in some syntactic variants (see next section). Because each of them includes compound expressions some special subcategories were defined, too e.g. element of category name, attribute name etc. Information about sublanguage category assignment to words is stored in semantic dictionary. The sublanguage grammar is based on a subset of general grammar of Polish [6]. The appropriate subset was chosen after analysis of the corpus of ex-ample texts. The utilized subset of rules is limited mainly to noun phrases and adjective phrases because of the Partial Parsing Method used for text analysis.
4
Partial Parsing Method
The main task of the parser is to assign to each sentence its meaning i.e. a sequence: identifier of a class(es) j and predicate expression wk (x). The predicate expression describes pairs of identifiers of attributes and identifiers of values.The pairs are connected by different conjunctions. In case the sentence does not contain rule like information, the empty expression is assigned as its meaning. To make the parsing relatively fast it is limited only to the phrases seeming to be significant to the expected meaning of the sentence. Each sentence is scanned for words from semantic dictionary. On the base of sublanguage category of the processed word the appropriate grammar rules (with the appropriate sublanguage head category) are activated to reconstruct a phrase following the word. The parsing is limited only to the group of words of the same sublanguage category. This process is illustrated by the following example. Let’s regard a typical sentence:
U dzieci wczesnym objawem przewleklej niewydolno´sci nerek moze by´c podwy˙zszone ci´snienie krwi. 1 after assignment of categories we receive (X means a word not found in semantic dictionary - and ignored by the parser): X[u] X[dzieci] X[wczesnym] X[objawem] CN E[przewleklej] CN E[niewydolno´sci] CN E[nerek] X[mo˙ze] X[by´c] VN E[podwy˙zszone] AN E[ci snienie] AN E[krwi] where CN E, VN E, AN E are sublanguage categories of the meaning: class name element, value name element, attribute name element, respectively Next, applying partial syntactic parsing we receive: X[u] X[dzieci] X[wczesnym] X[objawem] CN NP[przewleklej niewydolno´sci nerek] X[mo˙ze] X[by´c] VN ADJ[podwy˙zszone] AN NP[ci snienie krwi] When all words of the same category have been collected into a phrase, a meaning must be assigned to the phrase. From each phrase a unique semantic key for semantic dictionary must be generated. Because of free word order there can be many order variants and derivation trees for the same name e.g. cisnienie krwi and krwi cisnienie means the same. The unique key is generated on the base of mechanism of normal derivation tree. There are all necessary rules in the grammar of the parser but from each set of similar rules one of them is arbitrary chosen as a normal one. The key is produced as a concatenation (with spaces between words) of leaves of the derivation tree read from left to right. The strings stored in leaves are not identical with words from the processed sentence but represent a basic morphological form of each word (together with stored information about values of morphological attributes). For instance, for two phrases being order variants of each other: przewlekla niewydolnosc nerek and przewlekla nerek niewydolnosc (there is more possible variants), there is generated one unique semantic key: przewlekly niewydolnosc nerka. Semantic keys are identical with names used in domain specification. If the semantic key generated for a phrase is found in the semantic dictionary meaning is assigned to the phrase i.e. a formula of Lambda Calculus (LC) including the semantic key. Semantic keys are also the base for the synonyms recognition. Special translation table: key to key is established and because synonyms are stored together with their normal derivation trees including variables for some values of morphological attributes it is possible to exchange phrases in derivation trees of sentences. Starting with the level of name phrases the grammar used in PPM becomes compositional i.e. there is a semantic rule for each syntactic rule. Mostly semantic rules are just simple functional application based on LC. For instance, continuing the last example, regarding the pair value-attribute, there is a syntactic rule (written in DCG format): VA_NP(Cs, Num,...) = VN_ADJ(Cs, Nm,...) ATN_NP(Cs, Nm,...) where the meanings assigned on the base of semantic dictionary are: 1
eng. In the case of children an early symptom of chronic insufficiency of kidneys can be an increased blood pressure.
AN NP[cisnienie krwi] ⇒ λP.P(cisnienie krwi) VN ADJ[podwyzszone] ⇒ λN.[let(podwyzszony, N)] and the semantic rule is just a functional application. After application of semantic rule we receive: λP.P(cisnienie krwi)( λN.[let(podwyzszony, N)]) = let(podwyzszony, cisnienie krwi) Each semantic rule includes conditions, which must be fulfilled by its arguments to make the rule applicable e.g. value must belong to the set of possible values of a given attribute. The conditions mostly concern information stored in the domain specification e.g. some ambiguities in conjunction constructions can be resolved on the base of the specification of attributes domains. For example, in the phrase znieksztalcone [krwinki i temperatura] (eng. disfigured [blood corpuscle and temperature]) applying information from domain specification the association of attribute temperatura with value znieksztalcony can be rejected. Finaly, as an effect of semantic analysis a semantic representation for the processed sentence is generated, e.g. let( podwyzszony, cisnienie_krwi), cl:przewlekla_niewydolnosc_nerek and next it is transformed into a draft rule, e.g. if (cisnienie_krwi = podwyzszony) then przewlekla_niewydolnosc_nerek probability in Draft rules are presented together with the initial sentence to the knowledge engineer (KE) and can be accepted, modified or rejected. KE must assign to each draft rule the appropriate probabilistic values.
5
Implementation
Architecture of the system includes the following modules: Morphological Preanalyser, Partial Syntactic Parser (PSP), Semantic Analyser (SA), Draft Rules Generator (DRG). The modules uses the following dictionaries: General Syntactic Dictionary, Temporary Syntactic Dictionary, Semantic Dictionary and Domain Specification stored as data in the system. We assume the maximal possibly usage of existing Polish language resources. This assumption strongly influenced the construction of the system, especially the choice of Prolog as the main implementation language (the modules: PSP, SA and DRG ). Prolog was chosen because the biggest existing formal description of Polish grammar is done in DCG format [6]. That is why partial parser is based on classical methods [3] as well as LC implementation. There is no big electronic syntactic dictionary of Polish in the format ready to use. The only possible source is morphological analyser SAM-95 [4] unfortunately producing complicated output. SAM-95 was used to produce a prototype of the General Syntactic Dictionary (GSD) on the base of the corpus. The dictionary
was implemented as finite state automata using software prepared by Jan Daciuk and described in [1]. Effectiveness of parser was improved by the morphological preanalysis 2 [7] and switches 3 [7]. The User Interface (UI) is written in C++ and is working under Windows NT. The communication between UI and text processing module is established on the base of DDE mechanism (Dynamic Data Exchange).
6
Further Development of the System
PPM shows promising speed of processing and accuracy. The most serious limitation of it is that it does not work well for compound sentences. However, application of technique of templates, presently being developed, shows possibility of PPM extension to compound sentences, as well. The application of compositionality paradigm as a base for the parses occurred to be very successful. We received a clear construction of the system easy to maintain. Still, the problem of posteriori probabilistic values assignment on the base of input sentence is the big challenge.
References 1. Daciuk J, Watson B., Watson R., Incremental Construction of Minimal Acyclic Finite State Automata and Transducers. In: Proceedings of Finite State Methods in Natural Language Processing, Bilkent University, Ankara, Turkey, 1998. 2. Huzar Z., Kurzy´ nski M., Sas J.: Rule-Based Pattern Recognition With Learning, Wroclaw University of Tech. Press, Wroclaw, 1994. 3. Pereira F.C.N., Shieber S.M.: PROLOG and Natural-Language Analisis. CSLI, Stanford, 1987. 4. Szafran K.: Analizator morfologiczny SAM-95 opis uzytkowy. Technical Report TR 96-05 of Computer Science Institute of Warsaw University, Warsaw, May 1996. 5. Sager N., Friedman C., Lyman M.S., Medical Language Processing, Computer Management of Narrative Data. Addison-Wesley, 1987. ´ 6. Swidzi´ nski M.: Gramatyka formalna jezyka polskiego. In: Rozprawy Uniwersytetu Warszawskiego. Warsaw University Press, 1992. 7. Vetulani Z.: POLINT - system automatycznej interpretacji pyta´ n w j¸ezyku polskim i jego realizacja w PROLOGU. In: Eufonia i Logos. ed. Pogonowski J., UAM Press, Pozna´ n, 1995.
2
3
Before parsing a temporary dictionary including only forms of words found in the sentence is created dynamic cutting of some branches of inference