used for the semantic representation of meanings of prepositional phrases. ... prepositional phrases using the existing technology for indexing noun phrases,.
Use Of Multiaxial Indexing for Information Extraction From Medical Texts a
a
b
K. Denecke , I. Kohlhof , J. Bernauer a ID GmbH Berlin, Germany b University of Applied Sciences Ulm, Germany
Abstract. This paper presents a new approach for automatically determining a semantic representation of medical narratives (essentially reports of diagnostic imaging and surgical procedures). It is based on a multiaxial nomenclature (Wingert Nomenclature - WNC) and language engineering technologies: a concept-based morpheme lexicon and an indexing algorithm for noun phrases. Founded on this, sets of indices of the nomenclature will be transformed into conceptual graphs (CG), which can be used for information extraction. Transformation rules for mapping syntactic information to semantic roles are used for the semantic representation of meanings of prepositional phrases. The correctness of the representation depends on the complexity of sentences, the accuracy of mapping to concepts of the nomenclature as well as the semantic content of the processed sentences. The project is still in progress. The strategy is currently tested and used for information extraction from medical texts. Keywords: Natural Language Understanding, Conceptual Graphs, Information Extraction
1. Introduction In medicine, written text plays a significant role in documentation and communication. Automatic understanding of the content can be a substantial aid for reusing information, for clinical documentation as well as for decisionmaking. For this reason, possibilities for automatically locating specific information in medical documents and for processing this information are investigated. A prerequisite for that is a structured representation of essential facts of a text. This paper introduces a method for transforming sentences of a medical document into semantic representations. The chosen approach uses a multiaxial medical nomenclature (Wingert Nomenclature), a concept-based morpheme lexicon, a word segmentation algorithm as well as semantic transformation rules for mapping syntactic information to semantic roles. Sets of indices are mapped to conceptual structures using information from semantic categories and word position. The new approach tries to take over an available and effective language engineering technology for indexing to cope with more complex text structures. It targets noun phrases enriched by prepositional phrases using the existing technology for indexing noun phrases, and a rule-based algorithm for semantic representation of prepositions [10]. The algorithm ignores morphosyntactic information. The only syntactic information used is word order and word category (determiner, noun, preposition, adjective). Such a shallow syntactic analysis is adequate for the processed text type, namely reports of diagnostic imaging and surgical procedures in the German language, which often consist of verbless clauses or phrases without subordinate clauses.
2. Background 2.1 Existing approaches for medical language processing Systems for generating standardised representations for medical documents are often limited to certain domains and their construction is fairly complicated. For most of the systems, a domain model or knowledge base has been built up only for this purpose. Hahn et al. are working on a natural language processor (MedSynDiKATe) for medical finding reports in German [5]. The system combines functions for information extraction, learning and text mining. Special features are the discovery of new knowledge as well as acquisition of concepts and new words during processing. The work is based on a methodology for the segmentation of complex compounds into medical morphemes, which are at the same time used to build up a multilingual morpheme thesaurus [8] [9]. The RECIT system (REprésentation du Contenu Informationnel des Textes médicaux) [12] aims to map the content of a text to a language-independent representation. It is based on proximity processing for decomposing sentences into meaningful fragments from which a semantic representation can be gradually built up. A domain model and dictionaries are necessary. Huang et al. developed a context-sensitive indexing model for clinical radiology reports [11]. A system explicitly developed for a special domain is the Lung Pathology System [1]. It intends to extract information from medical reports based on a shallow, syntactic analysis and a mapping to a semantic representation in OWL (Web Ontology Language). Further more approaches could be mentioned. A more comprehensive overview of the current uses of natural language processing in medicine is given by Spyns [2]. 2.2 Semantic Representation using conceptual graphs Conceptual graphs, introduced by J.F. Sowa [4], allow the expression of linguistic meaning in a logically precise way that is machine processable. They make it possible to represent meanings in a readable, comprehensive manner. A conceptual graph is a bipartite graph with two kinds of nodes: concepts and relations, which are connected by arcs. Each concept consists of a concept type and a referent (e.g. [morphology: fracture], where morphology is the concept type and fracture the referent). A relation between two concepts has a type and represents a role between concept referents. 3. Method 3.1 Parsing sentences First, a preparser identifies sentence boundaries in the text. Secondly, each sentence is decomposed into segments using a shallow, syntactic analysis. Head segments and segments started by a preposition (prepositional segments) are determined. A head segment corresponds to the grammatical head nominal phrase of a sentence and reaches from the beginning of a phrase to the first occurrence of a preposition. A prepositional segment starts with a preposition and reaches to the next preposition or to the end of the phrase or sentence.
Example: The nominal phrase [CT des Abdomen und des Beckens] [mit Kontrastmittel] (CT of the abdomen and pelvis with contrast medium) consists of a head segment and a prepositional segment (here displayed in brackets).
3.2 Identification of special expressions After parsing, each sentence is checked for special expressions (“formula”, “formula-like” expressions) using regular expressions. Formula include for example dates (e.g. October 2005), quantities (e.g. 20 mg), dosages (e.g. 1-01) as well as grading expressions (e.g. type II). Formula-like expressions are specific simple or complex expressions, which occur very often in medical documents and convey special meanings (i.e. Ausschluss von (exclusion of), Verdacht auf (suspicion of)). All these expressions are tagged before the indexing process starts, because they often require a special interpretation. For each extracted formula, a semantic concept will be created. The formula in the phrase Pantoprazol 20 1-0-1, for example, will be translated into the two concepts [dosage: 1-0-1] and [quantity: 20 mg]. Each formula-like expression is subsumed under a specific semantic role, which is assigned to the corresponding segment of the sentence (e.g. Ausschluss einer Appendizitis (exclusion of appendicitis) will be represented as [(NEG) morphology: Appendicitis]). 3.3 Tagging of words For determining index proposals corresponding to an input word or phrase, the input is morphologically analysed, i.e. each word is decomposed into its morphemes (smallest units of a word). In addition, each word’s morphological features (“lexical information”) are determined (i.e. word category, language, gender, case, number). The German word Splenektomie (splenectomy) is, e.g., decomposed by the word segmentation algorithm into the morphemes splen and ectom, it’s lexical information is coded by NDF1234 which means it is of word category noun (N) in the German language (D), feminine (F) and is in nominative (1), genitive (2), dative (3) or accusative (4) case. The following indexing process uses the identified morphemes and word categories. For further details on the morphological analysis see [6], [7]. 3.4 Indexing of syntactic segments During the indexing task, every sentence, i.e. its head segment and prepositional segments are mapped to an optimised set of indices of the Wingert Nomenclature (WNC). The WNC [3] is a nomenclature based on the work of F. Wingert [3], who worked over SNOMED and translated it to German in 1974. The nomenclature comprises a complete multiaxial terminology of medical terms and allows encoding different aspects of a diagnosis or procedure. For this purpose, ten different axes are available: topography (T), morphology (M), function (F), procedure (P), diagnosis (D), job (J), information (G), treatment (V), agent (W), and aetiology (E). The termini are grouped into synonym or equivalence classes. Different synonyms are linked to the same index (e.g. T001118 Leber (liver, Hepar, hepat(o)*, hepatisch)). An index consists of a letter expressing the corresponding axis as well as a six-digit number for identification (e.g. the index of “inflammation” is M000562). Concepts of the WNC can be mapped to other terminological systems like SNOMED CT or UMLS.
For indexing an input phrase, the tagged input (see step 3.3) is compared with the tagged terms of the WNC-Lexicon. In case of accordance, the corresponding indices are assigned to the groups of morphemes of the input (step 1 in the example below). Then, the algorithm checks whether it is possible to combine several simple indices into one single index. This process is repeated recursively until a final set of indices for a phrase are determined (step 2 in the example). For further details on the indexing algorithm see [3]. Example: Input: Nabelschnurumschlingung (cerclage of the umbilical cord) 1. V0003E0 Umschlingung (cerclage), T001497 Nabelschnur (umbilical cord) 2. M00038E Nabelschnurumschlingung (cerclage of the umbilical cord) The indices for the single words Umschlingung (cerclage) and Nabelschnur (umbilical cord), determined in the first step, are combined to one index in the second step.
The indexing algorithm delivers good results for not too complex noun phrases. ID GmbH (http://www.id-berlin.de) successfully applies the algorithm in its products for mapping natural language phrases to concepts of the classification systems ICD-10 GM and OPS, the German adaptation of ICPM. 3.5 Semantic analysis and generation of conceptual structures The semantic analysis employs the indices established in step 3.4. As a first step, the central information unit (leading index) and the set of modifying information (non-leading indices) are determined for each sentence. In the text corpus underlying the presented study (text reports from diagnostic imaging, procedure reports) the leading index of a head segment usually belongs to the axis diagnosis, morphology or function, if the phrase concerns a morphological change. It belongs to the axis treatment or procedure, if the phrase describes a procedure. For head segments as well as for prepositional segments a heuristic priority order of semantic types is defined from which the leading index is derived. For head segments, this order is given by the sequence DMFVPTEJWG with decreasing priority. This sequence has the following interpretation: If a set of indices, that results from mapping the head segment to WNC, has indices of different semantic types, that index (respectively those indices) whose semantic type is furthest left in the priority sequence is picked as the leading index. If more than one index belongs to the furthest left category, all these indices are considered as leading indices. Example: To the phrase Thorax-CT (thoraxial CT) the indices V00040B for CT, T001CD9 for Thorax (thorax) are assigned. Using the mentioned priority sequence for head segments, the index for CT will be determined as the leading index of the phrase. The T-index for Thorax will be a non-leading index.
In prepositional segments, the leading index is determined by the preposition as well as by the context in which the prepositional segment occurs, i.e. the semantic type of the leading index of the head segment, which the prepositional segment belongs to.
Example: The German preposition mit (with) can be used in two different meanings: In the context of a procedure (V/P-index is leading index of the head segment) mit has the meaning “using” (e.g. CT mit Kontrastmittel (CT with radiopaque material)). The priority sequence is fixed at WEMDFTVPJG, i.e. a remedy will be the leading index of the prepositional segment. If mit follows a morphological change, it is used in the meaning of “accompanied by” (e.g. Gallenblasenkarzinom mit Tumorkachexie (carcinoma of the gall bladder with tumour kachexy)). For this case, another priority sequence is defined (MDFTVPWEJG).
After determining for each segment a leading index and its non-leading indices and generating the corresponding concepts, the non-leading concepts are linked to their leading concept by relations. The type of the relation between each non-leading concept and its leading concept is established from the concept type of the non-leading concept. Finally, prepositional segments are linked to the head segment of the processed sentence using the semantic role assigned to the preposition (e.g. mittels (using), begleitet von (accompanied by)). Links are created between the leading concepts of the head segment and those of each prepositional segment. Using the described procedure, one conceptual graph per sentence will be created. The steps described above are illustrated in figure 1 considering Thorax-CT mit Kontrastmittel im Juni 2005 (Thoraxial CT with contrast medium in June 2005) as example. Detection of sentences and determining segments
Text
Thorax-CT mit Kontrastmittel im Juni 2005 (Thoraxial CT with contast medium in June 2005)
head segment: [Thorax-CT] prep. segments : [mit Kontrastmittel], [im Juni 2005] Extraction of formula and formula-like expressions
head segment: [Thorax-CT] prep.segment: [mit Kontrastmittel]
Indexing of each segment
headsegment: [Thorax-CT] V00040B, T001CD9 prepositional segments : [mit Kontrastmittel] E0036A8
Identification of the leading index (marked bold in the example) for each segment
headsegment: [Thorax-CT] V00040B, T001CD9 prepositional segments : [mit Kontrastmittel] E0036A8
Mapping concepts
[treatment: V00040B CT] [topography: T001CD9 Thorax] [aetiology: E0036A8 Kontrastmittel] [date: 01/06/2005]
of
indices
to
Introduction of relations between head segments and prepositional segments
generated concept: [date: 01/06/2005]
[treatment: V00040B CT] =>[topography: T001CD9 Thorax] =>[using] [aetiology:E0036A8 Kontrastmittel] =>[date: 01/06/2005]
Figure 1: Generation of a semantic representation for “Thorax-CT mit Kontrastmittel im Juni 2005” (Thoraxial CT with contrast medium in June 2005)
4. Preliminary results The following initial observations can be made. The strategy does a fairly good job in identifying concepts in medical narratives and linking these to each other in a manner corresponding to the meaning. The proposed approach has the advantages of not being limited to a certain medical domain and of making use of already existing components (e.g. Wingert Nomenclature, indexing algorithm). The approach is implemented for German, but can easily be extended to other languages; WNC and morphological analysis already exist, e.g., for English. The representation of formula and formula-like expressions by semantic concepts allows exploitation of these expressions. The system has recently been extended to process more kinds of negated phrases with promising results. The semantic representations can be used for different purposes: The implementation of the strategy is currently tested as part of an information retrieval tool in a cooperating hospital. Last implemented has been the use of the semantic representation for determining corresponding classification codes of ICD-10 GM or OPS-301. Simultaneously, we investigate the system’s capacities for information extraction. For this purpose, templates for information extraction from surgical letters have been specified. As a first application, the system is used to extract information on the hospitalisation of a patient, more precisely: the kind and the reason of hospitalisation as well as the admission diagnosis has to be extracted. Figure 2 shows the template for the extraction task. Information on hospitalisation kind of admission: reason of admission: admission diagnosis:
first admission emergency appendicitis
Figure 2: Template filled with extracted information on the hospitalisation of a patient
During the extraction process, information is collected from the written text itself and from its semantic representation, i.e. depending on the extraction rules, the system looks for trigger words in the input document or for information in the conceptual graph created by the system. A first evaluation of the extraction process has been performed with a data set of 20 texts. For these texts, recall and precision values of 92,5 percent and 93 percent, respectively, were achieved. These results are promising. But the evaluation process is still in progress and has to be extended to a larger data set, in particular. The quality of the representation system is analysed in terms of correctness and completeness of the semantic representations derived. Their correctness will be evaluated by comparing the automatically generated representation to a manually found one. 5. Future Directions In this article, a promising approach for the automatic generation of semantic representations using a multiaxial nomenclature is introduced. The strategy is currently tested as a basis of information extraction. The quality of the system is evaluated in terms of correctness and completeness of the representations.
Among other things, a question to investigate is whether the shallow, syntactic analysis is sufficient for our purposes or whether a deeper syntactic analysis is necessary. Relations or dependencies expressed in syntactic structure are lost because of the shallow syntactic analysis (e.g. in the phrase left upper arm and right knee “left” modifies the compound “upper arm”; it does not modify the noun “knee”). Actually, the analysis seems to be adequate for short sentences and the noun phrase structures mainly occurring in the processed documents, but not clever enough yet to process more complex sentences. References [1] Stede M, Schlangen D: Feeding OWL: Extracting and Representing the Content of Pathology Reports. In: Proceedings of the 4th Workshop on NLP and XML (NLPXML-2004): RDF/RDFS and OWL in Language Technology. Barcelona, Spain, 2004 [2] Spyns P: Natural language processing in medicine: an overview. Methods of Information in Medicine 35, 285-301, 1996 [3] Wingert F: SNOMED Manual. Springer-Verlag, Berlin, Heidelberg, 1984 [4] Sowa J.F: Conceptual Structures – Information processing in mind and machine. AddisonWesley Publishing Company, Reading, 211-276, 1984 [5] Hahn U et al : Wissensbasiertes Text-Mining mit SyDiKATe (Knowledge based textmining with SynDiKATe). Künstliche Intelligenz, vol. 2/02, arendtap Verlag, Bremen, 2002 [6] Wingert F: Morphologic analysis of compound words. Methods of Information in Medicine, 24(3):155-162, 1985 [7] Norton L, Pacak M: Morphosemantic analysis of compound word forms denoting surgical procedures. Methods of Information in Medicine, 22(1): 29-36, 1983 [8] Schulz S, Hahn U: Morpheme-based, cross-lingual indexing for medical document retrieval. International Journal of Medical Informatics, 59(3):87-99, 2000 [9] Hahn U, Honeck M, Piotrowski M, Schulz S: Subword Segmentation - Leveling out Morphological Varieties for Medical Document Retrieval. In: Proceedings of the 2001 AMIA Annual Symposium, Washington, 229-234, 2001 [10] Romacker M, Hahn U: Empirical Data for the Semantic Interpretation of Prepositional Phrases in Medical Documents. In: Proceedings of the 2001 AMIA Annual Symposium, Washington, 563567, 2001 [11] Huang Y,Lowe H J, Hersh W R: A Pilot Study of Contextual UMLS Indexing to Improve the Precision of Concept-based Representation in XML-structured Clinical Radiology Reports. J. Am. Med. Inform. Assoc., 10(6): 580 – 587, November 1, 2003 [12] Rassinoux A-M, Baud R H, Scherrer J-R: A multilingual analyser for medical texts, Geneva, http://mbi.dkfz-heidelberg.de/helios/doc/nlp/Rassinoux94b.html, 1994