Connecting Medical Informatics and Bio-Informatics R. Engelbrecht et al. (Eds.) ENMI, 2005
829
Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System a,b
a
b
Kornél Markó , Stefan Schulz , Udo Hahn a
Medical Informatics Department, Freiburg University Hospital, Germany Jena University Language & Information Engineering Lab, Germany
b
Abstract We present a method for the automated acquisition of a multilingual medical lexicon (for Spanish and Swedish) to be used within the framework of a medical cross-language text retrieval system. We incorporate seed lexicons and parallel corpora derived from the UMLS Metathesaurus. The seed lexicons for Spanish and Swedish are automatically generated from (previously manually constructed) Portuguese, German and English sources. Lexical and semantic hypotheses are then validated making iterative use of co-occurrence patterns of hypothesized translation synonyms in the parallel corpora. Keywords: Medical Informatics; Information Storage and Retrieval; Multilingualism; Vocabulary, Controlled
1. Introduction The access to medical documents (from medical narratives, research articles, web-based health portals, etc.) is typically characterized by a mix of natural languages. While English is the primary (though not only) language of scientific communication for medicine, medical specialists and general practitioners use their native language(s) for medical reports of any sort. For non-English native speakers, even those familiar with English medical terminology, this diversity tends to create a problem to properly express their information needs. Therefore, automatically performed intra- and interlingual lexical mappings or transformations of equivalent expressions become crucial for adequate medical information supply. We respond to these challenges in terms of the MORPHOSAURUS system.1 It is centered around a new type of lexicon, in which the entries are subwords, i.e., semantically minimal, morpheme-style units [1]. Language-specific subwords are linked by intralingual as well as interlingual synonymy and grouped into concept-like equivalence classes at the layer of a language-independent interlingua. Our claim that this approach is useful for the purpose of cross-lingual text retrieval and document classification has already been experimentally supported [2]. 1
Acronym for MORPheme TheSAURUS, see http://www.morphosaurus.net
Section 10: Natural Language, Text Mining and Information Retrieval
Connecting Medical Informatics and Bio-Informatics R. Engelbrecht et al. (Eds.) ENMI, 2005
830
The quality of cross-lingual indexing and retrieval crucially depends on the underlying lexicons and the thesaurus in which equivalence classes are organized. As their manual construction and maintenance is costly and error-prone, we here propose an approach to automatically acquire Spanish and Swedish lexicons, starting from already available Portuguese, English, and German lexicons. We then evaluate our approach on a Spanish-Swedish parallel medical corpus. 2. Multilingual Lexicon Acquisition Our work starts from the assumption that neither fully inflected nor automatically stemmed words constitute the appropriate granularity level for lexicalized content description. In the medical sublanguage, we observe a high frequency of domain-specific suffixes (e.g., ’-itis’, ‘-ectomia’) and complex word forms such as in ‘pseudo◊hypo◊para◊thyroid◊ism’.2 Morpheme-style subwords are lexical entities capable of dealing with these phenomena in a particularly adequate manner. They are assembled in a multilingual lexicon and thesaurus, which contain lexemes, their attributes, synonym classes and semantic relations between them, according to the following considerations: • Subwords are registered in 7-bit ASCII (with few exceptions for Swedish), together with their attributes such as language and subword type (stem, prefix, suffix, invariant). Each lexicon entry is assigned a unique identifier representing one synonymy class, the MORPHOSAURUS identifier (MID). • Synonymy classes which contain intralingual synonyms and interlingual translations of subwords are fused. Equivalence is judged within the context of medicine only. • Semantic links between synonymy classes are added. We subscribe to a shallow approach in which semantic relations are restricted to a single paradigmatic relation has-meaning, which relates one ambiguous class to its specific readings,3 and a syntagmatic relation expands-to, which consists of predefined segmentations in case of utterly short subwords.4
Figure 1- Morpho-semantic normalization pipeline Figure 1 depicts how source documents (topleft) are converted into an interlingual representation. First, each input word is orthographically normalized according to language-specific rules for the transcription of diacritics (topright). Next, words are segmented into sequences of subwords (bottom-right). The segmentation results are checked for morphological plausibility using a finite-state automaton in order to reject invalid segmentations (e.g., segmentations without stems 2
’◊’denotes the concatenation operator. For instance, {head} → {zephal,kopf,caput,cephal,cabec,cefal} OR {leader,boss,lider,chefe} 4 For instance, {myalg} → {muscle,muskel,muscul}◊{schmerz,pain,dor} 3
Section 10: Natural Language, Text Mining and Information Retrieval
Connecting Medical Informatics and Bio-Informatics R. Engelbrecht et al. (Eds.) ENMI, 2005
831
or beginning with a suffix). Finally, each meaning-bearing subword is replaced by a language-independent semantic identifier, the MORPHOSAURUS identifier (MID) (bottom-left). In Figure 1, bold-faced MIDs co-occur in both document fragments. The manual construction of a trilingual lexicon and the thesaurus has consumed three and a half person years. The combined subword lexicon contains 59,288 entries, with 22,041 for English, 22,385 for German and 14,862 for Portuguese. In an effort to further expand the language coverage of our system by Spanish and Swedish, we wanted to reuse the already available resources for Portuguese, English, and German in order to speed up and to ease the lexicon acquisition process. For the initialization of the Spanish and Swedish subword lexicons we proceed as follows: From the PORtuguese (alternatively, ENGlish and GERman) lexicon, identical and similarly spelled SPAnish (SWEedish) subword candidates are generated. For example, the Portuguese word stem ‘estomag’ [‘stomach’] is identical with its Spanish cognate, while ‘mulher’ [‘woman’] (Portuguese) is similar to ‘mujer’ (Spanish). Similar subword candidates are generated by applying a set of string substitution rules, some of which are listed in Table 1. In total, we formulated 44 rules for the Portuguese-Spanish pair, 19 rules for German-Swedish and 6 for English-Swedish. Some of these substitution patterns cannot be applied to starting or ending sequences of characters in the source subword. This constraint is captured by a wildcard (‘+’ in Table 1). Table 1- Some String Substitution Rules Rule ss→s lh→j +ca→za
POR fracass mulher cabeca
SPA fracas mujer cabeza
Rule ei→e +aa→a +u→ö
GER Bein Saal brust
SWE ben sal bröst
Rule c→k ph→f ce→s
ENG cramp phosphor iceland
SWE kramp fosfor island
Based on these rules and the already available (Portuguese, English, German) subword lexicons, for each subword, all possible Spanish and Swedish variant strings were generated. All resulting subword variants were subsequently compared with target language (Spanish and Swedish) text corpora we acquired from the Web. Wherever a (subword-) variant matched a word in the target corpus, the matching string was listed as a potential Spanish (Swedish) cognate of the Portuguese (alternatively, English and German) subword it originated from. Whenever several substitution alternatives for a source subword had to be considered, that particular alternative was chosen, which had the most similar lexical distribution in the corpora considered. All other candidates were discarded. As a result, we obtained a list of putative Spanish (Swedish) subwords, each linked by the associated MID to their grounding source cognate in the Portuguese (alternatively, English and German) lexicon. Table 2- Selected Cognates (Union in Brackets) Language Pair Source Lexicon Variants Selected Cognates Linked MIDs Portuguese-Spanish 14,004 123,235 8,644 6,036 German-Swedish 21,705 145,423 4,249 (6,086) 3,308 (4,157) English-Swedish 21,501 68,803 4,140 (6,086) 3,208 (4,157) Starting from 14,004 Portuguese, 21,705 German and 21,501 English stems (affixes were excluded), a total of 123,235 Spanish subword variants were created using the string substitution rules. For Swedish, 145,423 variants were derived from German and 68,803 from English. Matching these variants against the Spanish and Swedish corpora and allowing for a maximum of one candidate per source subword, we identified 8,644 tentative Spanish and (combining English and German evidence) 6,086 tentative Swedish cognates (cf. Table 2). Spanish candidates are linked to a total of 6,036 MIDs from their Portuguese correlates (hence, 2,608 synonym
Section 10: Natural Language, Text Mining and Information Retrieval
Connecting Medical Informatics and Bio-Informatics R. Engelbrecht et al. (Eds.) ENMI, 2005
832
relationships have also been hypothesized), whilst Swedish candidates are associated with 4,157 MIDs from their German and English correlates. We then wanted to identify false friends, i.e., similar words in different languages with different meanings. In our experiments, we found, e.g., the Spanish subword candidate ’crianz’ for the Portuguese ‘crianc’ [‘child’] (the normalized stem of ‘criança’). The correct translation of Portuguese ‘crianc’ to Spanish, however, would have been ‘nin’ (the stem of ‘niño’), whilst the Spanish ‘crianz’ refers to ‘criac’ [‘breed’] (stem of ‘criação’ in Portuguese). For eliminating such false friends, we relied on parallel corpora made available by the Unified Medical Language System (UMLS) Metathesaurus. Unfortunately, word-to-word translation occurs only in very few cases. Much more common are more or less complex noun phrases with a similarly complex semantic structure. Examples for typical English-Spanish alignments are "Cell Growth" aligned with "Crecimiento Celular", or "Heart transplant, with or without recipient cardiectomy" aligned with "Trasplante cardiaco, con o sin cardiectomia en el receptor". We use English as the pivot language for our experiments, since it has the broadest lexical coverage in the UMLS. The size of the corpora derived from the linkages of the English UMLS to other languages amounts to 60,526 alignments for English-Spanish (only preferred entries in the UMLS MRCONSO table) and 10,953 alignments for English-Swedish. The parallel corpora of the aligned UMLS expressions were morpho-semantically processed as described in Figure 1. Whenever a MID occurred on both sides after this simultaneous bilingual processing, the appropriate Spanish (Swedish) subword entry that led to this particular MID is taken to be a valid entry. This approach is a reasonable way, since it is highly unlikely that a false friend occurs within the same translation context. All translation hypotheses that never matched in this validation procedure were rejected from the candidate lexicons. As a result, 3,230 of the original Spanish (37%) and 1,565 of the Swedish hypotheses are kept (26%) (cf. also Table 2 & 3). These supported cognates now serve as the seed lexicons (in the following, L(0)) for acquiring additional lexicon entries, which are not cognates to elements of any of the source lexicons. To illustrate this process, assume the Swedish subword ‘blod’ was identified as being a cognate to the English subword ‘blood’ (and, therefore, is included in L(0)). Then, the yet unknown Swedish word ‘Blodtryck’, which has the English translation ‘blood pressure’ in the UMLS Metathesaurus gets (invalidly) segmented into [ST:blod|UK:t|SF:r|UK:yck], with ST being a marker for a stem, SF for a suffix and UK for an unknown sequence. At the same time, the morpho-semantic normalization of ‘blood pressure’ leads to the sequence of MIDs [#blood #tense], whilst the normalization of ‘Blodtryck’ leads to [#blood], since ‘tryck’ is not yet part of the Swedish lexicon. Comparing these two representations, exactly one MID resulting from English cannot be found in the Swedish normalization result. In this case, the invalid segment is then reconstructed (leading to ‘tryck’) by eliminating those substrings that led to a matching MID (‘blod’) in the aligned unit (‘Blodtryck’). The supernumerary MID resulting from the English normalization is then assigned to the reconstructed substring. After processing all UMLS alignments, this new entry is then incorporated in the Swedish lexicon as a stem, resulting in the lexicon L(1). Next, all UMLS alignments are recursively processed once again. The newly derived lexicon entry may now serve for extracting, e.g., the Swedish word ‘luft’ with its identifier #aero from the UMLS entry ‘Air Pressure’ (indexed to [#aero #tense]) linked to ‘Lufttryck’ (Swedish), and so on. When no new entries can be generated in one run of processing the whole UMLS-derived alignments, the recursive algorithm stops. Table 3 depicts the growth steps of the target lexicons for the entire bootstrapping process. After 14 runs, learning comes to an end with 7,154 lexemes generated for Spanish, while after 8 runs 4,148 lexicon entries for Swedish are acquired. For multilingual lexicon acquisition, we referred to English-Spanish and English-Swedish
Section 10: Natural Language, Text Mining and Information Retrieval
Connecting Medical Informatics and Bio-Informatics R. Engelbrecht et al. (Eds.) ENMI, 2005
833
corpora compiled out of the UMLS Metathesaurus. To estimate the quality of the interlingual connections between the newly derived lexicons, we now compare the results after running MORPHOSAURUS on these collections. We are aware that these results probably include overfitting phenomena. Therefore, we additionally extracted a Spanish-Swedish parallel corpus from the UMLS. This corpus has 8,993 alignments ranging, again, from word-to-word translations (e.g., ‘Pierna’ to ‘Ben’ [‘leg’]) to complex noun phrases, which sometimes correspond to a single word in the other language, e.g., the Spanish phrase ‘Enfermedad virica transmitida por artropodos, no especificada’ maps to the Swedish ‘Arbovirusinfektioner’ [‘Arbovirus Infections’] in the UMLS. Table 3- Lexicon Growth Steps for Spanish (left) and Swedish (right) Lexicon Swedish Lexicon Swedish Lexicon Spanish Lexicon Spanish L(0) 3,230 L(6) 7,110 L(0) 1,565 L(6) 4,142 L(1) 6,817 L(7) 7,111 L(1) 2,324 L(7) 4,147 L(2) 7,001 L(8) 7,114 L(2) 3,685 L(8) 4,148 L(3) 7,094 L(9) 7,126 L(3) 4,013 L(4) 7,108 … … L(4) 4,119 L(5) 7,109 L(14) 7,154 L(5) 4,136 However, we rely on these resources for evaluation only. Rather than simply measuring the coverage, we wanted to estimate the quality of the generated lexicons, i.e., the validity of the interlingual synonymy relations we stipulate. For this goal, we indexed the English-Spanish and English-Swedish corpora employing the MORPHOSAURUS routines, for each lexicon level L(0)-L(14). Furthermore, the Spanish-Swedish corpus – previously unseen by the learning algorithm – was processed accordingly. For each alignment unit of the parallel corpus, we then compared the resulting MIDs. In order to determine the fit of the two representations, we used a measure of indexing consistency proposed by Hooper (1965) [3]: CAU(i) = (100*A)/(A+N+M). The indexing consistency of one alignment unit AU(i) of the parallel corpus, CAU(i), is dependent on A, the number of MIDs that co-occur on both sides of that unit in the parallel corpus and the number of MIDs that occur only on one side, N or M . To express the overall consistency, the arithmetic mean of all alignment units (CAU(i)) of the corpus is calculated. 3. Results Table 4 depicts the over-all consistency values (columns 2, 5 and 8) starting from lexicon L(0) (only validated cognates) to lexicon L(5) for the Spanish and Swedish lexical acquisition (improvements after that step are only marginal, cf. Table 3). When processing the English-Spanish corpus, consistency is already about 40%, only considering cognates. Adding those entries acquired from recursively bootstrapping the same corpus, consistency climbs to a maximum of 52%. As a reference item, the processing of an English-German corpus, which is also derived from UMLS, yields 57% consistency – keeping in mind that English and German lexicons were generated manually and provide a real good coverage [2]. For English-Swedish, consistency ranges from 27% (only cognates) to 56% (after five cycles). The processing of Spanish-Swedish is particularly interesting, since the underlying corpus was not involved at all in the lexical acquisition. With consistency starting from 21% for cognates, 46% is reached after five cycles of generating the non-English lexicons by processing parallel corpora aligned to English only. Coverage was measured by counting those cases in which at least one MID occurs on both sides of the alignment units considered. For Spanish cognates only (L(0) in Table 4), alignments to English can be observed for 88% of the corpus. This value increases to 96% after five runs of
Section 10: Natural Language, Text Mining and Information Retrieval
Connecting Medical Informatics and Bio-Informatics R. Engelbrecht et al. (Eds.) ENMI, 2005
834
bootstrapping the Spanish lexicon. For English-Swedish, coverage reaches 86% and for Spanish-Swedish 84%. Again, as a reference, the processing of the English-German corpus yields 97% coverage. The number of cases in which both sides are indexed identically, range from 6% to 12% for English-Spanish, from 12% to 43% for English-Swedish and from 9% to 27% for Spanish-Swedish. The reference data for these values is 30% for English-German. Table 4- Indexing Consistency (C ), Coverage (Cov.) of Lexicons and Number of Identical Indexes (Ident.) at each Stage of Lexicon Generation. Lexicon L(0) L(1) L(2) L(3) L(4) L(5)
English-Spanish: n=60,526 English-Swedish: n=10,953 Spanish-Swedish: n=8,993 C Cov.(%) Ident.(%) C Cov.(%) Ident.(%) C Cov(%) Ident.(%) 39.6 87.6 6.1 27.4 60.0 11.7 21.4 53.8 8.9 47.5 95.5 9.7 29.8 63.3 18.4 29.8 77.1 18.4 51.0 95.6 11.8 50.7 81.8 39.9 40.6 80.1 23.9 51.3 95.6 12.1 55.4 84.8 41.4 44.6 83.2 25.8 52.0 95.6 12.3 56.0 85.4 42.2 45.6 83.7 26.7 52.0 95.7 12.4 56.3 85.6 42.6 45.9 83.8 26.9
4. Discussion and Conclusion We have shown that a significant amount of Portuguese, English and German subwords from the medical domain can be mapped to Spanish and Swedish cognates by simple string transformations. With these seeds, we further enlarge the Spanish and Swedish cognate lexicons by subwords which are not cognates. For the latter task, we used the UMLS Metathesaurus, and extracted those noncognates in a bootstrapping way. Most alternative automatic approaches to multilingual lexical acquisition either employ heavy linguistic parsing machinery [4] or use statistical methods, such as context vector comparison [5] which require a seed lexicon of trusted translations. We derived such a seed lexicon via a generative method to cognate mapping. Déjean et al. [6] incorporate hierarchical information from MeSH for combining different evidence for lexical acquisition. 5. References [1] Honeck M, Hahn U, Klar R, Schulz, S: Text retrieval based on medical subwords. In Health Data in the Information Society. Proceedings of MIE 2002. Budapest, Hungary, August 2002; 241-245. [2] Markó K, Hahn U, Schulz S, Daumke P, Nohama P: Interlingual indexing across different languages. In Conference Proceedings of the 7th RIAO Conference. Avignon, France, April 26-28, 2004; 82-99. [3] Hooper RS: Indexer Consistency Tests: Origin, Measurement, Results, and Utilization. Bethesda, MD: IBM Corporation, 1965. [4] Hersh WR, Campbell EH, Evans DA, Brownlow ND: Empirical, automated vocabulary discovery using large text corpora and advanced language processing tools. In Proceedings of the 1996 AMIA Annual Fall Symposium. Washington DC 1996; 159-163. [5] Widdows D, Dorow B., Chan C-K: Using parallel corpora to enrich multilingual lexical resources. In Proceedings of the 3rd International Conference on Language Resources and Evaluation - LREC 2002. Las Palmas de Gran Canaria, 2002; 240-245. [6] Déjean H, Gaussier E, Sadat, F: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 18th International Conference on Computational Linguistics- COLING 2002.2002; 218-224.
Address for correspondence Kornél Markó,
[email protected], Medical Informatics Department, Freiburg University Hospital, Stefan-Meier-Str. 26, 79104 Freiburg, Germany
Section 10: Natural Language, Text Mining and Information Retrieval