and thesaurus, which contain their entries, special at- tributes and ... Thesaurus. Semantic ..... approaches to multilingual lexical acquisition employ statistical ...
Multilingual Lexical Acquisition by Bootstrapping Cognate Seed Lexicons Korn´el Mark´o and Stefan Schulz Freiburg University Hospital Medical Informatics Department Stefan-Meier-Strasse 26 D-79104 Freiburg, Germany
Udo Hahn Jena University Language & Information Engineering Lab F¨urstengraben 30 D-07743 Jena, Germany
www.imbi.uni-freiburg.de/medinf
www.coling.uni-jena.de
Abstract We present a methodology by which multilingual dictionaries (for Spanish, French and Swedish) emerge automatically from simple seed lexicons. The seed lexicons for the target languages are automatically generated by cognate mapping from (previously manually constructed) Portuguese, German as well as English sources. Lexical and semantic hypotheses are then validated by processing parallel corpora. In a last step, we use the cleaned list of ‘approved’ cognates in order to augment, step by step, the target dictionaries by processing the parallel corpora in terms of co-occurrence patterns of hypothesized translation equivalents which are not cognates.
1 Introduction Applications of NLP to medical language up until now have mainly focused on monolingual tasks involving document retrieval or information extraction. The reason for widening their scope to include multilingual considerations as well is fairly evident. While clinical documents are typically written in the country’s native language, searches in major bibliographic databases and the Web require sophisticated knowledge of English medical terminology. Hence, for cross-language information retrieval (CLIR) some sort of bridging between synonymous or, at least, related terms from different languages has to be done to make use of the information these sources hold. Dictionaries for CLIR provide explicit lexical links within and between the languages involved. However, manually built lexical resources often lack coverage, since their construction and maintenance is costly and error-prone. Therefore, we propose a mechanism by which comprehensive dictionaries for CLIR can be automatically set up, relying on simple techniques and easily available resources. In previous studies (Schulz et al. 04), we showed how lexical cognates can be identified using unrelated (i.e., non-parallel, non-aligned) corpora. We here enhance this approach by relating non-cognate lexical items from different language pairs as well. In particular, we examine a bootstrapping approach in order to acquire Spanish, French, and Swedish lexicons, starting from already available Portuguese, English, and German lexicons.
2 Subwords as Basic Indexing Units Our work starts from the assumption that neither fully inflected nor automatically stemmed words constitute the appropriate granularity level for lexicalized content description. Especially in scientific sublanguages, we observe a high frequency of domainspecific suffixes (e.g., ‘-itis’, ‘-ectomia’ in the medical domain) and the construction of complex word forms such as in ‘pseudo⊕hypo⊕para⊕thyroid⊕ism’, or ‘gluco⊕corticoid⊕s’.1 In order to properly account for these particularities of ‘medical’ morphology, we developed the M OR PHO S AURUS system.2 It is centered around a lexicon, in which the entries are subwords, i.e., self-contained, semantically minimal units (cf. (Schulz et al. 02) for a distinction between subwords and linguistically motivated morphemes). We have found empirical evidence that subword-based document indexing improves the performance of cross-lingual document retrieval in the medical domain (Hahn et al. 04). Subwords are assembled in a multilingual lexicon and thesaurus, which contain their entries, special attributes and semantic relations between them, according to the following considerations: • Subwords are listed, together with their attributes such as language (English, German, Portuguese) and subword type (stem, prefix, suffix, invariant). Each lexicon entry is assigned one M OR PHO S AURUS identifier representing one synonymy class, the MID. • Synonymy classes which contain intralingual synonyms and interlingual translations of subwords are fused. Intra- and interlingual semantic equivalence are judged within the context of medicine only. • Semantic links between synonymy classes are added. We subscribe to a shallow approach in which semantic relations are restricted to a single paradigmatic relation has-meaning, which 1 2
‘⊕’ denotes the concatenation operator. http://www.morphosaurus.net
High TSH values suggest the high tsh values suggest the Orthographic diagnosis of primary hypodiagnosis of primary hypoNormalization thyroidism ... thyroidism ... Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose ...
Original MID-Representation #up# tsh #diagnost# #thyre# #up# tsh #diagnost# #thyre#
Orthographic Rules
erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose ... Morphosyntactic Parser Lexicon
#value# #suggest# #primar# #small#
high tsh value s suggest the Semantic diagnos is of primar y hypo Normalization thyroid ism #value# #permit# er hoeh te tsh wert e erlaub en die Thesaurus #primar# #small# diagnos e einer primaer en hypo thyre ose
Figure 1: Morpho-Semantic Normalization Pipeline relates one ambiguous class to its specific readings,3 (cf. (Mark´o et al. 05) for the disambiguation of subwords) and a syntagmatic relation expands-to, which consists of predefined segmentations in case of utterly short subwords.4 Figure 1 depicts how source documents (top-left) are converted into an interlingual representation by a three-step procedure. First, each input word is orthographically normalized in terms of lower case characters and according to language-specific rules for the transcription of diacritics (top-right). Next, words are segmented into sequences of subwords or left as is when no subwords can be decomposed (bottomright). The segmentation results are checked for morphological plausibility using a finite-state automaton in order to reject invalid segmentations (e.g., segmentations without stems or beginning with a suffix). Finally, each meaning-bearing subword is replaced by a language-independent semantic identifier, its MID, thus producing the interlingual output representation of the system (bottom-left). A comparison of the original input (top-left) and the interlingual representation (bottom-left) already reveals the degree of (hidden) similarity uncovered by the overlapping MIDs.
3 Generation of Cognate Pairs The manual construction of a trilingual lexicon and the thesaurus has consumed four person years. The combined subword lexicon contains (as of July 2005) 57,210 entries,with 21,501 for English, 21,705 for German, and 14,004 for Portuguese. In an effort to further expand the language coverage of the M OR PHOSAURUS by Spanish, French, and Swedish, we wanted to reuse the already available resources for Portuguese, English, and German in order to speed up and to ease the lexicon acquisition process. The pro3
For instance, {head} ⇒ {zephal,kopf,caput,cephal,cabec, cefal} OR {leader,boss,lider,chef} 4 For instance, {myalg} ⇒ {muscle,muskel,muscul} ⊕ {pain, schmerz,dor}
Lang. POR GER ENG SPA FRE SWE
Seed Lexicon Stems Affixes 14,004 858 21,705 680 21,501 540 824 197 633
Corpus Types Tokens 133,146 13,400,491 17,151 161,952 11,349 56,317 82,431 3,979,051 43,105 2,284,646 47,823 957,904
Table 1: Resources Used for the Generation of Cognates cedure for doing so can be divided into three separate steps. First, cognate pairs for typologically related languages such as Portuguese-Spanish are generated. Second, the generated lexical hypotheses are checked for validity considering simple corpus statistics. In a last step, we use the cleaned list of validated cognates to augment, step by step, the target lexicons by processing parallel corpora in terms of co-occurrence patterns of hypothesized translation equivalents which are not cognates. Table 1 lists the resources we used for the generation of cognate pairs: • Manually established PORtuguese, ENGlish and GERman subword lexicons (stems and affixes). • Manually created lists of SPAnish, FREnch, and SWEdish affixes. They were assembled by medical linguists based on introspection and heuristic support from various dictionaries. • Medical corpora for all languages involved, all acquired from heterogeneous WWW sources. • Word frequency lists, which were automatically generated from these corpora. 3.1
Subword Candidates
For the initialization of the target subword lexicons we pursued the following strategy: From the Portuguese (alternatively, English and German) lexicon, identical and similarly spelled Spanish (French, Swedish) subword candidates were generated. As an example, the Portuguese word stem ‘estomag’ [‘stomach’] is identical with its Spanish cognate, while ‘mulher’ [‘woman’] (Portuguese) is similar to ‘mujer’ (Spanish). Similar subword candidates were generated by applying a set of string substitution rules, some of which are listed in Table 2. In total, we used 44 rules for Portuguese-Spanish, 26 rules for German-French, 18 rules for English-French, 19 rules for GermanSwedish, and 6 rules for English-Swedish. These rules were all formulated by medical linguists based on introspection, also using various dictionaries for
44 Rules: lh → j +ca → za 26 Rules: or → eur s→z 18 Rules: o → ou ve → f 19 Rules: ei → e +aa+ → a 6 Rules: ph → f ce → s
Portuguese mulher cabeca German tumor gas English movement nerve German bein saal English phosphor iceland
Spanish mujer cabeza French tumeur gaz French mouvement nerf Swedish ben sal Swedish fosfor island
Language String Variants Pair #Variants 4-chars 17-chars over-all POR-SPA 123,235 2.7 355.2 8.8 GER-FRE 68,999 2.0 9.1 3.2 ENG-FRE 46,122 1.6 5.6 2.2 GER-SWE 145,423 2.7 14.6 6.7 ENG-SWE 68,803 1.8 15.3 3.2 Table 3: Variant Generation: For each language pair (first column), the total number of variants is depicted in the second column. Columns three to five show variant averages per length.
generated from S that match the target language corpus CT , containing m tokens. With f (x, y) denoting the frequency of a word x in a corpus y, that particular Vj (1 ≤ j ≤ p) was chosen for which ¯ ¯ ¯ f (S, CS ) f (Vj , CT ) ¯¯ ¯ − ¯ ¯ n m
Table 2: Some String Substitution Rules heuristic guidance. Some of these substitution patterns cannot be applied to starting or ending sequences of characters in the source subword. This constraint is captured by a wildcard (‘+’ in Table 2), which stands for at least one arbitrary character. Based on these string substitution rules and the already available (Portuguese, English, German) lexicons, for each entry (excluding affixes) of these sources, all possible Spanish, French and Swedish variant strings were generated. This led, on the average, to 8.8 Spanish variants per Portuguese subword (ranging from 2.7 for high-frequent four-character words to 355.2 for low-frequent 17-character words). Since the rule set is much smaller for the other language pairs, their average is far less than for Portuguese-Spanish (cf. Table 3). All generated Spanish, French, and Swedish variants were subsequently compared with the target language word frequency list previously compiled from the text corpora. Wherever a (purely formal) prefix string match (in the case of stems) or an exact match (for invariants) occurred, the matching string was listed as a potential Spanish (French, Swedish) cognate of the Portuguese (alternatively, English and German) subword it originated from. Whenever several substitution alternatives for a source subword had to be considered that particular alternative was chosen which had the most similar lexical distribution in the corpora considered. Similarity was measured as follows: Let S be the source lexical item, CS the source language corpus containing n tokens and V1 , V2 , ..., Vp the hypotheses
was minimal. All other candidates were discarded. As a result, we obtained a list of putative Spanish (French, Swedish) subwords each linked by the associated MID to their grounding source cognate in the Portuguese (alternatively, English and German) lexicon. We refer to these lists of cognate candidates as CCSP A for Spanish, CCF RE for French, and CCSW E for Swedish. As an example, starting from 14,004 Portuguese, 21,705 German and 21,501 English subwords (cf. Table 1), a total of 123,235 Spanish subword variants were created using the string substitution rules (cf. Table 3). Matching these variants against the Spanish corpus and allowing for a maximum of one candidate per source subword, we identified 8,644 tentative Spanish cognates. Combining English and German evidence, 9,536 French and 6,086 tentative Swedish cognates were found (cf. Table 4). Spanish candidates are linked to a total of 6,036 MIDs from their Portuguese correlates (hence, 2,608 synonym relationships have also been hypothesized), whilst French (Swedish) candidates are associated with 6,622 (4,157) MIDs from their German and English correlates (cf. Table 4). 3.2
Validation Using Parallel Corpora
We take advantage of the availability of large parallel corpora in the biomedical domain in order to identify false friends, i.e., similar words in different languages with different meanings. In our experiments, we found, e.g., the Spanish subword candidate *‘crianz’ for the Portuguese ‘crianc’ [‘child’] (the normalized stem of ‘crianc¸a’). The correct translation of
Language Source Pair Lexicon POR-SPA 14,004 GER-FRE 21,705 ENG-FRE 21,501 Combined Evidence GER-SWE 21,705 ENG-SWE 21,501 Combined Evidence
Selected Cognates 8,644 6,817 7,861 9,536 4,249 4,140 6,086
Linked MIDs 6,036 5,398 6,023 6,622 3,308 3,208 4,157
Table 4: Selected Cognates Portuguese ‘crianc’ to Spanish, however, would have been ‘nin’ (the stem of ‘ni˜no’), whilst the Spanish ‘crianz’ refers to ‘criac’ [‘breed’] (stem of ‘criac¸a˜ o’ in Portuguese). The corpora are made available by the Unified Medical Language System (U MLS 04), an umbrella system which currently combines more than one hundred heterogeneous medical terminologies (thesauri, classifications), most of them available in a couple of languages. Entries of these different nomenclatures are linked to each other via the UMLS Metathesaurus, which makes it possible to extract parallel corpora for various languages. Unfortunately, word-toword translation occurs only in very few cases. More often one encounters rather complex noun phrases with a similarly complex semantic structure. Examples for typical English-Spanish alignments are “Cell Growth” aligned with “Crecimiento Celular”, or “Heart transplant, with or without recipient cardiectomy” aligned with “Trasplante cardiaco, con o sin cardiectomia en el receptor”. We use English as the pivot language for our experiments, since it has the broadest coverage in the UMLS. The size of the corpora derived from the linkages of the English UMLS to other languages amounts to 60,526 alignments for English-Spanish,5 17,130 for English-French, and 10,953 alignments for English-Swedish. In order to determine the false friends in the list of the generated cognate pairs — CCSP A , CCF RE and CCSW E — the parallel corpora of the aligned UMLS expressions were then morphosemantically processed as described in Section 2. Whenever the same MID occurred on both sides after this simultaneous bilingual processing, the appropriate Spanish (French or Swedish, alternatively) subword entry that led to this particular MID is taken to be a valid entry. We think that this approach is reasonable, since it is highly unlikely that a false friend occurs within the same translation context. 5
We only focused on the so-called preferred entries.
Language Pair POR-SPA GER/ENG-FRE GER/ENG-SWE
Hypotheses 8,644 9,536 6,086
Valid 3,230 (37.4%) 3,540 (37.1%) 1,565 (25.7%)
Table 5: Cognates Matching the UMLS Alignments Those hypotheses which never matched in this validation procedure were rejected from the candidate lexicons. As a result (cf. Table 5), 37% of the Spanish and French as well as 26% of the Swedish hypotheses are kept. These now serve as the seed lexicons (in the following, L(0)) for acquiring additional lexical entries, which are not cognates to elements of any of the source lexicons.
4
Lexical Learning Using Parallel Corpora
The parallel corpora derived from the UMLS and the lexicons with validated cognates both serve as starting points for a continuation of the lexical acquisition process, as described in Algorithm 1. In order to illustrate this process, assume the Swedish subword ‘blod’ was identified as being a cognate to the English subword ‘blood’ (and, therefore, is included in L(0)). Then, the yet unknown Swedish word ‘blodtryck’, which has the English translation ‘blood pressure’ in the UMLS Metathesaurus gets segmented into [ST:blod|UK:t|SF:r|UK:yck], with ST being a marker for a stem, SF for a suffix and UK for an unknown sequence, thus satisfying the condition in line 12 of the algorithm. At the same time, the morpho-semantic normalization of ‘blood pressure’ leads to the sequence of MIDs [#blood #tense], whilst the normalization of ‘blodtryck’ leads to [#blood], since ‘tryck’ is not yet part of the Swedish lexicon. Comparing these two representations, the condition in line 13 of the algorithm is satisfied, since there is exactly one more MID resulting from English which cannot be found in the Swedish normalization result. The invalid segment is then reconstructed (‘t⊕r⊕yck’) by eliminating those substrings that led to a matching MID (‘blod’) in the aligned unit (‘blodtryck’) (line 15). The supernumerary MID resulting from the English normalization is assigned to that remaining substring (line 17 in the algorithm). After processing all UMLS alignments, this new entry is then incorporated in the Swedish lexicon as a stem, resulting in the lexicon L(1) (line 26). In the next run, in which all UMLS alignments are processed once again, this newly derived lexicon entry may serve for extracting, e.g., the Swedish word ‘luft’ with its iden-
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
M SI: morpho-semantic indexing procedure from Section 2 (maps sequences of words to sequences of MIDs and remainders) current ← 0 quiescence ← false while not quiescence do the lexicon for MSI is set to L(current) the list of new entries is empty for all AUi , i ∈ [1,n] (UMLS alignment units) do AUS ← source language part of AUi AUT ← target language part of AUi M IDS ← M SI(AUS ) M IDT ← M SI(AUT ) if for exactly one word there is an invalid segmentation (checked by the FSA) in M IDT then if there is exactly one more MID in M IDS than in M IDT then mid ← supernumerary MID from M IDS entry ← restore the invalid segment and remove substrings that led to a matching MID in M IDS and M IDT ; strip off potential suffixes from entry, if the remaining substring is longer than 4 (thus, avoiding too short entries); add entry together with the associated mid to new entries end if end if end for if new entries is empty then quiescence ← true else current ← current + 1 copy L(current − 1) to L(current) add all entries from new entries to the lexicon L(current) end if Algorithm 1: Bootstrapping Algorithm for Lexical Acquisition end while
tifier #aero from the UMLS entry ‘air pressure’ (English, indexed to [#aero #tense]) linked to ‘lufttryck’ (Swedish). When no new entries can be generated using this method (quiescence), the algorithm stops. Table 6 depicts the growth steps of the target lexicons for the entire bootstrapping process (new entries in comparison to each previous step are in brackets). In the first run, for Spanish, 3,587 new lexemes are added to the lexicon which comes to a size of 6,817 including those lexemes already generated by the cognate identification routines (cf. Table 5). For French, 2,023 new lexemes were generated in the first step and for Swedish only 759. Remarkably, these Swedish entries lead to the acquisition of 1,361 new lexemes in the next step. After 14 runs, learning comes to an end with 7,154 lexemes generated for Spanish, while after 6 runs, 5,734 lexicon entries for French (Swedish, respectively) are acquired. Finally, for Swedish, 4,148 lexemes were learned after 9 iteration steps.
L(0) L(1) L(2) L(3) L(4) L(5) ... L(14)
Spanish 3,230 6,817 (3,587) 7,001 (184) 7,094 (93) 7,108 (14) 7,109 (1) ... 7,154 (45)
French 3,545 5,568 (2,023) 5,720 (152) 5,730 (10) 5,733 (3) 5,734 (1) ... 5,734 (0)
Swedish 1,565 2,324 (759) 3,685 (1,361) 4,013 (328) 4,119 (106) 4,136 (17) ... 4,148 (12)
Table 6: Lexicon Growth Steps (∆ in brackets)
5
Quality Checking of Derived Lexicons
For lexicon generation, we referred to EnglishSpanish, English-French, and English-Swedish corpora compiled out of the UMLS Metathesaurus. To estimate the quality of the interlingual connections between the newly derived lexicons, we now compare the results after running the morpho-semantic indexing system (the function MSI from Algorithm 1) on these collections, at each stage of the lexical acquisition. We are aware that these results probably include overfitting phenomena. Therefore, we additionally extracted SpanishFrench (13,158), Spanish-Swedish (8,993) and French-Swedish (6,713) aligned entities from parallel corpora from the UMLS. The alignments range, again, from word-to-word translations (e.g., Spanish ‘pierna’ to Swedish ‘ben’ [‘leg’]) to complex noun phrases, which sometimes correspond to a single word in the other language, e.g., the Spanish phrase ‘enfermedad virica transmitida por artropodos, no especificada’ maps to the Swedish ‘arbovirusinfektioner” [‘arbovirus infections’] in the UMLS. Rather than only examining the coverage of the acquired lexicons, we wanted to estimate the quality of the generated lexicons (admitting that their status is far from being complete ), i.e. the validity of the interlingual synonymy relations we stipulate. For this goal, we indexed the English-Spanish, English-French, and English-Swedish corpora on which the lexical acqui-
Lexicon L(0) L(1) L(2) L(5) L(0) L(1) L(2) L(5)
C Cov.(%) Ident.(%) English-Spanish (n = 60,526) 39.6 87.6 6.1 47.5 95.5 9.7 51.0 95.6 11.8 52.0 95.7 12.4 Spanish-French (n = 13,158) 34.9 73.6 17.2 45.4 86.4 26.7 45.7 86.7 27.0 45.8 86.9 27.0
C Cov.(%) Ident.(%) English-French (n = 17,130) 39.2 78.3 16.1 52.5 90.5 27.3 53.2 90.8 27.9 53.2 90.9 27.9 Spanish-Swedish (n = 8,993) 21.4 53.8 8.9 29.8 77.1 18.4 40.6 80.1 23.9 45.9 83.8 26.9
C Cov.(%) Ident.(%) English-Swedish (n = 10,953) 27.4 60.0 11.7 29.8 63.3 18.4 50.7 81.8 39.9 56.3 85.6 42.6 French-Swedish (n = 6,713) 32.4 66.7 17.9 45.4 79.2 30.0 45.8 79.5 30.0 45.8 79.6 30.0
Table 7: Indexing Consistency (C), Coverage (Cov.) of Lexicons and Number of Identical Indexes (Ident.) at each Stage of Lexicon Generation. English-German Reference (n = 34,296): 56.9 Consistency, 96.9% Coverage, 29.8% Identical MIDs.
sition was based employing the MSI routines for all lexicon levels, L(0)-L(14). Furthermore, the SpanishSwedish, Spanish-French, and French-Swedish corpora – previously unseen by the learning algorithm – were processed accordingly. For each alignment unit of the corpora, we then compared the resulting MIDs using the following measure of indexing consistency: CAUi = (100A)/(A + N + M ) The indexing consistency of one alignment unit (AUi ) of the parallel corpus, CAUi , is dependent on A, the number of MIDs that co-occur on both sides of that unit in the parallel corpus and the number of MIDs that occur only on one of its sides, N or M . To express the overall consistency, the mean over all alignment units (CAUi ) of the corpus is calculated. Table 7 depicts the over-all consistency values (columns 2, 5 and 8) starting from lexicon L(0) (only validated cognates) to lexicon , L(1), L(2), up to L(5) for all target languages (improvements after that step are only marginal, cf. Table 6). When processing the English-Spanish corpus, consistency is already about 40%, only considering cognates using the C measure. This surprisingly high value is due to the high amount of overlapping medical terms in different Western European languages. Adding those entries acquired from bootstrapping the same corpus, consistency climbs to a maximum of 52%. As a reference item, the processing of an English-German corpus, which is also derived from UMLS, yields 57% consistency – keeping in mind that English and German lexicons were generated manually and provide a real good coverage (as shown, e.g. in (Hahn et al. 04)). The processing of Spanish-French, Spanish-Swedish, and FrenchSwedish is particularly interesting, since the underlying corpora were not involved at all in the lexical acquisition. With consistency starting from 35% for
cognates (Spanish-French), 46% is reached after 5 cycles of generating the target lexicons, for each these language pairs. Coverage was measured by counting those cases in which at least one MID occurs on both sides of the alignment units considered. For Spanish cognates only (L(0) in Table 7), (incomplete) alignments to English can be observed for 88% of the corpus. This value increases to 96% after 5 runs of bootstrapping the Spanish lexicon. For English-French, coverage reaches 91% (for English-Swedish 86%). For Spanish-French, Spanish-Swedish, and FrenchSwedish, surprisingly enough, coverage yields 87%, 84%, and 80%, respectively. Again, as a reference, the processing of the English-German corpus yields 97% coverage. The number of cases in which both sides are indexed identically, are depicted in Table 7, Columns four, seven, and ten. The reference data for these values is 30% for English-German.
6 Related Work The rise of the empirical paradigm in the field of machine translation is, to a large degree, due to the widespread availability of parallel corpora.They also constitute an important resource for the automated acquisition of translational lexicons (Turcato 98). Most approaches to multilingual lexical acquisition employ statistical methods, such as context vector comparison (Rapp 99; Widdows et al. 02; D´ejean et al. 02) or mutual information (Fung 98) and require a seed lexicon of trusted translations. (Koehn & Knight 02) derived such a seed lexicon from German-English cognates which were selected by using string similarity criteria (a method also favored by (Ribeiro et al. 01)). (Barker & Sutcliffe 00) propose an alternative generative approach where Polish cognate candidates are created from an English word list using string map-
Thesaurus Eurovoc GEMET UNESCO OECD Eurodicautom Europ. Education Europ. Schools Treasury Browser AGROVOC Astronomy Thes.
# 13 19 3 4 12 18 13 6 5
Subject European Communities activities: science, politics, law, culture, economics, etc. technical terminology education, teaching, individual development research, etc. agriculture astronomy
Table 8:
Overview of Selected Multilingual Resources (http://sky.fit.qut.edu.au/˜middletm/cont_ voc.html, last visited in January 2005)
ping rules, an approach to cognate mapping also discussed by (MacWhinney 95) for 2nd language acquisition of human learners. The second issue concerns the processing of suitable corpora. Whilst (Widdows et al. 02) deal with parallel German-English corpora to enrich an existing multilingual lexicon (also taken from the UMLS Metathesaurus), (Rapp 99), (D´ejean et al. 02) and (Fung 98) propose methods that require only weaker comparable corpora (cf. (Fung 98) for a linguistic distinction between both types of corpora). Furthermore, (D´ejean et al. 02) incorporate hierarchical information from an external thesaurus for combining different evidence for lexical acquisition. In contradistinction to these precursors, we propose a fully heuristic method for acquiring translations of subwords, instead of using statistics. This is made possible by the availability of relatively large and well aligned parallel corpora, as provided within the UMLS Metathesaurus. Finally, rather than acquiring bilateral word translations, our focus lies on assigning subwords to interlingual semantic identifiers.
7 Conclusions We have shown that a significant amount of Portuguese, English and German subwords from the medical domain can be mapped to Spanish, French, and Swedish cognates by simple string transformations. With these seeds, we further enlarge the cognate lexicons by subwords which are not cognates. For the latter task, we used a specific aligned corpus, the UMLS Metathesaurus, and extracted those non-cognates in a bootstrapping way. In what concerns the generality of our approach, we rely on large aligned thesaurus corpora. Fortunately, large-coverage multilingual thesauri are already avail-
able for several relevant domains (cf. Table 8), both in terms of the number of languages covered and the number of alignment units available (e.g., on the order of 5 million for Eurodicautom). Hence, this approach bears further potential for lexicon acquisition tasks.
8
Acknowledgements
This work was partly supported by Deutsche Forschungsgemeinschaft (DFG), grant KL 640/5-2, and the European Network of Excellence Semantic Mining (NoE 507505).
References (Barker & Sutcliffe 00) Gosia Barker and Richard F. E. Sutcliffe. An experiment in the semi-automatic identification of false-cognates between English and Polish. In AICS 2000 – Irish Conference on Artificial Intelligence and Cognitive Science. National University of Ireland Galway, 24-25 August, 2000, 2000. ´ Gaussier, and Fatiha Sadat. An approach based (D´ejean et al. 02) Herv´e D´ejean, Eric on multilingual thesauri and model combination for bilingual lexicon extraction. In COLING 2002 – Proceedings of the 19th International Conference on Computational Linguistics, pages 218–224. Taipei, Taiwan, August 24 - September 1, 2002. Association for Computational Linguistics, 2002. (Fung 98) Pascale Fung. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In David Farwell, Laurie Gerber, and Eduard H. Hovy, editors, Machine Translation and the Information Soup. Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas – AMTA 98, volume 1529 of Lecture Notes in Computer Science, pages 1–17. Langhorne, PA, USA, October 28-31, 1998. Berlin: Springer, 1998. (Hahn et al. 04) Udo Hahn, Korn´el Mark´o, Michael Poprat, Stefan Schulz, Joachim Wermter, and Percy Nohama. Crossing languages in text retrieval via an interlingua. In RIAO 2004 – Conference Proceedings: Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pages 100–115. Avignon, France, 26-28 April 2004. Paris: Centre de Hautes Etudes Internationales d’Informatique Documentaire (CID), 2004. (Koehn & Knight 02) Philipp Koehn and Kevin Knight. Learning a translation lexicon from monolingual corpora. In Unsupervised Lexical Acquisition. Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 9–16. Philadelphia, PA, USA, July 12, 2002. Association for Computational Linguistics, 2002. (MacWhinney 95) Brian MacWhinney. Language-specific prediction in foreign language learning. Language Testing, 12(3):292–320, 1995. (Mark´o et al. 05) Korn´el Mark´o, Stefan Schulz, and Udo Hahn. Unsupervised multilingual word sense disambiguation via an interlingua. In AAAI’05 – Proceedings of the 20th National Conference on Artificial Intelligence & IAAI’05 – Proceedings of the 17th Innovative Applications of Artificial Intelligence Conference, pages 1075–1080. Pittsburgh, Pennsylvania, USA, July 9-13, 2004. Menlo Park, CA; Cambridge, MA: AAAI Press & MIT Press, 2005. (Rapp 99) Reinhard Rapp. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 519–526. College Park, MD, USA, 20-26 June 1999. San Francisco, CA: Morgan Kaufmann, 1999. (Ribeiro et al. 01) Ant´onio Ribeiro, Ga¨el Dias, Gabriel Lopes, and Jo˜ao Mexia. Cognates alignment. In Proceedings of Machine Translation Summit VIII, pages 287–293. Santiago de Compostela, Spain, September 18-22, 2001., 2001. (U MLS 04) U MLS. Unified Medical Language System. Bethesda, MD: National Library of Medicine, 2004. (Schulz et al. 02) Stefan Schulz, Martin Honeck, and Udo Hahn. Biomedical text retrieval in languages with a complex morphology. In Stephen Johnson, editor, Proceedings of the ACL/NAACL 2002 Workshop on ‘Natural Language Processing in the Biomedical Domain’, pages 61–68. University of Pennsylvania, Philadelphia, PA, USA, July 11, 2002. New Brunswick, NJ: Association for Computational Linguistics (ACL), 2002. (Schulz et al. 04) Stefan Schulz, Korn´el Mark´o, Eduardo Sbrissia, Percy Nohama, and Udo Hahn. Cognate mapping: A heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. In COLING Geneva 2004 – Proceedings of the 20th International Conference on Computational Linguistics, volume 2, pages 813–819. Geneva, Switzerland, August 23-27, 2004. Association for Computational Linguistics, 2004. (Turcato 98) Davide Turcato. Automaticaly creating bilingual lexicons for machine translation from bilingual text. In COLING/ACL’98 – Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics & 17th International Conference on Computational Linguistics, volume 2, pages 1299–1306. Montr´eal, Quebec, Canada, August 10-14, 1998. San Francisco, CA: Morgan Kaufmann, 1998. (Widdows et al. 02) Dominic Widdows, Beate Dorow, and Chiu-Ki Chan. Using parallel corpora to enrich multilingual lexical resources. In M.G. Rodriguez and C. Paz Suarez Araujo, editors, LREC 2002 – Proceedings of the 3rd International Conference on Language Resources and Evaluation. Vol. 1, pages 240–245. Las Palmas de Gran Canaria, Spain, 29-31 May, 2002. Paris: European Language Resources Association (ELRA), 2002.