Experiments in cross-language medical information ...

MEDINFO 2004 M. Fieschi et al. (Eds) Amsterdam: IOS Press © 2004 IMIA. All rights reserved

Experiments in cross-language medical information retrieval using a mixing translation module Tuan Duc Tran , Nicolas Garcelon , Anita Burgun, Pierre Le Beux Laboratoire d’Informatique Médicale, Faculté de Médecine Université de Rennes 1, Rennes, France Tuan Duc Tran , Nicolas Garcelon , Anita Burgun, Pierre Le Beux ularly, in CLMIR its relevant medical document results are limited due to inappropriate translation of medical terms. We currently target the retrieval of medical documents, and therefore the performance of our translation module is highly dependent on the quality of the translation of medical terms. With the aim of improving translated query quality that can result in a performance in medical information retrieval, we propose a crosslanguage medical information retrieval method that is based on query translation approach using a mixing translation module composed of a hybrid machine translation developed from our previous work on automated biomedical term translation [5] and a thesaurus-based translator derived from multilingual UMLS Metathesaurus®1. In this paper, we describe the functionalities of mixing translation module, machine translation based approach and thesaurus-based approach, as well as experiment in cross-language medical information retrieval. First we mention an overview of query translation approach as the current retrieval methods and describe our mixing translation module. Then, we present experiment in cross language medical information retrieval and finally comment on the results obtained.

Abstract Given the ever-increasing scale and diversity of medical literature widely published in English on the Internet, improving the performance of information retrieval by cross-language is an urgent research objective. Cross-language medical information retrieval (CLMIR) consists of providing a query in one language and searching medical document collections in one or more different languages. Our users of CLMIR are users who are able to read biomedical texts in English, but have difficulty formulating English queries. This paper proposes a French/English CLMIR system as a mixing model for supporting the retrieval of English medical documents. Methods fall into the category of query translation approach in which we use a hybrid machine translation that combines a pattern-based module with a rule-based translator and includes three steps from pre- to- post-translation. In parallel to this hybrid machine translation, we use multilingual UMLS Methasaurus as a complementary translator. The results show that using a mixing translation module outperforms machine translation-based method and thesaurus-based method used separately. Keywords :

Materials and Methods

cross-language medical information retrieval, query translation, mixing translation module, hybrid machine translation-based approach, thesaurus-based approach.

For an overview of cross-language information retrieval (CLIR) approaches, retrieval methodologies can be classified in three principal approaches : query translation, document translation and interlingual representation [6]. The query translation approach, up to now the most popular method, is subdivided into the following three approaches : dictionary-based method, corpus-based methods and hybrid methods (bilingual aligned corpora and monolingual corpora). Among different methods following the query translation approach, we combine a hybrid machine translation-based approach with a thesaurus-based approach. Hybrid machine translation-based approach is an approach to combine a pattern-based module of an existing machine translation with a rule-based translator. From two monolingual corpora (French and English) as comparable corpora concerning iron metabolism subject, we used two modules of Xerox Terminology Suite (XTS)2: TermOrganizer and TermFinder to extract the candidate terms in the French corpus, then in the English corpus. First, the matching of the candidate terms French/English was carried out manually. Then, we proposed

Introduction The world’s most widely cited medical journals are published in English. This has led to increasing research interest in cross-language medical information retrieval (CLMIR), where the users present queries in one language to retrieve relevant documents written in English. Our potential users of CLMIR are users who are able to read biomedical texts in English, but have difficulty formulating English queries. According to Douglas W. Oard and Funda Ertunc (2002), there are two alternative approaches : query translation and document translation [1] but statistically, query translation is better than document translation [2]. Query translation approach can be subdivided into dictionary-based query translation (DQT) and machine translation-based query translation approach (MQT) [3] for handling cross-language information retrieval. The main problem associated with crosslanguage information retrieval by DQT is untranslated term query [4] due to the limitations of entries as concept-names such as new medical terms that are not found in the dictionary. Machine translation based-approach is broadly used but partic-

1. 2.

946

http://umlsks5.nlm.nih.gov http://www.mkms.xerox.com/

TD. Tran et al.

The translation consists firstly of converting the accents into the web format, then launching the query to Systran®, a machine translation, and finally producing the translated expression.

morphosyntactic conversion rules with the aim of automating the translation from the French terms into English supported by Systran®1. This rule-based translator have contributed to improve medical terminology translation up to 70% of well-translated terms. In parallel to the hybrid machine translation, the query translation is combined with multilingual UMLS Metathesaurus®2 as medical thesaurus-based query translation method. This is a thesaurus-based translation extracted from multilingual UMLS (Unified Medical Language System) Metathesaurus® (the 2003AB edition with 900,551 concepts and 2.5 million concept names in its source vocabularies). The latter is implemented in our CLMIR system as a complementary translator. An automatic method for CLIR using the multilingual UMLS Metathesaurus to translate Spanish and French queries into English was reported by Eichmann D [7] but French produces less favorable results than Spanish because of difference in linguistic features.

The post-translation consists of removing the persistent accents (e.g. éÆ e), then retranslating the words incorrectly translated by Systran® or by pre-translation step : e.g. mutationÆ mutation [pre-translation]Æchange [translation] Æmutation [post-translation]. The translated words as mutation model are contained in a memory translation table which will be updated manually and progressively. This table also allows us to check whether there is a suffixal variant, e.g. -ome can be translated in -oma or -ome (with the exception of the words ending in -some): "entérotome" will be translated in "enterotoma" or " enterotome", both translated queries are then launched through PubMed or Google and give us their relevant documents with exactly translated query "enterotome". The translation is then correctly achieved. In parallel with this translation, the program seeks a translation of the French query via multilingual UMLS Metathesaurus. Finally, both translated queries produced by our CLMIR system will be combined with PubMed or Google to retrieve English document collection. The results will be posted according to the user’s search option. Our CLMIR system will be described in Figure 1, which involves medical term translation.

Our CLMIR system based on a hybrid machine translation and a medical thesaurus-based query translation is called mixing translation module which is developed in the PERL programming language. It is a Web interface (CGI) including: •

a form to fill a French query and to chose a search option (exact expression or not, a number of results per page) that is combined with two search engines : PubMed3, Google®

•

a translation module to translate the French query into English query after clicking on validation icon.

•

Multilingual UMLS Metathesaurus translation module

•

Result section : the translated query, the number of documents retrieved and the first 10 pages found on PubMed or Google.

The translation process of a French query proceeds in 3 steps : pre-translation, translation, and post-translation. The pre-translation is based on the application of 5 rules/steps for a translating query. The order of these rules must be respected as follows :

1. 2. 3.

•

Step 1 : Replacement of the capital letter with accent in small letter with accents as É -> é

•

Step 2: Translation of some special expressions previously envisioned by the system :

•

“alcoolique” Æ alcoholic, “responsable de” Æ due to

•

Step 3: Application of the morphosyntaxic rules of translation (e.g : “ose” -> osis)

•

Step 4: Suppression of the stop-words : le, les, la, l’, du, de, des, d’, en, à, au, aux, par, un, une.

•

Step 5: If the number of remaining word is equal to 2, the word order is reversed. If not, the word order remain identical. If the number of words is more than seven, the translation module will not function,

Figure 1 - The overall design of our CLMIR system

Results From the bilingual query sets, we chose to run the French queries, which were then translated using the automatic translation module and the French-English UMLS Metathesaurus-based module. In order to evaluate whether the mixing translation module results in a performance gain in information retrieval, we decided to focus our initial evaluation on performance of translation capability. Furthermore, in query translation ap-

http://www.systransoft.com/ http://umlsks5.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/pubmed/

947

TD. Tran et al.

approaches : Systran, UMLS and hybrid translation module (Table 1) respectively as an example:

proach, the CLIR performance varies depending on the quality of translation, and thus we avoided relevance based assessment. Our initial efforts were limited to the French language with a set of 1257 French queries. Figure 2 shows the effectiveness of our CLMIR system as well as the number of translated terms in each translation module among the 1257 French queries. They add up to a total of 12.8 millions of English documents thanks to a combination of hybrid MT-based approach and UMLS Metathesaurus-based approach.

Table 1: Exemple to illustrate the performance of hybrid translation module that outperforms Systran and UMLS French query

Systran

Hémochr- Hémochroomatose matose [noise] Hémochr- Hémochomaomatose tose genetic génétique [noise] Surcharge overload out en fer of iron [noise] Mutation Change C282Y C282Y [noise] Anémie weaken aplasaplastique tic [noise]

With the aim of finding the translated query that is mentioned in documents retrieved via PubMed, we compared the total number of documents which have been retrieved with all possible translations respectively provided by mixing translation module, machine translation Systran and multilingual UMLS Metathesaurus. A translated query is considered as good translation if it is capable to provide at least more than one English document that contain the translated query term. As can be seen in Figure 2, in all cases the mixing translation module results in enhancement of translation and performs better than Systran and multilingual UMLS Metathesaurus. Overall, 65% of the query terms were translated exactly as the manual translation, 5% were translated similarly to the manual translation.

UMLS

Hybrid translation module

Hemochro- hemochromatosis matosis [silent]

Genetic hemochromatosis

Iron overload [silent]

Iron overload C282Y mutation

anémie, aplastic

aplastic anemia

In comparison with hybrid MT-based approach, the thesaurusbased approach using multilingual UMLS Metathesaurus alone did not help improve, quantitatively, the retrieval performance; approximatively, 85% French queries could not be matched againts existing concept-names in the UMLS Metathesaurus due to the limitations of controled-vocabulary system. The main performance-limiting factor is the limited coverage of the thesaurus used in query translation. The category of untranslated terms as concept-names involved multi-word terms (e.g. genetic hemochromatosis), specific terms (e.g. HFE gene) and variants in biomedical terminology practice. However, thanks to its high quality translation as a widely-available resource unique to medicine, the multilingual UMLS Metathesaurus seems to be the best complementary translator because all translated queries derived from this system provided exactly documents in which appear the well-translated term. As an automatic translator, our translation module outperformed Systran, a general machine translation, in improving translation quality by 14%. Overall, we found that thesaurus-based query translation seems to work best for short queries (single term query and two-word term query) while for long queries (long phrase terms) hybrid MT-based query translation performs better than thesaurus look-up. As experiment, our CLMIR system is only available with translation from French to English as illustrated in Figure 3. In the future work, the user interface will be extensible to an oriental language (vietnamese).

Figure 2 - Results of cross-language medical information retriveal using a mixing translation module The total proportion of good translation performed by hybrid machine translation was about 70% while machine translation Systran was approximatively 55%. The French queries are short and usually consist of terms of lengths 1 to 3 tokens. Here is a typical example : • métabolisme du ferÆ iron metabolism

In the literature review of this paper we considered the main problems associated with general machine translation (e.g. improperly translated biomedical terms) and with thesaurusbased CLIR, which are untranslated terms and biomedical neologisms used in source language. From our experiment in CLMIR, our cross-language results suggest that the parallel implementation of hybrid machine translation as an automatic biomedical translator and multilingual UMLS Metathesaurus can enhance translation capacity and improve effectiveness in docu-

Discussion In our work, the French query as the source language is translated into English query as the target language. Let us take the following French queries translated into English by three

948

TD. Tran et al.

implemented separately. Furthermore, this is a CLMIR method exclusively reserved for French-speaking users who are able to read biomedical texts in English, but have difficulty formulating English queries.

ment collections published in English. In this way, our CLMIR system (French query/English document) is useful for Frenchspeaking students and researchers in biomedicine.

References [1] Oard WD, Ertunc F. Translation-Based Indexing for CrossLanguage Retrieval. In : Advances in Information Retrieval : 24th BCS-IRSG European Colloquium on IR Research Glasgow, UK, March 25-27, 2002. Proceedings. Lecture Notes in Computer Science, Heidelberg : Springer-Verlag Heidelberg, 2002 Jan (2291); pp 324-333. [2] Rosemblat G, Gemoets D, Browne AC, Tse T. Machine Translation-Supported Cross-Language Information Retrieval for a Consumer Health Resource. In: Musen M, ed. Proceedings of the 2003 AMIA Annual Symposium: Biomedical and Health Informatics: From Foundations to Applications. [3] Oard WD. A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval. In: : Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas, AMTA'98, Langhorne, PA, USA, October 1998. Proceedings. Lecture Notes in Computer Science, Heidelberg : Springer-Verlag Heidelberg, 1998 Jan (1529); pp 472-483. [4] Pirkola A, Hedlund T, Keskustalo H, Järvelin K. Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. In : Information Retrieval 2001: 4(3-4): 209-230. [5] Tran DT, Burgun A, Garcelon N. Semi-automatic acquisition of bilingual terminology in molecular biology from the comparable corpora (article in French). In : Actes des Vièmes rencontres Terminologie & Intelligence Artificielle (TIA 2003). Strasbourg : LIIA-ENSAIS, 2003 ; pp.166175. [6] Fujii A, Ishikawa T. Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration. In : Computers and the Humanities, 2001 Nov: 35(4): 389-420. [7] Eichmann D, Ruiz M.E, and Srinivasan P. Cross-Language Information Retrieval with the UMLS Metathesaurus. In : Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia, August 24 - 28, 1998. New York : ACM Press ; 72-80.

Figure 3 - Query form and search results presentation

Conclusion From the result for CLMIR, we learned that our CLMIR system using a mixing translation module based on hybrid MT-based approach and thesaurus-based aproach significantly improves the query translation results. Our automatic translation system seeks to translate French queries into English queries as an aid for human translator. The preliminary experiment shows promise as a tool whose translations are sufficiently accurate for our performance on the cross-language medical information retrieval task. We proposed using a mixing query translation module for CLMIR . This approach provided all possible translations better than MT-based approach and thesaurus-based approach

Address for correspondence Tuan Duc Tran and Pierre Le Beux Laboratoire d’Informatique Médicale Faculté de Médecine Université de Rennes 1 Tél : (33) 299 284 215 Fax : (33) 299 284 160 Emails : [email protected] [email protected]

949

TD. Tran et al.

950