Cross-Language
Information Retrieval with the UMLS Metathesaurus
David Eichmann School of Library and Information Science The University of Iowa david-eichmann(Puiowa.edu
Miguel E. Ruiz School of Library and Information Science The University of Iowa mruizQcs.uiowa.edu
Padmini Srinivasan School of Library and Information Science The University of Iowa
[email protected]
Abstract We investigate an automatic method for Cross Language Information Retrieval (CLIR) that utilizes the multilingual UMLS Metathesaurus to translate Spanish and French natural language queries into English. Two experiments are presented using OHSUMED, a subset of MEDLINE. Both experiments examine retrieval effectiveness of the translated queries. However, in the second experiment, the query translation procedure is augmented with digram based vocabulary normalization procedures. In this comparative study of retrieval effectiveness the measures used are: 11-point-average precision score (11-AvgP); average interpolated precision at recall of 0.1; and noninterpolated (i.e., exact) precision after 10 retrieved documents. Our results indicate that for Spanish the UMLS Metathesaurus based CLIR method appears equivalent to multilingual dictionary based approaches investigated in the current literature French yields less favorable results and our analysis suggests that linguistic differences may have caused the performance differences.
1
Introduction
Cross Language Information Retrieval (CLIR) refers to retrieval when the query and the database are in different languages. This form of retrieval is increasingly relevant as network-based resources become commonplace. There are several ways for handling CLIR. One approach that has received significant attention is to translate the query thereby transforming the CLIR problem into a monolingual information retrieval (MLIR) problem for which there are standard solutions [2, 81. A second approach is to translate the document [19]. A third approach receiving increasing attention is to automatically establish associations between queries and documents independent of language difference [6, 10, 211. CLIR methods involving machine translation systems, bilingual dictionaries, parallel and comparable collections are currently being
explored. Multilingual thesauri (or controlled vocabularies), however, are an underrepresented class of CLIR resources. We present here an investigation of the UMLS (Unified Medical Language System) [20] Metathesaurus, a product of the National Library of Medicine, as a resource for free-text retrieval against a MEDLINE test database (English) given Spanish and French queries. In Oard’s hierarchical classification scheme of the CLIR methods [17], our work falls under the thesaurus based free-text CLIR category. In pure thesaurus based retrieval, documents and queries are matched through their thesaurus based representations, with document representations derived by an indexer and query representations provided by users. Extending this to CLIR is straightforward given a multilingual thesaurus. However, there are at least two problems: it can be difficult for users to think in terms of a controlled vocabulary [17]; and this retrieval method ignores the free-text portions of documents during retrieval. Our thesaurus based CLIR approach seeks to overcome both problems, allowing free-text user queries and considering the free-text portions of documents during retrieval. More generally, this research is motivated by the fact that, relative to dictionaries and collection based strategies, thesauri remain unexplored in the recent CLIR context. 2
Background and Related Work
Major approaches for CLIR include bilingual dictionaries [3, 7, 141, parallel collections [4, 7, 10, 61 and comparable collections [26] or some combination of these. Documents of a comparable collection may be aligned at the document, sentence or even word level. Comparable collections raise interesting research questions, such as alignment strategies and the measurement of ‘domain shift’ as explored for example, by Oard [17]. Methods baaed on dictionaries typically begin by deriving a transfer dictionary specifying term equivalences across languages which is then applied to query translation. Hull and Grefenstette use an online English-French dictionary to translate 50 queries from the TIPSTER collection (141. Query words are first morphologically reduced to their root forms and then substituted by dictionary equivalents yielding an average precision at 5, 10, 15 and 20 retrieved documents of 0.235, compared to an MLIR baseline of 0.393. As the authors and other researchers point out, this method is troubled by incom-
Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. SIGIR’98, Melbourne, Australia Q 1998 ACM I-58113-015-5 8/98 $5.00.
72
plete dictionaries and ambiguities in translation. Disambiguation strategies are typically employed to reduce translation errors. Davis [7] explores three different disambiguation strategies while using the Collins EnglishSpanish dictionary to translate 25 Spanish queries into English. The first uses part of speech (POS) information to constrain translation. The second uses a parallel corpus aligned at the sentence and double-sentence levels to select Spanish terms that have the most in common with the English query. The third combines POS with a corpus based refinement strategy, yielding the best performance, 73.5% of MLIR (non interpolated average precision) with the corpus based strategy adding the last 6%. In an earlier paper Davis and Dunning apply an evolutionary approach, optimizing for translated query performance using parallel collections [9]. The dictionary based CLIR work by Ballesteros and Croft [3] takes a different slant to reducing translation ambiguity, exploring the value of pre- and posttranslation query expansion strategies. Their evaluation with 25 queries of a TREC datab&e using the Collins English-Spanish dictionary indicates that expanding the query both before and after translation reduces errors and yields 68% of the corresponding MLIR baseline. Their research also supports the findings of Hull and Grefenstette [14] that phrase translations are important for CLIR. Corpus based methods have also been investigated independent of dictionaries. Davis and Dunning replace English query terms by the 100 most frequent terms in the top 100 documents retrieved from the Spanish side of an English-Spanish parallel collection. Similarly, the CLLSI approach [lo] uses a parallel training corpus to compute a mapping from sparse term-based vectors to short and dense conceptual vectors suppressing cross language variations. More recently the generalized vector space model has shown good potential for CLIR [6]. Thirteen groups participated in the CLIR track introduced in TREC-6, with documents and queries in German, English, French and queries in Dutch and Spanish as well. Dictionary based CLIR was explored by several groups including New Mexico State University [8], University of Massachusetts [l], and the Xerox Research Center Europe [ll]. Groups such as ETH [15], and a collaboration between the University of Colorado, Duke University and Microsoft [21] investigated corpus based methods. Of particular interest to us is the ETH approach using similarity thesauri constructed from comparable documents of the SDA collection in French and German. The construction process used derives from term similarity as described in [25]. Unique angles in TREC-6 include document translation based CLIR [19] explored by the University of Maryland using the LOGOS system. According to the authors, it appears that document translation performs at least as well as query As reported in [24], another interesting translation. angle in the CLIR track is the approach taken by Cornell University wherein they exploit the fact that there are many similar looking words between French and English, i.e., near cognates. Interestingly, this assumption yielded good results in the English-F’rench CLIR runs. As summarized by Schauble and Sheridan [24] the TREC6 CLIR results appear consistent with previous results in that the performances typically range between 50 and 75% of the corresponding monolingual baselines. From our perspective, it is evident that given the nature of the TREC collections, CLIR approaches based upon multilingual thesauri remain difficult to explore. Our approach
73
to CLIR in MEDLINE is to exploit the UMLS Metathesaurus and its multilingual components. Soergel describes a general framework for the use of multilingual thesauri in CLIR [27], noting that a number of operational European systems employ multilingual thesauri (such as UDC and LCSH) for indexing and searching. However, except for very early work with small databases [22], there has been little empirical evaluation of multilingual thesauri (controlled vocabularies) in the context of free-text based CLIR, particularIy when compared to dictionary and corpus-based methods. This may be due to the expense of constructing multilingual thesauri, but this expense is unlikely to be any more than that of creating bilingual dictionaries or even realistic parallel collections. In fact, ongoing efforts such as the EC-funded EuroWordNet project indicate that such resources can be built collaboratively and semiautomatically [12]. In EuroWordNet, the goal is to extend the WordNet thesaurus 1161 to include Dutch, Italian, Spanish and English words. Multilingual thesauri can be built quite effectively by merging existing monolingual thesauri [27]; the UMLS Metathesaurus is an excellent current example. Combining the UMLS Metathesaurus with a MEDLINE test database enables an empirical investigation of a high quality multilingual thesaurus as a resource for free-text based CLIR using two broad approaches: document translation and query translation. We investigate query translation based CLIR here. Our approach is independent of stemmers, part of speech taggers and parsers. (We wish to get baseline results first before involving these additional techniques.) Comparable approaches include those conducted using bilingual dictionaries and similarity thesauri. In general these strategies yield performance scores in the range of 50 to 75% of the corresponding monolingual baselines. Our goal is to assess the UMLS Metathesaurus based CLIR approach within this context. The reader is referred to the technical report by Oard and Dorr for an excellent review of the CLIR literature [18]. 3 3.1
Methods OHSUMED
Test
Set.
We utilize the OHSUMED test database, a subset of the MEDLINE database, extracted for retrieval research [13]. This database is accompanied by a collection of 106 English language queries’. For our cross language experiments, these 106 queries are first translated into Spanish by a native Spanish speaker and into French by the Translation Laboratory at the University of Iowa. The Spanish/French versions are then translated back into English by our automatic method. The original English queries provide our baseline performance. 3.2
Retrieval
System
We use SMART [23] to identify appropriate Spanish/fiench UMLS phrases for each query and to run the retrieval experiments. For al1 ‘We use the corrected versions of these queries. but 5 queries, relevant document subsets are known. Please see ftp://medir.ohsu.edu/pub/ohsumed for d&ails. We use the 233,445 document subset that contains abstracts and MeSH phrases for each document.
3.3
Unified
Medical
Language
System
(UMLS)
The UMLS, a vocabulary system produced by the National Library of Medicine, has four components: the Metathesaurus, Semantic Network, Information Sources Map and the SPECIALIST Lexicon [20]. We use only the Metathesaurus, an integration of more than 40 independent vocabularies in the health care domain. The Metathesaurus model involves the notions of ‘concept, ’ ‘term’ and ‘string.’ Lexical variants are linked under the same term, while variations (such as case) only define independent strings, where certain strings designated as preferred forms for each concept. The 1997 Metathesaurus contains 331,756 concepts, 571,768 terms and 739,439 strings. The Metathesaurus is multilingual. French, Spanish, Portuguese and German translations of the MeSH subset of the UMLS are linked to their Concept hierarchies. There are 23,198, 23,093, 18,429 and 18,277 MeSH concepts with Spanish, Portuguese, German and French strings respectively. We investigate both Spanish (the highest represented) and French (the least represented) languages in the UMLS. This pair will allow us to examine the effect of representation level on CLIR performance. 3.4
Transfer
Dictionaries
Derived
from
the
UMLS The remaining resource used is a transfer dictionary, which we derive from the multilingual subset of the Metathesaurus. A transfer dictionar specifies phrase equivalences tied to common concepts I . The Spanish information (23,198 concepts, 32,282 unique strings and 22,891 unique words) and French information (18,277 concepts, 25,932 unique strings, and 18,179 unique words) form the foundation of our approach. Each language has an index file (mrwx.spa for Spanish and mrwx.fre for French) provided as part of the UMLS which contain the unique Spanish/French words found in the Metathesaurus and link them to their associated Concept numbers. The indexes hence also serve as indexes for our transfer dictionaries. The Spanish or French query arrives as a (potentially ill-formed) sentence. The Spanish/French MeSH entries of the transfer dictionaries derived from the Metathesaurus contain phrases. Hence we must first identify appropriate Spanish/French MeSH phrases for a query before translating these into English using the dictionaries. The effectiveness of the CLIR process depends upon this first non-trivial categorization step. The simplest selection strategy is to use the word based indexes (mrwx.spa and mrwx.fre) for the Spanish/French MeSH phrases to pull out all phrases that contain at least one of the query words3. However, we would like to identify the ‘set’ of Spanish/French MeSH phrases for the query as a ‘whole.’ For example, we would like more important query terms to have a greater role in the selection of MeSH terms than less important terms. We would like to weight both the query words and the MeSH phrases by their statistical features (IDF, DF etc.) and consider these weights during phrase selection. A 2Researchers have remarked upon the non-trivial effort required in deriving a transfer.dictionary 1141 from bilingual dictionaries. In contrast, our phrase equivalences are created using Recently straightforward UNIX shell commands such as grep. Brown [4] tested a relatively straightforward method for deriving a transfer dictionary from a sentence aligned parallel corpus. sThis method is similar to those used to determine word-byword translations from dictionaries [2, 141.
standard index lookup procedure ignores such weights. Finally, comparable to the disambiguation strategies in dictionary based research, MeSH phrases selected in this SMART based procedure go through a pruning phase to remove irrelevant entries as described in Section 3.6. 3.5
Selecting Spanish/French for Queries
MeSH
Phrases
We first create a SMART database (see Table 1) for each language using its UMLS index file. There are a total of 22,891 unique Spanish word entries in the mrxw.spa index and hence in the database and 18,179 records in the French index database. These are indexed by SMART without stemming following the removal of stopwords4, using the ate weighting scheme. Two separate index vectors are created, one for the .W field and the other for the .C field. We then retrieve database records for the free-text Spanish/French queries, indexed using the atn scheme. We retrieve by comparing the words in the Spanish/French query with the .W field of the database records and analyzing the top N records to identify the A4 most important concepts. It is essentially in this step that SMART offers the advantage of weights to distinguish between the concepts. We then temporarily assign the Spanish/French phrases corresponding to the selected M concepts to the query5. Readers familiar with previous query expansion work with MEDLINE [28] and TREC (51 may recognize the ‘retrieval feedback’ or ‘nearest neighbor’ flavor in this approach. A sample query and the Spanish concepts identified in this step appear at the top of Table 2. (Since the procedures used are identical for both Spanish and French, we limit our examples and tables to Spanish for simplicity). Mask numbers (i.e., word positions in the query, ignoring trivial words) appear next to non-trivial Spanish query words. For each concept, the table shows the concept#, string#, Spanish phrase, mask# values and English phrase. 3.6
Refining MeSH
the Selected
Set of Spanish/French
Phrases
Many of the phrases are irrelevant to the query, as shown in Table 2, so we next refine this set, selecting from a number of strategies. In combination strategies, each refinement step acts only upon the Spanish query that remains after the previous refinement step. We use the example of Table 2 to explain these strategies. l
Full Matches (FM): Only MeSH phrases composed entirely of query words are retained. We always carry out this refinement procedure. (The remaining optional refinement strategies are tested only following this full match criteria.) Selected Spanish Phrases: causa, cancer, pecho, estrogenos Final English Query: causation, cancer, thorax, estrogens
‘We use a 351 word Spanish stoplist and a 355 word French stoplist ‘Given the database schema, each query word is going to retrieve at most one database record. Thus the best value for N is the number of informative query words. After examining the test queries, this was set to 10. Since we have follow up refinement steps in our CLIR approach, we set M, the number of concepts identified for each query, to 15.
74
Schema
~nrnbase Field .I
.w .c
Explanation Ftceord ID. Spanish word of UMLS
List
in index Concept
(single
Example Field .I
.w .c
“al”= 600 ngudo COOOO727
COO322Ql
word)
in
numbers Datsbnse
COO36436
which
tbc
Spanish
word
occur*.
Record
co242934
Table 1: SMART Database: Schema and Example Record
English Querr Spanrsh Quew -St.i”g#
th,Cept#
.0446216 SO564307 SO564306 SO564165 a0571149 .0460547 *0563060 .0461034
~0007463 c0085976 COO26756 ~0006826 cOO3QQQ2 c0006031 ~0002962 coo14935
SO461035 .0574045 SO782114
CO206074 c0206074 eOOlQfJ30
SO563561 *0563566 so563569 SO572767 SO451809 ~0451.810 ~0461.911
coo22414 cOOlQ.560 COO22748 ~0086511 cOO14Q38 COO14936 c0007406 COO14941 COO14840 COO14841
SO66Z95.5
*0451*12 SO451613 -
ixx i--
Spanish Phra.e EILY*II de muerte
1 1 2 3 3 3 45
tere+ia terapia
rem&a
de de reempla.o
de rinon de hormona
43 45 45
Table 2: Example to Illustrate Refinement Procedures.
Strategy Baseline (0.2431) FM FM+PM FM+D FM + A FM+PM+D FM + PM + A FM+D+A FM + PM + D + A
Spanish 0.1559 0.1597 0.1610 0.1673 0.1682 0.1707 0.1737 0.1728
(64%) (66%) (66%) (69%) (69%) (70%) (71%) (71%)
Table 3: Experiment 1: ll-AvgP
75
French 0.1117(46%) 0.1040 (43%) 0.1276 (52%) 0.1329 (55%) 0.1028 (42%) 0.1090 (45%) 0.1493 (61%) 0.1084 (45%)
scores.
l
Partial Matches (PM): Partial matching phrases that best cover the remaining portion of the query are retained. We sort phrases by mask combination and retain the shortest phrase corresponding to each unique combination, ignoring stopwords when calculating phrase length. If more than one phrase qualifies, we choose the one with the smallest String # (given by the UMLS producers). Applying this strategy after the FM strategy yields: Remaining Spanish Query: de la terapia (4) de reemplazo (5) de Selected Spanish Phrases: terapia de reemplazo de estrogeno, artroplastia de reemplazo English Translations: estrogen replacement, joint prosthesis Final English Query: causation, cancer, thorax, estrogens, estrogen replacement, joint prosthesis
l
Word Based Translation
Query:
de la terapia
(4) de
Selected English Phrases: therapy, replacement Final English Query: causation, cancer, thorax, estrogens, therapy, replacement Thus the remaining query words ‘terapia’ and ‘reemplaza’ are correctly translated. l
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 11-AvgP % baseline
0.5454 0.4600 0.3598 0.2980 0.2476 0.2182 0.1722 0.1722 0.1075 0.0777 0.0492 0.2431 100% =
Table 4: Experiment. Recall Points.
1 Spanish 1 French Precision 0.3270 0.2578 0.2117 0.1683 0.1487 0.1184 0.0967 0.0773 0.0535 0.0341 0.1737 71% =
0.3682 0.2698 0.2196 0.1735 0.1485 0.1310 0.0993 0.0873 0.0653 0.0484 0.0316 0.1493 61%
1 -
1: Precision Scores at 11 Standard
eliminating stopwords. Based on previous experimentation, ann weights are used on documents and atn on
queries’. 4.1
Performance
Measures
We use three performance measures. The first is llAvgP or the average of precision at 11 standard recall points (0.0, 0.1, 0.2, . . . . 1.0). In CLIR, given the expense of translation, a user is likely to be interested in the top few retrieved documents. Thus, our second measure is average interpolated precision at 0.10 recall. However, since the actual numbers behind this level of recall can vary considerably across queries, we also compute the noninterpolated (i.e., exact) precision scores for the top ranking documents, and focus particularly on the top 10 documents.
Addition of Spanish query words (A): 4.2
Any remaining Spanish query words are simply added to the final query. Applying this strategy after the FM strategy yields:
We tested the above refinement steps in several combinations, with FM included as each combination’s initial step. Note that the same refinement procedures are used for the French collection.
Retrieval
Experiment
Results
and
Analysis
Table 3 presents the 11-AvgP scores. Abbreviations refer to the particular refinement strategies (FM: full match; PM: partial match; D: dictionary based and A: simple addition of Spanish/French query words). The FM+D+A refinement strategy is the best, achieving 71% and 61% of baseline for Spanish and French respectively’. It is not surprising that strategy A improves performance since medical terms in Spanish, French and English often have the same Latin roots. Table 4 indicates that for the FM+D+A runs, the average interpolated precision at 0.10 recall, i.e., when 10% of the relevant documents have been retrieved, is 0.3270 (71% of baseline) and 0.2698 (59% of baseline) for Spanish and French respectively. Table 5 shows that the performance achieved within the top 10 ranks is 79% and 51% respectively. Across all measures the performance range achieved is 71% to 79% (Spanish) and 51% - 61% (French) of monolingual performance.
Remaining Spanish Query: de la terapia (4) de reemplazo (5) de Final English Query: causation, cancer, thorax, estrogens, terapia, reemplazo
4
Baseline
(D):
A special word based translation dictionary limited to the (remaining) query words is built as follows. For a given word, identify the Spanish MeSH phrases containing it and extract the corresponding English MeSH phrases. Then list the words in these phrases and select the most frequent word as Applying this strategy after the the translation. FM strategy yields: Remaining Spanish reemplazo (5) de
Recall
1.
The goal is to evaluate the retrieval effectiveness of the UMLS Metathesaurus based query translation strategies. Searches are conducted against the free-text, i.e., title and abstracts of the OHSUMED documents. For each run the baseline is defined by retrieval using the original English query that came with the OHSUMED Indexing is done using stemming and after databa&.
7Since document and query lengths do not vary significantly there is no need to normalize the weights. ‘The ‘A’ indicates that any remaining Spanish/French words were simply added to the final query. For example, there are 98 untranslated Spanish query words (81 unique words) out of a total of 538 query words (356 unique words) for the runs represented in Table 4 and Table 5.
‘We recognize that this baseline ignores any effects of the translation of the original English query into Spanish and French by our native speaker.
76
5.2
Table 5: Experiment.
1: Exact Precision Scores.
Further Exploration
5
The previous experiment did not involve any morphological normalization with stemmers in the query translation process. When we select Spanish/French MeSH terms from the Metathesaurus for the queries, exact match criteria are employed (see Section 3.5). Unfortunately, there are a number of instances where some vocabulary normalization may help. For example, the Spanish queries #25 and #30 have the word ‘aislado’. Although the Metathesaurus does not contain this word, it contains the morphological variants ‘aislada’, ‘aisladores’ and ‘aislados’. As an alternative to stemming we explore digram based vocabulary normalization methods. Digrams are determined prior to the SMART baaed matching process described in Section 3.5. The Spanish/French query is first modified using the digram based method and then sent into the SMART procedure to identify appropriate MeSH concept phrases. Thus query words that do not occur in the Spanish/French Metathesaurus are substituted (if possible) by the closest matching words based on the digrams method. 5.1
Digram
Based
Matching.
Each Metathesaurus Spanish/French word and each query word is represented by its set of digrams. Similarity is computed using Dice’s Coefficient: Sim(Query
- word, Meta - word) = 2 * N/(P + Q) (1)
where P and Q are the number of digrams in each word and N is the number in common. (Note that computed similarity can be greater than one when a digram occurs more than once in either word.) We test two selection strategies, with exact matches selected in both. l
l
Q’: Select a single Metathesaurus word with similarity >= 0.8. If this fails, retain the original query word. Q”: Apply an additional length constraint where word length equals the number of unique digrams. Select all words >= 0.8 similarity within a difference of 1 in word length. This will select ‘aislada’ and ‘aislados’ for the query word ‘aislado’ but reject ‘aisladores’.
Table 6 shows examples of queries transformed through both alternatives. Note that stopwords are removed and the query words alphabetized in the transformed Words in the original query which do not queries9 appear in the Spanish/French Metathesaurus and are therefore candidates for digram based substitution are highlighted as are their substitutions. gSince our retrieval tests employ alphabetization has no effect.
a “word”
based approach,
77
Digram Based Results
The previous CLIR runs were repeated with the difference of including digram based vocabulary normalization procedures into the query translation process. All three performance scores for the two alternative query formats (Q’ and Q”) were computed. Similar to the previous experiment, the best runs are obtained with the FM+D+A combination of refinement strategies for both query types and languages. The digram approach, without the additional length constraints, i.e., Q’, consistently gives better results comnared to Q”. The best result obtained for Spanish is 0.1832 ll-AvgP (75% baseline); 0.3493 average interpolated precision at 0.1 recall (76% baseline) and 0.2179 exact precision at 10 retrieved documents (81% baseline). For French, the corresponding figures are 0.1647 (68%); 0.2935 (64% baseline) and 0.1547 (58% baseline) respectively. In comparison with the results of the first experiment, the best digram run offers improvements in the range of 2 to 5% for Spanish and 5 to 7% for French depending upon the measure used. Other normalization methods, such as stemming, will be explored in the future. 5.3
Comparison sults
of Spanish and French Re-
In general it is clear that the French results are inferior to the Spanish results. The question asked at this point is whether the performance differences observed is due to their different levels of representation in the UMLS or due to important differences in the languages that we are not considering, or perhaps due to both ‘? We know that of the MeSH concepts, 23,198 yield a total of 32,282 Spanish strings and 18,277 MeSH concepts yield 25,932 French strings. Interestingly except for a single concept, _ all concepts with French strings also have Spanish strings”. However, 4,922 of the MeSH concepts with Spanish strings do not have corresponding French strings. Thus for all practical purposes we may consider the French concepts to be a proper subset of the Spanish concepts. In order to study the effect of the difference in representation further we first reduced the Spanish concepts to those 18,276 concepts which were also available in French. We refer to this set as ‘Spanish-reduced’ and use ‘Spanish’ for the original set of 23,198 Spanish concepts. To our surprise, the differences between Spanishreduced and Spanish from the viewpoint of our collection of 106 Spanish queries is minimal. The auerv set has 538 words (after excluding stopwords) out ofwhich 82 do not occur in the Spanish concepts and only an additional 5 do not occur in Spanish-reduced. Thus we do not expect to see any differences in retrieval performance by moving from the set of Spanish concepts to the subset that also has French translations12. Thus we may conclude that the difference in the level of representation in the UMLS across the two languages does not cause the difference in CLIR performance observed for this query set.
“Of course the assumption bebind this question is that the translations in both languages are of equal quality. “The one exception is concept COO05403 with t,he French string ‘Reflux Biliare’. “This expectation was supported when we repeated the FM+D+A run of Table 3 on Spanish-reduced. The performance increased slightly to 0.1745. The slight increase may be explained by the fact that the 5 additional terms without representation in Spanish-reduced are of low frequency and are not very informative such as ‘adversos’.
.Tp.nle”
Q#
Version
Q--Y
Ql
Oti@,.l
sxhten adversos en 10s lipidoa cuando I. pro~cstcro.. con terapi. de rsemplnrante hormonal estrogenae .dmini.tr.do* adversos c.trogenos hormo”R Ilp,do* reemplrzrnte terapi. .dmini.tr.dos advera.* eatrogenos hormonn lipid.* reempi.z..te tcr.pia
Q’ Q” Q2
Q# Q2
version
Origin.1 Q’ version Q” version
p.tofl.iolo*i. co.gul.ei.n coagulation
“.r*ion
9-V
es admnnmtrad. proge.ter0.a progeateron.
y tratamiento dc coagulaeion intr.v.8cul.r diaeminad. dineminad. fisiolopi. i.trav.ecul.r tratnmienta diseminsd. intrnvaecular p.isofl.iolo*i. tratamiento Frcneh
Origin.1 Q’ ver.io” Q”
44
version
v.r.i..
Original Q’ version Q” v.r.io.
phy.iop.tholo#m. et traitcmenl de la congulation inrravasculnire disacmi... e0.gul.ti.n dis*.min.. intrsvaaculaire phytopathologl. tr.itement coagulation di...min. di...min.. intr.v.acul.ire phymiopathologi. traitemcnt e subdural chez les pa-...... .gcca d l.z”“.z sgees p.r*onn.. l.Y”C subdur., agee, p.r.onn. p.r.onn.. revue subdural
Table 6: Sample Query Transformation through Digram Based Strategies
Spsnish French
13ett.r thnn bnseline Clos. 1 Class 2 .t ,e.at 50% 10 to 30% 7 7 12 8
Equiv.1e.t to baseline Cl.** 3 .b.o,ute d,ffcrcnce betwecn 0 .nd 10% 40 22
c1s.a
4
Wore.
10 to 30% 10 11
th.s Cl.*.
baseline 5 anas 0 ,llore rho” 30 to 10% 70% 18 24 18 35
Table 7: Distribution of query-by-query performance.
Table 8: Examples of query terms and performance
If we assume that the translations are of equal quality then our conclusion is that there are important differences in the two languages that our CLIR algorithm has yet to consider. This is also indicated by the fact that the addition of the heuristic ‘PM’ always degrades the French results but not the Spanish results. Future work is planned to examine these aspects in further detail. 6
Query-by-Query
Analysis.
lation is on the high end of the performance range (of 5075% of baseline scores) observed with approaches based on dictionaries with or without information extracted from corpora 12, 3, 7, 14). As anticipated, performance is still behind dictionary independent methods using parallel corpora [lo]. It remains to be seen if the addition of tools such as stemmers and relevant parallel or comparable corpora improves performance. In addition to the use of the UMLS Metathesaurus (an excellent example of a collaboratively built vocabulary system), this study has a number of unique features. First, it involves SMART in selecting MeSH phrases for queries which allows us to consider weights in this phase. Another unique feature is the exploration of a new and automatic method for deriving word based transfer dictionaries from phrase based transfer dictionaries. We also show that such dictionaries contribute to CLIR performance. This is also one of very few recent studies to empirically explore the value of multilingual thesauri or controlled vocabularies for CLIR. Moreover we investigate how a controlled vocabulary can be used to conduct free-text based CLIR. Lastly, this research contributes to the strengthening of the international impact of MEDLINE. Future work will build on these results by exploring alternate refinement strategies and appropriate second language stemmers. We also look forward to exploring the other languages of the UMLS: Portuguese and German.
For this analysis we took the best performance of the Spanish and French queries and compare them against the baseline. We obtain the precision of each of the 106 queries and compute the percentage difference with respect to the baseline. The queries were grouped in six classes which are presented in Table 7. Each cell presents the number of queries in that class. We observe that the translation process improves several queries. Surprisingly French queries generate 20 significantly improved translated queries, in contrast to 14 generated by the Spanish queries. We also observe that the Spanish translation generates 40 queries that perform equivalently to the baseline. Table 8 shows four of the queries from classes 1 and 6. We observe that in query 87 both translations perform significantly better than the baseline. In query 105 the Spanish translation is better than baseline but the French translation performs very low. The contrary happens in query 44. In those cases where the translation performs better than the baseline, the process has introduced a new important term that was not present in the original English query. Query 61 is an example of a case where both translations perform worse than the baseline. In those cases, the translation process failed to translate an important term of the query. This type of analysis will allow us to refine our methods in future research.
Acknowledgements The authors thank Professor Bill Hersh for generously providing the OHSUMED test database. We also thank Professor Roccio Guillen for translating the OHSUMED queries into Spanish and Dr. Gertrud Champe of the UI Translation Laboratory for translating the queries into French. Finally we thank the reviewers of this paper for their recommendations.
7
References
Conclusions
We have explored a CLIR method for MEDLINE using only the multilingual Metathesaurus for query translation. No tools such as part of speech taggers, stemmers and separate corpora are involved. The approach begins by first selecting an initial set of Spanish/French MeSH phrases that is appropriate for the query as a whole. This set is then refined using alternative strategies, The best performance is achieved by: first selecting phrases that contain only query words; then translating any remaining query words on a word-by-word basis; and finally retaining the remaining Spanish/French query words. We tested the translated queries on OHSUMED with the monolingual retrieval results as the baseline. Three versions of translated queries were tested over two retrieval experiments. The second and third query versions (Q’ and Q” ) involved digram based vocabulary normalization procedures. In general, the best free-text based retrieval performance is at 75% of baseline MLIR performance in llAvgP; 76% of baseline in average precision at 0.1 recall and 81% baseline in exact precision at 10 retrieved documents for Spanish. French yields less favorable results, with best scores 68%, 65% and 58% respectively. We show that these differences in performance are not caused by differences in level of representation in the UMLS. The must likely cause is difference in linguistic features. When compared with previous results we see that Spanish CLIR using the Metathesaurus for query trans-
79
PI J.
Allan, J. Callan, W.B. Croft, L. Ballesteros, D. Byrd, R. Swan, and J. Xu. INQUERY does battle with TREC-6. In Proceedings of the Sixth Tezt Retrieval Conference (TRECG). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.
PI L.
Ballesteros and W.B. Croft. Dictionary methods for cross-lingual information retrieval. In Proceedings of the %!h International DEXA Conference on Database and Expert Systems, pages 791-801, 1996. http://ciir.cs.nmass.edu/info/psfiles/ irpubs/ir.html.
PI
L. Ballesteros and W.B. Croft. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, .July 1997. 84-91.
[41 R.D. Brown.
Automated dictionary extraction for “knowledge-free” example-based translation. In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation, July 1997.
[51 C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using SMART:TREC
3. In
[16] G.A. Miller. WordNet:an on-line lexical database. International Journal of Lexicography, 3(4), 1990.
D.K. Harman, editor, The Third Text Retrieval Conference (TREC-3), pages 69-80. NIST, November 1994.
[17] D.W. Oard. Alternative approaches for crosslanguage text retrieval. In D Hull and D Oard, editors, AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997.
[6] J.G. Carbonell, Y. Yang, R.E. Frederiking, R.D. Brown, Y. Geng, , and D. Lee. Translingual information retrieval: A comparative evaluation. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, August 1997. [7] M. Davis. New experiments in cross-language text retrieval at NMSU’s computing research lab. In The Fifth Text Retrieval Conference (TREC-5)., November 1996. [8] M. Davis. Free resources and advanced alignment for cross-language text retrieval. In proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998. Query transla[9] M. Davis and T. Dunning. tion using evolutionary programming for multilingual information retrieval. In Fourth Annual Conference on Evolutionary Programming, August 1995. http://crl.nmsu.edu/users/madavis/Site/ Book2/evolmltrl.ps.gz.
[19] D.W. Oard and P. Hackett. Document translation for cross-language text retrieval at the university of maryland. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998. [20] National Library of Medicine. Unified Medical Language System (UMLS) Knowledge Sources, 6th experimental edition. Bethesda, MD:NLM, 1997. [21] B. Rehder, M.L. Littman, S. Dumais, and T.K. Landauer. Automatic 3-language cross-language information retrieval with latent semantic indexing. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.
[lo] ST. Dumais, T.A. Letsche, M.L. Littman, and Landauer T.K. Automatic cross-language retrieval using latent semantic indexing. In D Hull and D Oard, editors, 1997 AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March http://wuu.clis.umd.edu/dlrg/filter/ 1997. sss/papers/dumais.ps. [ll]
[18] D.W. Oard and B.J. Dorr. A survey of multilingual text retrieval. Technical Report UMIACS-TR96-19 CS-TR-3615, University of Maryland, April 1996.
[22] G. Salton. Automatic processing of foreign language documents. Journal of the American Society for Znformation Science, 21(3):187-194, May 1970. [23] G. Salton, editor. The SMART Retrieval SystemExperiments in Automatic Document Processing. NJ: Prentice Hall, 1971.
E. Gaussier, G. Grefenstette, D.A. Hull, and B. M. Schulze. Xerox TREC-6 site report: Cross language text retrieval. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.
[24] P. Schluble and P. Sheridan. Cross-language information retrieval (CLIR) track overview. In Proceedings of the Sixth Text Retrieval Conference (TRECG). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.
[12] J. Gilarranz, J. Gonzalo, and F. Verdejo. An approach to conceptual text retrieval using the Eurowordnet multi-lingual semantic database. In AAAI Symposium on Cross-Language Text and Speech Retrieval, March 1997.
[25] P. Sheridan and J.P. Ballerini. Experiments in multilingual information retrieval using the SPIDER system. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 58-65, August 1996.
[13] W. Hersh, C. Buckley, T. Leone, and D. Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In B Croft and C van Rijsbergen, editors, Proceedings of the 17th International Conference on Research and Development in Information Retrieval, pages 192-200. New York: ACM, August 1994.
[26] P. Sheridan, M. Wechsler, and P. Schauble. In NJ Belkin, Cross-language speech retrieval. AD Narasimhalu, and P Willett, editors, Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 99-109. New York: ACM, July 1997.
[14] D.A. Hull and G. Grefenstette. Querying across lanapproach to multilinguages: A dictionary-based gual information retrieval. In H-P Frei, D Harman, P Schauble, and R Wilkinson, editors, Proceedings of the 19th International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 49-57. ACM, July 1996.
[27] D. Soergel. Multilingual thesauri in cross-language text and speech retrieval. In D Hull and D Oard, editors, AAAZ Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997.
[15] B. Mateev, E. Munteanu, P. Sheridan, M. Wechsler, and P. Schluble. ETH TREC-6: Routing, Chinese, cross-language and spoken document retrieval. In Proceedings of The Sixth Text Retrieval Conference (TREC-6). Gaithersburg, MD: National Institute of Standards Technology (NIST), November 1998.
[28] P. Srinivasan. Retrieval feedback in medline. Journal of the American Society for Information Science, 3(2):157-167, 1996.
80