TOWARDS AN INTERACTIVE MULTILINGUAL ENVIRONMENT E ric Gaussier Xerox Research Centre Europe 6, Chemin de Maupertuis 38240 Meylan F.
[email protected]
Abstract We want to present here two simple systems developed to help people work in a multilingual environement. The rst system allows one to perform searches in a multilingual document collection, whereas the second one helps understanding documents in foreign languages. These systems can be coupled together to provide a substantial aid to people working in a multilingual environment.
1 Introduction Cross language information retrieval (CLIR), also known as multilingual IR or trans-lingual IR, has received much attention in the last years, from both the research community and the industrial one. The challenge is to give a user access to the huge amount of textual data written in dierent languages, including, but not limited to, the user's mother tongue. Even though providing this access constitutes one of the rst steps towards working in a multilingual environment, it may not be sucient and the user may still be puzzled looking at a relevant text written in an unknown language, or at a partial or imperfect translation of this text. That is, we have to help the user in her/his comprehension of texts written in foreign languages. Dierent paradigms underlie cross-language information retrieval, depending on where the language barrier is crossed1 . There are at least three dierent places where one can go from one language to another one: at the query level, at This point of view issued from discussions at the SIGIR'97 Workshop on CLIR 1
the document level, or at a parallel or comparable corpus level, each of them with advantages and disadvantages that we brie y present. Query level: translating queries presents the advantage of being fast, with almost no additional space needed to store translations; but the translation accuracy may be low inasmuch as the context size is small. Furthermore, such systems generally need to be supplemented with tools to help users understand documents in foreign languages. Document level: this strategy is in a way symmetric to the preceding one. The processing time is usually high, and huge amounts of space are necessary to store the dierent translations of the documents. But the translation does not suer a priori from the same weakness as the above mentioned systems, and no additional tools is required. Corpus level: this strategy requires (semi)aligned copora, that we denote C, and is based on the following methodology: from a query in a given language L, retrieve documents in language L from C, and use their counterpart (i.e the documents they are aligned with) in language L' to construct the query in L'. This strategy is interesting in several aspects. First, the passage from one language to another relies on human translation rather than automatic translation. Secondly, the use of whole documents as queries performs a natural expansion of the original query, a technique often used in IR to improve the performance of a given system. The main drawback of this strategy resides in the restricted availability of
aligned corpora, resources which either do not exist or are not usable for lot of domains. Furthermore, as in the rst strategy, some comprehension aid has to be provided to understand documents. Not surprisingly, one of the key issue in CLIR concerns Machine Translation, and more particularly, the use of MT systems to translate documents or queries. In so far as the representation of IR objects (queries and documents) is based on index terms, the quality of a system will be evaluated through its capacity to accurately translate index terms. That is, leaving aside the problem of comprehension, a system which performs a translation close to word-by-word, with appropriate lexical choices, is, with respect to the current state of IR, a good system for searching multilingual documents. And thus, the two main problems, with respect to translation, encountered in CLIR are dictionary coverage, and word sense disambiguation. This explains why the rst strategy has received attention from researchers. Of course, the comprehension problem still exists, and the question raised is wether or not MT systems are developed enough to tackle this problem. But this is another issue. We want to present here the system we have developed for CLIR, based on query translation, as well as a system, called LOCOLEX, for comprehension aid. The two following sections will be devoted to these presentations. In the fourth section we will try to sketch some extensions we can dream of.
tions that a query undergoes during our translation process. For query construction, we rely on two models, weighted boolean and vector-space, the former implying a larger user involvement whereas the latter is designed to be used in a purely automatic way. In next subsections, we describe the components used to create index terms and to translate queries, as well as the two systems we dispose of.
2.1 Monolingual Processing The monolingual transformation of a query follows the following steps:
2 Cross Language Information Retrieval Cross Language Information Retrieval (CLIR) addresses the problem of retrieving documents written in one language using queries written in another language. As document repositories grow in size and distribution, and one can consider the World Wide Web as such a repository, it is becoming more important to nd solutions to this long-standing research problem [8]. We have built dierent Natural Language Processing components to perform this task, using SMART ([9]) as the indexing engine. These components characterize the transforma-
part-of-speech tagging: our Xerox part-ofspeech taggers2 provide the user with lemmatized forms[7] of the words in the text,
noun phrase extraction and decomposition: using the tagged text we also extract entire noun phrases [10], as well as the decomposition of complex noun phrases into two word subparts (pairs),
stopword removal: stopwords are removed using both standard IR stopword lists3 and grammatical word lists extracted from our lexicons,
stemming: individual words as well as words in noun phrases are derivationally stemmed,
pair sorting: the stemmed versions of pairs extracted from noun phrases are sorted in alphabetical order, as has been done in SMART since [2] in order to eliminate positional variation,
index terms: the individual words as well as the pairs thus obtained constitute the index terms. Pairs and words are stored separately in new elds which can be weigthed dierently.
This same treatment has been applied to documents for indexing. The addition of phrases derived in this way has improved our average precision for documents having many (more than 2 3
http://www.xrce.xerox.com/research/mltt/Tools/pos.html. See ftp://ftp.cs.cornell.edu/pub/smart
four) relevant documents by 7% over baseline retrieval using simple stemmed words in past experiments we conducted within TREC [6]. With documents with four or less relevant documents, adding phrases improves average precision by 14%.
sexue eduquer. In order to translate this version of the query into English, the following additional steps are performed:
For each single word, reverse stemming producing the related lemmas eduquer | education, educatif, educateur, eduquer
Translation of the lemmas (English) education, training, manners, educational, educative, educate, train, bring up
Stemming of the translated lemmas: educe, train, mannered, mannerism, bring up
Filtering of the stems obtained on the basis of their presence in the collection: educe, train, mannered, mannerism
For multiword expressions, generate all possible combinations of the stems produce above eduquer sexue mannerism sex mannered sex sex train educe sex mannerism sexual mannered sexual sexual train mannerism sexualiser educe sexual mannered sexualiser sexualiser train educe sexualiser
2.2 Translating Queries
In this section we show the transformations performed on a sample query during translation. First, as in the monolingual case explained in the last section, the query is part-of-speech tagged, noun phrases are extracted, stopwords are removed, and words and phrases are stemmed. Then, in order to produce a translated version of the query, each of the query terms are expanded (a reversal of the stemming process) to produce all the derivational variants. Each of these variants is looked up in a general language bilingual dictionary. The translations are restemmed using a derivational stemmer for the target language. Here follows an example of this chain on a TREC topic.
Sample Treatment Topic number 7 deals with sex education. The French title is L'education sexuelle. Let's imagine that this title is the entire French query and that we want to access English documents with it. The steps followed in the source language treatment of this query are the following:
Part-of-speech tag sequence | L'/DET education/NOUN sexuelle/ADJ Stopword removal | education sexuelle Lemmatisation(in ectional lexicon) | education sexuelle Stemming (derivational lexicon) | eduquer sexue Noun Phrase extraction, stemming and alphabetical ordering | NP extracted: feducation/NOUN sexuel/ADJg Stemmed and ASCII ordered: sexue eduquer
If this title were the entire query, then the monolingual query would consist of the following stemmed index terms: eduquer sexue
The derivational stemming algorithm used in these experiments is based on a new technique to automatically derive an approximation to derivational families using only a lexicon. Since the technique is currently under development, there are a number of problems which still need to be resolved. The algorithm is entirely automatic, meaning that like most traditional stemming algorithms, it will make a number of stemming errors. A manual correction step is planned for the future. On the other hand, this means that the algorithm is nearly language independent (for certain language families, given a lexicon), so we can develop derivational stemmers for new languages relatively easily. This is a key advantage for cross-language text retrieval.
2.3 Possible Improvements Translation of Non-compositional Phrases As described above, our dictionary performs only single word translation, modulo derivational variants. Before accessing the dictionary, we derivationally stem the word to be looked up, then generate all other lemmatised form of the same word, and concatenate the dictionary entries for all of these words. An equivalent strategy would be to derivationally normalize all the head words in the dictionary and con ate the translations of all the words stemming to the same form. When we use this technique to translate phrases, such a technique only works if the translation of phrases is compositional. For example, education sexuelle can be translated word-byword, modulo derivational variation, to sex education. But ours en peluche cannot be translated compositionally into teddy bear, since peluche only translates to plush and uy. One solution to this problem is to have an exhaustive list of the non-compositional phrases of a language, with their translations, and a mechnanism for recognizing instances of these expressions. We plan to incorporate this in future experiments for non-compositional expressions that are contained in our translation dictionaries. In addition to derivational variants, it would be useful to include close synonyms to palliate word choice variations. For example, we encounterd a French topic dealing with air pollution and using the rare term pollution de l'atmosphere whose derivational variants appear in the corpus one-half as frequently as the more common pollution de l'air. Similarly, organic farming was described in the French version as agriculture ecologique rather than the more common, compositionally translatable agriculture biologique (which appears 75 times in the French newsire versus just 6 occurrences for agriculture ecologique).4
Integration of a Language Guesser Even though we do not want to detail the possible interfaces to CLIR systems, we would Since one of the translations of biologique is organic and one of the translations of agriculture is farming. 4
like to point out that some systems integrate a language guesser to guess the language used in a query, and automatically call the appropriate bilingual dictionaries. The user does not have, in such cases, to select the language he used. Of course, this functionality can also be used for document translation.
2.4 Weighted Boolean vs. Space
Vector
The vector space model certainly is one of the most used models in Information Retrieval. We will not describe it here but refer readers to [9]. This model is particularly useful when one wants to automatically build queries from topics, the only task performed by the user being providing a topic, usually in the form of short sentences written in natural language. On the other hand, our approach to manual query construction is based on a simpli ed weighted boolean model. The model assumes that each query can be divided into a number of concepts. Each concept consists of one or more terms, and the terms within a concept are combined using a weighted OR operator. The concepts are then combined using a weighted AND operator. In addition, the user is expected to assign a value of 1, 2, or 3 to each concept to indicate its importance in the query (1 = not important, 2 = important, 3 = mandatory). These values are used internally to adjust the concept weights before applying the weighted AND operator. A concept value of 3 leads to a strict boolean constraint, while lower values relax the constraint, allowing documents which do not contain any terms in the concept to be retrieved with a non-zero weight. A longer description of the probabilistic weighted boolean model is provided in [4]. The detailed mathematical formulation of the operators can be found in [5], which should soon be available. Previous experiments [4] have found that the weighted boolean model is particularly eective for cross-language text retrieval, as it addresses two important problems in query translation. A primary source of error in CLIR is translation ambiguity (a source language term can can have multiple unrelated target language translations). The boolean AND operator provides a natural form of disambiguation, since it is likely that correct translations will cooccur much more often
in documents than incorrect translations. This approach has signi cant advantages when compared to other corpus-based and user-based disambiguation strategies. The search corpus itself is used for disambiguation, so domain relevance is guaranteed and no additional reference corpora are required. User knowledge is incorporated implicitly in the query construction process, so no additional user eort speci cally for disambiguation is required. The boolean model also performs automatic normalization of term importance. For example, a common source language term may have one or more rare target language translations which receive large term weights. The boolean operators make sure that rare terms and terms with many synonymous translations do not dominate during retrieval (since they may well be incorrect translations). These problems can also be addressed with other disambiguation strategies, but other methods tend to add substantial overhead to the query translation process. Table 1 shows an example of two queries manually built from the same topic, the French topic being a translation of the English one.
English: Is wine consumption/production rising or decreasing world-wide? French: La consommation de vin augmente ou diminue-t-elle dans le monde ? English Boolean: 3 wine wines 2 consume consumption produce production 2 increase decrease rate curve forecast future 1 world international French Boolean: 3 vin 3 consommation production consommation vin production vin 2 augmentation diminution diminuer croissance cro^tre decro^tre 2 monde pays Table 1: English and French weighted boolean queries of the same topic. Both English and French searchers used more or less the same concept structure for their queries, although this is often not the case. The
concept weights are dierent, however, with the French searcher requiring that either production and consumption occur in the document for it to be retrieved.
3 Comprehension Aid
3.1 Introduction
One of the greatest impediments to ecient understanding of foreign texts, that aects readers more or less in all levels of language comprehension skills, is the appearance of an unfamiliar word or phrase, and the subsequent manual searching in a hard-copy bilingual dictionary. Every language learner and every scholar has experienced the frustration of putting aside their primary task of understanding the text being read in order to attack the secondary task of deciphering an unknown word. Depending on the presentation of the dictionary and the ambiguity of the word, this decipherment can last for minutes as the reader nds the appropriate page and heading and then plows through subheadings. While this universal experience will remain as long as bilingual dictionaries are only available as paper books, a number of computer-based solutions to expedite this reading problem are available today for certain languages. We present here an intelligent reading aid, LOCOLEX, that has been implemented incorporating a machinereadable bilingual dictionary and state-of-the-art linguistic technology.
3.2 Overview of LOCOLEX
LOCOLEX provides intelligent dictionary lookup through the interaction between a complete on-line dictionary together with on-line text. As in the previous section, we describe this interaction and ignore the human user interface aspects of the project. Let us assume that the following sentence appears in a text. The user clicks on the word "bras": Dans l'autre sens il utilise son bras atrophie pour guider la balle. To create well-formed translations, we may need most of the detailed information usually provided by bilingual dictionaries. However, if our purpose is merely to understand a text, the following
list of direct equivalents is sucient (built out of the Oxford-Hachette dictionary):
bras nm, inv 1:
2: 3: 4:
5:
Anat
arm (main d'oeuvre) manpower Geog (de euve) branch Tech (de fauteuil) (d'electrophone) (ancre) arm (de brancard) pole Zool (de cheval) shoulder (de mollusque) tentacle
Only a part of the dictionary is displayed, but it is a part that can be extracted out of the full entry in a predictable way. Unlike small dictionaries geared exclusively towards language comprehension which would merely delete the extra information, LOCOLEX allows the user to ask for more information about one or another of the meanings. In this case s/he could click a second time on what s/he deems to be the right choice and get a usage example that may exactly match the input sentence. As the goal is comprehension only, LOCOLEX at rst does not display usages, examples, pronunciations, or cross references that were part of the initial dictionary entry. We make the entry shorter in order to make a complete on-line dictionary easy to use. But LOCOLEX can handle more complicated cases and provide only the appropriate translation by exploiting the context in which the word appears. Indeed, it can use the context to determine the correct part of speech from among dierent possibilities. A great number of word forms belong to dierent parts of speech. For instance, "slew" could be either a noun or the past tense of the verb "to slay". If the word "slew" appears in a text as a verb and the user does not know about the in nitive form of the verb s/he will look under the word "slew" and only get the translation for the noun interpretation. To avoid this, LOCOLEX analyses the words returning their dierent base forms. But this alone would again lead to presenting the user with more information than is desirable. For ex-
ample, for the French word "grille" we would now get both a verb form "griller" and a noun form "grille" when it appears in the following French sentence: Le 21 janvier, presentant sur FR 3 la nouvelle grille de la cha^ne, il a arme:
To avoid this, LOCOLEX disambiguates the word with respect to its possible parts of speech. If the user clicks on this word in the above phrase, because it appears in between an adjective and a preposition, the preference is given to the noun interpretation and only that part of the dictionary which attached to the noun will be displayed.
grille nf 1:
2:
(cl^oture) (porte) (de prison) (d'evier, egout) (de four, refrigerateur) (de po^ele, cheminee)
(de mots croises, d'horaires) 3: Admin
4: Electron
railings gate bars drain shelf grate grid scale grid
Another distinction that LOCOLEX can make by using the word's context is that: it can decide whether or not a word is part of a multiword expression. Consider the following French sentence: La victime de cette oensive ne reste pas les bras croises.
Here "bras" is part of a multiword expression (idiomatic phrase) "rester les bras croises" ("to stand idly"). In this case, translating the word "bras" directly into its literal meaning, as shown before, is useless. Instead, the entire phrase must be taken into consideration. Therefore, if the user clicks on the word "bras", in the phrase "ne reste pas les bras croises" LOCOLEX displays only the translation
for the whole phrase and not the translation for the word "bras" itself.
bras
nm, inv rester les bras croises
to stand idly
LOCOLEX can also deal with other kind of multiword expressions such as compound words. Consider the French word "juge" in the following compound : juge de paix
When the user clicks on "juge" within the above phrase LOCOLEX displays:
juge
nm juge de paix
Justice of the Peace
To summarize, LOCOLEX uses a word's context to look for multiword expression patterns, choose between parts of speech, and exclude irrelevant information in order to focus the user's attention on the best translation for better comprehension.
3.3 Possible Improvements
The room for improvement is large from the system described to what a user might want to have. Machine Translation and Machine Aided Human Translation provide directions to explore towards a better comprehension system. The approach we have followed relies on simple but ecient and high-quality tools, and can be glossed as: \let the system do what it is good at, and let the user do what we cannot do". That's along this line that we would like to evolve, in order to provide the user with better comprehension tools.
4 What are the Possible Extensions/Applications we can Dream of? Imagine you participate in an international meeting. Let us make the (weak) assumption that English is the ocial language, and (less weak though still probable) that you have a speech recognition system coupled with LOCOLEX
(note that this implies that you posses a computer, which, from my personal though small experience of international meetings, is less and less of an assumption but more and more of a fact). Now, when you do not understand particular expressions used by others, you can use your comprehension aid on the output of your speech recognition to recompose the meaning of what was said. Such systems can be built upon available technology, and their emergence might not be far o. Well, now dream you can write programs in your mother tongue. In such a case you can assume that databases of such programs exist. When you want to write a piece of code with speci c functionalities, you simply describe, in your mother tongue, the program you are interested in, and then let your CLIR system retrieve relevant pieces of codes that you understand with the help of your comprehension aid. This is a simple (in a way, but in a way only) application of the systems described in the previous sections, but with the implication that it could mark an end to endless computer scientist disputes on the way programs should be written and commented. Wonderland!
5 Conclusion We have presented two systems, with dierent user interactions, to help people work in a multilingual environment. The combination of these systems allows one to perform searches in a multilingual database, and to understand documents in dierent languages.
References [1] D. Bauer, F. Segond, and A. Zaenen. LOCOLEX: Translation Rolls o Your Tongue. In Proceedings of ACH-ALLC, Santa Barbara, USA, July 1995. ACH-ALLC'95. [2] J.L. Fagan. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD Thesis, Cornell University, September 1987. [3] E. Gaussier, G. Grefenstette, D.A. Hull, and B.M. Schulze. Xerox TREC-6 Site Report: Cross Language Text Retrieval. To
[4]
[5] [6]
[7]
[8]
[9] [10]
appear in The Sixth Text REtrieval Conference, 1997. D.A. Hull. Using Structured Queries for Disambiguation in Cross-Language Information Retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997. D.A. Hull. A Probabilistic Model for the Approximate Matching of Boolean Constrants. Being cleared for publication, 1997. D.A. Hull, G. Grefenstette, E. Gaussier, B.M. Schulze, H. Schutze, and J. Pedersen. Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks. In D.K. Harman, editor, The Fith Text REtrieval Conference (TREC-5). U.S. Department of Commerce, 1997. NIST Special Publication 500. L. Karttunen, R.M. Kaplan, and A. Zaenen. Two-level Morphology with Composition. In Proceedings COLING'92, Nantes, France, 1992. G. Salton. Automatic Processing of Foreign Language Documents. Journal of the American Society for Information Science, 21:187-194, 1970. G. Salton, and M. Mc Gill. An Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. A. Schiller. Multilingual Finite-State Noun Phrase Extraction. In Workshop on extending nite state models of language, Budapest, Hungary, 1996. ECAI'96.