Document not found! Please try again

A Framework for Cross-Language Information ... - Semantic Scholar

0 downloads 0 Views 256KB Size Report
Keywords: cross-language information retrieval, probabilistic retrieval, Japanese-. English, machine translation, information access. 1. Introduction. The quantity ...
1 A Framework for Cross-Language Information Access: application to English and Japanese

Gareth Jones, Nigel Collier, Tetsuya Sakai, Kazuo Sumita and Hideki Hirakawa

Human Interface Laboratory Research and Development Center, Toshiba Corporation 1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki 210-8582, Japan

Corresponding Author: Dr Gareth J. F. Jones Department of Computer Science School of Engineering and Computer Science University of Exeter Prince of Wales Road Exeter EX4 4PT United Kingdom Tel: +44 1392 264044 Fax: +44 1392 264067 email: [email protected]

ktim.tex; 22/10/1999; 11:38; p.1

1 A Framework for Cross-Language Information Access: application to English and Japanese

Gareth Jones, Nigel Collier, Tetsuya Sakai, Kazuo Sumita and Hideki Hirakawa

Human Interface Laboratory Research and Development Center, Toshiba Corporation 1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki 210-8582, Japan

Keywords: cross-language information retrieval, probabilistic retrieval, Japanese-English, machine translation, information access

ktim.tex; 22/10/1999; 11:38; p.2

A Framework for Cross-Language Information Access: application to English and Japanese

Gareth Jones3, Nigel Colliery , Tetsuya Sakai, Kazuo Sumita and Hideki Hirakawa Human Interface Laboratory Research and Development Center, Toshiba Corporat ion 1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki 210-8582, Japan Abstract.

Internet search engines allow access to online information from all over the world. However, there is currently a general assumption that users are uent in the languages of all documents that they might search for. This has for historical reasons usually been a choice between English and the locally supported language. Given the rapidly growing size of the Internet, it is likely that future users will need to access information in languages in which they are not uent or have no knowledge of at all. This paper shows how information retrieval and machine translation can be combined in a cross-language information access framework to help overcome the language barrier. We present encouraging preliminary experimental results using English queries to retrieve documents from the standard Japanese language BMIRJ2 retrieval test collection. We outline the scope and purpose of cross-language information access and provide an example application to suggest that technology already exists to provide e ective and potentially useful applications. cross-language information retrieval, probabilistic retrieval, JapaneseEnglish, machine translation, information access

Keywords:

1.

Introduction

The quantity of textual material available online is currently increasing very rapidly. The most dramatic example of this is the ongoing expansion in the number of documents accessible from the Internet and the World Wide Web. In principle users of the Internet and similar systems can download all online material to which they are permitted access. With the advent of query translation facilities in the front-end of retrieval systems users are now able to search for documents in languages other than that in which they wrote the query. However, a signi cant issue is that many users are currently restricted to only being able to make use of information contained in documents actually written in languages in which they have some degree of uency. This prob3 y

Current address: Department of Computer Science, University of Exeter, U.K. Current address: Department of Information Science, University of Tokyo, Japan

c 1999 Kluwer Academic Publishers.

Printed in the Netherlands.

ktim.tex; 22/10/1999; 11:38; p.3

2 lem e ectively blocks their opportunity to access information in other languages and hence limits their ability to exploit online information. The importance of this issue is demonstrated by the current evolution in people's approach to accessing information. Traditionally those requiring timely information in their work relied on material provided by various professional agencies. For example, current a airs information provided by international news agencies and nancial data from company and stock market reports. Today such people are increasingly seeking information for themselves. Users no longer wait for information to arrive on their desk, but rather with the assistance of search engines, they can look for it online themselves. Thus workers are empowered to seek and make use of all available pertinent information and not just that provided by professional services. This is not to suggest that such information providers are becoming obsolete. They still have a vital role in generating relevant summaries of current information, and may in fact be able to provide richer content to their clients by themselves making use of the increasing number of information sources. What we are seeking to explore is a complementary technology, the objective of which is to enable information providers and consumers to make use of all available information sources. This paradigm is already well developed for information retrieval in an individual language. However there are only a few sources for textual material originating in languages in which the information seeker is not uent such as the foreign correspondents and international news services. Advances in cross-language information retrieval and machine translation suggest that this problem may be eased by the development of translingual information access applications. This paper explores issues for the development of translingual (or cross-language) information access for one of the most challenging tasks: information access between European and Asian languages. In our case we take the example of English and Japanese. We analyse the importance of information retrieval and machine translation in achieving this objective and describe ongoing work which demonstrates current achievements. The remainder of this paper is organised as follows: Section 2 de nes information access and explores the challenges for translingual technology, Section 3 summarises the state-of-the-art for English-Japanese machine translation, Section 4 outlines current information retrieval procedures and how they are applied monolingually to English and Japanese, and Section 5 explores current approaches to cross-language information retrieval and access. Section 6 describes a preliminary experiment in cross-language access to Japanese news texts. Finally, Section 7 describes current conclusions and further research directions.

ktim.tex; 22/10/1999; 11:38; p.4

3 2.

2.1.

Information Access

Definitions

When using an information retrieval (IR) system a user is, in general, primarily interested in accessing information contained in documents indexed by the retrieval system. Conventionally IR is usually taken to be the location and retrieval of documents potentially relevant to a user's information need. It is assumed within this scenario that once a relevant document has been retrieved, the user will be able to identify it as such and extract the relevant information contained in the document by reading it. In this paper we extend this de nition of IR and consider its role in the complete process of knowledge acquisition which we refer to as information access (IA). In IA we view extracting information from retrieved documents as an integral part of the information seeking process. Interest in information retrieval research has expanded signi cantly in recent years, although much research is still focussed on several well established models including: the vector-space model (Salton et al., 1977) (Salton and Buckley, 1988), the probabilistic model (Robertson, 1977) (Robertson and Sparck Jones, 1976) and inference networks (Turtle and Croft, 1990). These models have been extensively researched and evaluated. Much of this e ort in recent years has concentrated on the US NIST TREC (Text REtrieval Conference) (Harman and Voorhees, 1998). Commercial online text retrieval systems, such as Alta Vista , InfoSeek and Lycos , now contain index information for millions of documents. Using these systems information seekers are able to enter search requests in natural language and receive an interactive response. Most information retrieval systems are currently restricted to single language or monolingual operation, although there is increasing interest in the ability to query between di erent languages, see for example (Carbonell et al., 1997) (Hull and Grefenstette, 1996) (Sheridan and Ballerini, 1996) (Ballesteros and Croft, 1998). In this scenario referred to as cross-language information retrieval (CLIR), queries are entered in one language and documents retrieved in one or more other languages. Of course, documents returned using a cross-language retrieval system are only useful to the user if they can acquire the information they want from the documents. Section 5 gives a brief overview of some methods for cross-language retrieval, and a more detailed review can be found in (Oard and Dorr, 1996). In the following discussion we highlight some pertinent issues for IA. An obvious method for cross-language retrieval is to use automatic machine translation (MT) to translate the documents into the query

ktim.tex; 22/10/1999; 11:38; p.5

4 user query in L1

MT

query

IR

in L2

ENGINE

ranked document list in L2

augmented MT

document list in L1 & L2

selected MT

document in L1

document collection in L2

monolingual IR cross-language IR cross-language information access

Figure 1. Flow diagram for a basic Cross-Language Information Access system.

language. Unfortunately there are practical, as well as technical, drawbacks to this approach. Various translation scenarios are possible, for example all information servers could translate and index all documents into all possible query languages, this is rather impractical since maintaining index les in multiple languages may not be possible due to their size and the index maintenance overhead would potentially be very large. It is possible that subscribers could pay for such a service in domains of special interest, however such a strategy is clearly limited. An easier alternative is to translate the query into the original document language at query time (i.e. online) and retrieve these documents. This option is much more exible since it allows the query to be translated into any desired language, subject to the availability of a suitable query translation system for this language pair. 2.2.

Cross-Language Information Access

Our scenario for cross-language information access (CLIA) extends the CLIR paradigm to incorporate various possible post-retrieval processes to enable users to access information contained in retrieved documents. Potentially useful post-retrieval IA technqiues include: full MT, text summarisation, MT for content gisting, information extraction, and graphical content-visualisation. Figure 1 shows an example CLIA process system which includes post-retrieval MT. The rst stages follow a standard CLIR path: the user enters a query in their native language, the query is translated into the desired document language, and applied to the IR engine. Current IR systems typically present the user with a list of documents ranked by retrieval matching score, and the title and often the rst sentence of each document in the list. Using this information the user selects potentially relevant documents.

ktim.tex; 22/10/1999; 11:38; p.6

5 The scenario as described so far is a standard CLIR system. For our CLIA system, to assist the user with the initial stage of relevance judgement, we could use MT to translate the title and rst sentence of each document provided in the document language into the language of the query. This additional information could be presented to the user in an augmented ranked list. This idea has previously been adopted in the NTT TITAN system (Hayashi et al., 1997) which translates the document title from English to Japanese, and in addition shows the user various metadata, such as the domain name of the server and the language of the document, to help inform their decision of whether to download a particular document. When the user selects a document it could be automatically translated into the query language. Although we still have not reached the goal of fully-automatic high quality translation, today's MT systems o er an valuable tool for gisting general language. A practical strategy to do this would have to be designed carefully since MT is in general computationally expensive, and the translation output will usually contain at least stylistic aws and disambiguation mistakes. However, it is important to remember that the user is interested in information, not perfect prose or necessarily a translation of the complete document. Fluent readers can usually read a document despite stylistic clumsiness and spot translation errors due to contextual inconsistency. One could view the MT as assisting the user in extracting the required information. Recent work reported in (Oard and Resnik, 1999) shows that users are able to perform a cross-language categorisation task using only simple dictionary lookup \gloss" translations from Japanese to English. Whether this result would extend to less distinct relevant/non-relevant decisions between more closely related documents in operational IR systems remains to be investigated. Probably the most challenging environment for CLIA is for language pairs with di erent scripting systems. For example, Europeans can often make reasonable guesses at the contents of documents in another European language, whereas they are completely unable to access information in Asian language documents. The same is often true, at least to some extent, in reverse for many Asian language speakers. Although not explored in our current work, a further method which might be employed in CLIA to assist the user in nding relevant material within a document is graphical visualisation of the content. This technique has employed for monolingual information access of text in TileBars (Hearst, 1995) and of digital broadcast news video (Brown et al., 1995). Using visualisation potentially relevant areas of documents can be indicated to the user graphically, thus allowing the user to concentrate their attention on portions of the documents. This feature

ktim.tex; 22/10/1999; 11:38; p.7

6 may be particularly useful if the cognitive load of browsing roughly translated material is found to be high. The foregoing discussions assume that the requisite component technologies are in place and that their performance levels are sucient to provide useful information access. These scenarios can only be properly explored in the laboratory if suitable test collections are available to assess CLIR, MT, and most importantly how useful non-native speakers nd a given system is for assisting in accessing information. Such collections do not currently exist, as a starting point for our work we describe a preliminary experiment using a simulated cross-language information seeking task. Ultimately even if these techniques appear to work in the laboratory, the real test of course is whether information seekers nd them helpful and make regular use of them in their information seeking activities. 3.

Machine Translation

The role of translation in CLIR is essentially to bridge the gap between surface forms of terms in the query and document collection languages. Much of the previous work in CLIR, such as (Harman and Voorhees, 1998) (Hull and Grefenstette, 1996) (Sheridan and Ballerini, 1996), has looked at CLIR for European language pairs and has avoided many of the challenges which we face in processing European-Asian language pairs. In the latter case particular diculties arise because the language pairs are not cognates, so for example, a word in English may appear as a phrase, or a word and a particle (bunsetsu) in Japanese. We also nd that the level of lexical transfer ambiguity, i.e. the number of di erent translations which a word can have, is higher in such language pairs than in say English-French or English-Spanish. The three major practical challenges which we face in CLIR are:

0 0

Coverage: providing sucient bilingual knowledge,

0

Synonym selection: how to choose between conceptually equivalent forms of a translation.

Disambiguation: how to choose between conceptually di erent forms from the set of possible translations of a query word, and

MT using deep linguistic analysis is a core-technology for providing solutions in all of these areas. The main limitations which arise in adapting MT to IR are in the coverage of the bilingual dictionaries and in the amount of context available in short IR queries, where

ktim.tex; 22/10/1999; 11:38; p.8

7 it is dicult for linguistic analysis to succeed. These problems are non-trivial and increase as the scope of language we are required to process expands. Until recently it was generally felt that MT quality is too unreliable to translate queries for CLIR (Hull and Grefenstette, 1996). However, results in this paper and elsewhere (Franz et al., 1999) (Gey et al., 1999) suggest that reasonable retrieval performance can be achieved without modi cation by using existing MT systems. 4.

Information Retrieval

To date the vast majority of information retrieval research has been carried out on English language document collections. For this reason issues surrounding the retrieval of English text are the best understood and the techniques adopted the most extensively evaluated. Much of this work has been carried out within the Text REtrieval Conference (TREC) program (Harman and Voorhees, 1998). TREC has provided large English language retrieval test collections which have enabled di erent approaches to information retrieval to be compared and contrasted. TREC has run smaller evaluation exercises or \tracks" focused on other languages such as Spanish and Chinese retrieval. Large scale explorations of Japanese language retrieval have only been conducted very recently in the NTCIR (Kando et al., 1999) and IREX (Sekine and Ishara, 1999) retrieval workshops. English and many other Western European languages have an advantage for retrieval because the basic word level indexing units are clearly de ned. For such languages retrieval generally adopts the following approach. Text is conditioned to remove standard common stop words (usually short function words) and the remaining content words are sux stripped to encourage matching between di erent word forms. The sux stripped words (search terms ) are then statistically weighted based on their distribution within each document and the overall document archive. In retrieval the search request is matched against each document and a corresponding matching score computed. The user is then presented with a document list ranked by matching score. While these techniques are well understood for English, aggreement on suitable techniques for other languages requires further research which is only now becoming possible as large collections become available. Compared to these languages many languages including Japanese and other Asians languages such as Chinese and Korean present two particular problems for information retrieval: rst there is extensive use of ideographic character sets, and second they are agglutinating languages with no spaces between the words. In order to perform retrieval

ktim.tex; 22/10/1999; 11:38; p.9

8 content-information must be extracted from the character strings contained within documents and search requests. Much of the published research work in Asian language retrieval has focussed on the development of indexing techniques (Ogawa and Iwasaki, 1995) (Chien, 1995) (Lee and Ahn, 1996) (Nie et al., 1996). Until very recently (Sekine and Ishara, 1999) all Japanese retrieval test collections had been very small, meaning that it has not been possible to draw de nite conclusions about existing research results. However, the work which has appeared suggests that weighting schemes developed for English language transfer well to Japanese (Fujii and Croft, 1993) (Jones et al., 1998a) (Ogawa and Matsuda, 1997) and indeed, at least some, other Asian languages such as Chinese (Beaulieu et al., 1997). Experiments using new collections such as IREX are ongoing and for the purposes of these paper assume the existing indicative results to be reliable. 4.0.1. Indexing Methodologies Two methods are available for extracting indexing units from Japanese. Both of these techniques are used in the Japanese language retrieval system used for the experiment described in Section 6.

0

Morphological Segmentation: the continuous string of characters is divided into word-level units using a dictionary-based morphological analyser. In operation the character string is compared against word entries in a dictionary. The morphological analyser extracts whole words, and also tends to extract component words (or morphemes ) from compound words as separate indexing units. Unfortunately, morphological analysers make mistakes in segmentation. Errors arise principally from ambiguity of word boundaries in the character string and limitations in the morphological analyser, such as the morphological analyser's inability to identify words outside its dictionary.

0

Character-based Indexing: individual characters or (usually overlapping) xed length character n-grams are automatically extracted from the character strings and used as the indexing units. In this approach no linguistic analysis is performed and possible word boundaries are ignored.

Once the indexing units have been extracted, appropriate text conditioning can be carried out. A description of the possible requirements of Japanese language text conditioning in IR and potential strategies is beyond the scope of this paper, but a good review is contained in (Fujii and Croft, 1993). In general a detailed analysis of text conditioning for Japanese language IR is an important area for future study, but

ktim.tex; 22/10/1999; 11:38; p.10

9 will have to wait until suitable experimental Japanese text retrieval collections are available. After applying text conditioning, a standard term weighting method can be applied to the indexing terms. At retrieval time a search request is processed, e.g. using morphological segmenation or character-string extraction, to produce appropriate indexing units, which are then matched against the documents. 5.

Cross-Language Retrieval Methods

Current techniques for CLIR all involve some form of translation of either queries, documents or both. The methods used by researchers can generally be divided into the following categories:

0

Dictionary Term Lookup (DTL): individual terms in the query are replaced by one or more possible translations in the document language taken from a bilingual dictionary (Hull and Grefenstette, 1996). The principal advantages of this approach are that online bilingual dictionaries are becoming increasingly common and that the translation process is computationally very cheap due to its low level of analysis. Its main disadvantage is the ambiguity which is frequently introduced in the translation process. Individual terms are replaced by several alternative terms which are sometimes semantically unrelated to the original term in its current context. Various techniques are being explored to overcome this problem, for example using relevance feedback (Ballesteros and Croft, 1997) and corpus co-occurrence information (Ballesteros and Croft, 1998).

0

Parallel-corpora based Query Translation: terms occurring in similar contexts in aligned \parallel" (more often \comparable") corpora in di erent languages are identi ed. When the user enters a query a number of related terms in the other language can be generated in a form of query expansion (Sheridan and Ballerini, 1996). The main advantage in this method is that it is less prone to ambiguity problems of dictionary based methods. Its main disadvantage is that parallel corpora are not as widely available as bilingual dictionaries, particularly outside specialised domains.

0

Machine Translation (MT): the query and/or the document are translated using full machine translation with linguistic analysis. The main attraction of this approach is that the ambiguity of terms should be greatly reduced by taking their context into account via

ktim.tex; 22/10/1999; 11:38; p.11

10 the linguistic analysis in the translation process. The main disadvantages are the computation expense of the MT process, and the inaccuracy of current translation systems when used outside speci c domains. Inaccuracy in translation is particularly a problem where there is little contextual information, which unfortunately is exactly the situation often encountered in the short search requests commonly entered into information retrieval systems. Although widely discussed in the context of CLIR, as noted in Section 3, only very recently has the use of MT been explored in the context of CLIR (Collier et al., 1998) (Gey et al., 1999) (Franz et al., 1999). In the experiment reported here we compare CLIR retrieval peformance for query translation using DTL and full MT of each query. 6.

A Preliminary Experiment in English-Japanese Cross-Language Information Access

At the start of this paper we examined scenarios in which an information seeker might be looking for generally available information in a language of which they have little or no knowledge. In this section we describe a preliminary simulation experiment exploring the scenario of an English speaking researcher who wishes to go beyond ocial English language reports of Japanese news events by investigating Japanese language news articles directly. Our researcher will need to make use of CLIR to locate potentially relevant documents and require the assistance of MT to decide whether a document is relevant and to access the information it contains. For this experiment we use the Toshiba Japanese language NEAT IR system (Kajiura et al., 1997) and Toshiba ASTRANSAC MT system (Hirakawa et al., 1991). 6.1.

The NEAT Information Retrieval System

The NEAT Information Retrieval system is being developed for the retrieval of online Japanese text articles. Documents are indexed separately using both morphological segmentation and character-based analysis. A ranked output list is formed by applying term weighting and summing the weights of terms found in both the query and each document. 6.1.1. Term Weighting In this experiment the NEAT System makes use of the BM25 probabilistic combined weight (cw ) derived by Robertson (Robertson and

ktim.tex; 22/10/1999; 11:38; p.12

11 Sparck Jones, 1997) (Robertson and Walker, 1994). The BM25 weight has been shown to be e ective not only for English text retrieval, but also where documents have been imperfectly indexed, for example in Chinese text retrieval (Beaulieu et al., 1997), and in retrieval of spoken documents (Walker et al., 1998). The BM25 cw weight is calculated as follows, (

cw i; j

)=

K

cf w (i) 2 tf (i; j ) 2 (K 1 + 1) 1 2 ((1 0 b) + (b 2 ndl(j ))) + tf (i; j )

where cw(i; j ) represents the weight of term i in document j , cf w(i) is the standard collection frequency weight (often referred to as inverse document frequency weight), tf (i; j ) is the frequency of term i in document j , and ndl(j ) is the normalised length of document j . ndl(j ) is calculated as, ( )=

ndl j

() ; for all documents dl j

Average dl

where dl(j ) is the length of j . K 1 and b are empirically selected tuning constants for a particular collection. K 1 is designed to modify the degree of e ect of tf (i; j ), while constant b modi es the e ect of document length. High values of b imply that documents are long because they are verbose, while low values imply that they are long because they are multitopic. In the experiments reported here document length is measured as the number of characters in the document. When a Japanese language request is entered it is morphologically segmented, A query-document matching score for each document is computed independently for the document term index les formed by morphological analysis and character-based indexing. These matching scores are then summed for each document. A list of articles ranked by the query-document summed matching scores is nally returned to the user. Further details of the operation of the probabilistic NEAT system are given in (Jones et al., 1998a). 6.2.

The ASTRANSAC Machine Translation System

The ASTRANSAC MT system is widely used for translating Internet pages from English to Japanese and so we feel it o ers the necessary general language coverage to succeed for a news domain. Translation is fully automatic and this frees the user to concentrate on the search selection task. The translation model in ASTRANSAC is the transfer method (for example see (Hutchins and Somers, 1992)), following the standard process of morphological analysis, syntactic analysis, semantic analysis and selection of translation words. Analysis is top-down and

ktim.tex; 22/10/1999; 11:38; p.13

12 uses ATNs (Augmented Transition Networks) on a context-free grammar. In our simulation experiment we used a 65,000 term common word bilingual dictionary and 14,000 terms from a proper noun bilingual dictionary which we consider to be relevant to news events covered in the document collection used in our experiment. For this experiment ASTRANSAC is used to automatically translate queries from English into Japanese, and also to translate individual documents for user browsing after retrieval. 6.3.

BMIR-J2 Japanese Retrieval Test Collection

Our simulation experiment uses the standard BMIR-J2 Japanese retrieval collection (Kitani et al., 1998). The BMIR-J2 collection consists of 5080 articles taken from the Mainichi Newspapers in the elds of economics and engineering, and a total of 50 main search requests1 . Each request consists of a natural language phrase describing a user's information need. The designers of BMIR-J2 identifed relevant documents for each query as follows. A broad Boolean expression was used to retrieve most possible relevant documents. The retrieved documents were manually assessed for relevance to the query and the assessment cross-checked by another assessor. The average number of relevant documents for each query is 33.6. BMIR-J2 was designed so that some search requests can be satis ed very easily, for example via simple keyword matching; while for some others it is very dicult to retrieve the relevant documents using the request, requiring syntactic or semantic analysis of the request in order for the user's information need to be fully understood. In the current investigation all queries are handled in the same way by the NEAT and ASTRANSAC systems. A breakdown of retrieval performance for the di erent query types using the probabilistic NEAT system is given in (Jones et al., 1998b). In general, as would be expected we observe much higher retrieval performance for easier queries than the more dicult ones. 6.3.1. English Language Queries For our simulation experiment the BMIR-J2 requests were translated into English by a bilingual native Japanese speaker. The objective of this translation process was to produce queries which used reason1

Data in BMIR-J2 is taken from the Mainichi Shimbun CD-ROM 1994 data collection. BMIR-J2 was constructed by the SIG-Database Systems of the Information Processing Society of Japan, in collaboration with the Real World Computing Partnership.

ktim.tex; 22/10/1999; 11:38; p.14

13 ably good native English while preserving the meaning of the original Japanese. In this experiment we assume that these requests have been generated by the English speaking information seeker hypothesised at the start of this section. 6.3.2. Example Query The original text of one of the BMIR-J2 Japanese queries is: EEOCNA6b$NCM2.. This does preserve the basic meaning of the original query, but is a little awkward. This is because, even though \=L>." is the correct translation of \reduction" in contexts such as \reduction of armament" and \reduction of the personnel," it is seldom used with \telephone rates." Inspection of the machine translated queries showed that while some were identical to the original query others were quite di erent. Some of these variations will have been introduced due to problems in the MT process, however others will be due to the inexact nature of the manual Japanese-English translation. 6.4.

Experimental Results

In our experiments we compare retrieval performance using the original Japanese queries with those generated using automatic translation. In these experiments the two translation methods use the same bilingual dictionary of possible term translations. The DTL method merely replaces each English term with all possible Japanese translations from the dictionary, while MT applies all the linguistic resources available to perform full ASTRANSAC machine translation. Retrieval performance is measured in terms of precision , the proportion of retrieved documents which are relevant to the search request, and recall , the proportion of relevant documents which have been retrieved. Table I shows BMIR-J2 retrieval performance for original monolingual Japanese requests, and for automatically translated EnglishJapanese requests generated using MT and DTL. The table shows precision at ranked list cuto levels of 5, 10, 15, and 20 documents and the average precision, which is calculated by averaging the precision values at the position of each relevant documents for each query,

ktim.tex; 22/10/1999; 11:38; p.15

14

PRECISION MONO

0.8

CL MT CL DTL

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

RECALL 0.00

0.50

1.00

Figure 2. Recall-Precision curve for BMIR-J2 using: monolingual IR (MONO); CLIR with full query MT (CL MT); CLIR with query dictionary lookup (CL DTL).

Table I. Retrieval Precision for BMIR-J2 using: monolingual IR (MONO); CLIR with full query MT (MT); CLIR with query dictionary lookup (DTL). MONO Prec.

5 docs 10 docs 15 docs 20 docs

MT

CL

DTL

0.588 0.508 0.463 0.425

0.396 0.342 0.333 0.307

0.196 0.194 0.192 0.185

Av Precision

0.451

0.289

0.161

% change

|

-35.9%

-64.3%

ktim.tex; 22/10/1999; 11:38; p.16

15 and then taking the average across the query set. Figure 2 shows a corresponding recall-precision curve. The results in Table I and Figure 2 show that as expected retrieval performance is degraded for the automatically translated queries. For the cross-language queries MT is clearly superior to the DTL method. We realise that our result must be treated with caution due to the small size of the retrieval collection. However, it should be emphasised here that we have made no attempt to modify the translation dictionaries to the BMIR-J2 task, and thus we feel overall that the results for MT based CLIR relative to monolingual IR are quite encouraging. Further results reported in (Jones et al., 1999) show that retrieval performance can be improved in all cases by application of local feedback to the retrieved ranked document lists. It is likely that further improvement, particularly in the case of DTL, could be achieved by the application of the disambiguation techniques explored in (Ballesteros and Croft, 1998). 6.5.

Document Selection and Browsing

Of course, as we have suggested earlier, in order to select a potentially relevant document in an informed way, and to access the information it contains, MT must be used to enable the user to access the contents. Figure 3 shows the top ve ranked documents retrieved in response to the example query given previously. The document headings and their rst sentence are shown in their original Japanese and then in English as generated by the ASTRANSAC MT system. The original Japanese information is assumed to be supplied by the search engine which generated the document list. Headings and similar short statements are a challenging translation domain for MT systems since they are often written in a terse style which is not typical of the language. The translations produced by the ASTRANSAC system in this example handle headings as standard text. If we were to incorporate some special features to process headings we could expect some improvement in the quality of the translations produced. Obviously in practice there will be some computational overhead associated with generating the translations in a returned list of this type, but the amount of text involved is very small and the translation overhead should not noticeably interfere with the user's interaction with the retrieval system. >From the English language information shown in Figure 3 our information seeker is able to gist the possible subject of each document. The user is only able to gist the possible contents from this much information even in a monolingual retrieval system. The fundamental

ktim.tex; 22/10/1999; 11:38; p.17

16 RANK 1 ARTICLE: 001339 HEADING: ORIGINAL: !N%K%C%]%s@/CHqJ$K4pK\NA6b$J$I 3F

Suggest Documents