The Remarkable Search Topic-Finding Task to Share Success Stories of Cross-Language Information Retrieval Masashi Inoue National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, Japan
[email protected] ABSTRACT The performance of cross-language information retrieval (CLIR) systems has been improved to the level of practical use. The next step is to inform potential users that CLIR technologies are ready to be used. A good way of doing this is to present attractive scenarios of using multilingual information sources. For this purpose, we need to obtain more knowledge on the occasions when CLIR is more beneficial as compared with monolingual information retrieval from the utility perspective. The difficulty lies in the inclusion of scenario building into the research activities. This paper introduces a framework named the remarkable search topicfinding task to examine the way in which we can pursue this objective as part of the CLIR evaluation framework. An example process implementing the task and some unresolved issues are discussed.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information Search and retrieval—Search process
General Terms Design
Keywords cross-language information retrieval, information need, scenario building, data gathering
1. INTRODUCTION Although information is disseminated through documents written in a variety of different languages, most people use only a limited number of languages regularly. Therefore, it is helpful to have techniques to enhance multilingual information access for obtaining more complete information. Among various multilingual information access processes, this paper focuses on cross-language information retrieval
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
(CLIR) in which users input a query into a system in one language and the system returns a ranked list of documents in a different language or possibly in multiple languages including the query language. The central question of CLIR research pertains to the further information that can be obtained with the use of CLIR. There is no simple answer for this. However, what we can do is enumerate examples such as the following: Germanspeaking users can find better information for their diningrelated query using CLIR functionality that is integrated into a Web search engine. “A search for ‘bestes vegetarisches restaurant in San Francisco’ brings up about 45 search results when searching German language documents, but pulls up an additional 290, 000 results1 when using the Yahoo Suche Translator tool” [10]. This seems to be a reasonable success story; however, such examples are not abundant presumably due to the less widespread use of CLIR and the lack of researches on its utility. Here, we define the utility of information retrieval (IR) as the degree of positive influence that the obtained information gives users in their actions or decision-making. To promote the use of CLIR, Oard suggested a ‘technology push’ through some form of ‘requirements pull’— ‘discussions with companies, qualitative user studies with experts from domains of interest, or rapid prototyping for requirements discovery’ [11]. This paper proposes another form of ‘push’— presentation of example usages to potential service providers. The difficulty of directing researchers’ efforts toward discovering some good example applications may arise from the lack of a formal evaluation framework. Therefore, this paper presents a new multi-stage task that focuses on the users’ search scenarios of CLIR and is named the remarkable search topic-finding task. This task does not aim at evaluating IR systems, but the mining or generation of interesting search scenarios when CLIR is used. For now, we regard the scenario as the search topic. This simplification enables us to set up the task consisting of the following three steps: 1) the extraction of features that influence the utility gaps between monolingual and crosslanguage settings from the given set of topics, 2) the prediction of the degree of performance improvement in the given topics when CLIR is used with the knowledge of the above extracted features, and 3) task-oriented pseudo-scenario generation based on the knowledge of the above two steps. These subtasks will be evaluated on different criteria. The rest of the paper is organized as follows: First, the 1 At present, the numbers have changed to 73 in German and 466, 120 with CLIR functionality (accessed: 17 May 06).
current CLIR research activities are contrasted as systemand utility-oriented views. Next, we clarify the meaning of utility in CLIR in view of information scarcity in monolingual IR. Then, the proposed research scheme named the remarkable search topic-finding task in CLIR is detailed. Finally, the issues related to the proposed scheme, including premises, obstacles and limitations of our research roadmap are discussed.
2. RESEARCHES ON SYSTEM PERFORMANCE AND UTILITY We divide the research topics of CLIR into the following two categories: system performance issues and utility issues. Most CLIR research works have focused on the first category, which deals with the improvement of CLIR system performance as closely as the level of monolingual IR system performance. The performance of the IR system has many dimensionalities such as efficiency, effectiveness and scalability. Among these dimensionalities, effectiveness—how many relevant documents are retrieved and displayed at the top of the ranked list—is a popular subject of IR research. In the case of CLIR, effectiveness is related to the location and exploitation of language resources for the translation. Language resources constitute dictionaries, thesauri, parallel and comparable corpora and the Web. The relationship between the availability of these resources and the performance of CLIR systems has been studied extensively. In addition, the problem of scarcity of resources in low-density languages or particular language pairs has been addressed. Regarding the effective use of resources, researches have been conducted on the construction of corpora, development of machine translation systems and integration of translation and retrieval into a model. Many CLIR technical investigations regarding system performance are surveyed in [8]. The utility issue had been raised at the 2002 SIGIR workshop, ‘Cross-Language Information Retrieval: A Research Roadmap’ as the most notable obstacle that prevents CLIR technology transfer [3]. The primary research methodology to investigate utility is user studies. However, CLIR user studies are fewer in number as compared with monolingual IR user studies. Petrelli et al. carried out a pioneering user study using their CLIR system called ‘Clarity’ [12]. An interesting finding on the users’ characteristics regarding CLIR system is that the best performing systems in terms of proximity to the monolingual systems are not the most preferred systems by the users. This finding supports the importance of studying utility from multiple viewpoints. The discrepancy between technological transfer and current user studies for the understanding of CLIR utility may be that the emphases are on the support of users after they begin searching, such as the design of user interface or the addition of search assistance functionalities. In our view— as well as according to such user studies— there is a need for researches that may facilitate the discovery of appealing example usages in order to motivate people before they begin using CLIR systems. Compared with the rapid advancement of CLIR system performances, the exploration of motivating usage has progressed less rapidly. This difference may be partly due to the lack of a framework to quantitatively evaluate the knowledge. Thus, it is considered that the utility issue should be turned into an evaluation task.
3.
OVERCOMING THE SCARCITY OF INFORMATION WITH CLIR
The utility of CLIR would be better understood if considered from the viewpoint of the limitation of monolingual IR, which suffers from a shortage of relevant documents written in the query language. Some information is best found in languages other than the query language. The scarcity of information in the target collection itself leads to unsatisfactory retrieval results even if the rankings are perfect. Here, the scarcity of information is confined to the number of descriptions or objects. Kando suggested a layered structure of CLIR technologies [6]. The layer we consider in this paper corresponds to the semantic layer—mapping of concepts. Above the semantic layer exists the pragmatic layer— differentiation of the cultural and social aspects. Inarguably, one of the most significant benefits of CLIR is the access it provides to multiple viewpoints on the same topic, such as the international dispute divided by languages. However, the methodology to evaluate information scarcity from such a pragmatic perspective has not yet been well established. At this point, we only consider straightforward semantic scarcity.
4.
REMARKABLE SEARCH TOPIC FINDING TASK
To sum up the above two sections, we need to do two things: First, we should systematically provide some good hints on the basis of which service providers and users can consider the potential benefits of CLIR. Second, the creation of these hints should be evaluated quantitatively. The remarkable search topic-finding task is intended to assist the discovery of unexplored usage beyond the researchers’ natural insight. This task involves the following steps. In the first step, task participants are assumed to be participating in the existing CLIR evaluation campaign that uses linguistically heterogeneous multilingual target collection. By linguistically heterogeneous, we mean that the availability of information differs significantly among the languages of documents depending upon the subject of the information needed. In addition to the designated system performance reports in the original task, the topic that performed better in the CLIR setting than in the monolingual IR setting and the amount of difference recorded will be reported. If we use only the number of relevant documents to measure the information scarcity, we can skip this process and use the known information on the test collection. For example, NTCIR has a list of topics that have less than three relevant documents in a particular language [7]. Then, as post-campaign analysis, the participants report their hypotheses regarding the determining features of topics that lend superiority to their CLIR systems as compared with monolingual systems. Following this, through a system-wide comparison, we can approximately identify those determining features that are system dependent and those that are not. For example, the existence of proper names in queries or the length of queries may determine the usefulness of CLIR. Moreover, by considering multiple test collections, we can consider the domain and media issues. For example, we may be able to find that the cross-language image retrieval in the biomedical domain receives more benefits than other media and domains as a result of cross-collection query analysis. To carry out this type of analysis, we need
to have assessments not only by topical relevance but also by other aspects of utility, such as the users’ command of languages and the users’ typologies of tasks for which CLIR is used [4]. These factors will be considered further in Section 5.3 and 5.4. The second step is the validation of the features identified in the previous step. In this step, we can set up a new track where a utility gap between monolingual IR and CLIR is predicted for various topics based on the knowledge of influential features in topics. The participants are provided with some training topics, their performance gaps in monolingual and cross-language IR and a fixed IR system; the systems will then be compared in the prediction performances on the utility gaps for unseen test topics. If the use of a certain set of features correlates well with the high prediction performances, the knowledge on the features found in the first step can be considered as being useful. The third step involves the generation of usage scenarios themselves. In this task, participating systems will generate example search topics based on the combination of the above proven distinguishing features. Since the generation of holistic stories is not an easy task for machines, scenario generation may take the form of ‘fill in the blanks’ (e.g. when, where and what) in search topic description sentences. If additional factors such as domain, language and media had been considered in the first and second steps, these aspects could be added to the scenarios. Human assessors, preferably domain specialists, then judge the generated search topics in terms of relevance and impressiveness. This assessment process may be labour-intensive; thus, some form of automation should be considered. From the high-score search topics, the example scenarios of the search with context can be imagined. An example scenario in which CLIR can help users remarkably may be as follows: a Spanishspeaking doctor seeking X-ray photographs to compare them with his or her client’s case uses a query containing the name of the diseased part with the patient’s reported symptoms. It should be noted that the generated scenarios do not necessarily correspond to the real task of potential users. The examples are sufficiently good if the users are able to encounter their own scenarios by inference through browsing the systems.
5. DISCUSSION 5.1 Availability of heterogeneous collection One problem with the above process is that some test collections used for CLIR research are composed as the parallel or comparable corpora. In such a collection, the same information obtainable in one language is also available in other languages; thus, monolingual IR almost always outperforms CLIR in terms of the number of relevant documents. These collections are useful to evaluate the system performances, while they are not suitable to assess CLIR utility from the viewpoint of information scarcity. There is no known method to convert these collections into the linguistically heterogeneous one. Many naturally arising corpora such as the Web contain multilingual documents heterogeneously. The use of the snapshot of these dynamic corpora as the test collections in a variety of domains may be a possible solution. The Flickr photo collection used in iCLEF 2006 [1] is one such example.
5.2
Problems on the number of topics
The number of topics plays an important role in actualizing the remarkable search-finding task. Final scenarios are based on the features selected from the topics of initial CLIR tasks. Therefore, this framework is ineffective unless there are a diverse and sufficient number of topics. To increase the number of topics, we can consider the use of query logs. For example, KDD Cup 2005 used the search engine query-log of approximately 800, 000 [9]. Although the number of usable queries is attractive, the queries based on search logs lack relevance judgements and users’ information needs behind the queries are unclear. A second method is to adopt synthetic pseudo-queries generated from document collection with pseudo-known-item relevance judgements [5]. To elucidate, the source document from which the pseudoquery has been generated is considered as the relevant one. Similarly, if we have pairs of titles and texts of the document body, we can use the titles as pseudo-queries with some form of term weighting for the selection of query-like terms. Although these methods may enable the generation of a larger number of queries with pseudo-relevance judgements, they evaluate only some aspects of IR systems and their validity is not well understood in CLIR. Note that the reason we need large amounts of topics is that we need to analyze the topics or queries. The relationship between the number of topics and the stability of system performance evaluation is not considered here. We may probably not have to follow the new measures mentioned above as long as the CLIR evaluation campaigns are growing. In the case of monolingual IR, a ten-year continuation of TREC evaluation campaign has generated approximately 400 topics [14]. Similarly, as time advances, we may have a sufficient number of CLIR topics to be analyzed.
5.3
Language and media
In addition to the investigation of language distribution in digital document collection, the language ability of users is an important dimension that should be taken into account. Thus, the outcomes of observational studies on CLIR users are beneficial for the development of the remarkable search topic-finding task. In regions where people are expected to speak more than one language, CLIR systems can have greater use due to fewer problems with regard to the presentation of retrieval results. Accordingly, the usage scenarios will include wider applicabilities. Another dimension that may influence the users’ CLIR behaviour is the media of documents. For example, regarding image retrieval, Clough et al. reported that their CLIR system can help users perform bilingual searches as accurately as in the case of a monolingual search [2]. They attributed its high performance to the possibility that users do not have to view the image caption to judge relevance and that they are willing to view many images. This languageindependent characteristic of visual media may widen the applicabilities of CLIR technologies as compared with that of textual media. As the task matures, we can integrate the density of the knowledge of information among languages, the users’ language abilities and the properties of media as the tunable parameters of the search topic-finding task.
5.4
Difference in genre and task
When creating scenarios, they become more attractive to
users if their domains and tasks are specified because it is easier for the users to imagine the situation. In addition, we should consider domain-specific tasks separately where the criteria of utility differ substantially. One example is monitoring or distributional filtering where users want to retrieve information that corresponds to their topics of interests in a temporally continuous manner. In some institutional or business intelligence applications, the language with smaller information (rather than larger information) may be of importance. For example, the accurate detection of the first emergence of documents in a language on a certain topic, for instance, a document indicating undesirable actions such as the leaking of confidential information, may be considered helpful. The problem of detailed domains and tasks is that these scenarios only appeal to the assumed group of people and not to the potential new users if the scenarios seem too specific for the latter. Also, it is less likely that too many domain specific tasks will be able to attract many researchers to each of them. The balancing of generality and specificity in the task design is an open issue.
5.5 Toward the acquisition of user-generated real example topics Thus far, we have examined the cases where CLIR systems are not fully exploited among people or by the industry. Once these systems are widely accepted, we should devise new approaches rather than continue with the remarkable search topic-finding task. In the case of established products, attempts have been made to collect usage scenarios in the form of a competition wherein the user who provides the most interesting usage scenario be awarded [13]. A similar framework may be feasible within an institute or an organization in which the members are asked to use CLIR systems on a regular basis. It is natural for user-generated topics to be more diverse and interesting than the machine-selected topics; therefore, those scenarios may appeal to a wider range of people. By using the machine-generated pseudo scenarios as the seed, we can expand the collection of usage examples in this manner.
6. CONCLUSION The advancement of CLIR performance had been caused by the accumulation of language resources, most notably test collections for system evaluation and the development of techniques for their effective usages. This is the time to contemplate on how CLIR can be used to enhance the access to information. User studies report users’ positive feedbacks on the experimental CLIR systems. However, there exists a gap between the experimental setting in which people have been suggested the possible use of CLIR systems and an infield setting where there may be few cues regarding effective CLIR usage. Therefore, we proposed a scheme to determine the possible scenarios in which the use of CLIR may make a remarkable improvement in achieving the users’ tasks. Once the people have obtained some hints regarding the effective use of CLIR, they will continue to use it, and thus, the CLIR technologies will spread search services. This paper is conceptual and there are many problems that need to be addressed in order to realize the proposed scheme. Further, the details of the process also need to be concretized. The plausibility of the concept will be examined through a discussion.
7.
ACKNOWLEDGMENTS
The comments provided by the anonymous reviewers were helpful, particularly for improving the references. The author also acknowledges the advice and suggestions provided by Noriko Kando on the test collections and the concept of utility in CLIR while composing the final version of this paper.
8.
REFERENCES
[1] P. Clough, J. Karlgren, and J. Gonzalo. Multilingual interactive experiments with Flickr. In Proceedings of Workshop on New Text - wikis and blogs and other dynamic text sources, Trento, Italy, 2006. [2] P. Clough and M. Sanderson. User experiments with the Eurovision cross-language image retrieval system. Journal of the American Society for Information Science and Technology, 57(5):697–708, 2006. [3] F. C. Gey, N. Kando, and C. Peters. Cross-language information retrieval: the way ahead. Inf. Process. Manage., 41(3):415–431, 2005. [4] P. Hansen and J. Karlgren. Effects of foreign language and task scenario on relevance assessment. Journal of Documentation, 61(5):623–639, Oct 2005. [5] M. Inoue and N. Ueda. Retrieving lightly annotated images using image similarities. In Proc. 2005 ACM symposium on Applied computing, pages 1031–1037, March 2005. [6] N. Kando. Evaluation – the way ahead: A case of the NTCIR. In Cross-Language Information Retrieval: A Research Roadmap Workshop at SIGIR-2002, pages 72–77, Tampere, Finland, Aug.15 2002. [7] N. Kando. CLIR at NTCIR workshop 3: Cross-language and cross-genre retrieval. In Lecture Note in Computer Science, volume 2785, pages 485–504. Springer, Sept. 2003. [8] K. Kishida. Technical issues of cross-language information retrieval: a review. Inf. Process. Manage., 41(3):433–455, 2005. [9] Y. Li, Z. Zheng, and H. Dai. KDD CUP-2005 report: Facing a great challenge. KDD explorations, 7(2):91–99, December 2005. [10] E. Mills. Parlez vous deutsch. CNET News (http://news.com.com/2100-1038_3-5787824.html), July 14 2005. [11] D. W. Oard. When you come to a fork in the road, take it: Multiple futures for CLIR research. In Cross-Language Information Retrieval: A Research Roadmap Workshop at SIGIR-2002, 2002. [12] D. Petrelli, S. Levin, M. Beaulieu, and M. Sanderson. Which user interaction for cross-language information retrieval? Design issues and reflections. Journal of the American Society for Information Science and Technology, 57(5):709–722, 2006. [13] F. T. Piller and D. Walcher. Toolkits for idea competitions: a novel method to integrate users in new product development. R and D Management, 36(3):307–318, 2006. [14] E. M. Voorhees and D. K. Harman, editors. TREC : Experiment and Evaluation in Information Retrieval. The MIT Press, Sept. 2005.