indicator of the document's contents; the indexing assumption. ..... by examining fewer search results is a powerful demonstration of I-SPY's enhanced search.
QUERY-BASED INDEXING IN COLLAORATIVE SEARCH Jill Freyne & Barry Smyth Computer Science Department, University College Dublin, Belfield Dublin 4 {firstname.lastname}@ucd.ie
ABSTRACT The success of Web search is often limited by a variety of factors. Typical queries are vague and imprecise. At the same time, the Web is a dynamic and unmoderated collection and term-based indexing techniques often fail to produce high quality indices. As a result there is often a significant gap between the query-space and the document-space and the terms used to query a document may vary from those that have been used to index it. In this paper we describe the I-SPY collaborative search engine. It uses information gleaned from the query-space as a supplementary index with which to retrieve documents. We show how prior user selection histories can be used to estimate the relevance of documents to queries, and how this can be used to re-rank documents during search. The results of a live-user trial are presented to demonstrate the potential benefits of I-SPY when compared to more traditional search techniques. KEYWORDS Web intelligence, search, context, relevance, personalization.
1. INTRODUCTION Most information retrieval systems generally assume that the terms used in a document are a good indicator of the document's contents; the indexing assumption. A second common assumption, which might be called the correspondence assumption, assumes that there is a close correspondence between the terms used in documents and the terms used in the queries designed to retrieve these documents. When these assumptions hold, information retrieval techniques provide for an efficient way to access relevant documents. However, when either of these assumptions breaks down, retrieval success can no longer be guaranteed. We believe that these assumptions are often strained in one of the most popular and important applications of information retrieval, namely Web search. For certain, the indexing assumption is limited in the context of the non-textual document sources that are becoming commonplace on the Web (e.g. graphics, photos, audio and video etc.) and specialized techniques are now being deployed to provide for new approaches to indexing and retrieving multimedia content. At the same time the correspondence assumption is also more brittle than in traditional information retrieval scenarios. This is due to the dynamic and unmoderated nature of the Web, and the limited retrieval expertise of a typical Web searcher – poorly indexed documents, changing trends and fashions, and under-specified queries are all significant obstacles in the way of successful Web search. Conceptually we can view this as a lack of correspondence between the document-space and the query-space (Bollmann-Sdorra and Raghavan, 1993) and in this paper we argue the need for Web search engines to attempt to bridge the gap that might exist between these spaces. We propose that the familiar static document-index should be supplemented by a more dynamic and flexible query-index. We describe and evaluate the I-SPY meta-search engine which uses a technique called collaborative search to automatically construct such a query-index by mining the past search behaviour of a community of users. This index records the query terms used and the selections that resulted and can be leveraged to estimate the relevance of a document to a query, based on its selection frequency for that query. Moreover, I-SPY is capable of maintaining different query-indices for different communities of users so that different underlying document-query space mappings can be learned. For example, the query ‘jaguar’ from a computer science related site is likely to refer to the Apple operating system. The same query from a wildlife site is likely to refer to the wildcat, and the same query from a motoring site will probably refer to the brand of car. The
essential point is that by developing a community-specific query-index, I-SPY is better able to cope with the type of vague queries that are commonplace on the Web by providing an implicit context for these queries.
2. BACKGROUND Our approach to Web search is motivated by the work of a number of researchers, which highlight the potential gap that exists between query and index terms (Cui et al., 2002, Furnas et al., 1987, BollmannSdorra and Raghavan, 1993). For example, the work of (Furnas et al., 1987) highlights the surprising degree of variety with which people refer to the same thing and leads to a worrying inference for search engines: that query terms may not reliably be found as index terms for relevant documents. As (Bollmann-Sdorra and Raghavan, 1993) argues, the structure of the query-space can be so different from the structure of the document-space that it makes sense to consider documents and queries to be elements from different term spaces. As a result a number of researchers have begun to explore the possibility of leveraging knowledge about the query-space to improve search performance. For instance, (Raghavan and Sever, 1995) propose the reuse of past optimal queries, to either respond to new queries or to help formulate optimal queries. A related idea has been subsequently exploited in Web search. For instance, (Fitzpatrick and Dent, 1997) describes how the use of past queries can improve automatic query expansion by using automatic feedback from the top documents returned in a result-list. (Glance, 2001) describes a related community search assistant, which enables communities of searchers to search in a collaborative fashion by using a form of query recommendation. (Wen et al., 2002) describe how query logs can be mined to discover relevancy information that can be used during query clustering. Our approach to Web search, collaborative search, also attempts to leverage knowledge about the queryspace. By monitoring user selections for a query we build a model of query-page relevance based on the probability that a given page pj will be selected by a user when returned as a result for query qi. In addition, we observe that specialised search engines attract communities of users with similar information needs and so serve as a useful way to limit variations in search context. For example, a search field on an AI Web site is likely to attract queries with a computer-related theme, and queries such as ‘cbr’ are more likely to relate to Case-Based Reasoning than to the Central Bank of Russia. The collaborative search approach combines both of these ideas in the form of a meta-search engine that analyses the patterns of queries and user selections from a given search interface serving a well-defined search context. The resulting selection patterns can be used to capture the relationship between queries and documents in a given search context and so serves to bridge any gap that might exist between the query and document space.
3. THE I-SPY SYSTEM ARCHITECTURE The I-SPY collaborative search architecture is presented in Figure 1. It presents a meta-search framework in which each user query, qT, is submitted to base-level search engines (S1-Sn) after formatting qT for each Si using the appropriate adapter. Similarly, the result set, Ri, returned by a particular Si is re-formatted for use by I-SPY to produce R'i, which can then be combined and re-ranked by I-SPY, just like a traditional meta-search engine. I-SPY's key innovation involves the capture of search histories and their use in ranking metrics that reflect user behaviour. The unique feature of I-SPY is its ability to personalize, search results for communities of users without relying on content-analysis techniques (Bradley et al, 2000). I-SPY borrows ideas from collaborative filtering research to profile the search experiences of users. These methods exploit a graded mapping between users and items and I-SPY exploits a similar relationship between queries and result pages (Web pages, images, audio files etc.). This relationship is captured as a hit matrix (see Figure 1). Each element of the hit matrix, Hij, contains a value vij (that is, Hij = vij) to indicate that vij users have found page pj relevant for query qi. In other words, each time a user selects a page pj for a query term qi, I-SPY updates the hit matrix accordingly.
Figure 1: I-SPY system Architecture
3.1 Collaborative Ranking I-SPY's key innovation is its ability to exploit the hit matrix as a direct source of relevancy information --after all, its entries reflect concrete relevancy judgments by users with respect to query-page mappings---that helps to bridge the gap between the query-space and the document-space. Most search engines, on the other hand, rely on indirect relevancy judgments based on overlaps between query and page terms, but I-SPY has access to the fact that, historically, vij users have selected page pj, when it is retrieved for query qi. I-SPY uses this information in many ways, but in particular the relevancy of a page pj to query qi is estimated by the probability that pj will be selected for query qi (see Equation 1)
relevance = ( p j , qi ) =
H ij
∑
∀j
H ij
(1)
It is worth noting that the above allows I-SPY to compute relevancy values for pages that have been selected for previous occurrences of a query. Obviously this limits I-SPY's collaborative ranking to new queries that are duplicates of those that have occurred previously. We will return to this point later and argue that while duplicate queries may be rare in generic Web search (