Using Query Context Models to Construct Topical Search Engines

Using Query Context Models to Construct Topical Search Engines Parikshit Sondhi

Raman Chandrasekar

Robert Rounthwaite

Univ. of Illinois at Urbana Champaign Department of Computer Science 201 N. Goodwin Ave. Urbana, IL [email protected]

Microsoft Research One Microsoft Way Redmond, WA 98052, USA [email protected]

Microsoft Research One Microsoft Way Redmond, WA 98052, USA [email protected]

ABSTRACT Today, if a website owner or blogger wants to provide a search interface on their web site, they have essentially two options: web search or site search. Site search is often too narrow and web search often too broad. We propose a context-specific alternative: the use of ‘topical search engines’ (TopS) providing results focused on a specific topic determined by the site owner. For example a photography blog could offer a search interface focused on photography. In this paper, we describe a promising new approach to easily create such topical search engines with minimal manual effort. In our approach, whenever we have enough contextual information, we alter ambiguous topic related queries issued to a generic search engine by adding contextual keywords derived from (topic-specific) query logs; the altered queries help focus the search engine’s results to the specific topic of interest. Our solution is deployed as a query wrapper, requiring no change in the underlying search engine. We present techniques to automatically extract queries related to a topic from a web click graph, identify suitable query contexts from these topical queries, and use these contexts to alter queries that are ambiguous or under-specified. We present statistics on three topical search engine prototypes we created. We then describe an evaluation study with the prototypes we developed in the areas of photography and automobiles. We conducted three tests comparing these prototypes to baseline engines with and without fixed query refinements. In each test, we obtained preference judgments from over a hundred participants. Users showed a strong preference for TopS prototypes in all three tests, with statistically significant preference differences ranging from 16% to 42%.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval] Information Search and Retrieval – information filtering, query formulation, search process

General Terms: Algorithms, Measurement, Experimentation.

Keywords: topic focused search, subwebs, query context, lexical generality, click graph

1. INTRODUCTION Consider the following scenario: Bill owns a digital photography website, offering reviews of digital cameras, accessories and equipment, tips & tricks, etc. He feels that adding a specialized photography specific search engine will attract more visitors to his site. Today, he has to choose between offering either a general web search, or a custom site search restricted to a selected set of sites. When a query is matched to the plethora of documents on the internet, results from a general search engine may be too broad and may be irrelevant to the user’s intent. On the other hand custom site search services require users to compile and maintain lengthy URL lists. Given the dynamic nature of the web, it is impractical for most users to maintain such comprehensive up-todate lists. The alternate approach of creating a separate topic specific document collection is expensive both in terms of resources and time and is not a viable option for most users. In the example above, what Bill really wants is a topic specific search engine, which can provide general web results focused on a specific topic. Thus for example, when a user searches for a query [Evolt], she should only get web results about Olympus Evolt (cameras) and not about other interpretations such as the Evolt web development community. We term such specialized search engines as “topical search engines” or TopS. The applications of such engines are likely to be enormous. Website and blog owners can now build their own search engines for niche areas, based on generic search engines. Each such topical search engine becomes an entry point for the generic search engine. Thus end users, web site owners and search engines all are likely to benefit from them. In this paper we describe a promising approach to easily create such light-weight topical search engines, using an underlying web search infrastructure with minimal overhead. Our approach involves disambiguating topic-related queries by adding suitable contextual keywords, without changing query intent. The research questions we investigate are:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IIiX 2010, August 18–21, 2010, New Brunswick, New Jersey, USA. Copyright 2010 ACM 978-1-4503-0247-0/10/08...$10.00.

Given a generic web search engine, is it feasible to automatically create topical search engines over it, with minimal manual input? Does our proposed method of biasing queries towards a particular topic by adding contextual keywords make search results better than those obtained from a generic search engine?

The paper is divided into six sections. In the next section we provide an overview of current solutions to the problem and their limitations. In Section 3, we outline the intuition behind our approach and describe the overall TopS framework. In Section 4 we discuss our method in detail. In Section 5, we present statistics from the development of our topical search engine prototypes. In addition, we describe an evaluation study conducted using three TopS prototypes, two in the area of Photography and one on Automobiles. Results show a strong preference for our prototypes with preference differences ranging from 16% to 42%. We conclude in Section 6 with a discussion of our results and a look at future work.

including the web page or file the user is currently on. We expect topical search to be less dependent on specific context.

2. RELATED WORK

Personalized search engines (see for example [22]) aim at identifying results especially relevant to an individual, while web search tries to satisfy all (or most) people. Topical search is somewhere in-between the two, as it tries to be relevant to people interested in a topic.

There has been a lot of interest in creating web search engines that are specific to a topic or a set of topics. These include engines restricted to searching one or more sites, engines where web communities are discovered using link analysis, engines that search pages identified as being topic-relevant, and those that rerank results based on potential relevance of URLs to a topic. We present a sample of such work here. ‘Site search’ is typically realized using enterprise/custom search engines or by restricting webs search results to one or more sites. You can get this behavior using the “site:” query operator on major search engines, as in [flash site:dpreview.com] This gets all web-pages from dpreview.com that match the term [flash]. Site search results tend to be high precision but low recall. Engines such as the Google Custom Search Engine (CSE) [10] or Rollyo [19] permit users to specify a set of relevant web sites, and filter the result sets to return only pages from these sites. CSE also allows for fixed keywords to be added to query terms before a search is issued. As discussed already, it is impractical for most users to compile and maintain comprehensive URL lists. There has been work to improve search by identifying relevant web communities [8] using link analysis [14]. Bharat and Henzinger [2] improve upon this using topics generated from single queries. Chakrabarti et al [3] use focused crawling to identify pages relevant to a set of topics, and restrict search to these pages. Glover et al [9] have used query expansion and structural methods to learn about specific topic domains, while Chang et. al. [5] use relevance feedback to create authority lists. I-SPY [21] uses the notion that queries are repeated and users tend to select the similar responses to queries. They describe a collaborative approach to search that uses search histories from communities of users interested in the same topics, and using this to re-rank search results. With some use, the ranking here can adapt to be relevant to these communities. The process takes time and is better when these communities use similar queries. Some engines have their own specialized crawlers and information extraction components, as in DEADLINER [16] for information about conferences and workshops and CiteSeer [17] for scientific literature search. Haveliwala [12] has proposed topic-specific PageRank, where each page is assigned multiple importance scores, for each topic; at query time, these scores are used with other ranking scores to produce a topic-biased rank. Kraft et al [15] use context in search

Chandrasekar et al [4] propose a method to specialize search by reranking results obtained from a generic search engine. They describe a method based on semi-automatically created ‘subwebs’. Subwebs are lists of URLs relevant to the topic, along with a weight indicating the relative importance of the URL within that topic. Reranking is based on using the rank on the result page, along with the subweb weights of the URLs present in the result list. Their system works well for queries which have some ambiguity. But because the subweb is of a limited size, their reranking may not work well on extremely specific queries.

The methods listed here are not suited for our problem setting, because they are limited in some way, are far too specific or require extensive (manual) effort to create topical search engines.

3. INTUITION BEHIND OUR APPROACH A topical query performs badly on general search engines in large part due to ambiguity. Since general search engines are optimized for result diversity, for any ambiguous query the top results will likely span multiple topics. This reduces the number of results that are useful for the user. Suppose a user submits a query [EOS], hoping to get results for Canon EOS cameras. When we try this query on the top three general search engines in the US today, we get only one or two useful results on Canon EOS, in the top ten. The rest are related to the other senses of the word, such as EOS Goddess, NASA EOS, Volkswagen EOS etc. In order to realize topic focused search, we need to identify and disambiguate such queries. However it is not trivial to decide whether a query is ambiguous. Song et al [20] define ambiguity and propose a supervised learning approach to identify ambiguous queries, but their method requires a labeled training set not available in our setting. An alternate approach may be to predict query performance and only alter queries likely to perform badly. Cronen-Townsend et. al. showed that the clarity score [7] of a query is directly correlated with its performance (specifically, mean average precision). Amati et. al. introduced the notion of query-difficulty [1]. However these are post-retrieval methods and require relevance scores to be calculated before performance prediction can be performed. This is impractical for large web-scale document collections. He and Ounis [13] proposed a set of preretrieval predictors which can be evaluated based solely on collection statistics and do not require calculating relevance scores. Hauff et al [11] has a good overview of several such systems. Other recent works include Larson and Rijke [18] and Zhao et al [27]. A problem with such predictors is that while their utility is measured by estimating correlation with statistical performance measures such as mean average precision, they are not always indicative of user performance. Turpin and Hersh [25] found that clarity scores are largely uncorrelated with user performance. Studies by Turpin and Scholer [26] and Turpin and Hersh [24]

have also shown that statistical evaluation measures themselves may not correlate with user performance. Hence, given our problem setting where the focus is on a specific topic, it is not clear whether query performance predictors would be useful. So we pursue a different approach that obviates the problem of identifying ambiguity altogether. Specifically, we check if a query is topic related or not, rather than if it is ambiguous. For any topic related query we find, we focus it towards documents of a specific topic by adding appropriate context keywords so the results are restricted to the topic of interest. The goal is to perform context addition only when we are confident that it will not change the original query intent. This is the intuition behind our approach.

related, it gets disambiguated by context addition. If Q is unambiguous and topic-related then the addition of context does not change its intent. If we do not have enough contextual information, we do not alter the query. We call this our “Do no harm” principle and it keeps us from hurting the performance of queries (more details on this in Section 4.4). Thus we augment ambiguous queries without actually having to classify them as ambiguous or unambiguous. Also owing to our “Do no harm” principle, non topic-related queries will remain unaltered.

End User

Query Q Generic Results for Q’ Topical Wrapper for T

= Topical Results for Q

Altered Query Q’

Generic Search Engine

Topical Search Infrastructure

Figure 2: The TopS approach to topical search

4. CONTEXTUAL QUERY ALTERATIONS Query alterations are performed with the help of a context list which stores potential context keywords for a large number of topic relevant query n-grams. In this section we will describe how this list is generated for any given topic. In the following discussion, we use the term keyword to refer to a unigram.

Figure 1. Sample result page. Results from our baseline search engine (a general search engine) are on the right, and results from the topical search engine on the left. For example, in our approach, on adding the keyword “Canon” to [EOS], all top ten results become photography related, without altering the original query intent (Figure 1). For a different query, something else may need to be added (e.g. “Sony” for the query [alpha]). As our results will show, adding generic keywords like “camera” or “photography” to every query does not work as well – not every site related to photography includes these words, and adding them might actually decrease the relevance of the results returned. Figure 2 gives a general idea of our TopS approach to topical search. The end user issues a query Q, hoping to get results related to topic T. The query is altered to Q’ = Q+C by the topical wrapper for T, which adds a set of contextual keywords C, and sent to a general search engine, which then returns the results. The goal is to achieve an alteration QÆQ’ such that the results retrieved for Q’ are both highly relevant to Q in the context of the topic T. Query Q may either be ambiguous and topic-related, or unambiguous and topic-related. If Q is ambiguous and topic-

Figure 3 shows a sample set of contexts for photography. Here for example “olympus” is a potential context for “evolt” with confidence 0.9 and “camera” with 0.7. “Alpha” is disambiguated with “Sony”. Note that some terms which are reasonably unambiguous (like “nikkor”) still have a context defined; this does not change the intent of the keyword. Evolt Alpha Focus Powershot Nikkor Sandisk

Olympus:0.9, camera:0.7, digital:0.5 Sony:0.95 camera:0.85 Canon:0.9 Nikon:0.7 memory:0.9, card:0.8

Figure 3: Sample contexts for Photography The generation of context list is an offline operation performed once per topic, when the topical search engine is created. It includes the following steps: 1.

A large set of topic related queries and URLs termed as the subweb is identified.

2.

For each query n-gram (unigram/bigram/trigram) in the subweb, co-occurring keywords are identified.

3.

For each n-gram, keywords that satisfy certain constraints are chosen as contexts.

The context list may need to be updated periodically, but this requires little or no manual effort. Also, since domain related keywords and their relationships are relatively stable, the updates need not be as frequent as for a URL list. We now present the details of the process of context list generation and query alteration, after defining some terms we will use.

•

Similarly, find queries for all URLs U’ that have not been examined before, and which have a weight greater than θ . Exit if there are no such URLs. Add all the URLs U’ to SubwebUrls.

4.1 Definitions

•

Now update the weight of the new queries identified with the sum (over all URLs) of the average weights of the URLs. The average weight of a URL is its weight divided by the number of selected queries linked to it.

of a query is its weight divided by the number of selected URLs linked to it.

Subweb: Subwebs were first defined in [4]. We extend that definition so that a subweb is now considered a collection of domain specific URLs and queries, with each unique URL and query having a corresponding domain relevance weight. Click Graph: A click graph is a bipartite graph of users’ queries and the URLs they clicked on, represented as a set of triples < q, u, c >. This triple is interpreted to mean that URL u was clicked c times by users when they issued a query q. A high value of c is indicative of stronger relevance of u to q. We used a click graph aggregated from 18 months of logs from a web search engine, where each URL had at least five clicks. The links between queries and urls are used to create the subweb, and the click values are used to create the context list from the subweb. Note that we do not expect each topical search engine creator to have direct access to the click log of a general web search engine. Ideally, in a production setting, the relevant portions of this data will be made available to search engine creators through a web service or similar mechanism, with appropriate safeguards.

4.2 Building a Subweb In this first step, we identify URLs and queries relevant to the topic of interest. The algorithm requires the creator of a topical search engine (e.g. a blog owner) to provide a small set of highly domain relevant ‘seed’ websites (typically 5-10). The pseudocode to construct the subweb from this seed set is given below: 1.

The topical search engine creator provides a small initial set of authoritative and highly topic relevant websites S. It is critical to ensure that the seed websites are highly topic-relevant, to avoid including off-topic queries and URLs.

2.

We extract all query-URL pairs from the click-graph where the URL is one of the sites in S, or where the URL is a subsite of one of the sites in S. Let Q0 be the set of queries extracted thus.

3.

For each query q in Q0, we assign it an initial weight equal to the number of sites in S where q occurs in the click-graph.

4.

We now alternately find URLs linked to the queries we identified and queries linked to the URLs we identified. In each step, we use a weight threshold to take only the best URLs and queries. Repeat for iteration i = 1…N: •

•

Find URLs for all queries Q’ that have not been examined before, and select ones that are above a weight threshold ϕ . If there are no such queries, exit. Add Q’ to SubwebQueries. Update the weights of each of the new URLs identified, with the sum (over all queries) of all the average weights of the queries. The average weight

5.

Finally select all URLs from SubwebUrls with weight greater than a threshold α and their corresponding queries from SubwebQueries. This set of queries and URLs along with their weights forms the subweb.

At the end of the process, the URLs identified in the subweb are a big but not a comprehensive list of web-pages for a given topic. However our method does not require the subweb to be complete. Instead it only needs the subweb to contain a sizeable proportion of topic related URLs and queries. This use of the click graph is similar in spirit to work by Craswell and Szummer [6]. The two parameters ϕ and θ , control the precision and recall of the subweb being constructed. Low values are likely to bring in non-domain query/URL pairs while high values are likely to miss out on some of the domain query/URL pairs. We leave these values low (both empirically set at 0.01) for all our experiments to allow for more accurate weighing of queries/URLs; once the subweb has been constructed, we use a high cut-off (parameter α) to select the final URLs and queries depending on the requirements of the task.

4.3 Deriving Keyword Co-occurrence Scores A sample hypothetical subweb for photography with only two URLs is shown in Figure 4. For each URL, we have a set of queries and for each query-URL pair we have the number of clicks observed. To calculate the co-occurrence scores between keywords, we use an idea similar to the bag of words approach used for text documents. We treat each URL as a pseudodocument and all the corresponding query keywords as words in the pseudo-document. The frequency of an n-gram is considered equal to the sum of all the clicks it has received. For example, as can be seen from Figure 4, the frequency of the unigram “Olympus” in URL1 is 10+10+5=25. For ease of reference we will use the terms URL and pseudo-document interchangeably. An n-gram n1 and a related keyword c1 are said to co-occur if they appear together in the same pseudo-document. The frequency of co-occurrence between n1 and c1 in a pseudodocument d is given by the lesser of the two frequencies:

Freqd (n1 ∩ c1) = min(Freqd (n1), Freqd (c1)) The total frequency over the subweb is calculated by summing the individual frequencies over all documents.

FreqSubweb (n1 ∩ c1) =

∑ Freq

d

(n1 ∩ c1)

d∈all subweb URLs

For instance in our sample subweb, the co-occurrence frequency of two unigrams ‘E300’ & ‘Olympus’ is given by:

Freq Subweb ( E 300 ∩ Olympus ) =

[‘olympus model review’, ‘olympus review’,

FreqURL1 ( E 300 ∩ Olympus ) + FreqURL 2 ( E 300 ∩ Olympus ) =

‘olympus camera model’], LG(olympus) = 3, LG(model) = 3, LG(camera) = 2, LG(review) = 2, and

min( FreqURL1 ( E 300), FreqURL1 (Olympus )) + min( FreqURL 2 ( E 300), FreqURL 2 (Olympus )) = min(10,25) + min(0,15) = 10

LG(olympus model review) = LG(olympus) = 3 If the chosen query set is topic specific, the lexical generality scores enforce a partial order on all the keywords based on their ‘importance’ in the domain. This property is critical in selecting the context keywords. Figure 5 shows some keywords from the photography subweb with the highest lexical generality scores.

######## URL1: http://www.dpreview.com/olympus300 Olympus 300

10

E300

10

Olympus Camera

5

Olympus models review

10

Camera Digital Canon Sony Best Nikon Buy Review Photo Panasonic Olympus Lens Kodak Photography

######## URL2: http://www.dpreview/olympus Olympus Reviews

10

Olympus models review

5

######## Figure 4. Sample photography subweb We use pseudo-document co-occurrence instead of query cooccurrence as it helps in finding relationships between keywords that do not appear with high enough frequency in the same query. Thus as in the above example, even if “E300” and ”Olympus” never appear in the same query, they still have a non-zero cooccurrence score. The same approach can be applied for n-grams of higher order. For example:

FreqURL1 ( E 300 ∩ Olympus Camera) = min(10,5) = 5 Since n-gram frequencies vary a lot within the subweb, we measure the strength of a relationship between n1 and c1, CScore(c1,n1) by normalizing the co-occurrence frequency with the n-gram frequency i.e.

CScore(c1, n1) =

Freqsubweb (c1 ∩ n1) . Freqsubweb (n1)

Intuitively this means that c1 would be strongly related to n1 if it appears with a large percentage of n1’s occurrences. Alternatively CScore(c1,n1) can be viewed as the maximum likelihood estimate of the conditional probability P(c1|n1). That is, it measures the probability of the keyword c1 co-occurring in a pseudo-document given the observation of n1 in the pseudo-document.

4.4 Constructing the Context List The next step is to create the context list. Recall that a context list stores potential contexts for a large number of topic relevant ngrams. To do this, we first define the concept of lexical generality. Given a set of (topical) queries, we define lexical generality (LG) of a keyword as the number of other unique keywords it appears with, in queries. The lexical generality of an n-gram is then considered equal to the lexical generality score of its most general keyword. For instance for the set of three queries:

1609 1453 1163 913 843 776 691 682 401 331 328 316 304 292

Figure 5. A sample of photography keywords along with their lexical generality scores An alternate approach to defining lexical generality could be to use the subweb frequency (Freqsubweb) of a keyword instead of the one discussed above. However empirical evidence suggests that subweb frequency is more representative of ‘popularity’ rather than ‘generality’ of a keyword. For example users query a lot more for keywords like ‘sony’ or ‘canon’ as compared to the keyword ‘camera’. Thus ordering based on subweb frequency would make ‘sony’ and ‘canon’ more general than ‘camera’. This is clearly undesirable. Given the simple case of a single keyword query q and a set of context keywords C, we define some constraints on C. These constraints essentially encode our “Do no harm” principle and tell us when we can confidently add context without hurting query intent. 1. 2. 3.

For every concept keyword c ∈ C , LG(c) > LG(q) |C| ≤ MAXC, max number of context keywords to add. One of the following holds a. Every c ∈ C has CScore (c, q ) ≥ HighC b.

LG(q) < HighG and every c ∈ C has HighC ≥ CScore (c, q ) ≥ LowC and |C| = MAXC

The first constraint prevents a query from being over-specified. For example “canon” and “powershot” are both highly correlated, and LG(“canon”) > LG(“powershot”). Thus we can add “canon” as context to “powershot” without changing its intent, but we should not add “powershot” to “canon”. The second constraint prevents us from adding too much context, since this will eventually hurt result relevance. The third constraint ensures that the query is shifted towards a large chunk of topic related documents. Recall that from our definition of cooccurrence scores in Section 4.3, a high score between a query keyword q and a context keyword c means that c occurs in many of the topic related documents in which q occurs. Constraint 3a asserts that even a single high scoring context keyword is enough to shift the query sufficiently; in this case, we do not need to add any moderately scoring keywords. On the other hand, Constraint 3b says that if no high scoring keywords are available, and q does not have a high lexical generality score, then we either add MAXC moderately related keywords or we do not add any keywords at all. This is required because for keywords with low lexical generality, to shift the query towards a sizeable percentage of all its topic related documents, a single moderately related keyword will not be sufficient and we would need to add multiple moderately related keywords. We fix this number at exactly MAXC which is typically set at 4. Keywords with a high generality score are likely related to a large number of documents in the subweb; hence if a highly correlated context is absent, in line with our ‘Do no harm’ principle we do not add any additional context.

turn ensures that the query shifts towards topical documents related to all query keywords. Since the maximum context size MAXC is limited, we would only be able to choose one or two contexts per query keyword. We therefore do not use moderately related contexts while doing round robin addition. Once the context keywords are generated, we add them to the query. The exact syntax we use to modify queries and the way modified queries are evaluated depends on the underlying search platform. In our implementation, we use a special query operator that signals the search system to (effectively and efficiently) identify all documents matching the query Q and then rank the matching documents using both Q and C. Alternatively for a TFIDF based ranker, assigning lower weights to contextual keywords C may be sufficient. In a Boolean setting, we can AND the query Q with an OR’ed list of contextual keywords C.

1. Given Q = q1q2 ..qn 2. If CL (Q ) ≠ null , Set C = CL (Q ) , Exit. 3. Set ContextCount = 0, C={} 4. While (ContextCount < MAXC) { For (i = 1 to n) { If ( ∃ c ∈ CL ( qi ) s.t. c ∉ C & CScore(c, qi ) ≥ HighC) { Add c to C . ContextCount + +.

We use the above constraints to find potential contexts for all unigrams, bigrams and trigrams in the subweb and these constitute our context list (recall LG(k1 k2 k3) = max(LG(k1), LG(k2), LG(k3))). The values of bounds for high and low co-occurrences HighC and LowC and the bound for high generality score HighG can be set based on application requirements; we set them to 0.75, 0.10 and 50 respectively for all our experiments regardless of the topics. This means that for a keyword ‘c’ to be added as context to any keyword ‘q’, ‘c’ must co-occur in at least 75% of the documents in which ‘q’ occurs. Alternatively if ‘q’ has a generality score lower than 50 and there is no ‘c’ with co-occurrence over 75%, we add multiple contexts with at least 10% co-occurrence. These parameters essentially allow us to decide when we have sufficient information to confidently add contextual keywords. Since constructing a labeled training set for multiple topics is expensive both in terms of time and labor required, it was not possible for us to explore the impact of these parameters on performance and we had to set them empirically. We plan to explore this aspect in our future work.

4.5 Query Alteration: Adding Context Query alteration is the only operation performed online by the Topical Wrapper (see Figure 2). This step uses the context list created in the previous section. The alteration algorithm for choosing a context C for a query Q using context list CL is described in Figure 6. We start by looking for the entire query in our context list and if that is present, we return the context. Otherwise, we add one highly related context keyword for each of the query keywords in turn (in round robin fashion). This addition for each keyword in

} } } Figure 6. The query alteration algorithm

5. EVALUATION We used our approach to construct three different topical search engines on: Photography, Automobiles and Home Repair. In this section we first provide some statistics on the subwebs and context lists constructed for these domains and then present the details of our user study.

5.1 Subweb Construction Statistics Figure 7 shows a graph displaying the cumulative number of URLs identified per iteration while constructing a photography subweb with 10 seed sites. Figure 8 shows a plot of new URLs identified per iteration for subwebs constructed with different numbers of initial sites for the photography domain. Figure 9 plots the number of new topic URLs identified for different domains. In all these cases, the number of URLs identified is fairly large even with a small number of initial sites. The number of new URLs drops off dramatically beyond the fourth iteration. The process usually terminates in less than 10 iterations. We can therefore (with reason) fix the number of iterations to 5.

reconstruct it periodically, say once a day. However, such a high frequency of updates may not be necessary in practice.

5.3 Evaluating TopS Results Web search engines are usually evaluated by getting human judges to create a gold standard of queries and (ordered) results, and comparing the results from a search engine with the gold standard. Typically, the gold standard consists of thousands of queries and many thousands of result sites, with one or more judgments for each query-URL pair. There is a lot of effort and cost involved in creating this gold standard. It was not viable to create such gold standard sets for multiple topic domains to evaluate the engines we created. Figure 7. Cumulative number of URLs identified per iteration for 10 initial photography sites

Another option is to use click logs (e.g. toolbar logs). That is, for any query, assume that the result URLs are ranked by the number of clicks on these URLs. The problem here is that the click graphs are built from generic web search—as such, the click counts do not apply to topical search. We therefore decided to evaluate our search engines through a user study, using a tool which let users try out their queries. The tool shows the results from the baseline search engine side by side with results from the topical search engine. The user can then compare the results from the two search engines and indicate which engine’s results they prefer, very much in the spirit of Thomas and Hawking [23]. In this setup, users are primarily comparing result sets; they are not comparing individual results.

Figure 8. New URLs identified per iteration for subwebs with three different initial sets of photography sites

Figure 9. New URLs captured per iteration for subwebs in three different domains – Photography, Home Repair and Automobiles

5.2 Context List Statistics For each of our topics, the total size of the final context list files is as given below: •

Photography : 2.63 MB

•

Automobiles : 27.3 MB

•

Home Repair : 8.88 MB

Their small size allowed the entire context list to be loaded as a dictionary in the memory. Thus for performing alterations, we merely needed to perform a dictionary lookup for a keyword to obtain its context. As a result the alteration overhead was minimal. The time taken to create the context list (including that of the subweb) was of the order of a few hours making it possible to

5.3.1 Study Details In our study, we did three tests, comparing three pairs of search engines: 1.

Photo1: Here a baseline generic search engine was compared to the TopS engine tuned to Photography.

2.

Photo2: In this test, the baseline generic search engine was compared to the Photography TopS engine; however, each query in the baseline engine was altered to include an OR of the terms camera, photography and digital -- that is, query Q becomes something like [Q AND (camera OR photography OR digital)].

3.

Autos: Here the baseline generic search engine was compared to the TopS engine tuned to Automobiles.

To recruit participants we sent messages to two email distribution lists within a large technology company. One was sent to a list for photography enthusiasts, and people with email addresses starting in A-M were directed to Photo1, while those with email addresses starting in N-Z were sent to Photo2. We requested people in a mailing list about automobiles to try out the search engines in Autos. In each case, participants were told they were testing two experimental topical search engines. The engines were anonymized, and participants were not told what the baseline was. Each participant was requested to try 15 or more topic related queries, but several did less than 15 queries each. They provided a rating on the results, to tell us if they preferred the results on the left or results on the right. It was optional for participants to send us ratings feedback on the result sets, and in fact we got ratings only for between 23% for Autos to about 30% for the photography tests. Participants were encouraged to provide free-form text feedback, and in several

Table 2. Query alteration statistics for the three search engine pairs Photo1

Photo2

Autos

Total Queries

391

162

224

Altered Queries (percentage)

212 (54.2%)

79 (48.8%)

140 (62.5%)

1-word queries (percentage of total queries)

174 (44.5%)

78 (48.1%)

109 (48.4%)

Altered 1-word Queries (percentage of 1-word)

140 (80.5%)

64 (82%)

91(83.5%)


136 (34.8%)

49 (30.2%)

63 (28%)


49(36%)

12 (24.5%)

40 (63.5%)


59 (15.1%)

24 (14.8%)

23 (10.2%)


23 (39%)

3 (12.5%)

9 (39.1%)

cases, they did provide this. At any point in the study, participants could optionally fill a survey. Many also responded to this survey.

engine was the TopS engine and which was the baseline) into the following buckets:

We did not tell participants about our “Do no harm” principle – as a result, they did not expect results to be the same in cases where the query was specific enough or very general (according to our algorithm).

•

TopS Results Are Much Better

•

TopS Has Better Results

•

Both Results Equally Good

As a reward for their participation, we offered each participant who had tried at least 15 queries a chance to win one of three prizes: a $60 gift card and two $30 gift cards.

•

Baseline is Better

•

Baseline is Much Better

Table 1 shows the number of participants, number of queries, number of ratings and number of surveys obtained for each pair of search engines. Table 1. Statistics about participation in the tests Topic

#Users

#Queries Tried

#Ratings

#Surveys

Photo1

184

1229

391

48

Photo2

109

551

162

26

Autos

203

960

225

42

To estimate the potential impact of our method, we measured the percentage of altered queries, among all the queries submitted to the three systems. Table 2 shows the statistics on query alterations. We see that the system altered a significant percentage of queries in all three cases; between 48.8% to 62.5% queries by volume. Also the algorithm altered shorter queries more often than the longer ones. For example for the ‘Autos’ topic, the system altered, 83.5% of one word queries, 63.5% of two word queries and 39.1% of three word queries. Since shorter queries tend to be more ambiguous, this is a desirable effect. In all three cases no queries with four or more words were altered.

5.3.2 Evaluation Results: Results Set Comparisons In this section we discuss the results from the comparison tool for the three tests we conducted in our study. Overall, the comparisons between TopS engines and the baseline engines were very favorable to TopS. In each case, we asked users to indicate if the search engine on the left was much better or better than the one on the right, or if they were equally good, or if the engine on the right was better or much better. We converted these user judgments directly (with our knowledge of which

For our TopS engines to be shown to be better, we should see users preferring the first two categories compared to the last two categories. In the charts below, this translates to seeing more mass on the left than on the right. Figure 10 charts the evaluation for the Photo1 test. Of the 391 ratings we received, for 247 (63%) of them, users thought both results were equally good. Of the rest, for 28% of them, users preferred the TopS engine – for 14% they thought it was much better and for another 14% they thought it was better. Only for about 8% they thought the baseline engine was better, and for less than 1% they thought the baseline was much better. To compute statistical significance, we combined the first two and the last two buckets above, and these results are statistically significant at level p < 0.001. Figure 10 also shows the frequency distribution of ratings for altered queries. Of the 212 queries that we altered, for 104 (49.1%) queries, the alterations improved performance. These were usually ambiguous queries for which the general search engine returned a mixed set of results, while TopS returned only topic related results. For another 76 queries (35.8%), the alterations did not hurt the performance. These were mostly queries that were already fairly specific, but were altered nevertheless. Although it is hard to verify directly, the observation that performance did not degrade for 180 (84.9%) altered queries, does give an indication that the query intent may not have been changed in these cases. The performance degraded only for 32 (15.1%) cases. The Photo2 evaluation is a harder test for the TopS engine, but the trend is the same. In this case, recall that every question in the baseline was altered to include an OR of the terms camera, photography and digital, to make results for these modified queries more tuned to photography. So there were far fewer queries in Photo2 where the results were identical. Not surprisingly, of the 162 ratings received, for fewer queries (28%) users thought both engines were equally good. But for more (44%), users preferred the TopS

engine, for about 9% queries, liking it much better and for 35% liking it better than the baseline. In contrast, for 28% queries users preferred the baseline engine with the query modification (for about 22% liking this better than the TopS engine and for about 6% liking it much better). Overall, the Photo2 numbers in Figure 11 show a preference difference of 16% in favor of the TopS engine. These results are statistically significant, at level p < 0.005, using the same combination as for Photo1.

Figure 10. Photo1 evaluation results

(38 (48.1%) better, 26(32.9%) same, 15 (19%) worse). Of the remaining 83 queries, that were not altered, TopS results were either as good or better for 52(62.7%) queries. Suggesting that not altering these queries was indeed a reasonable choice. The observation also shows that blindly adding the same set of topic relevant keywords to every query does not necessarily improve performance. The Autos test (Figure 12) is similar to the Photo1 test. Of the 224 ratings got, 48% preferred the TopS engine (32% thought TopS was much better and 16% thought TopS was better). Only 6% preferred the baseline engine (about 4.5% thought the baseline was better and about 1.5% thought it was much better). The preference difference is very large here -- 42% in favor of the TopS engine. One reason why the difference is so high may lie in the nature of queries. Model names of cars are often standard English words (Forester, Neon, etc.) that are ambiguous in the baseline engine, leading to results less tuned to the automobiles topic. These results are statistically significant, at level p < 0.001, using the same combination of the first and the last two buckets. The observations for altered queries are also similar to those of Photo1 except here the preference for TopS is more pronounced. To gain some insight into the reasons for good and bad performance, we present some example query alterations in Table 3. The system works well when the query keywords such as ‘ef’ or ‘matrix reviews’ have been seen before in the subweb with reasonable frequency, and as a result, their relationships with other topic related keywords have been modeled well. On the other hand the performance is not as good for cases like ‘robert runyon issaquah’ or ‘girl with flower’, for which the query keywords are either rare or completely absent from the subweb. For such rare keywords it is not possible to accurately model the relationships from the subweb and hence irrelevant contexts may get added. This becomes an issue in case of queries like newly released product names, which may have been absent from the old subweb used for building the context list. To overcome the problem, a new context list must be constructed periodically using recent query log data. Table 3. Examples of good and bad query alterations

Figure 11. Photo2 evaluation results

Figure 12. Autos evaluation results An analysis of the alteration statistics also reveals some interesting observations. Clearly of the 79 queries we altered, users preferred our alterations to the alterations performed by the baseline engine

Query

Topic

Added Context

Improvemen t

a4

Auto

‘audi’

Yes

highlander

Auto

‘toyota’

Yes

matrix reviews

Auto

‘toyota’

Yes

ef

Photo

‘canon’

Yes

black and white

Photo

‘digital’, ‘camera’, ’cameras’

Yes

girl with flower

Photo

‘wrestling’

No

robert runyon Issaquah

Photo

‘raw’

No

6. CONCLUSIONS In this paper, we proposed the use of topical search engines (TopS) to obtain results focused on specific topics. We described an approach to easily create such topical search engines using context derived from click graphs. Where we have enough contextual

information, we alter ambiguous topic-related queries by automatically adding contextual keywords so that the results of these queries are focused on the topic of interest, without changing the query intent. All we need is a few topical URLs to build a subweb, extract context, evaluate lexical generality and identify and use alterations. In the user studies we conducted on our prototypes, participants showed a strong preference for our TopS prototypes over the baseline engines, with preferences going from 16% to 42% It is useful to highlight four aspects of our work: • Our approach is mostly automatic, and does not require any time consuming manual editing, annotation or compilation of websites or maintenance of domain specific crawlers. • The domain specific keywords and their relationships we extract are stable over much longer periods of time, compared to URLs. So context upgrades do not have to be very frequent. • The system is deployed as a low overhead query wrapper. It does not need search indexes to be changed, for example to store and use category information. It also does not require any change in ranking strategies, or any results-post-processing. • Our method avoids altering queries when there is not enough contextual information. Thus where possible the system makes results more topic specific; where there is insufficient information, no changes are made and no harm is done. We see topical searches as being ideal for web site owners and bloggers who want more than site searches or vanilla web search entry points on their site. Our next steps are to refine our approach. The values for various parameters in our system were empirically chosen. Since our evaluation was based on user studies, it was difficult to analyze the impact of these parameters. We hope now to work on an evaluation method that will let us vary these parameters and set values in a principled manner. Our method also has applications in ontology extraction and query segmentation, and we hope to work on these aspects as well.

7. REFERENCES [1]

[2]

[3]

G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustness, and selective application of query expansion. In Advances in Information Retrieval, Proceedings of the 26th European Conference on IR Research, ECIR 2004, pages 127 - 137, Sunderland UK, 2004. K. Bharat, and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of SIGIR-98 (Melbourne, AU,1998), ACM Press, 104-111. S. Chakrabarti, M. van den Berg and B. Dom. Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery, 1999. Proceedings of WWW 1999.

[4]

R. Chandrasekar, H. Chen, S. Corston-Oliver, and E. Brill. Subwebs for specialized search. In Proceedings of SIGIR- 2004 (Sheffield, UK, July 2004), ACM Press, 480-481.

[5]

H. Chang, D. Cohn, and A. K. McCallum. Learning to create customized authority lists. In Proc. 17th ICML (Stanford, CA, 2000), Morgan Kaufmann, San Francisco, CA, 127–134.

[6]

N. Craswell, and M. Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Amsterdam, The Netherlands, July 23 - 27, 2007). SIGIR '07. ACM, New York, NY, 239-246.

[7]

S. Cronen-Townsend, Y. Zhou, and W. B. Croft. 2002. Predicting query performance. In Proceedings of SIGIR-2002 (Tampere, Finland, August 11 - 15, 2002). ACM, New York, NY, 299-306.

[8]

D. Gibson, J.M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In UK Conference on Hypertext, pages 225–234, 1998.

[9]

E. Glover, G. Flake, S. Lawrence, W. P. Birmingham, A. Kruger, C. L. Giles, and D. Pennock. Improving category specific web search by learning query modifications. In Symposium on Applications and the Internet, SAINT, (San Diego, CA, Jan 2001) IEEE, 23–31.

[10] Google Custom Search Engine. http://www.google.com/cse/ [11] C. Hauff, V. Murdock, and R. Baeza-Yates. 2008. Improved query difficulty prediction for the web. In Proceeding of CIKM ‘08 (Napa Valley, California, USA, October 26 - 30, 2008). CIKM '08. ACM, New York, NY, 439-448. [12] T. H. Haveliwala. 2002. Topic-sensitive PageRank: a context-sensitive ranking algorithm for web search. In Proceedings of the WWW 2002. (Honolulu, Hawaii, USA, May 07 - 11, 2002). ACM, New York, NY, 517-526 [13] B. He, and I. Ounis. 2006. Query performance prediction. Inf. Syst. 31, 7 (Nov. 2006), 585-594 [14] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [15] R. Kraft, C. C. Chang, F. Maghoul, and R. Kumar. 2006. Searching with context. In Proceedings of WWW ’06 (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM, New York, NY, 477-486. [16] A. Kruger, C. Lee Giles, F. Coerzee, E.J. Glover, G.W. Flake, S. Lawrence, C.W. Omlin, DEADLINER: Building an New Niche Search Engine, In Proc. CIKM 2000, pp. 272-281. [17] S. Lawrence, K. Bollacker and C. Lee Giles, Indexing and Retrieval of Scientific Literature, In Proc. CIKM 99, pp. 139-146. [18] J. H. Larson and M. D. Rijke. 2008. Using coherence-based measures to predict query difficulty. ECIR 2008. [19] Rollyo site. http://rollyo.com [20] R. Song, Z. Luo, J. Nie, Y. Yu, and H. Hon. 2009. Identification of ambiguous queries in web search. Inf. Process. Manage. 45, 2 (Mar. 2009), 216-229. [21] B. Smyth, E. Balfe, J. Freyne, P. Briggs, M. Coyle, and O. Boydell. 2005. Exploiting Query Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User Modeling and UserAdapted Interaction 14, 5 (Jan. 2005), 383-423. [22] J. Teevan, S. T. Dumais and E. Horvitz (2005). Personalizing search via automated analysis of interests and activities. In Proceedings of SIGIR 2005. [23] P. Thomas, and D. Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of CIKM '06. ACM, New York, NY, 94101. [24] A. Turpin, and W. Hersh. 2001. Why batch and user evaluations do not give the same results. ACM SIGIR’01. [25] A. Turpin, and W. Hersh. 2004. Do clarity scores for queries correlate with user performance?. In Proceedings of the 15th Australasian Database Conference - Volume 27 (Dunedin, New Zealand). K. Schewe and H. Williams, Eds. ACM International Conference Proceeding Series, vol. 52, pp 85-91. [26] A. Turpin, and F. Scholer. 2006. User performance versus precision measures for simple web search tasks. ACM SIGIR’06. [27] Y. Zhao, F. Scholer and Y. Tsegay. 2008. Effective pre-retrieval query performance prediction using similarity and variability evidence. ECIR 2008.

Using Query Context Models to Construct Topical Search Engines

Using Query Context Models to Construct Topical Search Engines

Suggest Documents

Using COTS Search Engines and Custom Query ... - Cogprints

Query Forwarding in Geographically Distributed Search Engines

Improving Intranet Search-Engines Using Context Information from ...

Context based Indexing in Search Engines using Ontology

Towards a new approach to query search engines ...

Context Matcher: Improved Web Search Using Query ... - Google Sites

Query Completion Using Bandits for Engines Aggregation

Query Completion Using Bandits for Engines Aggregation

Learning Query Ambiguity Models by Using Search Logs - Springer Link

Rewriting the Past: How Search Engines Construct and ... - CiteSeerX

Search Engines

Search Personalization through Query and Page Topical ... - CiteSeerX

Viewing Web Search Engines as Corpus Query Systems - CiteSeerX

Prefetching Query Results and its Impact on Search Engines

Authenticating the Query Results of Text Search Engines - Singapore ...

Adaptively Constructing the Query Interface for Meta-Search Engines

A Query Construction Service for large-scale Web Search Engines

Scalable Query Assistance for Search Engines ... - Semantic Scholar

A Query-Dependent Ranking Approach for Search Engines

Evaluation of Query Generators for Entity Search Engines

Competitive Caching of Query Results in Search Engines - CS Technion

Improving search engines by query clustering - Wiley Online Library

Improving Search Engines by Query Clustering - Ricardo Baeza-Yates ...

Batch query processing for web search engines - CiteSeerX