Clustering Queries for Better Document Ranking Yi Liu1 , Liangjie Zhang1 , Ruihua Song1 , Jian-Yun Nie2 , Ji-Rong Wen1 1 2
Microsoft Research Asia, Beijing, 100190, China
DIRO, University of Montreal CP. 6128, succursale Centre-ville 2 Montreal, H3C 3J7 Quebec, Canada
{lewisliu,liazha,rsong,jrwen}@microsoft.com ABSTRACT Different queries require different ranking methods. It is however challenging to determine what queries are similar, and how to rank documents for them. In this paper, we propose a new method to cluster queries according to the similarity determined based on URLs in their answers. We then train specific ranking models for each query cluster. In addition, a cluster-specific measure of authority is defined to favor documents from authoritative websites on the corresponding topics. The proposed approach is tested using data from a search engine. It turns out that our proposed topic-dependent models can significantly improve the search results of eight most popular categories of queries. Categories and Subject Descriptors: I.5.3 [Clustering]: Similarity measures;H.3.3 [Information Search and Retrieval]: Retrieval models General Terms: Algorithms, Performance, Experimentation Keywords: Query similarity, Clustering, topic-sensitive ranking, website authority
1.
INTRODUCTION
Many users prefer finding information from more authoritative websites. This is particularly the case for popular search topics, such as on movies, software, and celebrities, which represent a significant percentage of the whole query traffic on the Web. Therefore, it is crucial to determine what websites are authoritative for a given query. Many efforts have been devoted to improving document ranking for such popular queries. For example, HITS [7] and PageRank [8] are two typical approaches trying to rank more authoritative or popular web pages higher. However, HITS requires a considerable amount of online calculation, which makes it less applicable in practice, and PageRank is query-independent. In this paper, we assume that users with similar intents would consider similar authoritative websites. Our goal is to
[email protected]
develop new methods to determine clusters of similar queries for which the same websites are considered authoritative. An authority score for a website is calculated for each cluster of queries, which is then used as an additional feature to train a specific ranking model for the cluster. The estimation of query similarity is the key issue to the success of this topic-dependent ranking. If query similarity is estimated in such a way that reflects their common perception of authority, the above authority feature could correctly boost the authoritative websites in the ranking. Two families of approaches for similarity estimation have been proposed in the literature: using content words and using click-through data. We intend to determine similar queries beyond those that share common words. This is particularly important for popular search queries, which are often titles of games, movies and software that do not share words among them. However, we cannot rely on the approaches proposed in previous studies using click-through data because of data sparsity. Each URL is considered as a single token, but users click on only a few documents in practice. For example, the queries “Jackie Chan” and “Jet Li” will result in two different sets of URLs, even though they are similar in the sense that both intend to find documents about movies or movie stars. What we can observe from this example is that these queries on similar topics would likely retrieve documents from the same websites specialized in entertainment or movies. So, by considering a similarity between the retrieved URLs, one could determine a better similarity measure between them. Experimental results show that our proposed approach improves the retrieval effectiveness by around 20% for eight categories of queries. As the topics involved are most frequently searched on the Web, the number of queries concerned is as much as 60%. The rest of this paper is organized as follows. First, we review some related work in Section 2. In Section 3, our proposed approach is described. In Section 4, experimental results are reported. Section 5 concludes the work.
2.
RELATED WORK
Query similarity has been the object of a large number of studies. It has been estimated according to their content words [9], based upon top search results/snippets [10] or query logs [2, 4, 11, 12]. [11, 12] considers that two queries are similar if they lead to clicks on the same documents. Their experiments show that this similarity measure can help boost the content-based similarity measure.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2-6, 2009, Hong Kong, China Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.
1569
Jackie Chan
Jet Li
ent.sina.com.cn/s/h/f/clong.html baike.baidu.com/view/3539.htm datalib.ent.qq.com/star/104/index.shtml blog.sina.com.cn/JackieChan data.ent.163.com/ent/star/id=jc.html
ent.tom.com/Archive/2002/5/9-87406.html ent.sina.com.cn/s/h/f/llj.html www.mtime.com/person/898883 blog.sina.com.cn/lilianjie datalib.ent.qq.com/star/113/index.shtml
163 baidu JachieChan clong baike data
star blog sina ent datelib qq
the two given queries. From this fact, one can deduce that the two queries are related. Therefore, we propose the following method to calculate a URL-based query similarity. Given a query qi , a search system would return a list of results, each of which is represented by a title, a snippet, and a URL. In this paper, we focus on the URL part. We extract the top k URLs ui1 , ui2 , ..., uik and consider them as a representation of the query as follows:
person tom mtime Lilianjie Archive llj
E(qi ) = (ui1 , ui2 , ..., uik )
Each URL uj is then broken into tokens. First, we split the URL by pre-defined separators, such as slashes (/), dots (.), etc. Second, numbers (except for those in website domains) and URL stopwords (e.g. “com” and “index”) are removed. Finally, the URL is represented by l tokens:
Figure 1: Search results returned for “Jackie Chan” and “Jet Li” and their common URL tokens Baeza-Yates et al [1] use terms in the clicked URLs to determine similar queries. They store the clusters and the kmost popular URLs for each cluster. The above approaches can only establish similarity relationships between a small number of queries. To deal with this problem, in our approach, we do not rely on clickthrough data, but rather we extend it to the top retrieved URLs. Another limitation of the above approaches is the utilization of a URL as a unit. Similar queries are required to lead to clicking on the same URLs. As we mentioned earlier, this rarely happens for similar queries because different URLs will be retrieved for similar but different queries. In our approach, we further extend this approach by considering a URL as a set of tokens. Authority is a factor that affects the ranking of documents. PageRank [8] is a widely used method to determine an authority measure for web pages. However, the method only determines a single authority measure regardless to queries. In reality, the degree of authority of a website or web page depends on topics. To account for this, Haveliwala[5] proposed to extend the original PageRank algorithm to a topic biased PageRank vectors. Different from [5], we will consider a common authority measure for a group of similar queries, rather than for individual queries. This allows us to better account for the fact that similar queries have the same perspective of authority.
3.
Φ(uj ) = tj1 , tj2 , ..., tjl
(2)
Each query qi is then represented by a vector of tokens as follows: q~i =< wi1 , wi2 , ..., win >
(3)
where wij is the weight of token tj in qi , which is calculated in a similar way to traditional TF*IDF measures. The difference is that Document Frequency (DF) is counted upon a large corpus of URLs. The similarity between two queries is then determined using cosine similarity. Based on the similarity, we apply an agglomerative hierarchical clustering algorithm to group the most similar queries and clusters iteratively until the desired number of clusters is obtained.
3.2
Topic-Sensitive Authority Measure
In this section, we assume that similar queries have similar authoritative websites. Authoritative websites are recognized for each topic area. Once we have created clusters of queries, it is then possible to determine such authoritative websites for the whole query set. In this paper we propose using user clicks for the queries in the cluster to calculate a topic-sensitive authority measure for websites, instead of pages. Given a cluster, the most frequently clicked websites are most likely authoritative ones. Suppose Qc is the set of queries of a cluster c and si is one of the websites that users have ever clicked. We sum up the clicks on the website si over all the queries in Qc as Countc (si ), i.e.
OUR APPROACH
In this section, we describe three key aspects of our approach.
3.1
(1)
Query Similarity and Clustering
Countc (si ) =
The basic idea we exploit in this section is the following: queries are similar if they retrieve similar URLs. Let us examine two such typical queries: Jackie Chan and Jet Li, who are Kung Fu movie stars. These queries are related and can be considered to be similar, to some extent. The top five returned URLs for their Chinese names from a search engine in China are shown in Figure 1. We can observe that the top URLs for both queries can be from the same website or contain some common tokens such as “star” and “ent”. These tokens are meaningful for these URLs. They often correspond to a specific category of documents that are used to organize documents by webmasters. From these tokens, it is possible for one to guess that the documents are about entertainment or stars, so are
X
Click(si , q)
(4)
q²Qc
Then, we calculate the authority score of si by normalizing the count as follows: Authorityc (si ) =
Countc (si ) maxj Countc (sj )
(5)
To a document d in the category c, we will attribute the authority score of its website, i.e.: SiteImportance(d) = Authorityc (site(d))
(6)
Take the celebrity-related queries as an example. The site ent.sina.com.cn is considered as one of the most popular
1570
websites in China on celebrities, and its Authorityc is also as high as 1. Thus, given a celebrity-related query, the pages from ent.sina.com.cn will be assigned the highest SiteImportance value, i.e. 1.
3.3
Table 1: Clusters in the China Market in Dec. 2007 Cluster Queries TV Prison Break, Five Fortune Mouse, ... Celebrities F.I.R, jolin, Li Ning, Liu xiang, ... Software MSN, qq download, photoshop, vista, ... Movies Lust Caution, Bolt, Sex and the City, ... Music Going Home, Yesterday Once More, ... Online Games World of Warcraft, Audition Online, ... Mini Games Mini game, Fantastic Wonderland, ... Finance stock, exchange rate, Bank of China, ... Literature romantic fiction, Happy Together,... Education Tsinghua University, Peking University, ...
Topic-Dependent Ranking Method
The basic idea behind topic-dependent model is the following: similar queries require similar ranking strategies. Different categories of queries may correspond to different search strategies. If we train a specific ranking model for each cluster of queries, the learned model for a specific topic may fit the category of queries more than the general model trained on all mixed queries. In addition, the addition of our proposed topic-sensitive authority measure will further improve the topic-dependent ranking model. As similar queries have been clustered, we propose learning a topic-dependent ranking model. First, we select the topics that have the potential to have enough data for learning the models offline and influence as many queries as possible online. Then, given the topics, we learn query classifiers that will be used to identify the queries on the same topic, and thus topic-dependent models can be learned in training phase and be applied in testing phase. Next, the classifier of a query category is used to identify the training/validation/testing queries of this category to form the datasets for learning a topic-dependent model. Any learning to rank algorithm can be used to train a ranking model for each category of queries. In this paper, we choose to use RankNet[3]. For each query-document pair, we extract a vector of M general features, based on which we can train a general model if all queries are used. In contrast, if only the queries of a particular category are used, the model is a topic-dependent model. We can apply the model for the particular category in testing. Finally, we add the topicsensitive SiteImportance feature to the M general features. Based on the M+1 features, we can learn a topic-dependent model with topic-sensitive features by applying RankNet. The model is supposed to be more powerful according to our assumptions.
4.
rithm is described in Section 3.1. We cluster queries into 400 clusters, but will only select ten largest clusters to do our ranking experiment because of two reasons: 1) Learning a ranking model for a specific cluster needs enough training data, and only large clusters can meet the requirements. 2) The more popular the clusters are, the more impact the specific models make for user experiences. The ten largest clusters are shown in Table 1. In the ranking experiments, a classifier which trained on the ten clusters is used to select topic-dependent models for each query. We compare retrieval performance of three methods: 1) the general model (Baseline), which uses a single ranking model trained on all the queries; 2) the topic-dependent models (T-Model) trained for each query cluster with the general features; 3) the topic-dependent models with the topic sensitive SiteImportance score (T-M+SI) as an additional feature. The last method integrates all the aspects we propose in this paper. The detailed results of validation dataset are shown in Table 2. We have conducted t-tests to validate whether the improvement is statistically significant. In the table, * means the improvement is significant (with p-value