An Optimization Framework for Propagation of Query–Document Features by Query Similarity Functions Maxim Zhukovskiy
Tsimafei Khatkevich
Gleb Gusev
Pavel Serdyukov
Yandex 16 Leo Tolstoy St., Moscow, 119021 Russia
{zhukmax, tsimkha, gleb57, pavser}@yandex-team.ru ABSTRACT
queries contribute to the major differences among search engines. Therefore, improving the performance of these queries by overcoming the sparseness of the data associated with them is one of the most critical challenges faced by search engines. This problem is usually solved by propagating statistics between query–document pairs sharing the same document on the basis of query similarity, which can be determined via a similarity function [2, 5, 13, 21, 32, 34, 35, 38]. There is more than one way to define query similarity, which can be based on characteristics of user behavior ([13, 21, 26, 28, 32, 35]), semantic and syntactic properties of the queries [5, 17, 34, 38], and values of different features for different documents with respect to the considered queries [32]. Our motivation for this work is to understand how different similarities and other properties of queries can be incorporated in the most optimal way for solving the problem of propagating query–document features. For this purpose, we propose a general framework for learning a web page ranking function based on propagation of features via queries’ similarities. The learning part of our framework allows to compare different propagating methods by evaluating ranking quality of trained rankers. Given two propagating methods, a set of regular query–document features, a set of query– document features that are computed by propagation, and an algorithm for learning a ranking function, we learn two different ranking functions, which are trained by using the same learning algorithm. The learning process differs only in the sets of propagated features that, in each process, are computed with a different propagation method (both sets of features contain the considered regular features). The first propagating method outperforms the second one, if the first ranking function outperforms the second one in terms of a considered measure of retrieval quality. The effectiveness of propagating query–document features, as we demonstrate in this paper, depends on a variety of aspects that need to be considered by a propagating method, but that were overlooked by the state-of-the-art methods. Assume we are provided with a collection of queries Q, values of a feature f for query–document pairs (q, d) over all q from Q, and a collection of query similarity functions. The common approach to obtain the propagated value of the feature f for a target query–document pair (q, d) is the following [2, 13, 35]. The propagated feature value f˜(q, d) equals
It is well known that a great number of query–document features which significantly improve the quality of ranking for popular queries, however, do not provide any benefit for new or rare queries since there is typically not enough data associated with those queries that is required to reliably compute the values of those features. It is a common practice to propagate the values of such features from popular to tail queries, if the queries are similar according to some predefined query similarity functions. In this paper, we propose new algorithms that facilitate and increase the effectiveness of this propagation. Given a query similarity function and a query–document relevance feature, we introduce two different approaches (linear weighting approach and tree-based approach) to learn a function of values of the similarity function and values of the feature for the similar queries w.r.t. the given document. The propagated value of the feature equals the value of the obtained function for the given query–document pair. For the purpose of finding the most effective method of propagating query–document features, we measure the effectiveness of different approaches to features propagation by performing experiments on a large dataset of labeled queries.
1.
INTRODUCTION
Modern web search engines use a large number of relevance signals that are typically combined in a complex fashion via learning a ranking function. Using these features, popular commercial search engines can generally respond to most frequently asked queries (or head queries) with excellent results. However, providing answers of similarly high quality for tail queries, infrequently issued or previously unseen, is more difficult due the fact that the quality of some features [1] highly depends on the amount of the data available for calculation of those features that in turn is influenced by the query popularity. Previous studies [36] even reported that the quality of web page ranking for head queries is nearly the same for different search engines, while tail Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. CIKM’15, October 19–23, 2015, Melbourne, Australia. c 2015 ACM. ISBN 978-1-4503-3794-6/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/2806416.2806487.
X q˜∈Qq
981
w(˜ q , q)f (˜ q , d)
(1)
taken over the queries q˜ which are similar to query q. The weights w(˜ q , q) and the set Qq of similar queries are defined in some way on the basis of one of the existing query similarity functions s(˜ q , q). So far, the values of the query similarity functions (sometimes, normalized) were used directly as the weights in Equation 1, without regard to their different nature and scale. Besides, there have been no comparison of the weighted mean as a function to aggregate the feature values for similar queries with any other function that could be taken over them. Also the state-of-the-art propagating method described by Equation 1 was not previously compared [2, 13, 35] with methods utilizing some more statistics about the distribution of the feature values over similar queries besides its linearly weighted mean. The major contributions of this paper are the following. First, we investigate different approaches to learn a function of similarity values which is used to compute the weights in the linear combination of the feature values in Equation 1. Moreover, we consider another approach to propagate (aggregate) query–document feature values based on the decision trees instead of taking the weighted sum and compare the performances of the approaches. Second, when using different propagating methods, some important data associated with similar queries q˜ is currently neglected. We show that some properties of similar queries can help the ranker to use those similar queries more effectively for propagating. Our learning framework allows utilizing any query features to increase the contribution of propagated features to the quality of the ranking function. To the best of our knowledge, we are the first to address the above-mentioned aspects of propagating query– document relevance features of a ranking model via a collection of query similarities of different types. Third, we provide experiments on a large data set of labeled queries, compare performances of the state-of-the-art and proposed approaches and distinguish their combination that gives the best quality and advances the state-of-theart. Moreover, our experimental results show that the best method of propagation significantly outperforms the stateof-the-art even if the features values are propagated in the case of the queries of high or moderate frequency. However, the relative quality of the final ranking functions is much greater for the tail queries. The remainder of the paper is organized as follows. Section 2 contains a review of the methods of propagating query– document features and the ways of defining queries’ similarity functions. In Section 3, we propose a classification of similarity functions. We describe our framework of propagating the features in Section 4. In Section 5, we introduce a method of exploiting some statistics about the distribution of the feature values over similar queries to use similar queries more effectively for the propagation algorithms from Section 4. The experimental results are reported in Section 6. In Section 7, we summarize the outcomes of our study, discuss its potential applications and directions of future work.
2.
to affect the quality of results for tail queries most negatively. The common approach to this problem is to propagate features from head to tail queries. It is ordinarily done by considering a query similarity function [2, 13], which, in general case, can be defined in different ways [5, 21, 32, 34, 35, 38]. Evaluating a high-quality query similarity function itself is an important task in web search, because it is helpful in such problems as query suggestion [20, 21], query reformulation [19, 31], and query expansion [30, 33]. We distinguish the following four types of similarity functions: text similarity functions (semantic and syntactic similarities), query–document features-based similarity functions, query-level user-based similarity functions and session-level user-based similarity functions. Semantic similarity functions are typically computed based on the following intuition. Each query is associated with its n-gram vector. The closer two query n-gram vectors in the n-gram space are to each other (in accordance with some metrics), the more similar the queries are. This intuition was used to define a query similarity in [34, 38]. Syntactic similarity functions measure the distance between queries as term sequences. The most common syntactic function is the Levenshtein distance [5, 17, 18]. Query-document featuresbased similarities measure the distance between vectors of query–document features values for a pair of queries [32]. A query-level user-based similarity is computed on a click graph [13, 21, 32, 35]. Usually, a few steps of a random walk is used for this computation [13, 21, 26, 35]. A session-level user-based similarity is computed by using the information about co-occurrence of different queries in user search sessions [5]. In [26, 28], a random walk on a query-flow graph is exploited for similarity computation. Query–document feature values can be propagated among queries (usually, from head to tail queries, but not always) after one or several similarity functions are defined. Despite the obvious practical importance of the research on the propagation of query–document relevance features by using similarity functions, its history is rather short. Given a similarity function, the most common approach is to define propagated feature values by Equation 1 (see [2, 13, 35]). For instance, Aktolga and Allan [2] use a query similarity function for the purpose of click-through data propagating. The resulting ranking score of a document with respect to a query q in their framework equals to a linear combination of the default ranking score, the original value of the click-through feature and the propagated value of the feature. The latter propagated value is the weighted sum of the values of the feature over all queries similar to the query q as in Equation 1 (the weights are equal to the respective outcomes of the similarity function). In their experiments, they consider only a small set of 2400 popular queries, among which they propagate click-through data (a query is considered popular, if it has at least 500 clicks). In [13], the authors propagate click-through feature values as well. A query similarity function is used to expand the clickthrough streams of a document. The propagated features values are computed on the basis of these expanded clickthrough streams. The authors constrain the set of similar queries utilized to propagate click-through features by tuning a threshold on the value of the similarity function (for the purpose of propagating they use only those similar queries whose similarity value w.r.t. the target query ex-
RELATED WORK
The problem of query–document features (especially, clickthrough data) sparseness has been reported in many studies [1, 11, 13]. The lack of implicit feedback for tail (new or rare) queries affects the quality of web search, and is known [36]
982
ceeds the threshold). For any propagated feature f˜ from the paper there exists a query–document feature f such that for any query–document pair (q, d) and a clickthrough stream w of a document d the feature value f˜(q, d) equals sum of f (˜ q , d) over all q˜ in w. Therefore, the propagated values in this method equal the linear combination of feature values with the binary weights (so, essentially are defined by Equation 1). Unfortunately, both papers do not propose any method for optimizing the weights in the linear combinations of the feature values given ad-hoc similarity functions that define those weights. Our work elaborates the idea of querydocument feature value propagation via query similarity functions proposed in these papers further. We experiment with several implementations of that idea and, most importantly, advance it by solving the problems posed in Section 1, thus aiming to make this approach worth taking more seriously for search engine development.
3.
Popular search engines store the anonymized information describing user behavior (visited pages, times of visiting, submitted queries etc.) in the query logs. In the next two sections, we consider queries’ similarities mined from user search behavior data from the query logs. 3.2. Query-level user-based similarity. The similarity of the second type is based on properties of one of the bipartite graphs constructed by using user search behavior data and described in what follows. It is a common practice to compute such similarities (which are called querylevel user-based similarities) using random walks on these graphs (see [13, 21, 27, 35]), because random walks model the user behavior. We consider two graphs: a click graph and a graph of shows. Click graph is a bipartite unoriented graph G(Q, D, E), where Q is a set of queries, D is a set of documents and an edge {q, d} ∈ E is weighted by the number clqd of clicks on document d shown to query q. Graph of shows is defined in a similar way, the weight of an edge {q, d} in the graph of shows equals the number shqd of shows of document d to query q. Consider a random walk on a bipartite graph G(Q, D, E) (either a click graph or a graph of shows), where the probability of transition by an edge {q, d} is proportional to its weight wqd (either wqd = clqd or wqd = shqd ). Let q1 , q2 ∈ Q, k ∈ N. Usually, query-level sh user-based similarities QUScl k (q1 , q2 ) and QUSk (q1 , q2 ) [13] are defined as the probabilities of arriving at query q2 after 2k steps of the random walks on the click graph and the graph of shows respectively, which have started from query q1 . 3.3. Session-level user-based similarity. User search sessions extracted from the query logs can serve as another source of information about similar queries. Indeed, users often reformulate queries in the same search session. Such reformulations often represent similar information needs. We call a similarity function session-level user-based similarity, if the input data for computing this function is extracted from the query log. Particularly, we consider the similarity functions that were proposed in [5]. Using sessions extracted from query logs, the authors get the estimations P(q) and P(q1 , q2 ), the probability of a query q to appear in an arbitrary search session and the probability of queries q1 , q2 to appear in one search session respectively, which are utilized for computing four similarity functions: PMI, PMIJ, PMIS and PMIG. Apart from these functions, we consider a simple similarity function, which estimates the probability of the query q1 to appear in a session under the condition that the query q2 appears in that session: CP(q1 , q2 ) = P(q1 , q2 )/P(q2 ). 3.4. Text similarities (semantic and syntactic). There can be too little important data for computing similarities of new or rare queries which exploit some noise-prone query–document features or are based on users search behavior. Instead, the texts of such queries can be used for the estimation of their similarity. So, both semantic and syntactic similarities are computed by utilizing texts of the queries only. Therefore, we do not divide this class into two different subclasses. Each query q can be represented by a vector vq in the n-gram vector space [10, 34, 38], where each dimension corresponds to an n-gram (a sequence of n consecutive terms of a query from Q). To define vq , one first defines nonnegative weights of the n-grams occurring in the query. These weights are the components of vq that correspond to the n-
SIMILARITY FUNCTIONS
In this section, we introduce our classification of query similarity functions investigated in previous papers. Essentially, such a function should be computed on the basis of the following intuition: similar queries (in terms of this function) represent similar information needs. The similarity of these needs can be evaluated using the statistics of different nature. In particular, the computation of similarity might utilize some a priori data (lexical properties of the queries) or a posteriori data (shows of documents to the queries or clicks on documents shown to the queries). Therefore, our classification is based on the type of the data utilized for the computation of similarity functions. Specifically, we distinguish the following 4 classes: query–document features-based similarity (depending on the feature type it utilizes either a priori data or a posteriori data), query-level user-based similarity (utilizes a posteriori data), session-level user-based similarity (utilizes a posteriori data) and text similarity (semantic and syntactic similarity), which utilizes a priori data. We list several similarity functions of each class, which were commonly used in previous studies. In what follows, we consider a set of queries Q and define all the similarities for a pair of queries from this set. 3.1. Features-based similarity. The computation of a query–document features-based similarity s of queries q1 and q2 [32] utilizes any type of data, which is used for measuring relevance of documents with respect to the queries. Let f be any nonnegative feature of a query–document pair. Let D = {dj }N j=1 be a collection of documents. Consider two vectors f qi = (f (qi , d1 ), . . . , f (qi , dN )), i ∈ {1, 2}, of the feature values for all the documents with respect to queries q1 and q2 respectively. N Consider a function g : RN + × R+ → R+ chosen in such a way that, if two queries q1 and q2 are naturally similar, then the output value g(f q1 , f q2 ) is likely to be high. Query–document features-based similarity s between q1 and q2 equals s(q1 , q2 ) = g(f q1 , f q2 ). We choose g from the set of two functions, scalar product and cosine, which provide the following two similarity functions: SFSf (q1 , q2 ) = hf q1 , f q2 i, CFSf (q1 , q2 ) = cos(f q1 , f q2 ), for the given feature f .
983
which defines the propagated feature value f˜(q, d) as a weighted average of the feature values over similar queries q˜ ∈ Qq . The weights w(˜ q , q) reflect the importance of queries q˜ for query q and are defined on the basis of query similarities s(˜ q , q). In the state-of-the-art approach [2, 35], the weight function w is defined by
grams of the query. The other components of the vector are zero. Similarly to the feature-based approach, we consider two functions, scalar product and cosine. Note that these functions are nonnegative-valued, because their arguments are nonnegative. Thereby, two semantic similarities, SSSn (q1 , q2 ) = hvq1 , vq2 i, CSSn (q1 , q2 ) = cos(vq1 , vq2 ),
w(˜ q , q) = s∼ (˜ q , q) := P
are considered for a given value of n chosen for construction of the n-gram space. As it was done in previous studies [34, 38], for any n-gram, term frequency tf or the feature tf×idf (see [6], we compute these features on the queries’ corpus) are considered to be the weight w of the n-gram. For any two queries q 1 , q 2 , syntactic similarity measures the distance between the queries q 1 and q 2 as sequences of terms. The most common syntactic similarity function is the Levenshtein similarity s(q1 , q2 ), which is defined by using the Levenshtein distance dis(q1 , q2 ) [18] with the deletion, the insertion and the substitution costs which are equal to 1. dis(q1 ,q2 ) The Levenshtein similarity equals 1 − max{|q , where 1 |,|q2 |} |q| is the number of terms in q. In [5], the deletion and the insertion costs equal 1, and the substitution cost of x1 for x2 equals the normalized original Levenshtein distance dis(x1 ,x2 ) between terms x1 are x2 , which are considered max{|x1 |,|x2 |} as sequences of characters. We call the similarity function between queries based on this modified Levenshtein distance weighted Levenshtein similarity (WLS). The value of this function depends on the orders of the terms in the queries. Alternatively, the authors of [5] introduce the weighted Levenshtein similarity, which avoids this effect. In order to calculate this similarity, we should sort the terms in each query in the alphabetical order. Then the weighted Levenshtein similarity computed for the reordered term sequences is called ordered weighted Levenshtein similarity (OWLS).
4.
(3)
which is theoretically unfounded and may be suboptimal in practice. Intuitively, we assume that the actual purpose of Equation 2 is to obtain a (more reliable) estimate f (q, d) ∼ f˜w (q, d)
(4)
of unreliable or noisy feature value f (q, d). In contrast to Equation 2 which, in fact, defines f˜w (q, d) in an ad-hoc unsupervised way, we propose to learn the weight function w = w(˜ q , q) in order to optimize the quality of estimation, i.e we directly minimize the mean squared deviation of the right-hand side of Equation 4 from its left-hand side: min E(f (q, d) − f˜w (q, d))2 .
(5)
w
Appendix A provides a theoretical foundation of Equation 4 and optimization Problem 5 in more precise and detailed terms. In Section 4.2, we propose novel methods of learning the weights w(˜ q , q) in the described optimization setting. In Section 4.3, we consider another approach to propagate query–document features as an alternative to the linear weighting approach defined by Equation 2.
4.2
Linear weighting
In this section, we introduce a method of learning the weighting function w(˜ q , q) used in Equation 2. We look for a solution w to Problem 5 of the form w(˜ q , q) = g(s∼ (˜ q , q)),
METHODS OF PROPAGATION
In this section, we introduce the state-of-the-art and a number of novel methods of propagating query–document features via query similarity functions. We consider two different propagating principles: linear weighting and treebased. The first method is a generalization of the stateof-the-art propagating method which was proposed in the previous studies [2, 13, 35]. Let Q and D be a set of queries and a set of documents respectively. We consider a query–document feature f : Q × D → R and similarity functions s1 , . . . , sm : Q × Q → R+ . We assume that, for any query q and any j ∈ {1, . . . , m}, a set Qq = Qjq of queries, which are similar to q, can be obtained on the basis of the similarity function s = sj (·, q). Sections 4.1–4.3 focus on propagating feature f to queries q based on Qq and s.
(6)
where g belongs to a parameterized class of nonnegative valued functions G = {gγ : R → R, γ ∈ Γ}, and Γ denotes the parameter space1 . We consider two different approaches to define class G further in this paper: 1) to choose a particular family of functions; 2) to divide segment [0, 1] into N disjoint subsets D1 , . . . , DN and set G to be the space P of step functions that are constant on each Di , i.e. gγ (x) = N i=1 γi IDi (x), where N is a fixed natural number, γ = (γ1 , . . . , γN ) ∈ RN is the set of parameters, and IDi is the indicator function of Di . In the described setting, the optimization Problem 5 can be expressed as follows: X X ∼ h f (q, d), gγ (s (˜ q , q))f (˜ q , d) → min, (7) (q,d)∈D
4.1
s(˜ q , q) , s(ˆ q , q)
qˆ∈Qq
State-of-the-art propagation and the problem statement
q˜∈Qq
γ
where D is a training set of query–document pairs and h = h(x, y) (particularly, we consider the squared loss function h(x, y) = (x − y)2 ). 4.2.1. Parametric family of functions. Let G be any parameterized class of functions gγ w.r.t. the parameter (a vector of parameters) γ. For many classes of functions gγ
We start with a description of the standard way to propagate feature values across similar queries [2, 35]. It is based on the equation X f˜w (q, d) := w(˜ q , q)f (˜ q , d), (2)
1
q˜∈Qq
984
In our experiments, we consider Γ ⊂ RN .
P ∼ dependency of fq∼ (q, d) on q , q)fq∼ (˜ q , d) in Secq˜∈Qq s (˜ tionP6.3. As we show, the dependency of log fq∼ (q, d) on log q˜∈Qq s∼ (˜ q , q)fq∼ (˜ q , d) is approximately linear. Therefore, in this particular case we choose h(x, y) = (log x − log y)2 . 4.2.2. Step weighting functions. Assumption 11 is well-reasoned far not for any class of functions G and distributions (fq∼ (˜ q , d), q˜ ∈ Qq ). Fortunately, we do not need it in the case of the second approach to define G, where gγ ∈ G are step functions
(e.g., G = {γ x , γ ∈ R}), Problem 7 does not have a closedform solution. Consider an alternative optimization problem X X ∼ h f (q, d), gγ s (˜ q , q)f (˜ q , d) → min, (8) (q,d)∈D
q˜∈Qq
γ
which is obtained by interchanging the order of taking gγ and summation over Qq in the optimization Problem 7. It is easy to see that, for some classes G, Problem 8 has a closedform solution (e.g., a solution for a polynomial regression model with G = {ak xk +. . .+a1 x+a0 , γ = (a0 , a1 , . . . , ak ) ∈ Rk+1 }, h = (x − y)2 ). Obviously, if gγ (·) is not a linear homogeneous function, then Problems 7 and 8 are not equivalent or necessarily have P similar solutions, as long ∗as the values of g1∗ (q) = gγ ( q˜∈Qq s∼ (˜ q , q)f (˜ q , d)) and g2 (q) = P ∼ q , q))f (˜ q , d) are not the same. However, if, q˜∈Qq gγ (s (˜ for any (q, d) ∈ D, we replace the values f (˜ q ,P d) with their normalized modifications fq∼ (˜ q , d) := f (˜ q , d)/ qˆ∈Qq f (ˆ q , d) in g1∗ (q) and g2∗ (q), then the obtained values (we denote them by g1∼ (q) and g2∼ (q)) are close and hence we can arrive at the solution of the modification of Problem 7 X X h f ∼ (q, d), gγ (s∼ (˜ q , q))fq∼ (˜ q , d) → min (9) (q,d)∈D
gγ (x) =
q˜∈Qq
γ T = (F T F )−1 F T f , i∈{1,...,N }
where F = (F(q,d),i )(q,d)∈D , X q , q))f (˜ q , d), F(q,d),i = IDi (s∼ (˜
(13)
q˜∈Qq
f = (f (q, d), (q, d) ∈ D)T is the vector of the values of the feature f , which are ordered in the same way as the elements of each column of matrix F .
4.3
γ
Decision trees
In Sections 4.1 and 4.2, for a given query–document pair (q, d), we define the propagated feature value f˜(q, d) by Equation 2. It defines a function, which linearly depends on the values of f (˜ q , d) with coefficients w(˜ q , q) calculated on the basis of similarities s∼ (˜ q , q). Certainly, this linear prediction method might be not the best performing among all the prediction principles, as linear learning models are known to be suboptimal in a variety of learning tasks. So, in the general case, a predicted feature value f˜(q, d) equals a value of some function of feature values f (˜ q , d) and similarity values s∼ (˜ q , q) (˜ q ∈ Qq ). By this reason, we compare the linear approach described in Section 4.2 with the “treebased approach”, were a function of the same arguments f (˜ q , d), s∼ (˜ q , q) is learned using gradient boosted decision trees [12, 23, 29]). In this section, we consider this alternative approach to propagate query–document feature values. For a query–document pair (q, d), the input of a tree-based model is a feature vector X1 (q, d), . . . , XN (q, d) of dimension N , which is a parameter of the algorithm. We consider two ways to define features X1 , . . . , XN :
(10) Indeed, for linear functions g(a0 ,a1 ) (x) = a0 + a1 x, the exact equality g1∼ (q) = g2∼ (q) holds. For a convex function gγ , g1∼ (q) is slightly smaller than g2∼ (q). For a concave function gγ , g1∼ (q) is slightly larger than g2∼ (q). If gγ is neither convex nor concave differentiable function, then, intuitively, the difference between is g1∼ (q) and g2∼ (q) is far less. Moreover, in Appendix B, for two classes G, polynomial functions and power functions (the two classes we consider in our experiments, see Section 6.3) and Qq large enough, we prove that g1∼ (q) ∼ g2∼ (q)
(12)
and D1 , . . . , DN ⊂ [0, 1] are disjoint subsets. We learn parameters γ = (γ1 , . . . , γN ) ∈ RN by minimizing the mean squared error (MSE) (Problem 7, where h(x, y) = (x − y)2 ) exploiting the linear regression algorithm:
by solving the modification of Problem 8 X X ∼ ∼ ∼ h fq (q, d), gγ s (˜ q , q)fq (˜ q , d) → min (q,d)∈D
γi IDi (x)
i=1
γ
q˜∈Qq
N X
(11)
Therefore, in order to consider modifications of Problems 7 and 8 that have similar solutions, we, first, normalize the values of f and consider the optimization Problem 9 instead of Problem 7. Under our assumptions, Problem 9 reduces to Problem 10. Moreover, in Section 6.3, we show that the propagating method which is based on solving Problem 10 outperforms the state-of-the-art propagating method. As we mentioned above, in Section 6.3, we consider a class of polynomial functions with degree 3 and a class of power functions. In the first case, Problem 10 with h(x, y) = (x − y)2 is solved by ordinary least squares estimation for a P polynomial regression model E(y | x), where q˜∈Qq s∼ (˜ q , q)· fq∼ (˜ q , d) is the explanatory variable x and fq∼ (q, d) is the response variable y. In the second case, we change the loss function h, because, for the class of power functions, Problem 10 cannot be solved by ordinary least squares estimation (the iterative Gauss-Newton algorithm for a non-linear least squares optimization problem [14] can be used, but we avoid it because of its computational complexity). To motivate a choice of function h in this case, we study the
1. We simply take (f (˜ q , d), s∼ (˜ q , d) | q˜ ∈ Q0q ), where 0 Qq ⊂ Qq is the subset of queries q˜ ∈ Qq with the N/2 largest values of s∼ (˜ q , q). The value of N is even in this case. 2. Given a partition D1 , . . . , DN of [0, 1], we set Xi (q, d) = F(q,d),i (see Equation 13). The second approach is motivated by the second optimization algorithm from Subsection 4.2.2, where we learn, in fact, a linear model predicting f˜(q, d) using the same features F(q,d),1 , . . . , F(q,d),N . The components of feature vectors (f (˜ q , d), s∼ (˜ q , q) | q˜ ∈ Qq ) in the first approach should be ordered on the basis of
985
where Qq = {q1 , q2 , . . .} is ordered in such a way that s(q1 , q) > s(q2 , q) > . . .. Here we set f (qi , d) = s∼ (qi , q) = 0 for i ∈ {|Qq | + 1, |Qq | + 2, . . . , N/2}, if |Qq | < N/2. Our goal is to learn a function g : RN → R such that the predicted values f˜(q, d) = g(X1 (q, d), . . . , XN (q, d))
0.8
0.45
0.7
0.4
0.6
0.35 0.3 0.25 0.2 0.15
cfs wls cp qus
0.5 0.4 0.3 0.2
0
1
2
3
4
0.1
0
1
quantile
(14)
2
3
4
quantile
Figure 1:
The dependencies of an average value of query–document feature CTR over queries from Qq (i) and all pairs (q, d) of the number of quantile i (4 similarity functions: CFSCTR , WLS, CP, QUScl 2 ; frequency distribution (left), similarity distribution (right)).
are close to the target values f (q, d) for (q, d) ∈ D, where D is a training set of query–document pairs (as in the previous section). We solve this problem by minimizing the P squared loss function (q,d)∈D (f˜(q, d) − f (q, d))2 using gradient boosted decision trees [12]. We use the obtained function g to compute the propagated feature value f˜(q, d) defined by Equation 14.
5.
0.5
average CTR
f (q1 , d), . . . , f (qN/2 , d), s∼ (q1 , q), . . . , s∼ (qN/2 , q),
average CTR
some principle common across different pairs (q, d). We use the following ordering:
ple by using all the queries from Qq for the frequency distribution and the distributions of lengths (i.e., (X1 , . . . , Xn ) = (y(˜ q ), q˜ ∈ Qq ), where y(˜ q ) is either the frequency or a length of q˜) and all the pairs (˜ q , q), where q˜ ∈ Qq , for the similarity distribution (i.e., (X1 , . . . , Xn ) = (s(˜ q , q), q˜ ∈ Qq )). So, the quantiles depend on q ∈ Q. The second way is to take all the queries from Q for the frequency distribution and the distributions of lengths and all the pairs from Q × Q for the similarity distribution (the quantiles are the same for all the queries q ∈ Q). We use the quantiles µ1 , . . . , µr for dividing queries into disjoint subsets and propagating the values of feature f for each subset in the following way. For each i ∈ {1, . . . , r}, we define the propagated feature value fˆi (q, d) for the pair (q, d) ∈ Q × D either by Equation 2 in the linear weighting approach or by Equation 14 in the tree-based approach using
QUANTILES
In Section 1 we ask, if some statistics of the distributions of the feature values over similar queries besides the average value can help the ranker to use those similar queries more effectively. To answer this question, we use quantiles of four different distributions: the distribution of query frequencies, the distribution of the numbers of characters in queries (character lengths of queries), the distribution of the numbers of terms in queries (term lengths of queries), and the distribution of values of the similarity function in the following way. For each similarity s and query q ∈ Q, we divide the set Qq of queries similar to q (which is obtained on the basis of the similarity function s), into 5 subsets Qq (1), . . . , Qq (5) (see Equation 15) with respect to the values of quantiles and calculate the average value of query–document feature CTR (that we exploit in our experiments, see Section 6.2) over queries q˜ ∈ Qq (i) and all pairs (q, d). We represent the dependency of the obtained values on quantile i in Figure 1 for 4 different similarities, one from each similarity type (see Section 3) and two types of distributions whose quantiles define Qq (i): the distribution of query frequencies and the distribution of the values of the similarity function. For most of these similarities, the dependencies are monotone. This means that the values of the feature for similar queries is lesser, on average, for greater quantiles of query frequency (similarity to initial query q). This observation suggests that the average values of the feature over queries from each Qq (i) may carry more information than the average value of the feature over all queries from Qq alone. In our experiments, we show that these values provide useful information for improving the quality of ranking (see Section 6.3). In what follows, we describe how we use quantiles in detail. As in Section 4, we consider a query–document feature f , a similarity function s := sj , j ∈ {1, . . . , m}, and, for each q ∈ Q, a set Qq := Qjq of queries in Q, which are similar to q. Let P be a distribution on R+ . Consider a sample X1 , . . . , Xn of the distribution P and an increasing sequence of numbers α1 , . . . , αr ∈ [0, 1]. Let µ1 , . . . , µr be the empirical α1 , . . . , αr –quantiles of distribution P respectively. We construct the sample X1 , . . . , Xn in two different ways: either it depends on q or it does not. The first way is to build the sam-
Qq (i) := {˜ q ∈ Qq : Xl(˜q) ∈ (µi−1 , µi ]}
(15)
instead of Qq for learning the function g, where µ0 = −∞ and Xl(˜q) is the element of the sample which corresponds either to the query q˜ (for the frequency distribution and the distributions of lengths) or to the pair (˜ q , q) (for the similarity distribution).
6.
EXPERIMENTAL RESULTS
We compare the performance of different propagating techniques described in Section 4 by learning ranking functions using propagated query–document features. In the next section, we describe the details of learning to rank framework. In Section 6.1 and Section 6.2, we describe the dataset and the similarity functions we exploit in our experiments. Finally, we compare the performances of different propagating methods in Sections 6.3 and 6.4. First, we compute average values of ranking quality metrics on a test set of queries. Second, as we expect that our methods are especially useful for tail queries, Section 6.4 presents a closer look a the methods’ performance on a subset of infrequent queries. We consider the classical problem of ranking documents according to their relevance to a query [8]. It was shown that regression with gradient boosted decision trees, while computationally more efficient, is only marginally worse than the most advanced list-wise learning to rank methods [8], so, given the amount of calculations in our experiments and the fact that our methods are agnostic w.r.t. to a specific ranking method, we solve the ranking problem by minimiz-
986
6.2
ing MSE using a proprietary implementation of gradient boosted decision trees [12].
In our experiments, we consider 30 query similarity functions. We exploit 6 features-based similarities (see Section 3.1): SFSf , CFSf , where f is one of the features BM25, CTR, nCTR (normalized CTR [2]); 9 query-level user-based cl sh similarities (see Section 3.2): QUScl 1 , . . . , QUS6 and QUS1 , sh sh QUS2 , QUS3 ; 5 session-level user-based similarities (see Section 3.3): PMI, PMIJ, PMIS, PMIG, CP, 8 semantic similarities (see Section 3.4): SSS2 , CSS2 , SSS3 , CSS3 , where w is either tf or tf×idf; 2 syntactic similarities (see Section 3.4): WLS and OWLS. For each query q from the set Q0 , each similarity type y ∈ {1, . . . , 4} (we enumerate them in the same way as in Section 3), and each similarity function s of type y, we define a set of similar queries Qq (s) ⊂ Q by removing from the set Q queries q˜ with small values of similarity s(˜ q , q) (to avoid too outlying feature values for these queries q˜ w.r.t. a given document, which are used for propagation) in the following way. First, independently on q, we define a threshold t(s) on the similarity values in such a way that there are 10%2 of the values of s which are less than t(s) and, for the query q, remove from Q all queries q˜ such that s(˜ q , q) 6 t(s). Second, we define a threshold n(y) on the size of a set of top queries in set Qq (s) ranked by their similarity to q. The value of n(y) was chosen from segment [50, 2000] with step 50 by maximizing NDCG@3 on the part Q21 of the training set (we exploit 5 regular query–document relevance features PageRank, BM25, Query Frequency, CTR, nCTR (apart from the features obtained via propagation) and 60 propagated CTR and nCTR by the linear weighting approach (each of these two features is propagated by using each of the considered similarity functions), where the weights for propagating are defined by Equation 3). It was also compared with the quality of the ranking function, which was learned without the threshold n(y): the quality of the function which was learned by utilizing the threshold was better for all the similarity types. From this point, we consider only top n(y) queries from Qq (s) for each similarity type y. Finally, we filter out those similarity functions from the initial set, which do not influence the effectiveness of the propagation significantly (which is expected as many of them strongly correlated with each other). Let y ∈ {1, 2, 3, 4} be a similarity type, sy,1 , . . . , sy,m(y) be all the considered similarities of type y (we denote by m(y) the number of similarity functions of type y). First, we train the ranking functions on the set Q21 with 5 regular features and two sets of features f˜1 , . . . , f˜m(y) defined by Equation 2 and Equation 3, where f is either CTR or nCTR. Further, for each i ∈ {1, . . . , m(y)}, we remove two features f˜i (one for CTR and one for nCTR) at a time and train the ranking function. If the quality of the ranking function does not change statistically significantly (according to paired t-test with p < 0.05) on the validation set, we no longer exploit the similarity sy,i in our experiments. After this procedure, the following 15 similarity functions remain: all the 6 features-based similarities, 3 query-level cl sh user-based similarities: QUScl 2 , QUS6 and QUS1 , 2 sessionlevel user-based similarities (PMIJ, CP), 2 semantic similarities (SSS3 , CSS3 , where w is tf×idf), 2 syntactic similarities (WLS and OWLS).
6.1 Data For ranking evaluation, a random sample of queries Q0 issued by the users of the European country under study was selected from the query logs of one of the most popular commercial web search engines. We selected 66.6% of the queries as our training set Q1 , 16.7% as the validation set Q2 and the remaining 16.7% as the test set Q3 . The set Q1 was divided into two subsets Q11 and Q21 of equal size, the first one is for learning to propagate query–document features, the second one is for learning to rank (see Table 1). For each query q ∈ Q0 \ Q11 (we do not need relevance labels for learning to propagate features), the relevance of each of the top ranked documents was judged by professional assessors hired by the search engine (we denote the set of these documents by D(q)). In our experiments, we consider only S documents from the union D := q∈Q0 \Q1 D(q) of judged 1 documents over all queries in the dataset. The relevance labels were assigned using a popular graded relevance scale with labels: perfect, excellent, good, fair, bad. The data we use contains ≈ 1.9M query–document pairs and ≈ 90K unique queries (around 21 judged documents per query). We compare the methods on the test set in terms of average NDCG@3 and NDCG@5 metrics (see [7]), which were the primary measures in [2, 13].
˜ Q
Q (all queries) Q0 (ranking evaluation set) Q1 (train) Q2 Q11 Q21 (validation)
(propagation set)
(predict.)
(ranking)
52.6M
30K
30K
15K
Similarities and similar queries
Q3 (test)
15K
Table 1: Sets of queries. While our framework is not specific to a particular type of query-document features, in this paper, we experiment with click-through features for the sake of consistency with the previous papers [2, 11, 13]. One of the most widely used feature of that kind is CTR, another one is nCTR proposed in [2]. As the reliability of click-through feature values for query–document pairs (q, p) depends on the number of shows of p in response to q and the numbers of clicks on p shown ˜ of queries (see Table 1), to q, we construct a set Q = Q0 ∪ Q which we use to propagate query–document features from in ˜ if either the total the following way. We add a query q to Q, number of shows of documents to q recorded in the query log (all the records were made from 1 January 2014 to 31 March 2014) is greater than 70 or the number of clicks on documents shown in response to q is greater than 6. For each labeled document d from the data set, we create top-1000 ˜ which have the largest number list of queries from Q0 ∪ Q, of shows of document d in response to them. We remove a ˜ if it is not in such a top-1000 list for at least query from Q, one document.The final set Q contains ≈ 52.7M queries. For the purpose of evaluating features-based and querylevel user-based similarities and for propagating click-through features, we consider all query–document pairs (≈ 254.2M pairs) from the query logs, where all queries are from Q and the documents are from D.
2 The maximal portion of queries which can be removed without a significant change in the ranking quality.
987
6.3 Propagation
scale for each similarity function (Figure 2, blue line) and, therefore, obtain 30 propagated features. Finally, we train the ranking function on the set Q21 with 5 baseline features and the obtained 30 propagated features (see Table 2).
All the ranking functions in this section were trained on the set Q1 . The meta-parameters for each ranking function were tuned by maximizing the respective quality measures on the set Q2 . We trained the ranking functions by exploiting the set of 5 regular features. Moreover, we learn 8 ranking functions by exploiting different sets of features which are described in what follows (see Table 2). Rank. function
Features
NDCG@3
@5
Regular
Regular features only
0.61637
0.62757
State-of-the-art
Regular + 30 prop. features (Eq. 2,3)
0.63086
0.6463
Regular + 30 prop. features (Eq. 2,6 g — pow. function)
0.63257
0.6478
Regular + 30 prop. features (Eq. 2,6, g — polynom)
0.63223
0.64749
Uniformly discrete
Regular + 30 prop. features (Eq. 2,6,12, uniform partition)
0.63293
0.64851
Equivalently discrete
Regular + 30 prop. features (Eq. 2,6,12, equal card. of sets)
0.63491
0.64978
Point-based tree-based
Regular + 30 prop. features (Eq. 14, features 1), § 4.3)
0.6288
0.6422
Interval-b. tree-based
Regular + 30 prop. features (Eq. 14, features 2), § 4.3)
0.63254
0.647
Equivalently discrete with quantiles
Reg. + 150 prop. feat. (equal card. of sets, similarity distrib.)
0.63749
0.6521
Power
Polynomial
Figure 2: The dependency of f ∼ from Efˆ∼ and approximation of this dependency using linear regression on the log-log scale plot (red line) for two similarities, CFSCTR and PMIJ. For comparison, the blue line with learned parameters (a, b).
Second, we suppose that g is a polynomial function with degree 3 (see footnote 2), i.e. G = {a3 x3 + a2 x2 + a1 x + a0 , γ = (a3 , a2 , a1 , a0 ) ∈ Γ}, Γ = R4 , and h(x, y) = (x − y)2 . We find the optimal γ in Equation 10 using the linear regression algorithm for each similarity function, obtain 30 propagated features and train the ranking function on the set Q21 with 5 baseline features and the obtained 30 propagated features (see Table 2). For learning the discrete function g we set N = 100 (see Section 4.2)3 . We divide the segment [0, 1] into N intervals D1 = [0, d1 ), D2 = [d1 , d2 ), . . . , D100 = [d99 , 1] in two differi , ent ways. Firstly, we exploit the uniform division: di = 100 i ∈ {0, . . . , 99}, learn the parameters γ1 , . . . , γ100 (see Section 4.2) and obtain 30 propagated features (see Table 2). ˆ 11 of 20% queries Secondly, we choose a random sample Q 1 from Q1 and let d1 , . . . , d99 be the empirical 0.01, . . . , 0.99– quantiles of distribution of s when the second argument is in Q11 . For learning the parameters γ1 , . . . , γ100 we remove the ˆ 11 from the set Q11 . As the result, we also obtain 30 sample Q propagated features (see Table 2). For each of the two ways of constructing sets of features in the tree-based approach (see Section 4.3) we consider three choices of the parameter M (M = 100, M = 200 and M = 400). The optimal values of M were chosen by maximizing NDCG@3 on Q21 (see Section 6.1)4 . We compare the effectiveness of all the methods of propagating in each class of propagating methods on the test set: continuous weights-based, discrete weights-based and tree-based (we consider 2 ranking functions from each class of methods) in Table 2. The best methods are Power in the continuous weighting class, Equivalently discrete in the discrete weights-based and Interval-based tree-based in the tree-based class. The best method of propagating is Equivalently discrete. The NDCG@3 and NDCG@5 gains of this method in comparison with State-of-the-art equal ≈ 0.6% and ≈ 0.5% respectively. The NDCG@3 and NDCG@5
Table 2: Performances of the ranking functions for different sets of features on the test set.
For learning functions g used by propagation methods of Section 4.2 (Equation 6) and Section 4.3 (Equation 12) with each of the considered similarity functions, we set Q = Q11 . We construct the set D in the following way: pair (q, d) is in D if and only if the number of shows shqd of the document d to query q is greater than 25 and the number of shows of all documents to the query q is greater than 500. Let s be any of the considered similarity functions. First, we assume that the continuous function g for the linear weighting propagating method (see Section 4.2.1) is a power function and study the dependency of fq∼ (q, d) on P ∼ q , q)fq∼ (˜ q , d) for all pairs (q, d) such fˆq∼ (q, d) = j sj (˜ q˜∈Q q
that q ∈ Q11 . To prove the choice of the function h (see Section 4.2.1), we interpret this dependency in the following way. For each value of fq∼ (q, d), we find the mean value Efˆq∼ (q, d) of fˆq∼0 (q0 , d0 ) over all pairs (q0 , d0 ) such that fˆq∼0 (q0 , d0 ) ≈ fˆq∼ (q, d). On Figure 2 (for illustration purposes, we consider only two similarities here), we represent the dependency of fq∼ (q, d) on Efˆq∼ (q, d) and approximation f ∼ = bEfˆ∼ + a using linear regression on the log-log scale plot (red line). As the dependency is approximately linear, we suppose that G = {axb , γ = (a, b) ∈ Γ}, Γ = R2 and h(x, y) = (log x − log y)2 . We find the optimal γ in Problem 10 using the linear regression algorithm in the log-log
3
The value of N was chosen from {10, 100, 1000} by maximizing NDCG@3 on Q21 (see Section 6.1). 4 For the point-based tree-based ranking function we chose M = 200 and for the interval-based tree-based ranking function we chose M = 100.
988
gains of this method in comparison with Regular equal ≈ 3% and ≈ 3.5% respectively. All the differences in metrics values are significant according to Paired Bootstrap Hypothesis Test [25] with p < 0.01. For the best propagating method (Equivalently discrete), we use the quantiles (see Section 5) for dividing similarity scores into intervals and propagating the features for each interval. We consider 5 quantiles, α1 = 0.2, α2 = 0.4, α3 = 0.6, α4 = 0.8, α5 = 1 (see Section 5), for 4 different distributions (the distribution of frequencies, the distribution of character lengths, the distribution of term lengths and distribution of values of the respective similarity function). Moreover, we evaluate the quantiles by constructing samples by two methods (either each sample is constructed for each query or one sample is constructed for all the pairs of queries, see Section 5). We compare the performance of the obtained 8 ranking functions. Quantiles for the similarity distributions with one sample which is constructed for all the pairs of queries outperforms all the other propagating methods on the validation set with both NDCG@3 and NDCG@5 metrics. The NDCG@3 and NDCG@5 gains of Equivalently discrete ranking function with quantiles in comparison with simple Equivalently discrete ranking function equal ≈ 0.4% for both ranking metrics on the test data (see Table 2). The NDCG@3 and NDCG@5 gains of this ranking function in comparison with State-of-the-art equal ≈ 1.1% and ≈ 0.9% respectively. The NDCG@3 and NDCG@5 gains of this method in comparison with Regular equal ≈ 3.4% and ≈ 3.9% respectively. All the differences of qualities are significant according to Paired Bootstrap Hypothesis Test with p < 0.01.
6.4
function and a query–document feature, we investigate different approaches to learn a function of similarity values and values of the feature for similar queries w.r.t. a given document. This function is further used to compute the propagated values of the features. We also consider the statistics associated with distributions of character and term lengths of queries, queries’ frequencies and queries’ similarities and prove that it helps the ranker to use similar queries more effectively. Moreover, we choose the most effective type of the distribution and the most effective way of exploiting it. Our experimental results show that the proposed propagating algorithms significantly outperform the state-of-theart propagating algorithm. Moreover, as expected, the effectiveness of propagation by our method in comparison with the state-of-the-art approach depends heavily on the frequencies of queries which are supplemented with the features obtained via propagation. We observe that our methods are especially useful for rare (tail) queries. It would be interesting to continue to study the properties and applicability of the propagating methods. We plan to determine in more detail, which properties of queries define whether they would benefit from propagation by different methods. Moreover, we are going to apply the propagating methods for different tasks besides web ranking.
8.
Performance on tail queries
0.66
0.64
0.64
0.62
0.62
0.6
0.6
0.58
NDCG3
NDCG5
We test our algorithm on the set of tail queries (see Figure 3). For the queries infrequently issued by users (with frequencies at most 5) the NDCG@3 and NDCG@5 gains of the Equivalently discrete ranking function with quantiles in comparison with State-of-the-art (Regular) equals ≈ 3.9% (≈ 13.7%) and ≈ 3.9% (≈ 13.8%) respectively. These gains decrease up to ≈ 1.9% (≈ 6.2%) and ≈ 2.2% (≈ 7.4%) respectively as frequencies grow up to 30.
0.58 0.56
0.54 Equiv discr, quant State-of-the-art Regular
0.54 0.52
0.56
5
10
15
20
Equiv discr, quant State-of-the-art Regular
0.52 25
frequency
30
0.5
5
10
15
20
25
30
frequency
Figure 3: Comparison of our algorithm, State-of-the-art and Regular ranking functions for the tail queries.
7.
CONCLUSIONS
In this paper, we consider different methods of propagating query–document relevance features for the purpose of improving the quality of page ranking. Given a similarity
989
REFERENCES
[1] E. Agichtein, E. Brill, S. Dumais, Improving web search ranking by incorporating user behavior information, Proc. SIGIR’06, pp. 19-26, 2006. [2] E. Aktolga, J. Allan, Reranking Search Results for Sparse Queries, Proc. CIKM’11, pp. 173–182, 2011. [3] R.M. Bell, Y. Koren, Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights, Proc. ICDM’07, pp. 43–52, 2007. [4] M. Bilenko, R. W. White, Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity, Proc. WWW’08, pp. 51–60, 2008. [5] F. De Bona, S. Reizler, K. Hall, M. Ciaramita, A. Herdaˇ gdelen, M. Holmqvist, Learning Dense Models of Query Similarity from User Click Logs, Proc. NAACL-HLT’10, pp. 474–482, 2010. [6] C. Buckley, G. Salton, Term-weighting approaches in automatic text retrieval, Information Processing and Management, Elsevier, Vol. 24, No. 5, pp. 513–523, 1988. [7] C. Calauz` enes, N. Usunier, P. Gallinari, On the (Non-)existence of Convex, Calibrated Surrogate Losses for Ranking, NIPS’12, pp. 197–205, 2012. [8] O. Chapelle, Y. Chang, Yahoo! Learning To Rank Challenge Overview, JMLR: Workshop and Conference Proceedings 14, pp. 1–24, 2011. [9] O. Chapelle, D. Metzler, Expected Reciprocal Rank for Graded Relevance, CIKM’09, pp. 621-630, 2009. [10] S. F. Chen, J. Goodman, An Empirical Study of Smoothing Techniques for Language Modeling, Technical Report TR-10-98, Computer Science Group, Harvard University, 1998. [11] N. Craswell, M. Szummer, Random walks on the click graph, Proc. SIGIR’07, pp. 239–246, 2007. [12] J. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Statist., 29, pp. 1189–1232, 2001. [13] J. Gao, W. Yuan, X. Li, K. Deng, J.-Y. Nie, Smoothing Clickthrough Data for Web Search Ranking, Proc. SIGIR’09, pp. 355–362, 2009. [14] Greene, H. William, Econometric analysis, Prentice Hall, 4th ed, 1999.
[15] J. Guo, X. Cheng, G. Xu, X. Zhu, Intent-Aware Query Similarity, Proc. CIKM’11, pp. 259–268, 2011. [16] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, Springer, 2nd ed., 2009. [17] R. Jones, B. Rey, O. Madani, W. Greiner, Generating query substitutions, Proc WWW’06, pp. 387–396, 2006. [18] V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, 10(8): pp. 707–710, 1966. [19] M. Li, Y. Zhang, M. Zhu, M. Zhou, Exploring distributional similarity based models for query spelling correction, In ACL–44, pp. 1025–1032, 2006. [20] H. Ma, H. Yang, I. King, M. R. Lyu, Learning latent semantic relations from clickthrough data for query suggestion, Proc. CIKM’08, pp. 709–718, 2008. [21] Q. Mei, D. Zhou, K. Church, Query Suggestion Using Hitting Time, Proc. CIKM’08, pp. 469–478, 2008. [22] D. Metzler, S. Dumais, C. Meek, Similarity Measures for Short Segments of Text, Proc. ECIR’07, pp. 16–27, 2007. [23] A. Mohan, Zh. Cheng, K. Weinberger, Web-Search Ranking with Initialized Gradient Boosted Regression Trees, In Yahoo! Learning to Rank Challenge, pp. 77–89, 2011. [24] M. Sahami, T. D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, Proc. WWW’06, pp. 377–386, 2006. [25] T. Sakai, Evaluating Evaluation Metrics based on the Bootstrap, Proc. SIGIR’06, pp. 525–532, 2006. [26] D. Sheldon, M. Shokouhi, M. Szummer, N. Craswell, LambdaMerge: Merging the Results of Query Reformulations, WSDM’11, pp. 795–804, 2011. [27] Y. Song, L. He, Optimal Rare Query Suggestion With Implicit User Feedback, Proc. WWW’10, pp. 901–910, 2010. [28] I. Szpektor, A. Gionis, Y. Maarek , Improving Recommendation for Long-tail Queries via Templates, Proc. WWW’11, pp. 47–56, 2011. [29] S. Tyree, K. Q. Weinberger, K. Agrawal, Parallel Boosted Regression Trees for Web Search Ranking, WWW’11, pp. 387–396, 2011. [30] E. M. Voorhees, Query expansion using lexical-semantic relations, Proc. SIGIR’94, pp. 61–69, 1994. [31] X. Wang, C. Zhai, Mining term association patterns from search logs for effective query reformulation, Proc. CIKM’08, pp. 479–488, 2008. [32] W. Wu, H. Li, J. Xu, Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata, Proc. WSDM’13, pp. 687–696, 2013. [33] J. Xu, W. B. Croft, Query expansion using local and global document analysis, Proc. SIGIR’96, pp. 4–11, 1996. [34] J. Xu, G. Xu, Learning Similarity Function for Rare Queries, WSDM’11, pp. 615–624, 2011. [35] X. Yi, J. Allan, Discovering Missing Click-through Query Language Information for Web Search, Proc. CIKM’11, pp. 153–162, 2011. [36] H. Zaragoza, B. B. Cambazoglu, R. Baeza-Yates, Web search solved?: all result rankings the same?, Proc. CIKM’10, p. 529, 2010. [37] C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM Transactions on Information Systems (TOIS), Vol. 22, No 2, pp. 179–214, 2004. [38] Z. Zhang, O. Nasraoui, Mining Search Engine Query Logs for Query Recommendation, Proc. WWW’06 , pp. 1039–1040, 2006.
document. Consider a query as a random event q and a feature as a random variable f . We assume that Q is a random sample of the distribution of q, and the feature value f (q, d) available in the dataset is an observation of conditional distribution Pf |q . We assume that the conditional expectation E(f | q = q) carries more information on the relevance of d to q than the noisy observation f (q, d). Since E(f | q = q) is not directly observed in the data, our task is to provide its estimation based on observed values f (˜ q , d) for different q˜ ∈ Qq . For example, consider f = CT R (click-through rate). In this case, the observed value f (q, d) = CT R(q, d) is the number of clicks on d divided by the number of its shows to q and can be viewed as the observed success rate in a Bernoulli process, where a trial corresponds to a show. The value of CT R(q, d) is an estimation of the success probability, which is the probability of a click on the document d shown to query q. When the number of shows is low, CT R can deviate from the true value of this probability significantly, thus providing an inaccurate estimation. Our propagation approach is based on the intuition that conditional distribution Pf |q=˜ q is similar to conditional distribution Pf |q=q , if q˜ is similar to q. Relying on this, we consider the values f (˜ q , d), q˜ ∈ Qq , as independent observations of Pf |q=q . We substitute conditional distribution Pf |q=q by discrete empirical distribution of values f (˜ q , d), q˜ ∈ Qq , taken with probabilities w(˜ q , q), which decrease as similarity of q˜ to q decreases. We use expectation f˜w (q, d) of this surrogate distribution (see Equation 2) as a propagated estimate of E(f | q = q). This approach is similar to the technique of Nadaraya-Watson kernel-weighted estimation of conditional expectation [16, Section 2.8.2]. The main difference is that our observed events are queries, not quantitative variables, therefore, we cannot rely on Euclidean distance used by conventional kernel functions. Instead, we define weighting function w(˜ q , q) on the basis of similarity function s(˜ q , q). We further argue that the weights w(˜ q , q) should be tuned to directly minimize the mean squared deviation of the right-hand side of Equation 4 from its left-hand side. In fact, we have E (E(f | q) − f˜w (q, d))2 | q = q = h i2 E (f − E(f | q)) + (E(f | q) − f˜w (q, d)) | q = q − E (f − E(f | q))2 | q = q = E (f − f˜w (q, d))2 | q = q − E (f − E(f | q))2 | q = q .
APPENDIX
(P(ξ1 > n− (Eξ1 − ε), ξ1 + . . . + ξn > (Eξ1 − ε)n))n > 1 − δ
A.
So, if k is the exponent of a power function gγ , then gi∼ (q) = |Qq |−k(1+o(1)) , i ∈ {1, 2}. If gγ (x) = a0 +. . .+ak xk , a0 6= 0, then gi∼ (q) = a0 (1 + |Qq |−1+o(1) ), i ∈ {1, 2}. If gγ (x) = as xs + . . . + ak xk , then gi∼ (q) = |Qq |−s(1+o(1)) , i ∈ {1, 2}.
Therefore, the optimization problem minw E(E(f | q) − f˜w (q, d))2 is equivalent to the optimization problem minw E(f − f˜w (q, d))2 − E(f − E(f | q))2 that is equivalent to Problem 5.
B.
PROOF OF EQUATION 11
We rely on a hypotheses that f (˜ q , d) are independent observations of Pf |q=q and on an additional hypothesis that s(˜ q , d) are independent observations of some probability distribution Ps|q=q as well. Let ξ1 , . . . , ξn be independent nonnegative and identically distributed random variables such that for all t ∈ Z the inequality Eξ1t < ∞ holds. By the law of large numbers and Markov’s inequality, for any δ, , ε > 0 and n large enough, P(∃i ∈ {1, . . . , n} : ξi (ξ1 + . . . + ξn )−1 > n−1 ) < nP(ξ1 > n (Eξ1 + ε)) + P(ξ1 + . . . + ξn > (Eξ1 + ε)n) < δ, P(∀i ∈ {1, . . . , n} ξi (ξ1 + . . . + ξn )−1 > n−−1 ) >
PROBABILISTIC INTERPRETATION OF PROPAGATION APPROACHES
The optimization Problem 5 defined in Section 4.1 can be theoretically justified in the following assumptions. Let d be a fixed
990