Feb 21, 2013 - School of Computer Science and Technology, Dalian University of Technology, ... for language modeling in information retrieval (LMIR).
Learning to Rank Using Smoothing Methods for Language Modeling
Yuan Lin, Hongfei Lin, Kan Xu, and Xiaoling Sun School of Computer Science and Technology, Dalian University of Technology, 2 Linggong Road, Ganjingzi, Dalian, 116023 China. E-mail: {yuanlin, xlsun}@mail.dlut.edu.cn, {hflin, xukan}@dlut.edu.cn
The central issue in language model estimation is smoothing, which is a technique for avoiding zero probability estimation problem and overcoming data sparsity. There are three representative smoothing methods: Jelinek-Mercer (JM) method; Bayesian smoothing using Dirichlet priors (Dir) method; and absolute discounting (Dis) method, whose parameters are usually estimated empirically. Previous research in information retrieval (IR) on smoothing parameter estimation tends to select a single value from optional values for the collection, but it may not be appropriate for all the queries. The effectiveness of all the optional values should be considered to improve the ranking performance. Recently, learning to rank has become an effective approach to optimize the ranking accuracy by merging the existing retrieval methods. In this article, the smoothing methods for language modeling in information retrieval (LMIR) with different parameters are treated as different retrieval methods, then a learning to rank approach to learn a ranking model based on the features extracted by smoothing methods is presented. In the process of learning, the effectiveness of all the optional smoothing parameters is taken into account for all queries. The experimental results on the Learning to Rank for Information Retrieval (LETOR) LETOR3.0 and LETOR4.0 data sets show that our approach is effective in improving the performance of LMIR.
Introduction and Motivation Language modeling (Ponte & Croft, 1998) approach is one of the most important and widely used information retrieval (IR) functions because it connects the problem of retrieval with language model estimation. It has served as a strong baseline in the IR community due to the fact that very simple language modeling retrieval methods have performed
Received March 7, 2012; revised July 22, 2012; accepted July 23, 2012 © 2013 ASIS&T • Published online 21 February 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.22789
quite well. The basic idea of these approaches is to estimate a language model for each document and to rank documents by the likelihood of query according to the estimated language model. This simple framework is promising because of its foundations in statistical theory. An important issue in language model estimation is smoothing, which aims to adjust the maximum likelihood estimator to compensate for data sparseness. There are several representative smoothing methods: Jelinek-Mercer (JM), Bayesian smoothing using Dirichlet priors (Dir), and absolute discounting (Dis). However, previous works on the smoothing parameter estimation chose only a single fixed parameter for all the documents, which can achieve best retrieval performance on average. It is usually estimated by empirical methods, such as the grid search method. However, an interesting question is whether other parameters besides only one optimal parameter have effects on the results. It is known that the parameters for different queries may be different; for example, the smoothing parameter of the JM method for long queries may be bigger, but for short queries may be smaller (Zhai & Lafferty, 2004) to get better performance. However, for the real search engine, it is difficult to predict the length of user’s query. Therefore, it is necessary to propose an approach that could utilize different optional parameters to improve the performance of the language model. Learning to rank has become an important research field in IR and machine learning. It aims to learn the ranking function from the training set automatically, which is used for sorting documents by relevance. The key issue of learning to rank is how to construct such a model. The ranking task is defined as follows: The training data (referred to as D) consists of a set of records in the form of 〈q, d, r〉, where q is a query, d is a document (represented as a list of features f1, f2, . . . , fm), and r is the relevance judgment of d to q. The relevance draws its values from a distinct set of possibilities (e.g., 0, 1). The test set (referred to as T) consists of records 〈q, d, ?〉 in which only the query
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 64(4):818–828, 2013
q and the document d are known, whereas the relevance judgment of d to q is unknown. The training data is used to construct a model that relates features of the documents to their corresponding relevance, and the model is used to predict the relevance of the documents of the test data for getting a ranking list. The features of the documents are extracted by the traditional retrieval methods such as TF-IDF, BM25 and PageRank, among others. Thus, learning to rank approaches can also be considered as a framework that learns the ranking model by merging different retrieval methods such that all existing retrieval methods can be taken into account together. From this point of view, it is very flexible and could also include the influences of different smoothing parameters. The research in this article is based on two issues: Using the learning to rank method to merge language modeling smoothing approaches with different parameters, and examining the feasibility of using different parameters language model smoothing approaches as ranking features to improve the retrieval performance of existing methods of learning to rank. The main contributions of this article are three-fold: First, we propose a learning to rank–based framework that can take into account rich features for boosting language model in information retrieval (LMIR) performance. Second, we explore different features that may affect ranking performance, which are extracted from different parameters and smoothing methods of LMIR, and different fields of documents. Finally, we conduct extensive and systematic experiments over three standard IR test collections. The experimental results show that our proposed approach can significantly improve retrieval performance. Moreover, by expanding the feature space of the Learning to Rank for Information Retrieval (LETOR) LETOR3.0 data set based on our proposed approach, the existing learning to rank approach can also be enhanced. The rest of this article is organized as follows: Related Work introduces the related work about language model and learning to rank. Methodology gives a description on the approach we used to merge the smoothing methods with different parameters for language model. Experiments presents the experimental results and analysis. Finally, we conclude this work and point out some directions for future research in the Conclusions. Related Work There have been many approaches, ranging from simple to complex, to document retrieval. Language modeling approaches for IR are based on statistical theory, which incorporates terms frequencies both in documents and collections. It is one of the most widely used retrieval methods that can be effective in many fields for IR. Recently, much research has been based on language modeling approaches (Dang & Croft, 2010; Lease, 2009; Lease, Allan, & Croft, 2009). Although these works achieve good performance on query reformulation by using language modeling approaches, there is little discussion
about the language model itself, especially for the smoothing parameters estimation, which may lead to a better result by being taken into account. A challenge in using retrieval ranking approaches is parameter tuning. Some research has attempted to apply learning methods to tune the retrieval model parameters. Umemura and Church (2000) proposed an empirical method for estimating term weights by TF-IDF, directly from relevance judgments using regression method. Svore and Burges (2009) developed a machine learning approach to BM25-style retrieval that learns, using LambdaRank (Burges, Ragno, & Le, 2006), from the input attributes of BM25. These all achieved good performance; however, the efforts made with language modeling approaches did not achieve the same success. Metzler (2007) also used the RankNet (Burges, Shaked, Renshaw, Deeds, Hamilton, & Hullender, 2005) cost function to optimize two-stage (Zhai & Lafferty, 2004) smoothing parameters estimation of language modeling, but the results indicate that RankNet is never significantly better than direct search for estimating the two-stage language modeling parameters. It seems that it is difficult to learn an appropriate parameter for language modeling directly by the machine learning method. Hence, in this article, we drop the idea of learning a single optimal parameter by machine learning method, and present an approach to consider the influences of all optional parameters to improve the performance of language modeling retrieval approaches for ranking. Ranking is a core issue in IR because the quality of a retrieval system is mainly evaluated by the relevance of its ranking results. Learning to rank achieves great success in improving the ranking performance, so it has become an important research field. Many machine learning–based approaches have been proposed. They are usually grouped into three approaches (Liu, 2009): The pointwise approach, the pairwise approach, and the listwise approach. Recent research suggests that the listwise approach seems to be better than the other two approaches, especially for the algorithm of ListNet (Cao, Qin, Liu, Tsai, & Li, 2007), which achieves satisfactory performance on LETOR3.0 data sets. In this article, we choose ListNet as the learning approach to merge the language model approaches because it has been shown that ListNet (Cao et al., 2007) is empirically optimal for Mean average Precision (MAP) and other IR evaluation measures. Although learning to rank is effective to improve the ranking accuracy, the related research mainly focuses on the development of new approaches (Burges, Shaked, Renshaw, Deeds, Hamilton, & Hullender, 2005; Burges, Ragno, & Le, 2006; Freund, Iyer, Schapire, & Singer, 2003; Herbrich, Obermayer, Qin, & Graepel, 1999; Nallapati, 2004; Wang, Lin, & Metzler, 2011; Xu & Li, 2007; Zhou, Xue, Zha, & Yu, 2008), or on the improvement of the existing ranking approaches (Bian, Liu, Qin, & Zha, 2010; Chen, Liu, & Lan, 2009; Ganjisaffar, Caruana, & Lopes, 2011; Szummer & Yilmaz, 2011; Xia, Liu, & Li, 2009), and so on. Few works exist for the feature selection methods (Duh & Kirchhoff, 2008; Geng, Liu, Qin, & Li,
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
819
2007). The feature selection methods for LETOR are introduced in Geng et al. (2007); it is a foundation of feature selection methods in learning to rank. Many studies are based on those features which are extracted from different retrieval methods, whereas the fact that the approach with different parameters can also be regarded as features is ignored. Duh and Kirchhoff (2008) applied the kernel-based principal component analysis to the test set to mode principal component pattern extraction, which is used for extracting new features from the training set. It achieves a good ranking result by adding these new features to the feature space of LETOR. The approach generates new features from the training set and test set. However, the basis of these features is still from LETOR. In this article, we present a new feature selection method that uses the smoothing methods with different parameters for language model to extract the ranking features. In our previous work (Lin et al., 2009), we conducted some preliminary experiments to boost the performance of the language model by machine learning approach. We tried to apply RankBoost (Freund et al., 2003) to learn a ranking model from the feature space, which is constructed by a single smoothing method of language model based on multiple text content field. However, the features extracted in that way are too limited to significantly improve the ranking performance. In this article, we try other patterns to expand the feature space for further ranking accuracy with contributions: Training a ranking model based on the multiple parameters of multiple smoothing methods in multiple fields to improve the performance of LMIR by ListNet, and expanding the original feature space with the features from smoothing methods of LMIR to improve the effectiveness of the existing learning to rank approach.
smoothing refers to the adjustment of the maximum likelihood estimator of a language model so that it will be more accurate. The main idea of smoothing is to assign a nonzero probability to unseen words by using the information from the document collection. Based on this idea, there are three methods (Zhai & Lafferty, 2004): JM: This method involves a linear interpolation of the maximum likelihood model with the collection model, using a coefficient l to control the influence of each model.
pλ (w | d ) = (1 − λ ) pml (w | d ) + λ p(w | C )
where c(w; d) is the count of the word w in the document d, and the conditional probability denotes the probability that the document collection C generates the observed w. l is the parameter that needs to be tuned. For the Pml (w|d),
pml (w | d ) =
Smoothing Methods The basic idea of the language model approach (Zhai & Lafferty, 2004) can be explained as follows: A query q is generated by a probabilistic model based on a document d. Given a query q = q1, q2, ... , qn and a document d = d1d2 ... dm, it is important to estimate the conditional probability p(d | q), which is the probability that d generates from the observed q. After applying Bayes’s formula and dropping a document-independent constant (because we are interested only in ranking documents), we have
p(d | q ) ∝ p(q | d ) p(d )
(1)
In this article, we are interested in the smoothing approaches that give the relevance scores of documents. The term 820
c(w; d ) ∑ c(w; d )
(3)
w
Dir: A language model is a multinomial distribution, for which the conjugate prior for Bayesian analysis is the Dirichlet distribution; the model is given by
pμ (w | d ) =
c(w; d ) + μ p(w | C ) ∑ c(w; d ) + μ
(4)
w
where the parameter m is need to be tuned. Dis: The idea of the Dis method is to reduce the probability of seen words by subtracting a constant from their counts (Freund et al., 2003). The approach discounts the seen word probability by subtracting a constant. The model is given by
Methodology In this section, we briefly introduce the three representative smoothing approaches for language model. In addition, we review the state-of-the-art ranking algorithm called ListNet. Finally, we investigate how to integrate the smoothing language model retrieval approaches with ListNet.
(2)
pσ (w | d ) =
max(c(w; d ) − δ , 0) + σ p( w | C ) ∑ c(w; d )
(5)
w
where s = d|d|u\|d|. The |d|u is the number of unique terms in document d, and |d| is the total count of words in the document. For every word in the query, we get the value of p(w|d) in the document; after accumulating their logarithms, we can get the final relevance scores. The smoothing methods, which consider the global information, are helpful in dealing with unseen query words, especially for the content fields with a few words. The smoothing methods are the basis of our research. The tie between the smoothing methods and unseen query words is the smoothing parameter, and for each smoothing method there is a parameter to be tuned. One of the most conventional approaches is to search for the optimum value over the fixed parameter space, for example, by using a grid search. The parameters of JM smoothing method l range from 0 to 1. The optimum parameter l is different when the types of queries are different (Zhai & Lafferty, 2004). Short queries tend to have smaller optimal parameters, whereas long
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
queries have larger optimal parameters. This is similar to the other two smoothing methods. However, the length of the queries is not predictable when users submit them to a real search engine. Therefore, the common method is to choose one smoothing parameter empirically. From above all, the parameters for the queries usually comes from empirical conclusion. It may not always work for all the queries. Intuitively, the other optional parameters that are not the optimum value for the whole collection can also improve the retrieval performance. For one smoothing measure, an optional parameter can determine a relevance score for a document; it is the emphasis for this article to merge these different scores for the relevance of the query-document. It is easy to set weights to these scores from different parameters depending on empirical performance, but it is difficult to calculate these weights accurately by empirical measures. One feasible approach is to consider these scores as relevance features for a document to a query, and apply machine learning approaches to learn a model and combine them. ListNet Algorithm Learning to rank focuses on finding a ranking model that takes all the effectiveness of the ranking features into account in order improve the retrieval results. The ranking features may be the IR methods or anything that can be related to the relevance of the documents with respect to the query. ListNet is a kind of listwise approach that is quite effective in ranking. It is a feature-based ranking algorithm that learns ranking model by minimizing a probabilistic listwise loss function, namely, the cross entropy loss (Cao et al., 2007), which is defined as a mapping function that is the permutation probabilities in the Luce model; that is, it uses document lists as instances in the learning procedure. The learning approach is based on a neural network with linear ranking function. The learned ranking model is used to predict the relevance score of a document. In this article, we use ListNet to learn a ranking model for its superior performance on the LETOR3.0. Algorithm 1 shows the learning algorithm of ListNet. Algorithm 1. ListNet algorithm Input: training set {(X, Y)|(x1, y1), (x2, y2), . . . , (xm, ym)} Parameter: Iteration T, learning rate h, Initialize parameter w For t = 1 to T do Input X of the data set to Neural Network and compute score list z(fw) with current Compute gradient Dw: ∂L ( y, z( fω )) Δω = ∂ω Update w = w - h·Dw End For Output: Neural Network model: w
In Algorithm 1, X is the input space with elements as sets of objects to be ranked, Y is the output space with elements
as permutations of objects, (xn, yn) is a combination of the documents and their relevance judgments with respect to the query n denoted as qn. The ranking function is denoted as fw based on the neural network model. Given feature vector xd to qn, fw(xd) assigns a score to it. Given query qn, the ranking function fw can generate a score list zn(fw) = (fw(x1), fw(x2), { , fw(xk)). The loss function is calculated as
L ( yn, zn ( fω )) = − ∑ Pyx ( g ) log( Pzx ( fω ) ( g )) ∀g ∈G
(6)
G is the set of all possible permutations of k documents, and Ps(g) is permutation probability of g given s. The performance of a retrieval system is evaluated by IR measures such as MAP and normalized discounted cumulative gain (NDCG) (Järvelin & Kekäläinen, 2002). Learning to rank approaches can define the ranking loss function such as cross entropy loss according to the relevance judgments. By minimizing the loss, it can learn a ranking model to directly improve ranking performance. The aim of tuning the smoothing parameter in language model is also to improve the performance of ranking. Therefore, learning to rank can be used to learn a model for language modeling approaches. In this article, we mainly use the LMIR methods to extract features for ranking model. We expect it to improve the ranking accuracy by learning a model from the features extracted by smoothing methods for LMIR.
Ranking Features for Ranking Model LETOR (Liu, Xu, Qin, Xiong, & Li, 2007; Qin, Liu, Xu, & Li, 2010) is the benchmark data set for learning to rank research. It develops a general framework to extract features from the OHSUMED collection and the .Gov collection. There are 45 features for OHSUMED and 64 features for .Gov in the LETOR3.0 data set. It covers many classic features, including term frequency, inverse document frequency, document length and their combinations (BaezaYates & Ribeiro-Neto, 1999), and some classical retrieval methods: BM25 (Robertson, 1997), PageRank (Page, Brin, & Winograd, 1998), language model approaches for IR (Zhai & Lafferty, 2004), among others. Therefore, we can use the learning to rank approaches to merge different approaches that can be regarded as ranking features. The same smoothing approaches for language model using different parameters can also be used to extract the ranking features for learning. In this way we can use the ListNet algorithm to examine whether it is feasible to take all the possible parameters for smoothing into account to improve the performance of language model approaches for retrieval. The main difference between LETOR2.0 and LETOR3.0 is the number of features. For example, there are 25 features for the OHSUMED collection; however, the number of the features for the OHSUMED collection is expanded to 45. The performance of learning to rank approaches is improved with the increment of features. Therefore, it is inspiring that
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
821
we improve the approaches by increasing the effective features for ranking. In this article, we study the effective forms of the language model in three ways: Different parameters for smoothing, different smoothing methods for language modeling, and different content fields for extracting relevance scores by LMIR methods. They are all effective methods for extending the feature space for learning to rank. Parameter-based ranking features. The optimal parameter for the language model method may be only one value, which is usually selected from optional parameter collection. The optimal parameter is often effective on the average level in the training data. However, this assumption usually does not hold, because the training collection is far more complex than a single document or a single query. Indeed, the collection usually consists of a mixture of documents and queries. The optimal parameter cannot be effective for all models. Treating the language model with multiple parameters, instead of a single value, is more reasonable. The retrieval results are sensitive according to the queries with different parameters. It is difficult to predict which parameter is more effective, and in some ways, the scores calculated by different parameters are reflected in the relevance of a document to a given query. The key issue is how to use the results from the different parameters of the language model. The different parameters for the same smoothing method can create the features for ranking. Thus, we can develop an effective method to merge the results from different parameters to enhance the relevance of ranking performance. Learning to rank is a feasible method to combine the parameter-based ranking features. Method-based ranking features. The feature space of LETOR is constructed by the retrieval methods, which inspires us to combine different retrieval methods for learning to rank. The diversiform methods are the base of the ranking model. The methods are based on substantial theory or the reasonable heuristic rule, which could provide distinguishable information for the model to rank documents by relevance. Different smoothing methods for language model can also be seen as an important way to study the performance of LMIR methods, because the smoothing method is a critical process for estimating a language model for each document. Field-based ranking features. Field information in a document can also improve the performance of ranking model, so the field information can help the ranking model to more accurately rank the documents with respect to a query. BM25F (Robertson, Zaragoza, & Taylor, 2004) is an extension of BM25 (Robertson & Walker, 1994) that prescribes how to combine more than one field in the document; it also achieves some good results in the retrieval track, as shown at the Text REtrieval Conference (TREC). It is obvious that the information of different content fields have different essentiality. The importance of the same word occurring in a title is far different from that occurring in the context. So as the 822
basis of feature extracting, the fields also play an important role in the retrieval methods. In this article, the field-based features are also an important method to enrich the space of features of language model. In this article, for the LMIR methods, assuming that there are M optional parameters for a smoothing method, N smoothing methods, and K content fields, then the number of relevance scores calculated by LMIR methods is M ¥ N ¥ K. The experiments focus on the effectiveness of the different forms of LMIR to learn a ranking model.
Experiments Data Set We evaluate our methods on the LETOR3.0 data set (Qin et al., 2010) released by Microsoft Research Asia. LETOR3.0 is based on many query-document features and the relevance judgments. The features may be TF-IDF, BM25, HITS, Page Rank, and LMIR (Zhai & Lafferty, 2004). The data set also provides the meta information of documents and queries to mine ranking features further, which is important to our research. The meta information is the basis to calculate the feature values by the different smoothing parameters. The meta file contains the document number, field number and description, and the (unique) term number in each field. The average document length of each field can be determined from this information. For each individual query, there is a separate XML containing its raw information, as well as it associated documents.. Meta information can be used to derive some strong features for ranking model. This data set contains two collections: The OHSUMED and .Gov collections. The OHSUMED collection is derived from the medicine retrieval task, whereas .Gov collection is from the TREC task. The subsets that we use for our research are TD2003, TD2004 from .Gov and OHSUMED collections. For the .Gov collection, there are five content fields: Title, body, anchor, URL, and the whole document as a field. OHSUMED has three content fields: Title, abstract, and the whole document. We also extract the features from different fields by language modeling approaches. The OHSUMED collection is derived from the MEDLINE data set, which is popular in the IR community. There are 106 queries in this collection, and the total number of query-document pairs is 16,140. Each query-document pair is represented by a 45-dimensional feature vector. The documents are manually labeled with absolute relevance judgments in the collection. There are three level labels: 2 (definitely relevant), 1 (possibly relevant), and 0 (irrelevant). The TD2003 collection is extracted from the topic distillation task of TREC2003. The goal of the topic distillation task is to find good websites about the query topic. There are 50 queries in this collection, and the total number of querydocument pairs is 49,171. Each query-document pair is represented by a 64-dimensional feature vector. There are two levels of relevance: 1 (relevant) and 0 (irrelevant). TD2004
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
TABLE 1. Features for experiments to improve the performance of the language model. No. of No. of No. of No. of parameters fields methods features
Method Stage one: multiple parameters Stage two: multiple fields Stage three: multiple methods
10 10 10
1 3/5 3/5
1 1 3
10 30/50 90/150
is very similar to TD2003, which is extracted from the data set of the topic distillation task of TREC2004. It contains 75 queries and 74,170 documents with 64 features.
TABLE 2. Description for proposed methods to improve the performance of the language model. Method
Description
OH_JM OH_R OH_L SP MP SPF
JM-based ranking method on OHSUMED RankBoost approach on OHSUMED ListNet approach on OHSUMED Single-parameter estimation provided by LETOR ListNet model from the multiple smoothing parameters The most effective feature from multiple fields using one smoothing method ListNet using one smoothing method with multiple parameters in multiple fields ListNet using three smoothing methods with multiple parameters in multiple fields
MPF Multiple
Experimental Setup In our experimental studies, we present a three-stage strategy to examine whether the smoothing approaches with multiple parameters for learning a ranking model can improve the ranking performance of language modeling. At the first stage, we use only one smoothing method with 10 optional parameters for extracting features in the whole document field. The aim is to study whether the ranking model learned by the multiple parameters can achieve better performance than the single parameter used for smoothing. At the second stage, we introduce other fields in the data set to extract more features for training; there are three fields in the OHSUMED and five fields in .Gov. However, at this stage, we use only one smoothing method for language modeling, and at the third stage, we apply multiple smoothing approaches in the multiple fields to extract the features as LETOR does to look forward to further improvement. These three stages are used to examine whether learning to rank is effective to improve the ranking performance of the language model. The features used at three stages are in Table 1. There are three content fields in the OHSUMED collection and five fields in the .Gov collection. So the number of features for OHSUMED is different from that of .Gov collection according to the Field-Based Ranking Features subsection, which is shown in Table 1. Table 2 lists the main methods used to improve ranking performance of language model. For the smoothing method, we take the JM method as an example, and for the data set, we take OHSUMED collection for example to give the description of proposed methods. Finally, we add the features extracted at the former stage into the original feature space of LETOR, and our goal is to examine whether the features extracted in this way are effective in improving the ranking results. Direct optimization of an underlying retrieval metric ties our work to learning to rank approaches for IR. Experimental Results To evaluate the performance of the proposed approach, we adopt MAP and NDCG as evaluation methods. The
average precision (AP) of a query is the average of the precision scores after each relevant document retrieved. AP takes the positions of relevance documents on ranking list into account to give scores of the list with respect to one query. Finally, MAP is obtained by the mean of the AP over a set of queries. MAP is calculated as N
MAP =
1 nQ
∑
∑ (P@n ⋅ rel(n)) n =1
(7)
r q
n
q ∈Q
where nQ is a set of test queries, nqr is the set of relevant document for q, n is the rank position, N is the number of retrieved documents, and rel() is a binary function on the relevance of a given rank. nqr is the number of the relevant documents with respect to the query q. NDCG score at position n is calculated as follows:
NDCG(n) =
1 nq
n
2r ( j )−1
∑ Z ∑ log( j ) n
q ∈Q
(8)
j =1
where j is the position in the document list and r(j) is the relevant label of the jth document in the list. Zn is the normalization factor that is chosen so that the perfect list gets a NDCG score of 1. In fact, NDCG(n) is usually obtained by the mean of the AP over a set of queries. There are three data subsets chosen from LETOR3.0: OHSUMED, TD2003, and TD2004. The results are obtained by five-fold cross-validation. Effectiveness of multiple parameters. We first seek to determine which single parameter is the most effective in terms of ranking relevance documents using the smoothing approaches for language modeling. The parameters of the smoothing approaches, where we choose the whole document as a single field, are tuned to optimize MAP on our training set using the parameter estimation provided by LETOR; it also serves as a baseline approach (SP for short). Finally, we examine the single optimum parameter for the test set compared with the
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
823
TABLE 3.
TABLE 4. Smoothing approaches in the whole document with 10 parameters.
Parameters for smoothing approaches.
Collection
Smoothing
LETOR
OHSUMED
OH_JM OH_DIR OH_DIS TD3_JM TD3_DIR TD3_DIS TD4_JM TD4_DIR TD4_DIS
0.5 50 0.5 0.7 2000 0.1 0.7 2000 0.1
10-Feature
TD2003
TD2004
OH_JM OH_Dir OH_Dis TD3_JM TD3_Dir TD3_Dis TD4_JM TD4_Dir TD4_Dis
SP
MP
MAP gain(%)
0.4336 0.4302 0.4360 0.1176 0.0668 0.1230 0.1346 0.0853 0.1415
0.4340 0.4330 0.4379 0.1300 0.1047 0.1305 0.1361 0.1283 0.1429
0.09 0.65 0.44 10.54 56.74 6.10 1.11 50.41 0.99
Note. Statistically significant gains (p < .05) are highlighted in bold.
multiple parameters ranking model learning from training set. Table 3 lists the parameters for smoothing approaches. For each smoothing method, we set 10 optional parameters for searching the optimum parameter. The interval is ranged 0 to 1 for the JM and Dis smoothing methods; we select the optional parameters in this interval for these two methods, such as 0.1, 0.2, 0.3, ... , 1. For the Dir smoothing approach, we choose the optional parameters based on LETOR settings. The optional parameters are around these values; for example, the parameter for Dir in OHSUMED is set to 50, so we select the optional parameters as 10, 20, 30, ... , 100. After selecting the optional parameter, we can get 10 ranking features for each query-document pair for learning a ranking model. Our optimum parameters are different from LETOR, which may be caused by different optimized IR evaluation measures. In this experiment, we use MAP as the evaluation method. We give the results of the ListNet model from the multiple smoothing parameters (MP) compared with the baselines: LETOR parameter settings. According to the experiment, we can get the performance of the baseline and the results of the prediction of the ListNet model, and we can examine whether it is effective to use the ranking model training with the ranking feature from multiple smoothing parameters. Table 4 lists the MAP values of ListNet compared with baseline approaches on three collections. We observe that the ListNet approach for multiple parameters outperforms the single optimum parameter for smoothing approach. It shows that the ranking model with multiple parameters for smoothing achieves better performance than the single optimum parameter in most cases. Our method can achieve a significant improvement compared with the baseline method on the TD2003 collection. Especially for the Dir method, our method achieves more significant improvement. The results reveal that the effectiveness of multiple parameters can improve the ranking accuracy. The baseline method also shows the performance of a single smoothing method; it may represent the general level of a single smoothing method with different parameters. The higher the baseline is, the better performance a single smoothing method can achieve. Table 4 shows that the ranking model with better features will achieve better performance, and that 824
the quality of the original ranking methods also influences the performance of the ranking model. Although the ranking performance can be improved by the effectiveness of multiple parameters, some results such as JM on OHSUMED are still quite marginal. The possible reason may be that the features for training are too limited, so we try to increase the size of feature space in the next step, as it may obtain a better result. Meanwhile, from Table 4, we can also conclude that the increased number of features is helpful in improving the ranking accuracy. The key issue is how to select more effective and informative features for ranking model. Effectiveness of introducing multiple fields. There may be many fields in a web document, such as title, URL, body, abstract, and anchor text, among others. In this article, we choose language modeling as the retrieval method to get ranking scores, so we use content text fields as test fields to exert language model approaches. The content fields of a document include title text, URL text, body text, abstract text, and anchor text, which we use in this article. The title field contains the title of the document. The URL field contains the text of the page web address. The body field consists of the HTML content of the page. The abstract text contains the main idea of a document. Anchor text is the text associated with a link in a source document, which is assumed to describe the target document. All the information from the fields may be related to relevance, so we extract the ranking features based on these fields using the language model method. In our previous work (Lin, Lin, Ye, & Su, 2009), we examined the effectiveness of introducing the multiple fields into a single smoothing method of language model by RankBoost. We compared the performance between RankBoost and ListNet on OHSUMED evaluated by MAP. Table 5 lists the results. In Table 5, OH_R is the RankBoost approach on the OHSUMED collection and OH_L the ListNet approach on the OHSUMED. TD3_R and TD4_R are RankBoost approaches on the TD2003 and TD2004; TD3_L and TD4_L are ListNet approaches on the TD2003 and TD2004. The results show that the performance of ListNet is better than
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
TABLE 5. LETOR.
Introducing multiple fields to one smoothing approach in
Method
JM
Dir
Dis
OH_R OH_L TD3_R TD3_L TD4_R TD4_L
0.4451 0.4488 0.1292 0.1418 0.1337 0.1398
0.4430 0.4513 0.1348 0.1653 0.1446 0.1522
0.4444 0.4501 0.1624 0.1706 0.1420 0.15778
TABLE 7. Introducing multiple fields to one smoothing approach in the.Gov collection based on multiple parameters. 50-Feature TD3_JM TD3_Dir TD3_Dis TD4_JM TD4_Dir TD4_Dis
†
SPF
MP
MPF
0.1217 0.1783 0.1574 0.1431 0.1666 0.1346
0.1300 0.1047 0.1305 0.1361 0.1283 0.1429
0.1547*† 0.1844† 0.1778*† 0.1450† 0.1817*† 0.1794*†
Note. *Significant improvement to SPF of our approach MPF (p < .05). Significant improvement to MP of our approach MPF (p < .05).
TABLE 6. Introducing multiple fields to one smoothing approach in the OHSUMED collection based on multiple parameters. 30-Feature OH_JM OH_Dir OH_Dis
†
SPF
MP
MPF
0.4395 0.4439 0.4436
0.4340 0.4330 0.4379
0.4521*† 0.4540*† 0.4520†
Note. *Significant improvement to SPF of our approach MPF (p < .05). Significant improvement to MP of our approach MPF (p < .05).
RankBoost for merging multiple field ranking methods based on one smoothing approach, because RankBoost is a pairwise approach whose objective of learning is formalized as minimizing errors in classification of document pairs rather than minimizing errors in ranking of documents as ListNet. Therefore, it is more appropriate for applying ListNet to learn a ranking model for smoothing methods. The previous work (Lin et al., 2009) did not take into account the effectiveness of multiple parameters to multiple field ranking performance. The following experiments were conducted to study the effects of multiple parameters. There are three fields for each document in the OHSUMED collection: Title, abstract, and the whole document. For each field, we can get 10 ranking features by using one smoothing approach, so there are 30 features in total for learning. We introduce the multiple fields to test whether it is effective to improve the ranking performance for one smoothing approach. In the collection .Gov, there are five fields in the documents: Title, anchor, URL, body, and the whole document. Therefore, the size of features for the TD2003 and TD2004 collection is 50 by introducing the multiple fields. Table 4 shows the results of the experiment on the .Gov collection. In Tables 6 and 7, the baseline is the best performance obtained by the most effective features extracted only from the multiple fields of the whole document using one smoothing method in LETOR 3.0 SPF. MP is also baseline for examining whether the feature from multiple field is effective in improving the ranking model. Finally, we also apply the ListNet to learn a ranking model for multiple parameters for smoothing in the multiple fields (MPF), and MAP is the evaluation method for ranking accuracy. In Table 6, we present our results introducing the multiple fields to the single smoothing method. The MPF significantly
outperforms baseline. On the OHSUMED collection, the improvement of ranking performance for MPF compared with MP becomes very significant, which shows that extracting features from multiple fields is better than from a single field. Compared with SPF, the MPF approach still achieves a better performance. Although SPF produces the best results in the single field of a document, it loses the information from other fields, whereas the scores come from the fields, which can decide the relevance of the document independently. But for the limitation of information that the single field contains, a single ranking list based on one field cannot represent the actual relevance perfectly. These experimental results seem to indicate that taking effectiveness of multiple fields can also lead to better ranking performance. In Table 7, the effectiveness of introducing multiple fields to the single method for the .Gov collection also achieves significant improvement. First, we can observe that baseline method is better than MP method; this is because the features from multiple fields are more effective in estimating the relevance of documents from only the single field, even the field that contains all the content in the documents. Second, we also conduct a significant test between the MPF approach with two baseline approaches. It reveals that multiple information is also effective in improving the ranking model on TD2003 and TD2004, which also makes the MPF approach outperform the baseline approaches. In this section, we introduce the multiple fields to our method. The experimental results show the learning to rank method can be an effective way to construct the ranking model for language model method on multiple fields. Effectiveness of multiple smoothing methods. For a document description over a single field and multiple fields, the ListNet exhibits reasonable accuracy by using the features from a single smoothing method. However, it is easy to lose the weighting information of different fields to describe the whole document as a single field. The scores come from the different fields, which can determine the relevance of the document independently. But for the limitation of information that the single field contains, a single ranking list based on one field cannot represent the actual relevance perfectly, so it is necessary to find a model based on these multiple field ranking features to determine the final
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
825
TABLE 8.
Multiple smoothing approaches in OHSUMED collection.
TABLE 11.
Multiple smoothing approaches on OHSUMED collection.
OHSUMED
MAP
N@1
N@3
N@10
OHSUMED
MAP
N@1
N@3
N@10
MAX_SPF MAX_MPF Multiple
0.4439 0.4540 0.4575*
0.4856 0.5078 0.5539*†
0.4619 0.4780 0.5096*†
0.4457 0.4374 0.4591†
Multiple LETOR Combination
0.4575 0.4457 0.4580
0.5539 0.5326 0.5827*†
0.5096 0.4732 0.5295*†
0.4591 0.4410 0.4680*†
Note. *Significant improvement to MAX_SPF of our approach Multiple (p < .05). † Significant improvement to MAX_MPF of our approach Multiple (p < .05).
Note. *Significant improvement to Multiple of our approach Combination (p < .05). † Significant improvement to LETOR of our approach Combination (p < .05).
TABLE 9.
TABLE 12.
TD2003 MAX_SPF MAX_MPF Multiple
Multiple smoothing approaches in TD2003 collection. MAP
N@1
N@3
N@10
0.1783 0.1844 0.2163*†
0.2444 0.2622 0.2822*†
0.2098 0.2296 0.2983*†
0.2278 0.2592 0.2994*†
TD2003 Multiple LETOR Combination
Multiple smoothing approaches on TD2003 collection. MAP
N@1
N@3
N@10
0.2163 0.2753 0.2915*†
0.2822 0.4000 0.4267*†
0.2983 0.3365 0.3651*†
0.2994 0.3484 0.3654*†
Note. *Significant improvement to MAX_SPF of our approach Multiple (p < .05). † Significant Improvement to MAX_MPF of our approach Multiple (p < .05).
Note. *Significant improvement to Multiple of our approach Combination (p < .05). † Significant improvement to LETOR of our approach Combination (p < .05).
TABLE 10.
smoothing methods in multiple fields is superior to using a single method in multiple fields. It also demonstrates that ranking model, which is trained by the features of language model approaches with multiple parameters in multiple fields, can achieve better performance than choosing only one optimal parameter in one field or multiple fields. The learning to rank approach can improve the performance of language model approaches significantly. Moreover, in this way, it is not necessary for the language model approaches to seek the optimal parameters or fields. We can obtain good ranking results only by optional parameters and the fields of the documents without further optimization.
TD2004 MAX_SPF MAX_MPF Multiple
Multiple smoothing approaches in TD2004 collection. MAP
N@1
N@3
N@10
0.1666 0.1817 0.2001*†
0.2933 0.3000 0.3200*†
0.2717 0.2899 0.3325*†
0.2434 0.2545 0.2878*†
Note. *Significant improvement to MAX_SPF of our approach Multiple (p < .05). † Significant improvement to MAX_MPF of our approach Multiple (p < .05).
ranking. The superior accuracy may be the result of both multiple optional parameters and multiple fields. In this section, we also seek to determine the more effective combination of the multiple smoothing methods to extract the features from multiple fields. Tables 8 through 10 list the results of the effectiveness of multiple smoothing methods used to extract more features from the various fields in LETOR3.0. The ranking accuracies are evaluated by both MAP and NDCG. In Tables 8 to 10, MAX_SPF means the best performance (evaluated by MAP) that is achieved by single parameter in a single field. MAX_MPF is the best performance that is achieved by ranking multiple parameters in multiple fields. “Multiple” stands for the performance achieved by ListNet using three smoothing methods to extract the features. There are three fields in the OHSUMED collection; for each field we extract 30 features by using three smoothing methods, so there are 90 features in total for a document in this collection. In this way, we can also extract 150 features for documents in the .Gov collection, which has five fields in the document. We find that using multiple 826
Effectiveness of incorporating features of LETOR. All the smoothing methods for LMIR are the same type of approach for ranking the documents with respect to the query. Different from the features creation of LETOR, which usually tends to use different types of ranking features to learn a ranking model, we focus on the features that come from the same type of retrieval method, and the experiments show it can also improve the performance of ranking model. However, we want to examine whether it is feasible to combine smoothing features and features of LETOR to further improve the ListNet ranking model. Tables 11 to 13 present the performance of combining two kinds of features. The ranking accuracies are evaluated by both MAP and NDCG. Multiple represents the performance using smoothing features only, and LETOR represents the performance using the features of LETOR. The Combination gives the results that combine two kinds of features. The ranking model is learned by ListNet. For LETOR, there are 45 and 64 features in the OHSUMED and .Gov collections, respectively, and there are 90 and 150 smoothing features in the OHSUMED
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
TABLE 13. TD2004 Multiple LETOR Combination
Multiple smoothing approaches on TD2004 collection. MAP
N@1
N@3
N@10
0.2001 0.2231 0.2368*†
0.3200 0.3600 0.4933*†
0.3325 0.3573 0.3750*†
0.2878 0.3175 0.3288*
Note. *Significant improvement to Multiple of our approach Combination (p < .05). † Significant improvement to LETOR of our approach Combination (p < .05).
TABLE 14. MQ2007 Multiple LETOR Combination
TABLE 15.
Multiple smoothing approaches on MQ2008 collection.
TD2004 Multiple LETOR Combination
MAP
N@1
N@3
N@10
0.4576 0.4775 0.4862*†
0.3611 0.3754 0.4453*†
0.4025 0.4324 0.4368*
0.2568 0.2303 0.2968*†
Note. *Significant improvement to Multiple of our approach Combination (p < .05). † Significant improvement to LETOR of our approach Combination (p < .05).
Multiple smoothing approaches on MQ2007 collection. MAP
N@1
N@3
N@10
0.4322 0.4652 0.4801*†
0.3810 0.4002 0.4344*†
0.3980 0.4091 0.4250*†
0.4081 0.4440 0.4119
Note. *Significant improvement to Multiple of our approach Combination (p < .05). † Significant improvement to LETOR of our approach Combination (p < .05).
and .Gov collections, respectively. Removing some duplicated features after combining two groups of features, there are 126 and 199 features for the documents in these two collections, respectively. The experimental results show Combination improvement on both Multiple and LETOR. However, for the OHSUMED collection, Multiple is better than LETOR and is close to Combination. It may be caused by this reason: The features for OHSUMED are all based on words such as TF, IDF, BM25, language model approaches, and so on. These features depend on the content of the document itself. Because its size is smaller than the smoothing features, which are also based on the information from words, the performance of the LETOR is a little worse than the Multiple; it also shows that similar features may not improve the ranking performance significantly because the size has increased enough. For the same reason, after adding the LETOR features to the smoothing features from language model, the performance is improved a little. However, the 64 features in the .Gov collection include PageRank, and Hits, among others, which do not solely depend on the information of words. Therefore, after combining these two groups of features, it may achieve better performance than TD2003 and TD2004 do. We also apply MQ2007 and MQ2008 data sets from LETOR4.0 data set to test the effectiveness of smoothing method-based features. For LETOR4.0 data set, there are also five content fields like .Gov data set, and there are 46 features for a query-document pair. After adding the smoothing method-based features into feature space of LETOR4.0, there are 181 features in the feature space. The experimental results in Tables 14 and 15 show that our method is also effective in improving the performance of ListNet on LETOR4.0 data set. The results from Tables 11 to 15 show that the features from multiple fields extracted by language model
approaches with multiple parameters can improve ListNet ranking model effectively. The language model approaches can generate features for learning to rank approaches to expand the feature space for further ranking accuracy. Conclusions In this article, we present a flexible approach for selecting different smoothing parameters to improve the ranking performance. Our experiments using multiple parameters with ListNet demonstrate significant improvements on the LETOR3.0 evaluated by MAP. In the experiment, we introduced multiple fields for extracting features, which also improved the performance further. The experiment revealed that it is feasible to use learning to rank approaches with multiple smoothing parameters to improve the performance of language modeling approaches to IR. In the Effectiveness of Incorporating Features of LETOR subsection, we added the smoothing features to the LETOR feature space, which showed that the smoothing features can also improve existing ranking methods such as ListNet. In future work, we plan to investigate the following issues: We will introduce other ranking methods such as AdaRank (Xu & Li, 2007) to determine whether their performances can also be significantly improved by features extracted by multiple parameters to the smoothing methods for language model; we will also research the other IR approaches such as BM25, to obtain the features for ranking methods by setting different parameters, and examine whether it is effective in improving these retrieval approaches. In this article, the source for smoothing language model is based on the collections themselves, but we will also bring in other smoothing sources from users and from the web to extract the features more; and we will introduce some other smoothing or parameter training methods to extract the features for further ranking accuracy. Acknowledgments This work is partially supported by grant from the Natural Science Foundation of China (No.60673039, 60973068, 61277370), the National High Tech Research and Development Plan of China (No.2006AA01Z151), Natural Science Foundation of Liaoning Province, China (No.201202031), State Education Ministry and the Research
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi
827
Fund for the Doctoral Program of Higher Education (No.20090041110002, 20110041110034), and the Fundamental Research Funds for the Central Universities.
References Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Essex, UK: Addison Wesley. Bian, J., Liu, T.Y., Qin, T., & Zha, H.Y. (2010). Ranking with querydependent loss for web search (pp. 141–150). In Proceedings of WSDM. New York, NY: ACM Press. Burges, C.J.C., Ragno, R., & Le, Q. (2006). Learning to rank with nonsmooth cost functions (pp. 93–200). In Proceedings of NIPS. Cambridge, MA: MIT Press. Burges, C.J.C., Shaked, T., Renshaw, A.L.E., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent (pp. 89–96). In Proceedings of ICML. New York, NY: ACM Press. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach (pp. 129–136). In Proceedings of ICML. New York, NY: ACM Press. Chen, W., Liu, T.Y., & Lan, Y.Y. (2009). Ranking measures and loss functions in learning to rank (pp. 315–323). In Proceedings of NIPS. Shapleigh, ME: Curran Associates, Inc. Dang, V., & Croft, B. (2010). Query reformulation using anchor text (pp. 41–50). In Proceedings of WSDM. New York, NY: ACM Press. Duh, K., & Kirchhoff, K. (2008). Learning to rank with partially-labeled data (pp. 251–258). In Proceedings of SIGIR. New York, NY: ACM Press. Freund, Y., Iyer, R., Schapire, R.E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. Ganjisaffar, Y., Caruana, R., & Lopes, C.V. (2001). Bagging gradientboosted trees for high precision, low variance ranking models (pp. 85–94). In Proceedings of SIGIR. New York, NY: ACM Press. Geng, X.B., Liu, T.Y., Qin, T., & Li, H. (2007). Feature selection for ranking (pp. 407–414). In Proceedings of SIGIR. New York, NY: ACM Press. Herbrich, R., Obermayer, K., & Graepel, T. (1999). Large margin rank boundaries for ordinal regression (pp. 115–132). In Advances in Large Margin Classifiers. Cambridge, MA: MIT Press. Järvelin, K., & Kekäläinen, J. (2002). Ir evaluation methods for retrieving highly relevant documents. ACM Transaction of Information Systems, 20(4):41–48. Lease, M. (2009). An improved markov random field model for supporting verbose queries (pp. 476–483). In Proceedings of SIGIR. New York, NY: ACM Press. Lease, M., Allan, J., & Croft, B. (2009). Regression rank: Learning to meet opportunity of descriptive queries (pp. 90–101). In Proceedings of ECIR. Berlin/Heidelberg, Germany: Springer Verlag. Lin, Y., Lin, H.F., Ye, Z., & Su, S. (2009). A machine learning approach to improve language model retrieval on multiple content fields. Journal of Computational Information Systems, 5(6), 1643–1651.
828
Liu, T.Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331. Liu, T.Y., Xu, J., Qin, T., Xiong, W., & Li, H. (2007). LETOR: Benchmark data set for research on learning to rank for information retrieval (pp. 3–10). In Proceedings of the Learning to Rank Workshop in SIGIR. New York, NY: ACM Press. Metzler, D. (2007). Using gradient descent to optimize language modeling smoothing parameters (pp. 687–688). In Proceedings of SIGIR. New York, NY: ACM Press. Nallapati, R. (2004). Discriminative models for information retrieval (pp. 64–71). In Proceedings of SIGIR. New York, NY: ACM Press. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web, technical report, Stanford University, Stanford InfoLab, Stanford, California, USA. Ponte, J., & Croft, W.B. (1998). A language modeling approach to information retrieval (pp. 275–281). In Proceedings of SIGIR. New York, NY: ACM Press. Qin, T., Liu, T.Y., Xu, J., & Li, H. (2010). LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13(4), 346–374. Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. 345–354. In Proceedings of SIGIR. New York, NY: ACM Press. Robertson, S.E., Zaragoza, H., & Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. 42–49. In Proceedings of CIKM. New York, NY: ACM Press. Robertson, S.E. (1997). Overview of the okapi projects. Journal of Documentation, 53(1), 3–7. Svore, K.M., & Burges, C.J.C. (2009). A machine learning approach for improved bm25 retrieval (pp. 1811–1814). In Proceedings of CIKM. New York, NY: ACM Press. Szummer, M., & Yilmaz, E. (2011). Semi-supervised learning to rank with preference regularization (pp. 269–278). In Proceedings of CIKM. New York, NY: ACM Press. Umemura, K., & Church, K.W. (2000). Empirical term weighting and expansion frequency (pp. 117–123). In Proceedings of SIGDAT. Stroundsburg, PA: ACL. Wang, L., Lin, J., & Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval (pp. 105–114). Proceedings of SIGIR. New York, NY: ACM. Xia, F., Liu, T.Y. , & Li, H. (2009). Statistical consistency of top-k ranking (pp. 2098–2106). In Proceedings of NIPS. Shapleigh, ME: Curran Associates, Inc. Xu, J., & Li, H. (2007). Adarank: A boosting algorithm for information retrieval (pp. 391–398). In Proceedings of SIGIR. New York, NY: ACM Press. Zhai, C.X., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing, 22(2), 179– 214. Zhou, K., Xue, G.R., Zha, H.Y., & Yu, Y. (2008). Learning to rank with ties (pp. 275–282). In Proceedings of SIGIR. New York, NY: ACM Press.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2013 DOI: 10.1002/asi