Learning to Search Web Pages with Query-Level Loss Functions

7 downloads 3329 Views 426KB Size Report
pages from a commercial Web search engine all showed that our proposed query-level loss ... whether minimizing these loss functions can effectively improve the retrieval ... The optimization formulation of RankSVM is shown as follows1:.
Learning to Search Web Pages with Query-Level Loss Functions Tao QIN2, Tie-Yan LIU1, Ming-Feng Tsai3, Xu-Dong ZHANG2, Hang Li1 1

Microsoft Research Asia, No.49 Zhichun Road, Haidian District, Beijing 100080, P.R. China 2

Dept. Electronic Engineering, Tsinghua University, Beijing, 100084, P.R. China

3

Dept. Computer Science and Information Engineering, National Taiwan University, Taiwan 106, ROC 1

{tyliu, hangli}@microsoft.com

2

{qinshitao99@mails., zhangxd@}tsinghua.edu.cn 3

[email protected]

Abstract With the rapid development of information retrieval and Web search, ranking has become a new branch of supervised learning. Many existing machine learning technologies such as Support Vector Machines, Boosting, and Neural Networks have been applied to this new problem and achieved some success. However, since these algorithms are not proposed initially for information retrieval, their loss functions are not quite in accordance with widely-used evaluation criteria for information retrieval. Such criteria include mean average precision (MAP), mean precision at n (P@n), and normalized discounted cumulative gain (NDCG). While these criteria are averaged on the query-level, loss functions of the aforementioned machine learning algorithms are defined on the level of documents or document pairs. Therefore, it is not clear whether minimizing these loss functions can effectively improve the retrieval performance. To solve this problem, we propose to use query-level loss functions when learning the ranking function. In particular, we discuss what kind of properties a good query-level loss function should have, and propose a query-level loss function based on the cosine similarity between a ranking list and the corresponding ground truth with respect to a given query. In the meanwhile, we also discuss whether loss functions of existing ranking algorithms can be upgraded to the query level. Experiments on the datasets for TREC web track, OSHUMED, and a large collection of web pages from a commercial Web search engine all showed that our proposed query-level loss function can improve the search accuracy significantly, and it is non-trivial to upgrade traditional document-level loss functions to the query level.

Keywords Information Retrieval, Learning to Rank, Query-level Loss Function, RankCosine

1. Introduction While search engines have been changing people‘ s life, how to improve the accuracy of Web search has attracted more and more research attention. Classical approaches [1] use empirical methods

to rank web pages, such as BM25[20]-like content based methods and PageRank[18]-like link based methods. However, there are many factors that can impact the ranking results, such as link information, query word distribution in different sections of the Web page, and many others. In this case, empirically tuning model parameters becomes very difficult. To tackle this problem, machine learning methods have been applied to combining these factors in recent years. Typical work includes RankBoost [8], RankSVM [12][16], and RankNet [3], which are based on Boosting, Support Vector Machines and Neural Networks respectively. To some extent, automatically learning from ground truth labels to generate a sorted list of documents according to their relevance to the given query has become a new branch of supervised learning, in addition to classification, regression and density estimation [26]. However, it should be noted that the aforementioned machine learning algorithms were not initially proposed for information retrieval (IR), and therefore their loss functions are not quite in accordance with widely-used evaluation criteria for IR. The IR evaluation criteria mainly include mean average precision (MAP) [1], mean precision at n (P@n) [1], and normalized discounted cumulative gain (NDCG) [13][14]. All these criteria are averaged on the query level: no matter how different the numbers of documents retrieved for two queries are, they contribute equally to the final performance evaluation of the ranking algorithm. However, as we know, loss functions of the aforementioned machine learning algorithms are defined on the level of document [17] or document pairs [3][8][12][16]. Therefore, it is not clear whether minimizing these loss functions can effectively improve the retrieval performance, since those queries with more documents or document pairs may bias the training process. In order to solve this problem, we propose to use query-level loss functions when learning the ranking function for IR. In particular, we first discuss what kind of properties a good query-level loss function should have, and then propose a query-level loss function as an example, which is based on the cosine similarity between a ranking list and the corresponding ground truth with respect to a given query. With this new loss function, we derive an algorithm based on generalized additive model to learn the ranking function. Moreover, we also discuss whether the

document-level and pair-wise loss functions of the aforementioned ranking algorithms can be reasonably upgraded to the query level, and how they may perform after that. We used two public datasets and a commercial dataset to evaluate our methods. Experimental results show that the proposed query-level loss is very effective for information retrieval, while it is in general non-trivial to upgrade existing document-level loss functions to the query level.

document pairs. The optimization formulation of RankSVM is shown as follows1:

1 T min : V ( , )   C  i , j ,q 2 i , j ,q s.t. : (d i , d j ) r1* : ( q1 , d i ) ( q1 , d j ) 1  i , j ,1  ( d i , d j ) rn* : ( qn , d i ) ( qn , d j ) 1  i , j ,n

In summary, contributions of our work are as below: 1)

(1)

We point out that learning to rank for IR should consider query-level loss functions; We enumerate fundamental properties that a good querylevel loss function should have; We provide an example for query-level loss functions and demonstrate how to derive an efficient algorithm to minimize it; We discuss how to upgrade loss functions of previous algorithms (e.g. RankSVM, RankBoost, and RankNet) to the query level, and investigate their performance empirically.

Where C is a parameter which controls the trade-off between structure loss and empirical loss, φ (q,di) is the feature vector calculated from document di with respect to query q, and the constraint ω φ (q,di)> ω φ (q,dj) means that user prefers document di to dj for query q. From the theoretical aspect, RankSVM is wellfounded in the framework of structure risk minimization, and various kinds of experiments verified its effectiveness. Note that such pair-wise strategy can not only handle binary relevance judgment, but also multi-grade relevance judgment. There is some following work [4] of RankSVM .

The following part of this paper is organized as follows. In Section 2, we briefly review some related works about machine learning algorithms for ranking. In Section 3, we justify the necessity of using query-level loss function for IR and discuss some properties that a good query-level loss function should have. We then give an example, the cosine loss, for query-level loss functions and derive an efficient algorithm to minimize the cosine loss function in Section 4. Experiment settings are described in Section 5, and results are reported in Section 6 to 8. In Section 9, we discuss how to upgrade loss functions of some previous algorithms to the query level, and limitations of the cosine loss. Conclusions are given in the last section.

Freund, Y. et al [8] adopted the Boosting approach and proposed the RankBoost algorithm for ranking. Similar to RankSVM, RankBoost operates on document pairs. Suppose di»qdj means that document di should be ranked higher than dj for query q. Consider the model f, where f(φ (q,di))>f(φ (q,dj)) means that the model asserts di»qdj. Then the loss for a document pair in RankBoost is defined as

2) 3)

4)

L(di  e q dj) 



L(d

(2)

i

 q dj)

(3)

q di  qd j

2. Related works

Thorsten Joachims [16] took an SVM approach to learn the ranking function and proposed the RankSVM algorithm. The basic idea of RankSVM is the same as general SVM: minimizing the sum of structure loss (which can be expressed by L-2 norm of the hyper plane parameter ω ) and empirical loss. The difference is that constraints in RankSVM are partial order preferences for



Consequently, the loss function of RankBoost is the sum of losses for all document pairs:

L As mentioned in the introduction, many machine learning technologies [3][5][7][8][12][16][17] were applied to the problem of ranking for information retrieval in recent years. Some early works simply tackled this problem from a binary classification point of view [17]: they directly judge whether a document is relevant to the query or not, and rank relevant documents higher than irrelevant documents. However, in realworld scenarios, the relevance of a document to a query can hardly be simply classified into ― relevant‖ or ― irrelevant‖ . Actually, some documents may be highly relevant, some are partially relevant, and others are definitely irrelevant. In a word, binary classification can not capture the multi-grade nature of relevance judgment for IR. To solve this problem, many other works were proposed such as RankSVM, RankBoost, and RankNet.



f  ( q,di )  f ( q,d j )

One advantage of RankBoost is that it is easy to implement and is feasible for parallel computation. Also the ranking performance is promising due to empirical studies. Neural networks have also been applied to ranking in recent years. With a probabilistic cost function, C.J.C. Burges [3] proposed the RankNet algorithm, in which neural networks are used to learn the underlying ranking function. Same as RankSVM and RankBoost, training samples of RankNet are also a set of document pairs. Denote the modeled posterior P(di»qdj) by Pij, and let P ij be the desired target values of those posteriors, and oq,i,j=f( φ (q,di))f(φ (q,dj)) 2. Then the loss for a document pair in RankNet can be represented as follows.

Lq ,i , j L(oq ,i , j )  Pij log Pij (1 Pij ) log(1 Pij )  Pij oq ,i , j log(1 e q ,i , j ) o

(4)

Similarly, the total loss of RankNet is the sum of all pair-wise losses

1

For details, please refer to [16].

2

The definitions of f and φ are inherent from the discussions on RankSVM and RankBoost.

L

L

(5)

q,i , j

q

i, j

RankNet is very successful and has been integrated into a commercial Web search engine [29]. After introducing RankSVM, RankBoost and RankNet, we have to point out that they still have some problems although their successes have been demonstrated in the literature. One major problem is that the loss functions used in these algorithms are not in accordance with the IR evaluation. We will elaborate on this in more detail in the next section.

3. Query-level loss functions for information retrieval Before introducing query-level loss function, we use Table 1 to summarize the loss functions in existing algorithms described in Section 2. In the classification approach [17], loss function is defined on the document level, which only considers a single document and does not consider the relationship of relative relevancy among documents in the ranking problem. The loss functions of RankSVM, RankBoost, and RankNet are all defined on the document-pair level. They can model the relationship of partial-order preference among documents. However, they still cannot model the total-order preference for all the documents for a particular query. In this regard, these loss functions are not yet in accordance with the evaluation criteria for IR such as MAP and NDCG which are all averaged on the query level. This motivated us to propose the concept of query-level loss function, which will be discussed in following sub sections. Table 1 Loss functions for web search Loss function

Documentlevel

Binary classification Algorithms [17]

Pair-wise

Querylevel

RankSVM[4][12][16 ] RankBoost[8]

Our work

RankNet[3]

3.1 Why query-level loss for IR As mentioned for many times, loss functions in RankSVM, RankBoost and RankNet are not in accordance with the widelyused evaluation criteria for IR. Actually, this may seriously affect the effectiveness of learning algorithms. Consider a simple example. Suppose there are two queries q1 and q2 with 40 and 5 training documents respectively. For the extreme case with complete partial-order document pairs for training, we could get 40*39/2=780 pairs for q1 and only 10 pairs for q2. If a learning algorithm can rank all pairs correctly, there would be no problem. However, if not, for example, we can only rank 780 out of the total 790 pairs correctly, problems will come into being. With the pair-wisel loss function, losses will be the same if the learning algorithm correctly ranks all pairs of q2 and 770 pairs of q1, or correctly ranks all pairs of q1 and no pair of q2. However, for these two cases, the resultant performance with query-level evaluation criterion will be quite different. As clearly shown in Table 2, case 1 is much better than case 2. This example tells us

that using a document-level loss function is not suitable for IR. Actually only if all queries have the same number of document pairs for training, the document-level loss function and the querylevel loss can lead to the same result. However, it is almost impossible in real-world scenarios. Therefore it will be better to define the loss function on the query level while learning ranking functions. Table 2 Document-pair loss v.s. query-level loss correctly ranked Document pairs wrongly ranked of q1 Accuracy correctly ranked Document pairs wrongly ranked of q2 Accuracy Document-pair level overall accuracy query level

Case 1 770 10 98.72% 10 0 100% 98.73% 99.36%

Case 2 780 0 100% 0 10 0% 98.73% 50%

3.2 What is a good loss function for IR Then a natural question is what properties a good query-level loss function should have. Here we list some properties as follows, and discuss why they are important for learning based information retrieval. 1)

Insensitive to the number of document and document pairs per query.

This is clear, if considering the example shown in Table 2. In this regard, loss functions of RankSVM, RankBoost and RankNet are not good query-level loss functions. 2)

Emphasize top-ranked documents in the rank list

Considering the gap between the explosion of information and the user‘ s capability of absorbing this information, precision becomes much more important than recall for applications of information retrieval. As a result, many IR evaluation criteria emphasize the error rate for top-ranked documents. For example, a ranking mistake for top-5 documents is more serious than a ranking mistake in the 95th document in the retrieved document list. Accordingly, a good loss function should also reflect this property. Actually, loss functions of RankSVM, RankBoost and RankNet can be more or less regarded as emphasizing top-ranked documents. The reason is that if a definitely irrelevant document is ranked high, it will disobey many partial-order constraints, while if it is ranked in the middle of the list, the number of constraints that it disobeys will be much smaller. 3)

Have a finite upper bound

Query-level loss function should not be easily biased by some difficult queries. For this purpose, a natural requirement is that the loss for each query should have an upper bound. Otherwise, some specific query with extremely large loss will dominate the training process. We find that document-level loss functions of RankBoost, RankSVM and RankNet do not have an upper bound. For RankBoost, because the exponential loss is used, it is easy to understand that in some cases, the corresponding value of the loss function can be extremely large. We can get the similar conclusion for the hinge loss of RankSVM. For RankNet,

according to the analysis in [3], its loss function does not have an upper bound either.

4. RankCosine After discussing properties of a good query-level loss function for information retrieval, in this section, we propose a new loss function which can satisfy all the aforementioned conditions.

T

H( q)  t ht ( q)

Where α t is the combination coefficient, ht(q) is a weak learner which maps an input matrix (a row of this matrix is the feature of a document) to an output vector and d is the feature dimension of a document:

ht (q) : Rn ( q )d  Rn ( q ) With the additive model and the cosine loss function, we can derive the learning process as follows. For simplicity of derivation, we assume the ground truth for each query has already been normalized:

4.1 The cosine loss First of all, we give some necessary notations. Suppose there are n(q) documents for query q, and the groundtruth ranking list for this query is g(q), where g(q) is a n(q)dimension vector, and the k-th element in this vector is the human labeled rating (multi-grade relevance score) of the k-th document. The absolute value of the score for each rating is not very important, provided that the score of a more relevant document should be higher than that of a less relevant document. Denote the output of a learning machine for query q by H(q). Same as g(q), H(q) is also an n(q)-dimension vector, and the k-th element in it is the output score for the k-th document given by the learning machine.

g(q) 

 g(q)T H(q)  J (g(q), H(q)) 12  1   H(q)   

k

Hk ( q)  t ht ( q) t 1

(6)

3

Where ht(q) is a weak learner at the t-th step. Then the total loss of Hk(q) over all queries is

 g(q)T H(q)  J (H k ) 12  1   H(q)  q  

Where ||.|| is the L-2 norm of a vector. Since we use cosine similarity in (6), this loss function is named by the cosine loss. Even though we applied a very simple metric to measure the difference between a ranking list and its ground truth, it can be easily verified that the loss function defined by (6) has all the three properties listed in the previous sub section. Firstly, the cosine distance is insensitive to document number of a query because of the normalization term g ( q ) 1H ( q ) . Secondly, we can put on more emphasis on mistakes in top-ranked documents by setting an appropriate ground truth label. Thirdly, this loss function has explicit lower bound and upper bound as follows:

0 J (g(q), H(q)) 1

  gT ( q )  H k 1 (q) k h k (q)   (12) J (H k ) 12  1 T   q H k 1 (q) k h k (q)  H k 1 (q) k h k (q)    

Setting the derivative of J(Hk) with respect to α k to zero, we can get the optimal value of α k as follow with some derivation and relaxation,

q Q

W

T 1, k

(7)

(8)

(11)

Given Hk-1(q) and hk(q), (11) can be re-written as

k 

(q)h k (q)

q

W

Based on (6), what we eventually minimize is the macro sum over all training queries

J ( H ) J (g( q), H( q))

(10)

Similar to Boosting algorithms [9], we use a stage-wise greedy search strategy to get parameters in the ranking function. Denote Hk(q) as

J (g(q), H(q)) 12  1 cos(g(q), H(q))     

g(q) g(q)

Then we re-write (6) as

With above notations, we define the ranking loss for a query q as follows

 g(q)T H(q) 12  1  g(q) H(q) 

(9)

t 1

T 2, k

(q)  h k (q)gT (q)h k (q) g( q)hTk ( q)h k ( q) 

(13)

q

where W1,k(q) and W2,k(q) are two n(q)-dimension weight vectors with the following definitions.

g (q)H T

W1, k (q) 

4.2 The learning algorithm In this subsection, we demonstrate how to optimize the cosine loss. In order to make this algorithm efficient for large-scale implementation, we use generalized additive model to define the final ranking function:

k 1

(q)H k 1 (q) HTk 1 (q)H k 1 (q)g(q)  H k 1 (q)

W2, k (q) 

3

3/ 2

H k 1 (q) H k 1 (q)

3/ 2

(14)

(15)

Here one can have many different ways to define or select these weak learners. Typically, we can take the same approach as in RankBoost.

With (13) and (12), we can calculate the optimal weight α k and evaluate the cosine loss for each weak learner candidate, in order to select the one with the smallest loss as the k-th weak learner. By use of this stage-wise greedy search strategy, we can get a sequence of weak learners step by step together with their combination coefficients, so as to get the final ranking function. We summarize the corresponding details in Fig. 1. Note that en(q) is an n(q) dimensional vector with all elements equal to 1. Algorithm: RankCosine Given: ground truth g(q) over Q, and weak learner candidates hi(q), i= 1 , 2 , … Initialize: W1,1 ( q) W2,1 ( q) 

en ( q ) n( q)

For t=1,2, … , T (a) For each weak learner candidate hi(q) (a.1) Compute optimal α t,i by (13) (a.2) Compute the cosine loss ε t,i by (12) (b) Choose the weak learner ht,i(q) with minimal loss as ht(q) (c) Choose corresponding α t,i as α t (d) Update query weight vectors W1,k(q) and W2,k(q) by (14) and (15)

listed in the first page, we used P@1 to P@10 to evaluate the ranking algorithms.

5.1.2 Mean average precision (MAP) For a single query, the average precision is the average of the precision after each relevant document is retrieved. N

P(n) rel (n) 

n 1 AP  # total relevant docs for this query

(17)

Where N is the number of docs retrieved, P(n) is the P@n value as described in previous subsection, and rel(n) is a binary function on the relevance of n-th doc

1, if n th doc in the hitlist is relevant  rel (n)  0, otherwise  Similar to mean P@n, over a set of queries, we get MAP by averaging the AP values of all the queries.

5.1.3 Normalized discount cumulative gain (NDCG)

T

Output the final ranking H( q)  t ht ( q) t 1

Fig. 1 The RankCosine Algorithm

5. Experiment settings To verify the benefit of using query-level loss functions, we conducted experiments on three corpora. Before getting into the details of our experimental results, we describe the experiment settings in this section.

Note that P@n and MAP can only handle cases with binary judgment: ― relevant‖ o r ― irrelevant‖ . Recently, a new evaluation metric named Normalized Discount Cumulative Gain (NDCG) was proposed, which can handle multiple levels of relevance judgment in the evaluation process. Considering this new metric has been widely used in many recent works [2][3][14][25], we also use it in our experiments. While evaluating a ranked document list, NDCG follows two rules: 1)

Highly relevant documents are more valuable than marginally relevant document; The greater the ranked position of a relevant document (of any relevance level), the less valuable it is for the user, because the less likely it will be examined by the user.

2)

5.1 Evaluation criteria To evaluate the search accuracy, we adopted three widely used criteria in our experiments: mean precision at n (P@n) [1], mean average precision (MAP) [1], and normalized discount cumulative gain (NDCG) [2][3][14][25].

5.1.1 Mean precision at n

According to the above rules, there are four steps to compute the NDCG value for a ranked document list: 1) Compute the gain of each document; 2) Discount the gain of each document by its ranked position; 3) Cumulate the discounted gain of the list; 4) Normalize the discounted cumulative gain of the list.

Precision at n measures the accuracy of the top n results of the returned list for a query.

In our experiments, we compute the NDCG value of a ranked document list at position i as follow:

# relevant docs in top n results P@n  n

2  1 N (i ) ni log(1 j )

i

r( j)

(16)

For example, if the top 10 docs of a query are {relevant, irrelevant, irrelevant, relevant, relevant, relevant, irrelevant, irrelevant, relevant, relevant}, then the P@1 to P@10 values of this query are {1, 1/2, 1/3, 2/4, 3/5, 4/6, 4/7, 4/8, 5/9, 6/10} respectively. For a set of queries, we average the P@n values of all queries to get the mean P@n value. Since most of today‘ s Web search engines (such as Google [30], Yahoo! Search [31], and MSN Search [32]) display 10 retrieved documents in each page and users often only browse the results

(18)

j 1

Where r(j) is the rating of the j-th document in the list, and the normalization constant ni is chosen so that the perfect list gets a NDCG score of 1. From Equation (18), we can see that 2r ( j )  1 is the gain (G) of the j-th document,

2r ( j )  1 log(1j )

i

(DG) of the j-th document,

 j 1

2r ( j )  1 log(1j )

is the discounted gain is the discounted

cumulative gain (DCG) at position i of the list, and finally i

2  1 ni log(1 j ) is the normalized discounted cumulative gain r( j)

j 1

(NDCG) at position i of the list, which is called NDCG@i for short. Similar to P@n, we only used NDCG@1 to NDCG@10 to evaluate the ranking algorithms in our experiments.

attention because they have less relevant documents, and can only construct much fewer pairs than other queries. In other words, about two thirds of the queries will not contribute to the learning process as much as they should.

Histogram

5.2 Comparison algorithms

In order to compare with traditional IR approaches, we also chose a widely used non-learning ranking algorithm, BM25 [20], as the baseline. BM25 computes the relevance score of a web page as follow:

(k 1)tf (k3  1)qtf relevance   1 ( K tf )(k3 qtf ) T Q

(19)

35 30 25

# queries

In our experiments, we selected three machine learning algorithms for comparison: RankBoost, RankNet, and RankSVM. For RankBoost, weak learners have binary output of 0 or 1. For RankNet, we used a linear neural net and a two-layer net [3], which are called Linear-RankNet and TwoLayer-RankNet respectively. For RankSVM, we directly used the binary code of SVMlight [15]. For our proposed RankCosine, weak learners take continuous values from the interval [0,1].

20 15 10 5 0 10

20

30

More

# relevant pages

Fig. 2 Histogram of number of relevant pages per query

For learning algorithms, we extracted 14 features for every document. The details of these features are listed in the following where Q is a query consisting of terms T; tf is the occurrence table: frequency of the term T within the web page, qtf is the frequency of the term T within the topic from which Q w a s d e r i v e d , a n d ω i s Table 3 Extracted features for TREC data the Robertson/Sparck Jones weight [22] of T in Q. K is calculated Feature Name Feature Number by BM25 [20] 1 (20) K k1  (1 b) b dl / avdl  MSRA1000 [24] 1 where dl and avdl denote the page length and the average page PageRank [18] 1 length. HostRank [28] 1 Relevance Propagation [19]

10

6. Experiments on TREC web track collection Because of the great impact of Text REtrieval Conference (TREC) [27] in information retrieval community, we tested the performance of our RankCosine algorithm on one of TREC web track collection firstly.

6.1 The data corpus This corpus is crawled from the .gov domain in early 2002, and has been used as the data collection of Web Track since TREC 2002. There are totally 1,053,110 pages with 11,164,829 hyperlinks in it. When conducting the experiments on this corpus, we used the topic distillation task in the Web Track of TREC 2003 [6] as our query sets (with 50 queries in total). The ground truths of this task are provided by the TREC committee with binary judgment: relevant, or irrelevant. The number of relevant pages for each query spans from 1 to 86. In the previous section, we mentioned that queries may have large diversity in their document (pair) numbers. Actually this can be well validated by this datasets. For instance, from Figure 2, we can see that the number of relevant documents per query in this dataset has a diverse distribution: about two thirds queries have less than 10 relevant documents. If we use the document-level loss function, two thirds of the queries will not get enough

6.2 The results We conducted 4-fold cross validation for learning algorithms. We tuned the parameters for BM25 with one of the trials and applied them to the other trials directly. The results reported in Figure 3 are those averaged over the four trials. From Figure 3(a), we can see that our RankCosine algorithm outperform all other algorithms by MAP, while the other learning algorithms get similar results. This may imply that these three state-of-the-art algorithms (RankSVM, RankBoost, and RankNet) have similar learning ability for information retrieval. Note all learning algorithms outperform BM25. The MAP value of BM25 is about 0.13. RankBoost gets the lowest MAP values with 0.18 among all the learning algorithms, and this value is still much higher than BM25, which is about 40% improvement over BM25. The best algorithm, RankCosine, which gets MAP value of 0.21, improves about 70% over BM25. On the one hand, this is somehow because that BM25 was used as one feature in the learning algorithms. On the other hand, this suggests that learning to rank is a promising direction for IR. From Figure 3(b), we can see that our RankCosine algorithm outperforms all other algorithms, from P@1 to P@7. Although RankCosine interweaves with TwoLayer-RankNet together at P@8 to P@10, it is still much better than other algorithms. An interesting phenomenon is that RankCosine gets more

improvement at top results comparing with the second best algorithm, for example, more that 4 precision points 4 improvement for P@1, about 2 precision points improvement for P@5, and similar performance for P@10. This would be better for user experience while using search engine, since we can make the top results much better. Again, Figure 3(c) shows that RankCosine can improve search accuracy as compared with other algorithms. And this figure clearly verifies that learning algorithms can get better performance with non-learning algorithms as their features.

0.55

0.5 RankSVM Linear-RankNet

0.45

2layer-RankNet RankCosine 0.4

RankBoost BM25

0.35

0.25

0.3

RankCosine RankSVM

0.2

1

2layer-RankNet

2

3

4

5

6

7

8

9

10

NDCG @

Linear-RankNet RankBoost

(c)

0.15

MAP

BM25

Fig. 3 Ranking accuracy on TREC web track data. (a) MAP, (b) P@n, (c) NDCG@n

0.1

7. Experiments with OHSUMED data We also conducted experiments with the OHSUMED collection [11], which has been used in many experiments in IR [4][12][21].

0.05

0

7.1 The data corpus

(a) 0.35

0.3 RankSVM Linear-RankNet

0.25

2layer-RankNet RankCosine RankBoost

0.2

BM25 0.15

0.1 1

2

3

4

5

6

7

8

9

10

Precision @

(b)

OHSUMED is a collection of documents and queries on medicine, consisting of 348,566 references and 106 queries. There are a total of 16,140 query-document pairs upon which relevance judgments were made. Different from web track collection, in this corpus the relevance judgments have three levels: ― definitely relevant‖ , ― possibly relevant‖ , a n d ― not relevant‖ . . We adopted similar features to those used in [17]. Table 4 shows some of these features. For example, tf (term frequency), idf (inverse document frequency), dl (document length), and their combinations. BM25 score is another feature, which is calculated using the ranking method of BM25 [20]. We took log on the feature values in order to reduce the effects of large numbers. This does not change the tendencies of the results, based on our preliminary experiments. Stop words are removed and stemming is conducted in indexing and retrieval. We got 30 features in total. In MAP calculation, we define the category ― d e f i n i t e l y r e l e v a n t ‖ as positive (relevant) and the other two categories as negative (irrelevant).

Table 4 Some features. c(w, d) represents count of word w in document d; C represents the entire collection; n denotes the number of terms in the query; |.| denotes the size of function; and idf(.) denotes the inverse document frequency.

4

We define a precision point by the precision score of 0.01

7.2 The results

0.66

Similar to the web track collection, we also conducted 4-fold cross validation for learning algorithms, and tuned the parameters for BM25 with one of the trials and applied them to the others directly. The results reported in Figure 4 are those averaged over the four trials. From this figure, we can see that our RankCosine algorithm outperforms all other algorithms, from NDCG@1 to NDCG@10. And the other three learning methods (RankNet, RankSVM, and RankBoost) get similar results, which is a little worse than RankCosine. In fact, we could get almost the same conclusion on this collection as on the previous collection. Comparing Fig 3(b) with 3(c), Fig 4(b) with 4(c), we may find some differences between the two evaluation criteria. P@n only considers the number of relevant documents in top n results, and ignores the distribution of these relevant documents. For example, the following two rank lists get same P@4 value. Different from P@n, NDCG@n emphasizes the position of relevant documents in top n results, as described in Section 5.1.3. For example, the NDCG@4 value of list B is much higher than that of list A.

0.64 0.62

RankSVM Linear-RankNet

0.6

2layer-RankNet RankCosine

0.58

RankBoost BM25

0.56 0.54 0.52 1

2

3

4

5

6

7

8

9

10

NDCG @

(c) Fig. 4 Ranking accuracy on OHSUMED data. (a) MAP, (b) P@n, (c) NDCG@n

A: {irrelevant, irrelevant, relevant, relevant, … } B: {relevant, relevant, irrelevant, irrelevant, … }

8. Experiments with web search data To verify the performance of our algorithms in real world search application, we also conducted some experiments with a collection from a commercial web search engine.

0.315 RankCosine

0.31 0.305 0.3

Linear-RankNet 2layer-RankNet

RankSVM

8.1 The data corpus

RankBoost

MAP

0.295 0.29 0.285

BM25

0.28 0.275 0.27 0.265

(a) 0.45

0.4 RankSVM Linear-RankNet

0.35

2layer-RankNet RankCosine RankBoost

0.3

BM25 0.25

0.2 1

2

3

4

5

6

7

Precision @

(b)

8

9

10

This data corpus contains over 2000 queries, with incomplete human-labeled ground truths. These queries were randomly sampled from the query log of this search engine. Human labelers were asked to assign ratings (f r o m 1 w h i c h m e a n s ‗ irrelevant‘ t o 5 whi c h m e a n s ‗ definitely relevant‘ ) t o t h o s e t o p -ranked pages for each query. Each top-ranked page was rated by five labelers, and the final rating was generated by majority voting. If voting failed, e. g. the 5 labeler gave ratings of 1,2,3,4, and 5 respectively to a page, a meta labeler will be asked to give the final judgment. Note that not all documents were labeled due to their limited resources. In order to conduct experiments, we randomly divide this dataset into a training set and a test set. There are more than 1300 queries in the training set, and about 1000 queries in the test set. Figure 5 shows the query diversity in the training set: about one third queries have less than 200 pairs; more than half of the queries have less than 400 pairs, while the number of documents for other queries varies from 400 to over 1400. If we use pair-wised loss function, more than half of the queries will not get enough attention because they construct much fewer pairs than the other queries. For this collection, we extracted two kinds of features to represent a web page: query-dependent features and query-independent features. Query-dependent features include term frequencies in anchor text, URL, title, body text and so on, while queryindependent features include page quality, number of hyperlinks and so on. In total, we have 334 features for each web page.

Table 5 P-value of t-tests

Histogram

NDCG@n 1 p-value 7.51E-03 NDCG@n 6 p-value 2.04E-06

450

Number of Queries

400 350

2 6.16E-03 7 2.43E-06

5 4.36E-06 10 8.27E-08

250

8.3 Further study: the impact of query variance

200 150 100 0 200

400

600

800

1000

1200

1400

M ore

Number of Pairs

Fig. 5 Histogram of number of document pairs per query

Since the judgment has 5 grades and P@n and MAP are mainly used for binary judgment, we only evaluated the search accuracy on this collection with NDCG.

8.2 Comparison of search accuracy The corresponding search accuracies of each learning algorithms are plotted in Fig. 6. From this figure, we can see that our RankCosine algorithm got the best results for all NDCG scores, which outperformed other algorithms by 2~6 NDCG 5 points (which corresponds to about 4% to 13% relative improvements). Twolayer-RankNet got the second position. Besides, RankSVM, RankBoost and linear-RankNet got similar search performance. These results tell us that RankCosine (and its query-level loss function) is much more suitable for the web search application.

0.61

As aforementioned, there may be large variance among different queries in terms of the number of documents. In this subsection, we investigate the impact of this variance on the performance of ranking algorithms. For this purpose, we randomly sampled five datasets each with 500 queries from the original training set for training, and then tested the ranking performance on the test in the same way as in the previous subsection. Since these five new training sets were generated by random sampling, we can observe some variances in them. Taken dataset 1 and 2 for example (see Fig. 7(a)), we can see that distributions of pair numbers in these two datasets are very different. Therefore, these new training sets are suitable to test whether a ranking algorithm is robust to query variances. The performance of each algorithm with respect to the five training sets is shown in Fig. 7(b). As can be seen from the figure, the results of RankBoost on the five datasets have large variance. It got the highest accuracy of 0.576 on the third dataset and worst of 0.557 on the fifth. The results of RankSVM also change dramatically over different training sets. These observations indicate that both RankBoost and RankSVM are not very robust to query variances. Two-layer RankNet got similar best result to RankBoost and RankSVM, but its performance is more stable. Comparatively, RankCosine got the highest ranking accuracy which is as high as 0.597 and its performance varied only a little over different training sets. This shows that query-level loss functions can handle query variances better than those document (document pair) level loss functions. Dataset1

0.59 0.57

Dataset2

250

RankSVM

0.55

Linear-RankNet

0.53

2layer-RankNet

0.51

RankBoost RankCosine

0.49 0.47

200

Number of Queries

Score

4 6.58E-06 9 4.19E-06

300

50

150

100 50

0.45 1

3

5

7

9

NDCG@

Fig. 6 Performance comparison on testing set To further verify the above findings, we conducted significance tests. Since TwoLayer-RankNet performed the second best among algorithms under investigation, actually we only compared RankCosine with TwoLayer-RankNet by t-test to see whether the improvement of RankCosine was statistically significant. The pvalues of NDCG@1 to NDCG@10 with respect to the confidence level of 98% are shown in Table 5. As can be seen, all p-values are very small, indicating that the improvement of RankCosine algorithm is statistically significant.

5

3 2.02E-04 8 1.14E-06

Similarly, we define a NDCG point by the NDCG score of 0.01.

0 200

400

600

800

1000

Number of Pairs

(a)

1200

1400

M ore

dataset2

dataset3

dataset4

dataset5

dataset1

0.6

0.6

0.59

0.59

0.58

0.58

0.57

dataset2

dataset3

dataset4

dataset5

0.57

NDCG@10

NDCG@10

dataset1

0.56 0.55

0.56 0.55

0.54

0.54 0.53

0.53 0.52 RankBoost

RankCosine

RankSVM

Linear-RankNet

TwoLayerRankNet

0.52 RankBoost

RankBoost-Query

(b)

RankSVM

RankSVM -Query

RankCosine

(a)

Fig. 7 (a) Query variances in different training sets, (b) Comparison of NDCG @ 10 with respect to different training sets

dataset1

dataset2

dataset3

dataset4

dataset5

0.6 0.59

9. Discussions 9.1 Upgrade of pair-wise loss functions As shown in the experimental section, query-level loss function performs better than loss functions define on the document or document pair level. Then one may prefer to upgrade those document-level loss functions to the query level, by introducing query-level normalization. In this section, we discuss this problem in detail. In particular, we introduce query-level normalization to loss functions of RankSVM, RankBoost, and RankNet, and investigate their corresponding ranking performance. For RankSVM, we can revise its loss function (1) to the query level as below,

1 T RankSVM : V ( , )   C #1q 2 q

i, j,q

(21)

i, j

Where #q denotes the number of document pairs for query q. Similarly, we can revise the loss function of RankBoost as follows

 L(d

RankBoost : L 

1 #q

q

i

 q dj)

(22)

di  qd j

And revise the loss function of RankNet (Eq. (5)) as follows:

 L

RankNet : L 

1 #q

q

q,i , j

(23)

i, j

Note that the above revisions to the loss functions are still compatible with the optimization procedure for their original loss functions. Therefore, here we will not describe the corresponding details of the learning algorithms, but directly report their ranking performance in Fig. 8. Note that the performance is evaluated using the same methodology as shown in Section 8.3, and we also listed results of pair level algorithms in the figure for comparison.

NDCG@10

0.58 0.57 0.56 0.55 0.54 0.53 0.52 Linear-RankNet

Linear-RankNetQuery

TwoLayerRankNet

TwoLayerRankNet-Query

RankCosine

(b) Fig. 8 Comparison of ranking algorithms with query and pair level loss functions From Fig. 8, we find that (1) As compared with their pair-wise algorithms, query-level RankBoost and query-level Linear-RankNet got higher performance on some datasets while lower performance on other datasets. That it, it is difficult to judge whether querylevel versions of these two algorithms are better or worse than their pair-wise versions. However, we can get a conclusion at least: query-level RankBoost and query-level Linear-RankNet is sensitive to query variance. (2) The results of query-level TwoLayer-RankNet were worse than its pair-wise version for all the five datasets, indicating that the query-level extension of TwoLayer-RankNet is not successful. (3) Query-level RankSVM got better results on all the five datasets than its pair-wise version. This is somehow consistent with the conclusion of [4], in which they showed that adapting RankSVM with query-level information can improve its performance. However, as can be seen, the performance of query-level RankSVM is still not better than RankCosine. From the above experiments, we can come to the conclusion that it is non-trivial to upgrade existing ranking algorithms to the

query level by simple normalization. And further investigation on query-level loss functions still makes sense. 2)

9.2 Limitations of existing loss functions Note that all the loss functions tested in this work are just some approximation of the ideal loss. One reason for these approximations is for computational convenience [23], since the ideal loss is usually discontinuous and difficult to minimize [10], for which we show an example in Figure 9. One limitation of the approximation for all these loss functions is that the loss is not always zero even if the ranking function ranks a query correctly. Take the pair-wise loss of RankBoost for example, which is as shown in Figure 9. Consider a document-pair (x,y), in which document x should be ranked higher than y. Suppose the ranking function is f, and relevance scores of x and y are f(x) and f(y) respectively. The solid curve in Figure 9 shows the ideal 0-1 loss: if f(x)-f(y)0, the loss is non-zero, although this curve is smooth, continuous, and thus easy to be minimized. It is easy to verify that the pair-wise losses of RankNet and RankSVM also have this problem.

3)

(pair) level loss functions, and discussed the necessary properties of a good query-level loss function; We defined the cosine loss as an example of query-level loss functions, and derived the RankCosine algorithm based on generalized additive model to minimize it; By empirical study, we showed that it was non-trivial to upgrade loss functions of some previous algorithms (e.g. RankSVM, RankBoost, RankNet) to the query level;

Experiments on the datasets for TREC web track, OSHUMED, and a large collection of web pages from a commercial Web search engine all showed that our proposed query-level loss function can improve the search accuracy significantly.

11. References [1] Baeza-Yates, R., Ribeiro-Neto, B. Modern Information Retrieval, Addison Wesley, 1999.

[2] Borlund, P. (2003): The Concept of Relevance in IR, Journal of the American Society for Information Science and Technology 54(10): 913-925

[3] Burges, C.J.C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G. Learning to Rank using Gradient Descent, 22nd International Conference on Machine Learning, Bonn, 2005.

[4] Cao, Y., Xu, J., Liu, T.Y., Li, H., Huang, Y., Hon, H.W. Adapting ranking SVM to document retrieval. In SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA.

[5] Crammer, K., & Singer, Y. (2002). Pranking with ranking. NIPS 14.

[6] Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. Overview of the TREC 2003 Web Track. In NIST Special Publication 500-255: The Twelfth Text REtrieval Conference (TREC 2003), pages 78– 92, 2003.

[7] Dekel, O., Manning, C., & Singer, Y. (2004). Loglinear models for label-ranking. NIPS 16. Fig. 9 Curves of ideal loss and loss of RankBoost Actually, the same limitation also exists in the proposed querylevel loss function. That is, although the proposed cosine loss can solve some problem of previous work as mentioned in the previous sections, it is still not perfect and has its limitation. How to define better query-level loss function with less limitation is worth investigating.

[8] Freund, Y., Iyer, R., Schapire, R., & Singer, Y., An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 2003 (4).

[9] Friedman, J., Hastie, T., & Tibshirani, R. (1998). Additive logistic regression: a statistical view of boosting. Dept. of Statistics, Stanford University Technical Report.

[10] Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, Springer, New York, 2001.

10. Conclusions

[11] Hersh, W. R., Buckley, C., Leone, T. J., and Hickam, D. H.. OHSUMED: An interactive retrieval evaluation and new large test collection for research. Proceedings of the 17th Annual ACM SIGIR Conference, pages 192-201, 1994.

Ranking for information retrieval has been an important machine learning problem. In this paper, we investigated how to improve search accuracy with machine learning techniques by considering the special character of information retrieval. In particular, we focused on defining a more effective loss function. Our contributions include:

[12] Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large

1)

[13] Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods

We pointed out that query-level loss functions are more suitable for information retrieval comparing with document

margin rank boundaries for ordinal regression. Advances in Large Margin Classiers, MIT Press, Pages: 115-132. for retrieving highly relevant documents. Proc. 23rd ACM SIGIR.

Evaluation of IR Techniques, ACM Transactions on Information Systems, 2002.

weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pages: 345– 354.

[15] Joachims, T. Making large-Scale SVM Learning Practical.

[23] Scholkopf, B., & Smola, A. J. (2001). Learning with kernels:

[14] Jarvelin, K., & Kekalainen, J. Cumulated Gain-Based

Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.

[16] Joachims, T. Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002.

[17] Nallapati, R.. Discriminative models for information retrieval. In SIGIR 2004, Pages: 64-71.

[18] Page, L., Brin, S., Motwani, R., and Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford University, Stanford, CA, 1998.

[19] Qin, T., Liu, T.Y., Zhang, X.D., Chen, Z., and Ma, W.Y. A study of relevance propagation for web search. In SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 408--415, New York, NY, USA, 2005. ACM Press.

[20] Robertson, S. E. Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, 1997, pp. 3-7.

[21] Robertson, S. and Hull, D. A. The TREC-9 Filtering Track Final Report. Proceedings of the 9th Text Retrieval Conference, pages 25-40, 2000.

[22] Robertson, S.E. and Walker, S. Some simple effective approximations to the 2-Poisson model for probabilistic

Support vector machines, regularization, optimization and beyond. Cambridge, MA: MITPress

[24] Song, R., Wen, J. R., Shi, S. M., Xin, G. M., Liu, T. Y., Qin, T., Zheng, X., Zhang, J. Y., Xue, G. R., and Ma, W. Y. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004, in the 13th TREC, 2004

[25] Sormunen, E. 2002. Liberal relevance criteria of TREC— Counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[26] Vapnik, V. Statistical learning theory. Wiley, 1998. [27] Voorhees, E.M. and Harman, D.K.. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

[28] Xue, G.R., Yang, Q., Zeng, H.J., Yu, Y., and Chen, Z. Exploiting the hierarchical structure for link analysis. In Proc. of the 28th annual international ACM SIGIR conference on Research and development in information retrieval Salvador, Brazil, August 2005.

[29] http://blog.searchenginewatch.com/blog/050622-082709 [30] Google, http://www.google.com [31] Yahoo, http://www.yahoo.com [32] MSN, http://www.msn.com