benchmark dataset for document retrieval; and LETOR, a package of benchmark dataset for research on learning to rank. LETOR is derived from two datasets, ...
Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011
Learning to Rank Relational Objects Based on the Listwise Approach Yuxin Ding, Di Zhou , Min Xiao, Li Dong
Abstract—In recent years machine learning technologies have been applied to ranking, and a new research branch named “learning to rank” has emerged. Three types of learning-to-rank methods – pointwise, pairwise and listwise approaches – have been proposed. This paper is concerned with listwise approach. Currently structural support vector machine(SVM) and linear neural network have been utilized in listwise approach, but these methods only consider the content relevance of an object with respect to queries, they all ignore the relationships between objects. In this paper we study how to use relationships between objects to improve the performance of a ranking model. A novel ranking function is proposed, which combines the content relevance of documents with respect to queries and relation information between documents. Two types of loss functions are constructed as the targets for optimization. Then we utilize neural network and gradient descent algorithm as model and training algorithm to build ranking model. In the experiments, we compare the proposed methods with two conventional listwise approaches. Experimental results on OHSUMED dataset show that the proposed methods outperform the conventional methods.
I. INTRODUCTION
R
anking is the central issue of many applications, such as IR, natural language processing. In recent years, machine learning technologies have been applied to this field, and a new research branch named “learning to rank” has emerged. Many learning-to-rank algorithms have been proposed in recent literatures, and these algorithms can be categorized into three types, pointwise, pairwise, and listwise approaches. The pointwise and pairwise approaches transform ranking problem into regression or classification on single object and object pairs respectively. These approaches take documents or documents pairs as instances in learning. In this field many methods have been proposed, such as SVM Ranking [1], RankBoost [2],RankNet[3],LambdaRank [4],Multiple nested Ranker[5] and Frank [6]. Unfortunately, both pointwise and pairwise ignore the fact that ranking is a prediction task on a list of objects. Considering the fact, the listwise approach was proposed in 2007 by [7]. In the listwise rank learning approach, document lists instead of document pairs are used as instances in learning, and the major task is how to construct a listwise loss function, representing the difference between the ranking list output by a ranking model and the ranking list given as ground truth. Experimental results show that the listwise
Manuscript received Feb 10, 2011. This work was supported in part by Key Laboratory of Network Oriented Intelligent Computation (Shenzhen). The authors are with the Dept. of Computer Science and Technology, Harin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055,China (corresponding author Yuxin Ding, phone:86-755-26032193; fax: 86-755-26032461; e-mail: yxding@ hitsz.edu.cn).
978-1-4244-9637-2/11/$26.00 ©2011 IEEE
approach usually outperforms the pointwise and pariwise approaches. The Representative listwise ranking algorithms include ListNet[7],ListMLE[8] and RankCosine [9]. These listwise approaches mentioned above only employ the content relevance between documents and queries for ranking, without considering the relationship between documents. They are based on the assumption that there is no relation information between the objects that should be used in ranking. However, this is not the case in practice. For example, PageRank [10] and HITS [11] are the most popular algorithms for computing importance of web pages; they employ hyperlink (relation) information among web pages. Similarity between documents (relation between documents) is used for search ranking as well. For example, the paper [12] uses similarity between documents to find documents that cover as many subtopics as possible, the paper [13] uses similarity between documents to boost the rank of documents related with a query. Based on these researches, we study if we can use relation information between documents to improve the performance of learning-to-rank algorithms. In this paper the relation information between documents refers to similarity between documents. The algorithm proposed in paper [14] also used the relationship between documents to learn a ranking model. However, it is a pairwise approach. One problem of pairwise approaches is that the number of document pairs varies with the number of documents [7], leading to a bias toward queries with more document pairs when training a model, this bias caused by document pairs does not existed in liswise approach. Therefore, we study on listwise based learning-to-rank approach. We focus on how to use relation information between objects to improve the performance of listwise learning-to-rank algorithm. In this paper a novel ranking function which combines both of the content relevance of documents with respect to queries and relation information between documents is presented. The relation information between documents (similarity between documents) is represented as a matrix (similarity matrix). Two types of loss function, the likelihood loss and cross entropy loss are defined as the surrogate loss function, and the linear neural network and stochastic gradient descent algorithm are employed as the ranking model and the training algorithm. The remaining part of this paper is organized as follows. Section two gives a general description on the listwise approach using the relationship between documents. Section three discusses how to calculate and express the relation information between documents. Ranking function design is introduced in section four. Section five describes the construction of two types of loss functions, and the gradient descent algorithm for training ranking model. Section six
1818
describes the experiment setting and experimental results. The conclusion and future work are given in the last section.
In this section, we firstly define several symbols. Secondly, a general description of conventional listwise approach and the listwise approach proposed in this paper are given. At last, we analyze the similarities and differences between the two approaches briefly. During the training phase, a set of queries Q = {q1, q2, …, qn} is given. Each query qi is associated to a set of documents Di = {di1, di2, …, dim} where m denotes the number of documents in Di. Each document dij in the set of documents Di has a feature vector x ij = Φ ( qi , d ij ) , which contains not only conventional features ,such as term frequency, but also some new features, such as HostRank, all the features are defined in [15]. Besides, each document set Di is associated with a set of judgments Li = {li1, li2, …, lim}, where lij is the relevance judgment of document dij with respect to query qi. For example, lij can denote the position of document dij in ranking list, and lij can also represent the relevance judgment of document dij with respect to query qi. Therefore, each query qi corresponds to a set of document Di, a set of feature vectors Xi = {xi1, xi2, …, xim} and a set of judgments Li [7]. Without considering the relationship among documents, a ranking function f(.) can be represented as F(Xi), where the input is a set of feature vectors corresponding to a set of documents Di and the output is a vector gi = {f(xi1), f(xi2), …, f(xim)}. Each dimension f(xij) of gi indicates a relevance score of document dij in terms of query qi . Then a loss function can be defined as (1),
∑ L(g , L ) i
(1)
i
i =1
where L(.) is a listwise loss function. Obviously the goal of learning is to select the best function fˆ to minimize the total loss. n
fˆ = arg min ∑ L(g i , Li ) f
i =1
(2)
n
= arg min ∑ L({ f ( xi1 ) , f ( x i 2 ) , … , f ( x im )} , Li ) f
f
(3)
i =1
n
II. LISTWISE APPROACH USING RELATION INFORMATION
n
n
fˆ = arg min ∑ L(g i' , Li )
i =1
While the ranking function F(.) can be defined as F(Xi, Ri), in which Ri denotes the relation information between documents (an m × m similarity matrix), if the relationship among documents is taken into consideration. In matrix Ri, both of Ri (q,j) and Ri (j,q) are equal, (Ri (q,j) represents the element at qth row and jth column) denoting the relationship between documents dij and diq. Then the output of f(Xi, Ri) is a vector g i' = { f (x i1 , Ri ), f ( xi 2 , Ri ),..., f ( xim , Ri )} . We can see both of the content information (relationship between documents and queries) and relation information (relationship between documents) are utilized in this case. Similarly the object of learning in this case can be defined as below,
= arg min ∑ L({ f (xi1 , Ri ),..., f (xim , Ri )}, Li ) f
i =1
We can see that traditional listwise approach described by (2) is just a special case of the proposed listwise approach which takes relationship among documents into consideration. Therefore,the algorithm proposed in this paper extends the conventional listwise learning-to-rank approach(such as ListMle [8] and ListNet [7]) by not only utilizing the content information but also utilizing the relation information. In this paper we compare the proposed listwise approach with ListMle and ListNet. III. RELATION INFORMATION AMONG DOCUMENTS In this paper vector space model (VSM) is used to represent documents. Let
v ij = {vij(1) , vij(2) ,..., vij( k ) } denote
the features of document dij , where k is determined by the size of vocabulary of document collection Di after preprocess (such as removing stop words and stemming). The relation information between document is defined as the similarity between documents, the relation score between document dij and diq can be calculated by (4),
sim(dij , diq ) = cos( v ij , v iq ) =
v ij ⋅ v iq || v ij || ⋅ || v iq ||
(4)
vij = ψ (dij ) Then, a similarity matrix denoted as Ri is constructed to represent the relations among objects, where Ri (q,j) and Ri (j,q) are equal to sim(dij, diq). Similarity matrix Ri is a m×m matrix, which can be described as an undirected graph. In this graph, each node represents a document and each link is assigned a weight indicating the relation score between the corresponding documents. In the remaining part of this section, we will describe the details of the function ψ (.) , where the input is document dij and the output is a feature vector vij of document dij. A. TF IDF Based Document Feature Vector tf-idf weighting is one of the most common types used in retrieval models. Many variations of these weights are all based on the term frequency and inverse document frequency. In this paper, tf-idf weighting is used to represent document feature vector. As we described in the previous section dij denotes a document. To extract vij from document dij, we utilize tf-idf based method to assign weights to words occurring in document dij. And weights are calculated by (5)
ni + 0.01) DF (t ) = ni TFk2 (t ' , dij ) log 2 ( + 0.01) ∑ DF (t ' ) t '∈V TFt (t , dij ) log(
vij( t )
(5)
1819
where
vij(t ) indicates the weight assigned to term t, TFt(t, dij)
is the term frequency weight of term t in document dij, ni denotes the number of documents in the collection Di, and DF(t) is the number of documents in which term t occurs. Note that the denominator is used for normalization to remove the effect of document length on weights. Before constructing features for each document, stop words such as " a " , " the " , " of " and so forth are removed, and then stemming – for instance Porter Stemmer – is used to reduce the number of lexical items. At last, a document set associated to query qi including ni documents can be represented as a matrix of term weights – document-term frequency matrix – in which each row indicates a document occurring in the document set Di and each column denotes a term occurring in Di,
vij(t ) is the weight assigned to the
corresponding term for a particular document.
t1
t2
d i1
vi(1) 1
vi(12 )
... vi(1k )
di2 #
vi(1) 2 #
vi(22 ) #
... vi(2k ) #
d ini
(1) ini
(2) ini
(k ) ini
v
v
...
A. Similarity Matrix Ri and Normalized Matrix Ri In this subsection, matrix Ri( j ,q ) is defined firstly. Then the roles of similarity matrix Ri (j,q) and normalized matrix Ri( j ,q ) are discussed. Ri( j ,q ) can be calculated according to (9), which denotes the fraction of the similarity Ri (j,q) between document dij and diq over the sum of similarity between dij and all the other documents.
tk
... v
relation information between documents, it can be interpreted as follow: if the relevance score between diq and query qi is high and diq is very similar with dij , then the ranking score of dij for query between qi will be increased significantly, and vice versa. The similarity Ri (j,q) between documents dij and diq is calculated by cosine similarity, it is a real number between 0 and 1. Therefore, we think if Ri (j,q) is smaller than 0.3, dij has little relevance with diq. From (6) we can see that Ri (j,q) is included in the ranking function, that is the value of the ranking function for document dij not only depends on the content of document dij itself, but also depends on relationship between documents.
Ri( j ,q ) =
d2
d3
d4
⎛ 1 ⎜ d 2 ⎜ 0.8 d 3 ⎜ 0.8 ⎜ d 4 ⎜ 0.8 d 5 ⎜⎝ 0.8
0.8
0.8
0.8
1 0
0 1
d1
IV. RANKING FUNCTION DESIGN In this section, we firstly describe the ranking function design in detail, and then the bias problem on the ranking function is analyzed.. In the previous section, f(xi, Ri) denotes a ranking function of document dij with respect to query qi . f(xi, Ri) is constructed as bellow,
~ + τ ∑ h( x iq , w ) Ri( j ,q ) Ri( j ,q )σ ( Ri( j ,q ) | ζ )
d1
q≠ j
σ (R
( j ,q ) i
⎧⎪1, if Ri( j ,q ) ≥ ζ |ζ ) = ⎨ ⎪⎩0, if Ri( j ,q ) < ζ
h(x ij , w ) =< x ij , w >= x ij ⋅ w
(7) (8)
where ni denotes the number of documents in the collection Di and a feature vector xij denotes the content relevance of dij with respect to query qi . Vector w in h(xij , w) is unknown, which is exactly what we want to learn and h(xij , w) as shown in (8) is content relevance of dij with respect to query qi . In this paper, h(xij , w) is defined as a linear function, that is h(.) is the inner product between vector xij and w. Ri (j,q) denotes the relationship between documents dij and diq as defined in section 4, and Ri( j , q ) is a normalized value of Ri (j,q), which is described in the next part. Besides, ζis a constant, the equation (7) is a threshold function that can prevent some documents which have little relevance with document dij from affecting f(xi, Ri) . The second item of (6) represents the
0.1 0.2 0
0
d5
0.8 ⎞ ⎟ 0.1 0 ⎟ 0.2 0 ⎟ ⎟ 1 0 ⎟ 0 1 ⎟⎠
Fig.2. A Similarity Matrix
f ( x ij , Ri | τ , ζ ) = h( x ij , w )
(6)
(9)
d1
Fig.1. document-term frequency matrix
ni
Ri( j ,q ) ∑ r ≠ j Ri( j ,r )
d1 d2 d3 d4 d5
d2
d3
d4
d5
1/ 4 1/ 4 1/ 4 1/ 4⎞ ⎛ 1 ⎜ ⎟ 8 / 9 1 0 1/ 9 0 ⎟ ⎜ ⎜ 4/5 0 1 1/ 5 0 ⎟ ⎜ ⎟ 0 ⎟ ⎜ 8 / 11 1 / 11 2 / 11 1 ⎜ 1 0 0 0 1 ⎟⎠ ⎝ Fig.3. A Normalized Matrix
( j ,q ) The role of normalized matrix Ri is to remove the bias caused by those objects with more relevant documents. To illustrate the problem, we give an example as shown in Fig. 2 and Fig. 3. Fig. 2 is a similarity matrix and Fig. 3 is the corresponding normalized matrix. If we only use the similarity matrix to calculate the value of the ranking function, we can see that document d1 has four documents with relevance value 0.8, while document d5 has only one relevant document with the same relevance value 0.8, under this situation the ranking function would tend to assign an over large ranking value to document d1. To overcome this bias, a
1820
normalized matrix (see Fig. 3) is added into the ranking function. We can see that normalized matrix can restrict the sum of relevance value no more than 1, no matter how many relevant documents d1 has. For example, in this case the sum of relevance value with respect to d1 is 0.8 which is equal to the sum of relevance value with respect to d5. d1 d 2
d3
d4
d1 ⎛ 1 0 0.1 0.2 ⎞ ⎜ ⎟ d2 ⎜ 0 1 0.4 0.8 ⎟ d 3 ⎜ 0.1 0.4 1 0 ⎟ ⎜ ⎟ d 4 ⎝ 0.2 0.8 0 1 ⎠
A. MleNet One listwise approach proposed in this paper is called MleNet. In MleNet, the likelihood loss proposed by [8] is utilized as the surrogate loss function. In other words, MleNet is combined with the ranking function (6) and the likelihood loss function. Compared with ListMle[8], MleNet employs both of the content information and relation information to build ranking model, therefore MleNet can be seen as an extension of ListMle. be Let the loss function for query qi
L
({ f ( x
i1,
(
)}
Ri ) , f ( xi 2, Ri ) , …, f xini , Ri , Li
defined as (10) and (11).
),
and it is
Fig.4. A Similarity Matrix
d1
d2
d3
d4
0 1/ 3 2 / 3 ⎞ ⎛ 1 ⎜ ⎟ 0 1 1/ 3 2 / 3 ⎟ ⎜ ⎜1/ 5 4 / 5 1 0 ⎟ ⎜ ⎟ d 4 ⎝1/ 5 4 / 5 0 1 ⎠ d1 d2 d3
Fig.5. A Normalized Matrix
The role of the similarity matrix Ri is also important. To illustrate the importance of the similarity matrix, another example as shown in Fig. 4 and Fig. 5 is given. From Fig. 5, we can see that the normalized relevance values of d1 with respect to d3 and d4 is same as that of d2, but in fact, the real similarity value between d1 and d3 is 0.1, while the real similarity value between d2 and d3 is 0.4, they are totally different. Therefore, in order to distinct this difference, similarity value Ri (j,q) is necessary to be added into the second item of (6). From above analysis, we can see that in the ranking function R i and Ri restrict and cooperate each other,
L({ f (xi1 , Ri ), f (xi 2 , Ri ),..., f (xini , Ri )}, Li ) = − log P({ f (xi1 , Ri ),..., f (xini , Ri )} | xi , Li , Ri )
(10)
P({ f (xi1 , Ri ), f (xi 2 , Ri ),..., f (xini , Ri )}| xi , Li , Ri )
j =1
(11)
exp( f ( xiLj , Ri ))
ni
=∏
∑
i
n k= j
exp( f (xiLk , Ri )) i
where ranking function f(.) is defined as (6), ni denotes the number of documents in Di, Lik denotes the k-th position in the ranking list and therefore xiLk represents the document i
which occurs at the k-th position in the ranking list. From (10) we can see that the closer the document ranking list calculated by the ranking function (6) to the ranking list Li, the smaller L(.) is, and vice versa. For all the queries, the total expect loss R(f) is defined as (12), where n represents the number of query.
({ f ( x
(
)}
)
Ri ) , …, f xini , Ri , Li , (12)
they leverage relation value between objects together.
R( f ) = ∑ i L
B. Coefficient τ
Our goal is to find the weight vector w for the ranking function f(.) which minimize R(f). In fact, what we learn is the function h() defined in (8), h() is a linear function, the most simple and direct way is to represent h() as a linear neural network, w is the connection weights of the neural network. Therefore, we choose the linear neural network as ranking model and employ stochastic gradient descent to train the linear network. The paper [7] employed the same way to learn the listwise ranking function as well. As the loss function L(.) defined in (10) is the function of weight vector w, we represent L(.) as L( f ( X i , Ri )w , Li ) .
The role of coefficient τ is to adjust the effect of the relation value between objects (the second item of (6)) on the whole ranking value. To illustrate the problem, we give an example as shown in Fig. 2 and Fig. 3. From Fig. 2 and Fig. 3, we can see that the relation value between d1 and all the other documents is 0.8 which is a relative big value, therefore the second item will take the dominant position, and the effect of the first item in the ranking function becomes weak. In general the ranking value for a document is mainly determined by the first item. Thus, we can restrict the relation value by setting coefficient τ a small value. Usuallyτis smaller than 1, in our experiments we set it to 0.5.
n
i1,
The gradient of L( f ( X i , Ri )w , Li ) with respect to wj can be derived as follows,
V. LISTWISE LOSS FUNCTION In this section, we propose two listwise approaches for learning to rank – MleNet and ListEntropy. MleNet uses the likelihood loss as the loss function for optimization, and ListEntropy uses the cross entropy as the loss function.
1821
Δw j =
where
∂L( f ( X i , Ri ) w , Li ) ∂w j
=−
∂f (xiLk , Ri ) ni 1 i { ∑ k =1 ln10 ∂w j
∑ −
(13)
ni
[exp( f ( xiLp , Ri )) ⋅ p=k
∂f (xiLp , Ri ) i
∂w j
i
∑ p =k exp( f (xiLp , Ri )) ni
] }
i
where
∂f (xik , Ri ) = x(ikj ) + η ∑ xip( j ) Ri( k , p ) Ri( k , p )σ ( Ri( k , p ) | ζ ) and ∂w j p =1, p ≠ k
x(ikj ) is the j-th element in xik . B. ListEntropy Another listwise approach proposed in this paper is called ListEntropy. In ListEntropy, the cross entropy loss proposed by [7] is utilized as the surrogate loss function. ListEntropy is combined with ranking function (6) and the cross entropy loss function. it can be seen as an extension of ListNet[7]. The cross entropy loss function is defined as (14).
L({ f (xi1 , Ri ), f (xi 2 , Ri ),..., f (xini , Ri )}, Li ) = −∑ ki=1 PLi (xik ) log( Pf (xik )) n
(14)
Assume Lik is the ranking score of the k-th document of query qi in the ground truth sorting list Li . Then PL ( x ik ) and i
Pf ( xik ) can be defined as follows:
PLi ( xik ) =
exp( Lki )
,
∑ p=1 exp( Lip ) ni
(15)
and
Pf (xik ) =
exp( f ( xik ))
∑ p =1 exp( f (xip )) ni
.
(16)
Similar to the definition of L( f ( X i , Ri ) w , Li ) in MleNet, the gradient of L( f ( X i , Ri ) w , Li ) in ListEntropy with respect to wj can be derived as follow,
Δw j =
∂L( f ( X i , Ri ) w , Li ) ∂w j
= −∑ ki=1[ PLi ( xik ) ⋅ n
∑ +
ni k =1
∂f (xik , Ri ) ] ∂w j
[exp( f (xik , Ri )) ⋅
∑
ni k =1
∂f (xik , Ri ) ] ∂w j
exp( f (xik , Ri ))
(17)
∂f (xik , Ri ) = x(ikj ) + η ∑ xip( j ) Ri( k , p ) Ri( k , p )σ ( Ri( k , p ) | ζ ) and ∂w j p =1, p ≠ k
x(ikj ) is the j-th element in xik . C. L earning Algorithm In this paper, the linear neural network is used as optimization method and Stochastic Gradient Descent is used as the algorithm for minimizing the loss function. Algorithm 1 shows the learning algorithm for MleNet and ListEntropy. MleNet utilizes (13) to calculate △ w, while ListEntropy utilizes (17) to calculate △w.
Algorithm 1 Stochastic Gradient Descent Input: training data {{X1, L1, R1}, {X2, L2, R2},…, {Xn, Ln, Rn}} Parameter: learning rate η, number of iterations T Initialize parameter w For t = 1 to T do For i = 1 to n do Input {Xi, Li, Ri} to Neural Network Compute the gradient △w with current w Update w = w –η×△w End for End for Output: Neural Network model w VI. EXPERIMENT In our experiment, we employed OHSUMED and LETOR dataset. To evaluate the performance of the proposed listwise approaches, we compared MleNet with ListMle[8], and compared ListEntropy with ListNet[7]. A. Data set We used two data sets in the experiments: OHSUMED, a benchmark dataset for document retrieval; and LETOR, a package of benchmark dataset for research on learning to rank. LETOR is derived from two datasets, the OHSUMED and the “.gov” collection used in the topic distillation task of TREC 2003 and 2004. For the details of LETOR, please refer to [15]. In the experiments we only use the part of LETOR derived from OHSUMED. The dataset OHSUMED is a collection of queries on medicine, consisting of 348,566 documents and 106 queries. There are in total 16,140 query document pairs upon which relevance judgments are made. The relevance judgments are either highly, partially relevant or irrelevant. We use this dataset to generate the feature vector vij for each document dij with respect to query qi. Then we use vij to calculate the similarity matrix Ri with respect to document collection Di . The dataset LETOR contains standard features and relevance judgments. The features extracted from OHSUMED in LETOR include both low-level features (term frequency, inverse document frequency, document length and their combinations) and high level features (BM25, LMIR). In total, 45 features were extracted. For the details of these features, please refer to [15]. In our experiment, the feature
1822
vector xij for each document dij consists of all these 45 features, please note xij is used in the function h(). There are 106 document collections corresponding to 106 queries in the dataset LETOR. In order to conduct five-fold cross validation, we partitioned the dataset LETOR into five sets; each set includes 21 document collections corresponding to 21 queries. For each fold, four parts were used for training, and the remaining part for testing. For performance evaluation, we adopted two IR evaluation measures: normalized discount cumulative gain (NDCG) [16] and mean average precision (MAP) [15]. Table I shows the data partition in the experiments. Qid is the query id. The dataset partition for five-fold cross-validation is shown in Table II.
the relation information between documents is used to adjust the document ranking value. This shows that relation information between documents has a positive effect on estimating the ranking scores of documents. The interactions between documents have the same function as a voting mechanism. A document can give a bigger positive adjustment to a target document if they are very similar, and vice versa. This kind of inner adjustment among documents is important for sorting documents. TABLE Ⅲ RANKING PERFORMANCE MAP ON OHSUMED.
MAP
TABLE Ⅰ
ListEntropy
ListNet
MleNet
ListMle
0.354
0.332
0.307
0.289
THE PARTITION OF DATASET OHSUMED
Subset S1 S2 S3 S4 S5
queries D = {Qid = 1 ~ 21 } D = {Qid = 22 ~ 42} D = {Qid = 43 ~ 63} D = {Qid = 64 ~ 84} D = {Qid = 85~ 105}
Note that the listwise approach needs a total document ranking list for training, but only three types of relevance judgments are given. In this paper we randomly selected one perfect ranking among the possible perfect rankings to each query as the ground truth ranking list. Fig.6. ranking performance NDCG@n on OHSUMED
TABLE Ⅱ THE PARTITION OF DATASET OHSUMED FOR CROSS-VALIDATION
FOLD ID
Training set
FOLD 1 FOLD 2 FOLD 3 FOLD 4 FOLD 5
{S2,S3,S4,S5} {S1,S3,S4,S5} {S1,S2,S4,S5} {S1,S2,S3,S5} {S1,S2,S3,S4}
Testing set {S1} {S2} {S3} {S4} {S5}
B. Experimental Results In the experiments, MleNet is compared with ListMle [8], and NetEntropy is compared with ListNet [7]. The NDCG values and MAP values for each approach are the average testing value on four tests. The estimated values of NDCG and MAP are respectively shown in Fig. 6 and Table III. In Fig.6 the y-axes is the value of NDCG. From Fig.6 and Table III, it can be seen that when both using likelihood loss function, MleNet outperforms ListMle in terms of NDCG and MAP measures, in average the NDCG values of MleNet is about 3 points higher than ListNet, which is about 5% relative improvement. The MAP value of MleNet is about 2 points higher than ListMle, which is about 6% relative improvement. When both using cross entropy loss, ListEntropy outperforms ListNet in terms of NDCG and MAP measures too, in average the NDCG values of ListEntropy is about 4 points higher than ListNet, which is about 8% relative improvement. The MAP value of ListEntropy is about 2 points higher than ListNet, which is about 7% relative improvement. From Fig.6 and Table III, we can see that ListEntropy and ListNet which employ cross entropy loss outperform MleNet and ListMle which employ likelihood loss. The only difference between the proposed approaches and the approaches in [8][7] is that in the proposed approached
VII. CONCLUSIONS AND FUTURE WORK In this paper, listwise approaches using content information of objects and relation information between objects for learning to rank are proposed. Our main contribution is extending the traditional ranking function by introducing relation information between objects. We also give a detailed illustration on the ranking function design, especially for the bias problem caused by the ranking function, we propose an effective solution. According to the difference of the loss function design, two listwise approaches are constructed, and their performances are compared with the traditional listwise approaches. The experimental results show that our methods outperform the traditional methods, and relation information between documents is useful for sorting documents. One of future work is to try different ranking functions which include the content information and relation information to improve ranking accuracy. ACKNOWLEDGMENT This study is supported by the Scientific Research Fund of Shenzhen(2011) and the Scientific Research Fund of Harbin Institute of Technology(2011). REFERENCES [1]
K. Obermayer, R. Herbrich, T. Graepel. Support vector learning for ordinal regression. Proceedings of the International Conference on Artificial Neural Networks, Edinburgh, British:ENNS, 1999:97-102
1823
[2] [3]
[4]
[5]
[6]
[7]
[8]
[9] [10] [11] [12]
[13] [14]
[15]
[16]
Y. Freund, R. Iyer, R. E. Schapire , and Y. Singer. An efficient boosting algorithm for Combining Preferences. J. of Machine Learning Research, 2003,4: 933-969 C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. Proceedings of the 22nd international conference on Machine learning, Bonn, Germany:ACM, 2005:89-96 Pinar Donmez , Krysta M. Svore , Christopher J. C. Burges. On the Local Optimality of LambdaRank. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval(SIGIR'09), Bostob, USA: ACM, 2009:460-467 Irina Matveeva,Chris Burges, Timo Burkard, and Leon Wong. High accuracy retrieval with multiple nested ranker. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06) , Seattle, USA:ACM,2006:437-444 Ming-Feng Tsai ,Tie-Yan Liu, Hsin-Hsi Chen, and Wei-Ying Ma. FRank: a ranking method with fidelity loss. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07) , Amsterdam: ACM, 2007:383-390 Z. Cao , T. Qin , T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. Proceedings of the 24th international conference on Machine learning, Oregon, USA:ACM,2007: 129-136 F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise Approach to Learning to Rank: Theory and Algorithm. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland: ACM,2008:1192-1199 T. Qin, X.D. Zhang, M.F. Tsai, D.S. Wang, T.Y. Liu, and H. Li. Query-level loss functions for information retrieval. Information Processing and Management ,2008, 44(2): 838–855 L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998 J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 1999,46(5):604-632 C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, Toronto, Canada :ACM,2003:10-17 T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo-relevance feedback. SIGIR '06, Seattle, USA:ACM,2006:162-169 Tao Qin, Tie-Yan Liu, Xu-Dong Zhang. Learning to Rank Relational Obects and Its application to Web Search. Proceedings of the 17th International World Wide Web Conference, Beijing, China: IW3C2, 2008: 407-416 Jun Xu, Tie-Yan liu, Tao Qin. LETOR: Benchmark letorDataset for Research on Learning to Rank for Information Retrieval. SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, Amsterdam: ACM, 2007:201-206 Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, Athens, Greece: ACM,2000:41-48
1824