Directly Optimizing Evaluation Measures in Learning to Rank Based on the Clonal Selection Algorithm Qiang He
Jun Ma
Shuaiqiang Wang
School of Computer Science and Technology Shandong University Jinan, China
School of Computer Science and Technology Shandong University Jinan, China
Department of Computer Science Texas State University San Marcos TX, USA
[email protected]
[email protected]
ABSTRACT
ranking problem into binary classification on pairs constructed between documents [3, 4]. These methods typically minimize a loss function loosely related to IR evaluation measures. Recently, several methods have managed to directly optimize the performance in terms of IR measures [9]. Mostly, these methods address the non-smooth optimization by surrogate loss functions, which bound or approximate IR measures. The effectiveness of the direct optimization methods have been verified empirically and theoretically [7, 9]. But in many cases the relationships between the surrogate functions and the IR measures are not clear [7]. In this paper, we aim to develop a general learning approach that can directly optimize the performance measures used in IR innately. To address this challenge, a learning to rank method called RankCSA is proposed, which employs the clonal selection algorithm to learn an efficient ranking function by combining various evidences in IR. The main contributions of this work are as follows. First, this paper employs the clonal selection algorithm for learning ranking functions. We define the representations of antigen and antibody together with a shape-space model for the ranking problem of IR. The second contribution is the direct use of IR measure, Mean Average Precision, as the affinity function in the evolution process. Using IR measure but not a surrogate function helps RankCSA outperform other methods. Finally, the effectiveness of the proposed method is verified on the LETOR collections in comparison with the state-of-the-art methods. The results show that RankCSA achieves consistent improvements over the compared ranking algorithms in most cases.
One fundamental issue of learning to rank is the choice of loss function to be optimized. Although the evaluation measures used in Information Retrieval (IR) are ideal ones, in many cases they can’t be used directly because they do not satisfy the smooth property needed in conventional machine learning algorithms. In this paper a new method named RankCSA is proposed, which tries to use IR evaluation measure directly. It employs the clonal selection algorithm to learn an effective ranking function by combining various evidences in IR. Experimental results on the LETOR benchmark datasets demonstrate that RankCSA outperforms the baseline methods in terms of P@n, MAP and NDCG@n. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Retrieval models General Terms: Algorithms, Experimentation, Theory. Keywords: Clonal Selection Algorithm, Information Retrieval, Machine Learning, Learning to Rank, Ranking function
1.
[email protected]
INTRODUCTION
In recent years, learning to rank is becoming more widely used for the ranking problem of IR. It has been devoted to automatically learning a ranking function from a set of training data with relevancy labels. One fundamental issue of learning to rank is the choice of loss function to be optimized. Since in IR, ranking results are generally evaluated using IR evaluation measures, therefore they are ideal loss functions for optimizing. In this way, high accuracy in training promises high performance in evaluation. However, this is usually difficult due to the requirements of loss functions in conventional machine learning techniques, i.e., most optimization algorithms need smooth, or even convex loss functions while IR measures are rank-dependent, and thus non-continuous and non-differentiable [8]. Many proposed learning to rank algorithms transform the
2. THE PROPOSED LEARNING METHOD 2.1 Formulation In training, a query set Q = {q1 , q2 , . . . , q|Q| } and a document set D = {d1 , d2 , . . . , d|D| } are given. Each query qi is associated with a list of documents di = {di1 , di2 , . . . , di|di | } and a list of labels yi = (yi1 , yi2 , . . . , yi|di | ), where dij ∈ D denotes the j th document in the list and yij is a relevance judgment indicating the relative similarity of dij to qi . Each query-document pair (qi , dij ) can be expressed as φ(qi , dij ), where φ : Q × D → N is a feature mapping function from a query-document to a feature vector, a point in N dimensional space. In the learning to rank task for IR, an antigen derived from the biological immune systems can denoted as a 3-tuple.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’10, October 26–30, 2010, Toronto, Ontario, Canada. Copyright 2010 ACM 978-1-4503-0099-5/10/10 ...$10.00.
Agi = (qi , di , yi )
1449
(1)
2.3 The Learning Method: RankCSA
It can be seen that each antigen is composed of a query, a list of related documents and the corresponding list of labels. In this way, our method enjoys the advantages of the listwise approach. The training set is created as a set of antigens.
The proposed learning method, RankCSA, can be described in Algorithm 1.
Algorithm 1 RankCSA Input: An antibody Ab represents a potential ranking function. training set T , validation set V and parameter N (The It is defined as a functional expression using three compofixed antibody repertoire size), n (The number of antinents: Sf , Sc , Sop . Sf is a set of symbols, referring to bodies to select for cloning), Gen (Stop condition), d (The features in the feature vector φ(q, d). Sc is a set of real number of antibodies to select for replacing) numbers, ranging from 0.0 to 10. Sop is a set of arithmetic Output: operators. Thus an antibody can be denoted as Equation 3 the best antibody Abbest shows. Learning Procedure: Ab = (Sf , Sc , Sop ) (3) (1) Randomly initialize N antibodies as the initial repertoire, R. where Sf , Sc and Sop are given by Equation 4. (2) Present the training set T to each of the N antibodies Sf = {Fi |Fi ∈ φ(q, d)}, in R and determine their affinities by Equation 7. Sc = {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, (3) Select n high affinity antibodies from R composing a (4) 2, 3, 4, 5, 6, 7, 8, 9, 10}, new set Rn of high affinity antibodies. Sop = {+, −, ∗, /}. (4) The n selected antibodies in Rn will be cloned (reproduced), generating a repertoire RC of clones. In this paper, we choose the tree-based framework as our (5) All antibodies in RC are submitted to an affinity matcomputing architecture for antibodies. Specifically, Ab is uration process inversely proportional to their antigenic represented as a complete binary tree structure. The interaffinities, generating a population RC ∗ of matured clones: nal node is an arithmetic operator and the leaf node is either the higher the affinity, the smaller the mutation rate. a feature or a constant. The maximum number of available (6) Determine the affinities of the matured clones. nodes of an antibody is determined by the depth of the tree, (7) From this set of mature clones, re-select the N −n high which is a parameter of the learning method. affinity antibodies that are not in Rn . These antibodies A repertoire is a set of antibodies, wherein each memand Rn generate a repertoire RE of elites. ber represents a candidate solution, and can be denoted as (8) Replace the d lowest affinity antibodies from RE by Equation 5 shows. new randomly generated individuals to re-compose the R = {Ab1 , Ab2 , . . . , Ab|R| } (5) antibody repertoire, R. (9) Repeat Step (2)-(8) until the number of iterations For each document dij in Agi , Ab can output a score acreaches Gen. cording to φ(qi , dij ). A ranking πi = (π(di1 ), π(di2 ), . . . , π(di|d| )) (10) Select the best antibody Abbest in R. is produced by sorting the scores, where π(dij ) is the posiT = {Ag1 , Ag2 , . . . , Ag|T | }
(2)
tion that dij appears at in πi . That is, πi is the predicted ranking of the documents with respect to qi given by the antibody Ab.
Following are the explanations for Algorithm 1.
2.3.1 The selection for high affinity antibodies
2.2 The Shape-Space Model and the Affinity Function
In Step (3) and (7), antibodies are not selected in a sequential manner, because it is opposite to the principles of immune system. Instead, we introduce the conception “Deme” into our approach, which originates from an important technology of GP. Select an antibody Abr randomly, then itself together with its |Deme| − 1 neighbors comprise a Deme as follows,
Mathematically, the generalized shape of a molecule, either an antibody or an antigen, can be represented as a L dimensional attribute string m. The attribute string contains values of the individual coordinates of the shape in the shape-space S, i.e. m =< m1 , m2 , . . . , mL >∈ S L . In this paper, a shape-space model for rank problem of IR is defined to characterize of the antigens and antibodies, and quantitatively describe their interactions. For the given antigen and antibody, let mAgi = yi , mAb = πi and L = |di | respectively. The Ag-Ab representation determines the affinity measure. Based on the definitions above, the affinity function can be defined as follows. AF (Agi, Ab) = E(yi , πi )
Deme = (Ab1 , Ab2 · · · , Abr , · · · , Ab|Deme| )
(8)
The antibody with the highest affinity in the deme will be selected.
2.3.2 Cloning, hyper-mutation and replacement In addition to selection, the iterative process in RankCSA involves three other operations: cloning, hyper-mutation and replacement. In Step (4), the number of the clones generated for each of the n antibodies is given by round(β ∗ n ∗ M AF (Ab, T )), where β is a multiplying factor and the round(·) function rounds a number to the nearest integer. So that the total amount of clones, namely the population of RC , can be denoted as Equation 9 shows. NRC = round(β ∗ n ∗ M AF (Ab, T )) (9)
(6)
where E is an evaluation measure in IR. The objective of learning is formalized as selection of a best antibody which can maximize the mean affinity with respect to all antigens in the training data. 1 |T | M AF (Ab, T ) = AF (Agi , Ab) (7) i=1 |T |
Ab∈Rn
1450
to a rate of 20% newcomers) and β = 10. Pm is initialized as 0.5 and will be increased in step of 0.05 if necessary. The most widely-used IR measure, MAP is utilized as the affinity function. The tree depth is set to be 8 in order to at least cover the case that leaf nodes contain all features and the same number of constants. Same with [10], we perform RankCSA 10 times to reduce the effect of the random process. Results shown in the next section are the average over 10 runs.
The hyper-mutation provides the algorithm with the ability to introduce new material into the repertoire and expands the solution space searched. The inverse proportionality of hyper-mutation ensures that high-affinity antibodies are disturbed only slightly while low-affinity ones are modified to a high extent [6]. There are two kinds of operations used in hyper-mutation to diversify the clones: singlemutation and multi-mutation. Single-mutation only makes some changes on a single node of the antibody. For multimutation, a mutant is created by randomly choosing an internal node then replacing its whole sub-tree with a randomly generated tree. The choice of which operation takes place depends on a dynamic parameter Pm : If all the antibodies in the repertoire have similar affinities, RankCSA will emphasize multi-mutation by increasing its probability of occurrence. Otherwise the rate is set unchanged as the initial one. It is important to remark that antibody with higher affinity must somehow be preserved as high quality candidate solutions, and shall only be replaced by improved candidates, based on statistical evidences. In order to maintain the best antibodies for each clone during evolution, in Step (7), we keep one original antibody for each clone un-mutated during the maturation phase. A duplicate antibody is not allowed. Another basic mechanism to diversify the repertoire of antibodies is replacement. In Step (8), newcomers are added to the repertoire and low-affinity antibodies are eliminated to stop further alteration. It also offers the ability to escape from local optima on an affinity landscape and yield a broader search for the global optimum of ranking functions.
3.2 Experiment Results Table 1 gives the evaluation result of MAP on the both datasets. It shows that RankCSA beats all the baseline methods in all cases. However, there is no significant difference among these methods in terms of the MAP measure. This is because MAP emphasizes the overall ranking performance, i.e., it gives consideration to the ranking precision at all positions. Figure 1 shows the results on OHSUMED. From Figure 1(a), we observe that all methods perform similarly after n = 6. But if we focus on the P@1-5, RankCSA outperforms all baseline methods evidently. Especially for P@1, RankCSA achieves more than 11% relative improvement over RankSVM. Similar result is also observed in Figure 1(b). NGCD@1 is improved by 14.6% and 4% over RankSVM and the second best method, ListNet respectively. However, RankCSA is beaten by AdaRank-NDCG in terms of NDCG@5 and NDCG@6 by a nose. We also conducted experiments to observe the learning curve of RankCSA. We recorded the average affinity of the best antibody in the repertoire every 10 generations in 10 runs. Figure 1(c) demonstrates the average affinity variation curve. We observe that the affinity changes sharply before 100 generations and will converge after that. Furthermore, the observation implies RankCSA still has a potential after 200 generations. This issue will be discussed in Section 3.3. Figure 2 shows the results on MQ2007. We can see that RankCSA outperforms all baseline methods in terms of all measures. In Figure 2(a), RankCSA has a clear advantage over the four baselines with respect to P@1-6. Although RankCSA is competitive with RankBoost in terms of P@n value after n = 7, it performs much better than the three other methods. Figure 2(b) shows that RankCSA achieves significant improvements. In particular, NDCG@1 and NDCG@2 are improved by 11.6% and 10.8% over AdaRankMAP respectively.
2.3.3 The selection for the best antibody In the last step of the algorithm, we use a validation set V to help choosing the best antibody from R that is not over-specialized for antigens in the training set. To assure antibody’s generalization ability, we consider the average performance of an antibody in both the training and validation sets minus the standard deviation value, which is called AV Gσ method in [2]. Formally, for all the antibodies Ab in R, the best one is selected by Equation 10. AV Gσ : argmax( Ab∈R
3.
M AF (Ab, T ) + M AF (Ab, V ) − σ) (10) 2
EXPERIMENTS
3.1 Experimental Settings LETOR [5] is a package of benchmark datasets for research on learning to rank released by Microsoft Research Asia, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. We conducted experiments to test the performance of RankCSA using two benchmark datasets: OHSUMED and MQ2007. For each benchmark dataset, 5-fold cross validation strategy is adopted. We use four learning methods, Ranking SVM [4], ListNet [1], AdaRank [9] and RankBoost [3] as baselines and evaluation measures P@1-10, MAP and NDCG@110 are conducted to compare with the proposed RankCSA method. For RankCSA, most parameters are set empirically in the experiments. The type of the ranking function that RankCSA targets at is assumed non linear. We set Gen = 200, N = 500, n = 50, |Deme| = 9, d = 100 (corresponding
1451
3.3 Discussion Based on the comparison and analysis on the both datasets, we can draw a conclusion that the proposed RankCSA is good at ranking relevant documents at the very top positions, e.g., it shows distinct advantage in terms of P@1 and NDCG@1. The experiment results demonstrate that RankCSA can always improve upon the pairwise methods of Ranking SVM and RankBoost. In addition, AdaRank optimizes an exponential loss function based on MAP while RankCSA uses MAP as the affinity function without any relaxation or approximation. The experiment results indicate that the proposed RankCSA is more effective than AdaRank. Through the experiments, we conjecture that the results of RankCSA might be improved if more generations and larger populations are given. However, the computational
Table 1: MAP on the two benchmark datasets AdaRank AdaRank ListNet RankBoost -MAP -NDCG 0.44 0.4492 0.4485 0.4501 0.4514
RankSVM
MAP OHSUMED MQ2007
0.4659
0.4671
0.4606
RankSVM ListNet AdaRank−MAP AdaRank−NDCG RankBoost RankCSA
0.7 0.65
0.4 0.35 0.3 Affinity
NDCG@n
0.5 0.45
0.45
0.25 0.2
0.4
0.15
0.4
0.1
0.35 0.35 0.3
0.45
0.5
0.55
0.4841
0.5 RankSVM ListNet AdaRank−MAP AdaRank−NDCG RankBoost RankCSA
0.55
0.6
0.4647
0.4718
0.6
0.75
P@n
0.4597
RankCSA
0.05 1
2
3
4
5
6
7
8
9
0.3
10
1
2
3
4
n
5
6
7
8
9
10
0
0
20
n
(a) P@1-10
(b) NDCG@1-10
40
60
80
100 120 Generation
140
160
180
200
(c) The average affinity variation curve
Figure 1: Results on OHSUMED experiments on LETOR benchmark datasets. Four state-ofthe-art learning methods of RankSVM, ListNet, AdaRank and RankBoost were compared with RankCSA. The results show that RankCSA yields consistent improvements over baseline methods in most cases.
0.5 RankSVM ListNet AdaRank−MAP AdaRank−NDCG RankBoost RankCSA
0.48
0.46
P@n
0.44
0.42
5. ACKNOWLEDGEMENTS
0.4
This work is supported by the Natural Science Foundation of China (60970047), the Natural Science Foundation of Shandong Province (Y2008G19) and the Key ScienceTechnology Project of Shandong Province (2007GG10001002, 2008GG10001026).
0.38
0.36
1
2
3
4
5
6
7
8
9
10
n
(a) P@1-10 0.5 RankSVM ListNet AdaRank−MAP AdaRank−NDCG RankBoost RankCSA
0.48
NDCG@n
0.46
6. REFERENCES [1] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of ICML’07. ACM, 2007. [2] H. de Almeida, M. Gon¸ calves, M. Cristo, and P. Calado. A combined component approach for finding collection-adapted ranking functions based on genetic programming. In Proceedings of SIGIR’07. ACM, 2007. [3] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4:933-969, 2003. [4] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. MIT Press, Cambridge, MA, 2000. [5] T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR’07. Citeseer, 2007. [6] P. Musilek, A. Lau, M. Reformat, and L. Wyard-Scott. Immune programming. Information Sciences, 176(8):972-1002, 2006. [7] T. Qin, T. Liu, and H. Li. A general approximation framework for direct optimization of information retrieval measures. Information Retrieval, pages 1-23. [8] M. Taylor, J. Guiver, S. Robertson, and T. Minka. SoftRank: optimizing non-smooth rank metrics. In Proceedings of WSDM’08. ACM, 2008. [9] J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. In Proceedings of SIGIR’07. ACM, 2007. [10] J. Yeh, J. Lin, H. Ke, and W. Yang. Learning to rank for information retrieval using genetic programming. In Proceedings of SIGIR’07. Citeseer, 2007.
0.44
0.42
0.4
0.38
0.36
1
2
3
4
5
6
7
8
9
10
n
(b) NDCG@1-10 Figure 2: Results on MQ2007
cost of RankCSA is much huge. There is a trade-off between performance and time complexity.
4.
CONCLUSIONS
We have proposed a learning method called RankCSA for learning to rank for IR. RankCSA employs the clonal selection algorithm to learn an effective ranking function by combining various evidences in IR. Our method avoids optimizing surrogate loss functions of IR measures. Instead, it directly optimizes IR measures themselves. We conducted
1452