GPQ: Directly Optimizing Q-measure based on Genetic Programming Yuan Lin, Hongfei Lin, Ping Zhang, Bo Xu Information Retrieval Laboratory of DLUT School of Computer Science and Technology, Dalian University of Technology No. 2 LingGong Road GanJingZi District DaLian, China
[email protected],
[email protected], {pingzhang, xubo2011}@mail.dlut.edu.cn
permutation by the ranking model and the ground truth permutation, the representative algorithms are ListNet [2], ListMLE [3]. The output space of the second category contains the relevance degrees of all the document for a query, and the loss function of which is defined based on the bound of information evaluation measures. An example algorithm is SVMMAP [4]. Besides, RankGP [5] utilizes genetic programming to optimize Mean Average Precision metric. Similar to RankGP, we propose a novel ranking algorithm GPQ, in which genetic programming is employed to directly optimize the Q-measure [6] evaluation metric. In our method, we regard the objective ranking function as an individual in Genetic Programming process. Then a group of evolution operations will be executed iteratively. A population will be initialization at first. During each iteration, three operations will be applied to the population to generate new individuals, selection, crossover and mutation immediately. We use the Q-measure evaluation metric to evaluate the fitness of each individual, based on which excellent individuals are kept and weak individuals are discarded. We eventually obtain an individual that owned the best fitness. Unlike the RankGP, we employ the Artificial Neural Network to save the ranking model and take the encoded weights as the individual. Besides, we use the genetic programming to optimize a novel evaluation measure Q-measure which is a graded version of Average Precision.
ABSTRACT Ranking plays an important role in information retrieval system. In recent years, a kind of research named ‘learning to rank’ becomes more and more popular, which applies machine learning technology to solve ranking problems. Lots of ranking models belonged to learning to rank have been proposed, such as Regression, RankNet, and ListNet. Inspired by this, we proposed a novel learning to rank algorithm named GPQ in this paper, in which genetic programming was employed to directly optimize Q-measure evaluation metric. Experimental results on OHSUMED benchmark dataset indicated that our method GPQ could be competitive with Ranking SVM, SVMMAP and ListNet, and improve the ranking accuracies.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models
General Terms Algorithms, Experimentation, Theory
Keywords Information Retrieval; Learning to Rank; Q-measure; Genetic Programming
The rest of this paper is organized as follows. Section 2 gives a detail introduction to the evaluation metric Q-measure which will be optimized in our GPQ method. Section 3 describes our proposed learning to rank method GPQ. Experimental results and analysis are presented in Section 4. Section 5 concludes this paper and presents avenues for future work.
1. INTRODUCTION Learning to rank became a more popular research area in this decade, which employs machine learning technology to solve ranking problems in information retrieval. Many learning to rank algorithms are proposed in recent years. These methods could be graded into three categories: the Pointwise approach, the Pairwise approach and the Listwise approach. The experimental results on Letor3.0 indicates that listwise approach generally performs better than pointwise approach and pairwise approach[1]. The input space of the Listwise approach is an entire group of a query and is more similar with the real ranking. It could be divided into two categories. The first category measures the difference between the
2. Evaluation Measures 2.1 Q-measure The evaluation metric Q-measure [6] is a graded version of Average Precision, inherits both the reliability of Average Precision (AP) and the multigrade relevance capability of Average Weighted Precision (AWP) [7]. Therefore, so as to introduce the Q-measure evaluation metric, we will firstly introduce the evaluation metrics Average Precision and AWP. The definition of Average Precision is well known. We then describe the definition of the AWP below, which was intended for evaluation based on multigrade relevance. The AWP evaluation measure is intended for evaluation based on multigrade relevance. Let J(r) denotes the relevance degree of the document at position r (If there are three relevant levels, we use digit 2 denotes relevant, 1 denotes partial relevant and 0
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected] CIKM '14, November 3-7, 2014, Shanghai, China. Copyright © 2014 ACM 978-1-4503-2598-1/14/11…$15.00. http://dx.doi.org/10.1145/2661829.2661932
1859
denotes irrelevant). If the document at position r is relevant, then a gain is generated which is denoted by g(r) =gain (J(r)), else 0. The cumulative gain at position r in the search result list is denoted as follows: cg (r ) g (r ) cg (r 1) r 1 (1) cg (1) g (1) r 1
optimize Q-measure evaluation metric. Figure 1 summarizes the procedure of GPQ. From Figure 1 we could understand the key steps are population initialization, fitness calculation, crossover, mutation and selection. We will report them in the following parts.
Let cg*(r) denotes the cumulative gain at position r in an ideal ranked list, and then AWP is defined as follows: 1 cg (r ) AWP J (r ) (2) R 1 r l cg * ( r )
The training set S and the validation set V are standard learning to rank dataset, both of which contain a set of queries. There are a group of documents with respect to each query as well as their relevance judgments. In our GPQ method, a potential ranking model (individual) is corresponding to an artificial neural network (ANN). In order to calculate conveniently, an individual is denoted as an array list, each factor in which is a weight (real number) in ANN. Hence we need encoding and decoding process between the two implementations. In this paper we use three layers ANN and the length of an individual is defined by the input number and the hidden number of ANN, length (Individual) = (InputNum*HidNum+HidNum) immediately.
3.1 Population Initialization
The AWP evaluation metric seems like an extension of Average Precision, but it suffers from a serious problem. After Rank R (the number of relevant documents), the cumulative gain cg*(r) in an ideal list becomes a constant. This leads to a problem that AWP could not distinguish between system A that has a relevant document at position r (rR). In fact, system A is better than system B in terms of ranking performance. However, according to equation (2), the evaluation metric AWP could not tell the difference between the two systems. To tackle this problem, Sakai [6] proposed an evaluation measure Q-measure, which introduced the notion of bonused gain at position r: bg(r)=g(r)+1 if g(r)>0, else bg(r)=0. Then the cumulative bonused gain at position r is denoted as: cbg(r ) bg (r ) cbg (r 1) r 1 (3) cbg (1) bg (1) r 1
GPQ: Employing Genetic Programming to directly optimize Q-measure Input 1: Training set S, Validation set V Input2: Artificial Neural Network parameters: InputNum, HidNum, Ihthreshold, Hothreshold Input3: Genetic Programming parameters: Generations, Psize, Pcrossover, Pmutation (1) Initialize population P with a set of randomly generated individuals (Psize), calculate fitness of each individual (2) Execute crossover with probability Pcrossover and mutation with probability Pmutation, calculate the fitness of new individuals (3) Sort the individuals with the fitness in descending order, keep the first Psize individuals to population P and remove others (4) Repeat steps (2)-(3) until the iterations is equal to Generations (5) Calculate the fitness of each individual on validation set V, calculate the final fitness(fitness on S and V) of each individual (6) Sort the individuals in descending order and return the first individual (the individual with the best fitness)
Hence the cbg(r) = cg(r) +C(r). The expression C(r) is the number of relevant documents from position 1 to r in the ranked list which could be denoted as follows.
C ( r )=
r k 1
*J ( r )
(4)
Then, the value of Q-measure evaluation metric at cutoff l is: l 1 C ( r ) cg (r ) Q@ l J (r ) (5) min(l , R ) r 1 r cg * (r )
The value of Q-measure is equal to 1 when the ranked list is ideal. The parameter β is a persistence parameter for Q-measure. If β is set to zero, then Q-measure is reduced to Average Precision. Here we set β to 1. Since the Q-measure evaluation metric has these excellent properties, we employ genetic programming to optimize Q-measure. In the next Section, we will give the detail description of our method GPQ.
Output: The ranking model (an artificial neural network) owned the best fitness.
3. Learning Method: GPQ The Genetic Programming (GP) is an evolutionary algorithmbased methodology inspired by biological evolution to find computer programs that perform a user-defined task. In Genetic Programming, a population will be initialized at first which consists of a group of individuals. Then the whole evolution process is a series of iterative operations. At each iteration, three genetic operations will be applied to individuals in the population, named selection, crossover and mutation. The fitness is evaluated by a user-defined metric and is used to measure whether the potential solution is good or not. The genetic programming will eventually produce an individual with the best fitness. The Genetic Programming has been applied to many areas, such as Combinatorial Optimization, Machine Learning, and Image Processing. In this paper, we proposed a novel ranking algorithm GPQ in which Genetic Programming was employed to directly
Fig. 1.
The workflow of GPQ
In GPQ, we first randomly initialize a set of individuals. For each individual generation, what we actually do is initialize an artificial neural network. The weights between input layer and hidden layer are randomly set with threshold (-Ihthreshold to Ihthreshold). Similarly, the weights between hidden layer and output layer are randomly set with threshold (-Hothreshold to + Hothreshold). Then we encode each ANN to an individual and ensure the number of population P is Psize.
3.2 Fitness Calculation The fitness of an individual is based on user-defined metric and used to evaluate the ranking performance of a ranking model. The Q-measure evaluation metric inherits both the reliability of
1860
Average Precision and the multigrade relevance capability of Average Weighted Precision and could be used to multi-level relevant degrees evaluation. In the rest of this paper, we use letter ‘Q’ to denote the evaluation metric Q-measure. Hence, we take it as the fitness function in our GPQ method. When a fitness calculation process begins, we first decode the individual to the weights of ANN. Then training data (validation data) are inputted to the ANN and ANN gives the score of each document. Finally the value of Q-measure is calculated as the fitness over the whole queries.
4.2 Experimental Results We performed 10 times GPQ on the OHSUMED dataset and get two groups of results. The first group of result is denoted as GPQAvg, which are the average results over the 10 times results. The other group is denoted as GPQ-Bst, which are the average results of the best run in the 10 times results. The initialization of the population in GPQ is a random process; we perform 10 times GPQ in order to reduce the influence of experimental results by the random process. Table 2. Comparison with other methods by MAP
3.3 Evolution Operations In this section, there are three kinds of evolution operations, namely crossover, mutation and selection. In crossover process, for every two individual, we execute crossover with probability Pcrossover. We utilize arithmetic crossover to generate new individuals. For original individuals A and B, new individuals C and D, the crossover process is denoted as follows:
C A (1 ) B
(6)
D (1 ) A B
The parameter is a real number between 0.0 and 1.0. The new individuals generated by crossover will be added to the original population. In mutation process, for each original individual, we execute mutation with probability Pmutation. If an individual meets the mutation condition, a new weight will be set to a random position of the individual. By contrast, the selection operation is very simple. The first Psize individuals will be kept. When select the best individual over training set S and validation set V, besides calculate the fitness of all individuals on S, we also calculate them on V. This paper uses simple sum of the value of Q-measure on S and V:
FinalScore( I ) Q measure( S ) Q measure(V )
Algorithms
Ranking SVM
ListNet
SVMMAP
GPQAvg
GPQBst
Fold1
0.3038
0.3464
0.3423
0.3415
0.3516
Fold2
0.4468
0.4499
0.4543
0.4579
0.4797
Fold3
0.4648
0.4606
0.4618
0.4579
0.4619
Fold4
0.4990
0.5106
0.5179
0.5100
0.5138
Fold5
0.4528
0.4611
0.4500
0.4609
0.4642
AvgMAP
0.4334
0.4457
0.4453
0.4456
0.4542
Table 2 presented the MAP values compared with baseline algorithms. From table 2, we could observe that GPQ-Avg outperformed baseline algorithms and achieved the best MAP value 0.4579 in Fold2. GPQ-Bst achieved the best ranking performance in terms of Fold1, Fold2 and Fold5. On the level of Average MAP value, GPQ-Avg performed better than Rank SVM, SVMMAP and has similar ranking performance to ListNet. Besides, GPQ-Bst achieved the best MAP value over all the algorithms listed in table 2. These experimental results indicated that the proposed algorithm GPQ could improve ranking accuracy. Figure 2 showed the results of P@K values. From Figure 2, we could observe that GPQ-Bst almost obtained the best Precision over all of the positions. Besides, GPQ-Avg obtained similar ranking performance to ListNet in terms of Precision evaluation metric, and performed better than SVMMAP and Ranking SVM.
(7)
4. Experiments 4.1 Experiments Settings
Figure 3 presented the evaluation results of NDCG@K over baseline algorithms, from which we observed similar results with respect to Figure 2. GPQ-Bst outperformed others and obtained the best NDCG values from position 1 to position 10. In addition, GPQ-Avg also performed better than all the baseline algorithms, especially at NDCG@3, NDCG@4, NDCG@5 and NDCG@6.
We perform the GPQ method on the benchmark learning to rank dataset OHSUMED in LETOR3.0 released by MSLR [8]. The OHSUMED is a subset of MEDLINE, a database on medical publications, which consists of 348,566 records from 270 medical journals during the years of 1987-1991. A query set with 106 queries is used on the OHSUMED dataset, which contains 11303 irrelevant documents, 4837 relevant documents and 45 dimensions of features. The relevance degrees of the documents with respect to each query fall into three levels: definitely relevant, partially relevant or irrelevant. Table 1 described the parameters of GPQ which are set empirically in the experiments. The experimental results are compared with the state-of-the-art learning to rank algorithms, such as Ranking SVM, ListNet, and SVMMAP. The experimental results are listed as follows. Table 1. The parameters for the proposed algorithm GPQ Name
Value
Name
Value
InputNum
45
Ihthreshold
0.01
HidNum
1/5*(InputNum)
Hothreshold
0.5
Generations
100
Pcrossover
0.95
Psize
300
Pmutation
0.05
Fig. 2.
1861
The Average P@k Curve on OHSUMED dataset
operations and the calculation of fitness. Besides, the population size and the generations are also related to time cost. More individuals and generations generally produce better ranking model. The real search system should consider the relationship between effectiveness and efficiency.
5. Conclusion
Fig. 3.
In this paper, we proposed a novel ranking algorithm GPQ, in which genetic programming is employed to directly optimize the Q-measure evaluation measure. In GPQ the potential ranking model is corresponded to an individual in GP algorithm and Artificial Neural Network is used to save the ranking model. The evolution operations are performed iteratively. Finally, GPQ generated an individual with the best fitness (the best Q-measure value), and then an excellent ranking model is generated through decoding the individual to the ANN. Experimental results on the OHSUMED dataset indicated that the proposed algorithm GPQ could improve the ranking performance compared to the state-ofthe-art ranking algorithms, such as Ranking SVM, ListNet and SVMMAP. Our future work will focus on directly optimizing multiple evaluation metrics using Genetic Programming.
The Average NDCG@k Curve on OHSUMED dataset
Table 3. The t-test results on OHSUMED in terms of P@K (P@1P@10) p-value Ranking SVM ListNet SVMMAP
GPQ-Avg
0.0002
-0.3978
0.0096
GPQ-Bst
0.0002
0.0016
0.00001
From Table 3, we observed that both of GPQ-Avg and GPQ-Bst significantly improve the ranking performance when compared to Ranking SVM. In terms of compared to ListNet, GPQ-Bst get a significant improvement, however GPQ-Avg didn’t perform well. In addition, GPQ-Avg obtained improvements over SVMMAP (pvalue=0.0096) and GPQ-Bst also made a significant improvement compared to SVMMAP. Table 4 also gives the similar observation. For the evaluation measure NDCG, both of GPQ-Avg and GPQBst significantly improve the ranking performance when compared to the baseline algorithms.
Ranking SVM
ListNet
SVMMAP
GPQ-Avg