Instance Ranking with Multiple Linear Regression: Pointwise vs. Listwise Approaches João Brito
João Mendes-Moreira
INESC TEC L.A. Porto, Portugal
[email protected]
Dept. Eng. Informática, Fac. Engenharia, U.Porto LIAAD-INESC TEC L.A. Porto, Portugal
[email protected]
Abstract—This paper presents a comparison between listwise and pointwise approaches for instance ranking using Multiple Linear Models. A theoretical review of both approaches is performed, including the evaluation methods. Experiments done in seven datasets from 4 different problems show that the pointwise approach is slightly better or similar than the listwise approach. However the models obtained with the listwise approach are more interpretable because they have in average fewer features than the models obtained with the pointwise approach. The obtained results are important for problems where interpretable ranking models are necessary. Keywords-Instance ranking, Multiple Linear Regression, Rank regression, evaluation functions.
I.
INTRODUCTION
Instance ranking aims to calculate the order of a given set of instances according to their features. It can be used in different kinds of problems, such as, ranking web pages according to its significance for a given subject, ranking stock titles according to its expect profitability, etc. Three different approaches can be used for instance ranking: pointwise, pairwise and listwise. We only discuss pointwise and listwise approaches. Pointwise approaches firstly predict the score associated to each instance and, afterwards, it ranks them based on the predictions previously done. Listwise approaches predict directly, i.e. using only one step, the ranking order of each instance. There are several use cases using different algorithms, such as, Neural Networks [2], Support Vector Machines [7] among others. However, these algorithms are not interpretable. This is an important drawback for problems where interpretability is necessary. An example of such problems is ears maize contests where the ears maize should be ranked according to their expected productivity [5]. In such example the obtained model should be fully interpretable both in terms of giving full explanation on how the ear maize characteristics affect productivity but also on the simplicity of such equation. Otherwise farmers will not use the information given by the ranking equation to improve the way they select ears maize for the nest season. More details about this kind of problems can be found in [5]. The motivation of this work is to compare pointwise and listwise approaches with a fully interpretable model, namely, Multiple Linear Regression (MLR), a simple yet powerful approach in many different problems where the target variable is quantitative.
Next section describes the instance ranking approaches, namely, pointwise and listwise ones followed by the description of multiple linear regression and ranking evaluation measures. Then, the experimental setup, the results and their discussion are presented before the conclusions. II.
APPROACHES
A. Pointwise Approach The general purpose of the pointwise approach is to make predictions for all instances and then sorting them according to the predicted values. Consider as example a vector ܧൌ ሼ݁ଵ ǡ ݁ଶ ǡ ݁ଷ ǡ ǥ ǡ ݁ ሽ having associated to each position another vector ܧܨ ൌ ሼ݂݁ଵ ǡ ݂݁ଶ ǡ ݂݁ଷ ǡ ǥ ǡ ݂݁ ሽ which contains the list of m features for each of the n elements. Since this approach will rank instances by predicted values, firstly the values for all instances are calculated using a previously built model. Then the ranking function will sort the instances according to the predicted values. In this approach the goal is to minimize the prediction error. To accomplish this, the model that is chosen to make predictions will be the one that gets a better error for the evaluation function presented later on for this approach. B. Listwise Approach In order to perform instance ranking in lists, some context definitions are needed. In [3] listwise details are described using as example of instance ranking a problem of document retrieval. The idea behind it, is to use a query set and its associated features. For that purpose they define a query set (with m instances) as ܳ ൌ ሼ ݍሺଵሻ ǡ ݍሺଶሻ ǡ ǥ ǡ ݍሺሻ ሽ . All of the entries in ܳሺ ݍሺሻ ሻ have n associated documents ݀ ሺሻ ൌ ሺሻ ሺሻ ሺሻ ሼ݀ଵ ǡ ݀ଶ ǡ ǥ ǡ ݀ሺሻ ሽ. For each document, there is a feature and a score vector associated, denoted as ݔሺሻ and ݕሺሻ respectively (Figure 1).
A ranking function f is used to rank all feature vectors ( ݔሺሻ ) from each query ( ݍሺሻ ), resulting in a score list of feature vectors ( ݖሺሻ ). The goal is to minimize a given evaluation measure, as the Normalized Discounted Cumulated Gain (presented in the following section). III.
ARCHITECTURE
This section presents an overview of the predictive model, Multiple Linear Regression (MLR), and the evaluation measures considered to study the two approaches described in section II.
CISTI 2014 | 1040
ଵ ܧܵܯൌ σୀଵሺܻ െ ܻ ሻଶ
(5)
The error is given by the average quadratic distance between the predicted values (ܻ ) and the real values (ܻ ), for all݅ ൌ ͳǡ ǥ ǡ ݊. This measure is frequently used as evaluation error in regression problems. 2) Normalized Discounted Comulated Gain: Let us consider firstly the definition of Cumulated Gain (CG) and Discounted Cumulated Gain [6] (equations 6 and 7). Being G a vector with gain values and b the base of the logarithm: ܩሾͳሿǡ ݂݅݅ ൌ ͳ ܩܥሾ݅ሿ ൜ ܩܥሾ݅ െ ͳሿ ܩሾ݅ሿǡ ݁ݏ݅ݓݎ݄݁ݐ
ܩܥܦሾ݅ሿ ൝
Figure 1. The listwise approach.
A. Prediction model with Multiple Linear Regression The simple linear regression [4] fits a straight line to the input data. Once it is a linear model the result will appear as a linear equation, having one response Y and one predictor variable X, where α, β and ε represent respectively: the Y value when X = 0; the slope of the line; and the random error (equation 1). However these values are unknown and must be estimated from data: the estimations for ߙ and ߚ are ߙො and ߚመ respectively. ε is the random error and, consequently, cannot be estimated. ܻ ൌ ߙ ߚܺ ߝ
(1)
The estimation is usually done using the least squares ) method. This method will estimate the coefficients (ߙො andߚ by minimizing the global error from the training data to the line, applying the equations 2 and 3 for the data points ൫ሺݔଵ ǡ ݕଵ ሻǡ ሺݔଶ ǡ ݕଶ ሻǡ ሺݔଷ ǡ ݕଷ ሻǡ ǥ ǡ ሺݔ௦ ǡ ݕ௦ ሻ൯ . In these equations ݔҧ ൌ ܩܸܣሺݔ௦ ሻ andݕത ൌ ܩܸܣሺݕ௦ ሻ. ߚመ ൌ
σೞసభሺ௫ ି௫ҧ ሻሺ௬ ି௬തሻ σೞసభሺ௫ ି௫ҧ ሻమ
ߙො ൌ ݕത െ ߚመ ݔҧ
ሺʹሻ
ሺ͵ሻ
Multiple Linear Regression [4] uses several predictor variables (equation 4), instead of only one as in simple linear regression. Consequently the number of β coefficients will be equal to the number of predictor variables (n). In order to estimate the ߚመ ሺ݅ ൌ ͳǡ ǥ ǡ ݊ሻ values, the same method can be used applying equation 2 to allߚመ . ܻ ൌ ߙ ߚଵ ܺଵ ڮ ߚ ܺ ߝ
(4)
B. Evaluation Measures For the previously presented approaches (section II) two evaluation measures were used: (1) Mean squared error for the pointwise approach; (2) and Normalized Discounted Cumulated Gain (NDCG) for both the listwise and pointwise approaches; 1) Mean squared error: This error evaluation measure has as goal to minimize the prediction error in the pointwise approach (equation 5).
ܩܥሾ݅ሿǡ ݂݅݅ ൏ ܾ
ܩܥܦሾ݅ െ ͳሿ
ீሾሿ
୪୭್
ǡ ݂݅݅ ܾ
(6) (7)
The best possible solution for listwise approach is when the prediction of the values is in the correct order, i.e., according to CG and DCG ideal vectors (denoted as ܩܥூ and ܩܥܦூ , respectively). In order to compare results between different problems, the DCG can be normalized using the ideal values as shown in equation 8. ܰܩܥܦሾ݅ሿ ൌ
ீሾሿ
ீ ሾሿ
(8)
As example consider the logarithm base b = 2 and take the vector ൌ ሼͷǢ ͵Ǣ ͶǢ ͳǢ Ǣ ʹሽ. By definition: 1) 2) 3) 4) 5)
ܩܥሺܸሻ ൌ ሼͷǢ ͺǢ ͳʹǢ ͳ͵Ǣ ͳͻǢ ʹͳሽ ܩܥூ ሺܸሻ ൌ ሼǢ ͳͳǢ ͳͷǢ ͳͺǢ ʹͲǢ ʹͳሽ ܩܥܦሺܸሻ ൌ ሼͷǢ ͺǢ ͳͲǤͷʹǢ ͳͳǤͲʹǢ ͳ͵ǤͲǢ ͳͶǤ͵ͺሽ ܩܥܦூ ሺܸሻ ൌ ሼǢ ͳͳǢ ͳ͵ǤͷʹǢ ͳͷǤͲʹǢ ͳͷǤͺͺǢ ͳǤʹሽ ܰܩܥܦሺܸሻ ൌ ሼͲǤͺ͵Ǣ ͲǤʹǢ ͲǤǢ ͲǤ͵Ǣ ͲǤͺͷǢ ͲǤͺͺሽ
The value used in the listwise approach is the average of the NDCG (ܽ݁݃ܽݎ݁ݒሺܰܩܥܦሻ ൌ ͲǤͺͲ in the example) IV.
EXPERIMENTAL SETUP
In this section the data and the procedures taken to test the pointwise and the listwise approaches are described. A. Data Sets Table 1 presents the datasets used in the experiments. Only public datasets were used. Since the “Energy Efficiency” and the “Parkinson Telemonitoring” data sets have two response variables, they were used twice, once per response variable. The “Wine Quality” data set was divided into two categories: (1) red; and (2) white wines; and consequently it was also used twice in the experiments. LETOR dataset is already partitioned in 5 folds, each one of them with train, validation and test sets. Due to memory and time limitations and performance issues (non parallelization of the algorithm) with R, it was used 1/100 part of data for each one of the 5 folds.
CISTI 2014 | 1041
TABLE 1. DATASET DESCRIPTION Dataset Description Number of instances
Number of attributes
Source
Energy Efficiency
768
8
[1]
Parkinson Telemonitoring
5875
26
[1]
Wine quality
4898
12
[1]
Approximately 30 000
135
Name
Microsoft Learning Rank (LETOR)
to
minimizes the error MSE. In LW the chosen attribute is the one that maximizes the global gain of NDCG. In this step the error is calculated based on the validation folder. This process stops when there are no more attributes to select or none of the remaining gives a better evaluation value.
1
B. Implementation overview The procedure to compare listwise and pointwise approaches (Fig. 2), can be split into four steps: (1) Folder creation; (2) Folder selection; (3) Model creation; and (4) Model testing.
Figure 3. 10-fold cross validation with a validation and a test set.
Figure 2. Experimental setup.
1) Folder Creation In this step 10-fold Cross-Validation (CV) is used, i.e., the input data set is randomly divided into ten disjoint folders all of them with approximately the same size. 2) Folder Selection The folder selection step is repeated for each iteration of the CV process ( Figure 3). For each iteration, one folder is selected for test, another one is randomly selected for validation and the remaining eight folders are chosen for model training. The utility of the validation and test folders will be explained in the next sections. 3) Model Creation To make a comparison between listwise and pointwise approaches, in each iteration of the CV process, two models are created: (1) one uses the pointwise approach; and (2) the other uses the listwise approach. The model creation process is the same for both approaches with one exception, the error evaluation function. To select the best attributes for the model, the forward-selection process uses a different evaluation function according to the approach used. For PW the forward-selection process selects the attribute that
Figure 4. Pseudo-code for Model Creation.
Forward-selection process is repeated for all 10 folders, consequently as result we obtain 20 models for each dataset (10 for LW and 10 for PW approach) at model creation step. Once again the LETOR dataset is an exception. As explained earlier in section IV.A this dataset is partitioned in 5 folds, because of that for this dataset we will have only 10 models (5 for PW and 5 for LW) and steps 1 and 2 (folder creation and selection) are not performed for this dataset.
1
This data set was downloaded from Microsoft research website through the link http://research.microsoft.com/enus/projects/mslr/
CISTI 2014 | 1042
C. Model Test To be possible to make a comparison between LW and PW approaches the models final evaluation have to be performed by the same error evaluation function. Therefore the function used in this step will be NDCG, once the objective of the comparison is to know how accurate the global ranking is. With this purpose, in this step we calculate the NDCG for each model created (for PW and LW) for the values of the corresponding test set for each model. V.
RESULTS AND DISCUSSION
Table 2 shows the results for both approaches (listwise and pointwise) using the datasets presented in Table 1. Dataset Description. The results in this table were obtained for five runs of each dataset. For each run, the train, the validation and the test folders are created with new values of the dataset. In order to get more results to compare, a transformation was performed in the datasets. For each dataset the values of the response variable were replaced for their corresponding ranking values. For example, if the original response variable values were ܸ ൌ ሼͳͲǢ ʹͲǤͳͷǢ ͵ǤʹǢ ͷሽ the transformation result isܸԢ ൌ ሼ͵Ǣ ͶǢ ͳǢ ʹሽ. In tables Table 2 and Table 3 the results for the original values are presented in column LW RV (Listwise real values) and the ranking ones are in column LW RANK (Listwise rank). TABLE 2 - RESULTS LW RV ndcg ndcg training test
LW RANK ndcg ndcg training test
PW mse ndcg training test
Wine (white)
0.911
0.908
0.821
0.931
0.120
0.935
Wine (red)
0.938
0.926
0.849
0.939
0.113
0.947
Parkinson (motor)
0.828
0.783
0.757
0.781
0.328
0.798
Parkinson (total) Energy Efficience Y1 Energy Efficience Y2
0.828
0.783
0.757
0.781
0.328
0.798
Letor Averages
As less variables Multiple Linear Regression has, as easier its interpretation is. In Table 3 it can be seen that the listwise approach uses in average less variables than the pointwise approach facilitating the interpretation of listwise models. In order to better understand how comparatively the listwise and the pointwise approaches behave, additional experiments were done. In the left panels of Figure 5 - Figure 16 the results in the validation set of the pointwise approach are shown while in the right panels they are shown the results of the listwise approach. The same validation set is used in both the left and right panels of each figure. The odd figures show the results when the feature selection process follows the listwise approach (the right panes) while in the even figures the feature selection process follows the pointwise approach (the left-panels). Figures show that both approaches converge fast. However, while the listwise approach finishes soon, the pointwise approach reach a plateau where adding new variables decreases the error slowly. In these situations the addition of variables continues without meaningful gains. This is easily observable in the left panes of the even figures (Figure 5 - Figure 16). In the right panes of the same figures it can be seen that the listwise approach has a bit erratic behavior. The odd figures are not so informative because when the listwise approach finishes the process of variable selection, the pointwise approach still presents meaningful gains in the process of variable selection. Certainly, the pointwise approach could benefit of using a threshold for stopping the process of variable selection sooner. Listwise approaches do not need it. TABLE 3. NUMBER OF SELECTED VARIABLES Dataset
LW RV
LW RANK
PW
Wine (white)
2.86
2.800
7.220
Wine (red)
2.86
1.56
7.72
Parkinson (motor)
1.82
1.76
8.64
Parkinson (total)
1.28
1.78
7.94
0.939
0.930
0.919
0.918
0.114
0.973
0.939
0.930
0.919
0.918
0.114
0.973
Energy Efficience Y1
2.98
3.24
5.5
0.589
0.538
0.808
0.512
1.686
0.568
Energy Efficience Y2
1.56
2.72
5.74
0.856
LETOR
10.8
1
36.2
Average
2.227
2.310
7.127
0.828
0.826
As explained in section IV.B.3), each approach has a distinct error evaluation function. So, the values presented in “training“ columns at Table 2 are according the approach used. For the LW RV and LWRANK the values are the average of NDCG evaluations for the validation set. For the pointwise approach (PW column) the coefficient of variation (CoV) calculated according to equations 9 and 10 is presented. ଵ
ܩܸܣൌ σୀଵሺܺ ሻ
ܸܥൌ
ξெௌா ீ
(9) (10)
At “ndcg test” columns the results on the test phase are presented. Here NDCG evaluation method is used for every approach (LW RV, LW RANK and PW), using the test folder values to evaluate the models previously created.
VI.
CONCLUSIONS
A comparison between listwise and pointwise approaches for instance ranking is done using Multiple Linear Regression with forward search for variable selection. The pointwise approach obtains similar or better results than the listwise approach in seven datasets. However, as shown in table 3, the listwise approach obtains models with less variables which is meaningful in terms of interpretability. Interpretable instance ranking models are necessary in problems where the ranking is obtained according to predictions and where the ranking criteria should be known.
CISTI 2014 | 1043
Figure 9 – Parkinson Motor LW DP
Figure 5 – Wine white LW DP
Figure 10 – Parkinson motor PW
Figure 6 – Wine white PW
Figure 11 – Parkinson total LW DP
Figure 7 – Wine red LW DP
Figure 12 – Parkinson total PW
Figure 8 – Wine red PW
CISTI 2014 | 1044
Figure 13 – Energy efficiency Y1 LW DP Figure 17 - Letor LW DP
Figure 14 – Energy Efficiency Y1 PW Figure 18 - Letor PW
ACKNOWLEDGMENT This work is part-funded by project ”NORTE-07-0124FEDER-000059”, a project financed by the North Portugal Regional Operational Programme (ON.2 O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency, Fundação para a Ciência e a Tecnologia (FCT). REFERENCES Figure 15 – Energy efficiency Y2 LW DP
[1] [2]
[3]
[4] [5]
Figure 16 – Energy efficiency Y2 PW
[6]
[7]
Kevin Bache and Moshe Lichman. UCI machine learning repository, 2013. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton and Greg Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96. ACM, 2005. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129–136. ACM, 2007. Jiawei Han. Kamber., m.: Data mining: Concepts and techniques, 2001. Pedro M.R.Mendes-Moreira, João Mendes-Moreira, António Fernandes, Eugénio Andrade, Arnel R. Hallauer, Silas E. Pêgo and M.C. Vaz Patto. Is ear value an effective indicator for maize yield evaluation? Field Crops Research ,2014, DOI: http://dx.doi.org/10.1016/j.fcr.2014.02.015. Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large Margin rank boundaries for ordinal regression. MIT Press, Cambridge, MA, 2000.
CISTI 2014 | 1045