An Optimization Framework for Remapping and Reweighting Noisy Relevance Labels Yury Ustinovskiy, Valentina Fedorova, Gleb Gusev, Pavel Serdyukov Yandex Leo Tolstoy st. 16, Moscow, Russia
{yuraust, valya17, gleb57, pavser} @yandex-team.ru ABSTRACT
In the context of Web search, typically [1], ranking algorithms are trained with the use of supervised methods, i.e., methods employing a labeled set of query-document pairs. This makes labeling such a set an indispensable step in any learning to rank framework. In the classical setting, training and test datasets are manually labeled by professional assessors, who determine the relevance of each result retrieved in response to a given query. To achieve consistency of the labels, professional assessors are usually taught to follow very sophisticated and elaborated instructions. Although this approach provides high-quality training datasets and leads to an accurate evaluation of ranking algorithms, it does not scale well and turns out to be comparatively expensive. Recent rapid take-up of crowdsourcing marketplaces, e.g., Amazon MTurk1 , provides an alternative way of collecting relevance labels. On these marketplaces, employers post labeling tasks to be completed by a large number of workers for a monetary payment. The human expertise in this scheme turns out to be much cheaper than in the scheme with hired professional assessors. However, labels collected via crowdsourcing have a number of serious shortcomings: (1) crowd workers are usually not provided with detailed instructions like those compiled for professional assessors, since the majority of them would either refuse or fail to follow complicated guidelines; (2) partly due to this, individual workers vary greatly in the quality of their assessments; (3) a large number of workers are spammers, answering randomly or using simple quality agnostic heuristics, see [22]. Label noise is likely to have a negative impact on machine learning algorithms. Traditionally, it is dealt with by various noise reduction techniques [6, 18]. For example, common approaches to noise reduction include cleansing and weighting techniques. Noise cleansing techniques are similar to outlier detection and amount to filtering out samples which ‘look like’ mislabeled for some reason. With the weighting approach none of the samples are completely discarded, while their impact on a machine learning algorithm is controlled by weights, representing our confidence in a particular label. In both approaches, one has to use certain heuristics or assume some underlying model for the noise generation. In the setting of crowdsourced labeling, one can modify the labeling process in order to gain some evidence for each label being correct. Namely, the employers often: (1) provide facile labeling instructions, much simpler than in the case of professional assessors; (2) place ‘honeypot’ tasks, i.e., tasks with a known true label; (3) assign each task to multiple workers in order to evaluate and aggregate their answers.
Relevance labels is the essential part of any learning to rank framework. The rapid development of crowdsourcing platforms led to a significant reduction of the cost of manual labeling. This makes it possible to collect very large sets of labeled documents to train a ranking algorithm. However, relevance labels acquired via crowdsourcing are typically coarse and noisy, so certain consensus models are used to measure the quality of labels and to reduce the noise. This noise is likely to affect a ranker trained on such labels, and, since none of the existing consensus models directly optimizes ranking quality, one has to apply some heuristics to utilize the output of a consensus model in a ranking algorithm, e.g., to use majority voting among workers to get consensus labels. The major goal of this paper is to unify existing approaches to consensus modeling and noise reduction within a learning to rank framework. Namely, we present a machine learning algorithm aimed at improving the performance of a ranker trained on a crowdsourced dataset by proper remapping of labels and reweighting of samples. In the experimental part, we use several characteristics of workers/labels extracted via various consensus models in order to learn the remapping and reweighting functions. Our experiments on a large-scale dataset demonstrate that we can significantly improve state-of-the-art machine-learning algorithms by incorporating our framework.
Keywords Learning to Rank; IR theory; Consensus Models
1.
INTRODUCTION
The fundamental problem faced by many of the of information retrieval systems is learning to rank. The aim of a ranking algorithm is to provide an ordered set of results, e.g., documents or recommendations, in response to a given query. In the last decade, various machine-learning techniques were applied to the construction of such rankers. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
SIGIR ’16, July 17-21, 2016, Pisa, Italy c 2016 ACM. ISBN 978-1-4503-4069-4/16/07. . . $15.00
1
DOI: http://dx.doi.org/10.1145/2911451.2911501
105
http://www.mturk.com/
The presence of honeypots and multiple labels for each query-document pair in the dataset allows one to use certain crowd consensus models, e.g, [5, 12, 23]. These models infer a single consensus label for each task, providing more accurate labels than those generated by individual crowd workers. Consensus models make additional assumptions on the distributions of errors among labels and workers and derive certain quantities that estimate the probabilities of labels being correct. The simplest examples of consensus models are ‘majority vote’ and ‘average score’, which assign the most frequent/average score to each query-document pair. In practice, crowd consensus models could be used to purify learning to rank datasets by substituting crowd labels with consensus labels or by discarding particular labels with low confidence in their quality. However, this approach is ad-hoc in its nature: the objective of a consensus model is accuracy of output labels, and optimizing the accuracy of labels, one does not necessarily optimize the quality of a ranker, trained on the dataset purified by the consensus model. In fact, our experiments in Section 6 demonstrate that a straightforward utilization of consensus labels within a learning to rank algorithm results in suboptimal rankers. There is another aspect, which is usually not captured by the existing consensus models. Often, assessor instructions are simplified (e.g., a 5-grade scheme is reduced to a 2-grade scheme) to easier attract non-professional assessors from crowdsourcing platforms. Unfortunately, while such simplification allows to hire and quickly instruct more workers, it also introduces a bias into their judgements, as they become much less precise and expressive. For instance, some workers are more conservative than the others, thus their positive labels might imply higher relevance than the positive labels of workers who assign them less reservedly. We argue that in such cases it becomes important to have an additional pre-processing mechanism for putting crowd labels on the same more fine-grained scale. To sum up, with the current state of consensus modeling within a learning to rank problem, many questions remain unsettled: (1) Which consensus model out of many should we use for a particular dataset? (2) How to utilize several aspects of ‘quality’ of crowd workers (e.g., experience, agreement rate with other workers, their parameters of the expertise inferred by the consensus models, etc) at once? (3) How to take into account variance in the workers’ interpretation of the assessment instructions and put their labels on the same scale? To address these questions, we propose a learning framework that automatically assigns to each learning sample (1) its relevance value and (2) its weight, which captures the confidence in this value. These two quantities are modelled as functions of label features, which may include the outputs of various consensus models, statistics on a given task, crowd label itself, etc. Our framework trains both functions (one for the relevance value and one for the weight). We assume that there is a background learning to rank algorithm which uses the assigned relevance values and the weights of samples. The proposed framework directly optimizes the ranking quality achieved by this background algorithm on the validation set. We refer to the two training steps as label remapping and sample (re)weighting. With this paper we make the following contributions:
2. Adopt the reweighting techniques to the problem of learning to rank with a crowdsourced dataset; 3. Introduce the label remapping step which assigns a relevance value to each label. To evaluate our framework, we use two datasets. The first one is provided by Yahoo! within Learning to Rank Challenge2 (LTRCD for short). LTRCD is one of common datasets for evaluation of learning to rank algorithms [3]. Since all samples in LTRCD are labeled by professional assessors, we were forced to adapt this dataset to our problem. We simulate errors by randomly corrupting some labels, see Section 6 for details. The second one (referred to as YD) is shared with us by a commercial search engine Yandex and precisely conforms to our problem statement. Experimental results demonstrate that our framework significantly outperforms various competitive baselines. The rest of the paper is organized as follows. Section 2 reviews prior work related to our approach. The details on the YD dataset are provided in Section 3. Section 4 gives an overview of our framework and provides a rigorous formulation of the optimization problem we are solving. Section 5 describes the solution to this optimization problem in details and discusses possible extensions to a more general setup. In Section 6 we report the results of various experiments, showing that our approach significantly improves our baselines. Finally, in Section 7 we discuss some features of our framework and make recommendations for the directions of future research.
2.
RELATED WORK
Reweighting of samples in a training dataset is a wellestablished method, applied in various contexts in machine learning. It is one of the classical approaches to the machine learning in the presence of noise and outliers, see [18, §3]. Sample reweighting is also extensively used in transfer learning tasks, when one tries to use a labeled dataset from a source knowledge domain in order to predict/classify/rank instances in a target domain. The samples in the source and target datasets are often differently distributed, so reweighting techniques are used to reduce this difference. In [8], Garcke and Vanck study inductive transfer with a dataset shift. They suggest supervised (DITL) and unsupervised (KLITL) instance reweighting algorithms. The idea of the reweighting optimization step in our framework is slightly similar to the supervised algorithm DITL, since it also tunes weights in order to improve the quality on the target dataset. However, as opposed to our approach, DITL does not address the problem of noise reduction and does not use any label features. In [4], Chen et al. use feature- and instance- transfer learning for cross-domain learning-to-rank. They propose a heuristic method for reweighting samples in the source set. Namely, the authors train a ranking function on a small portion of the target dataset and apply it on the source dataset. A query in the source dataset is assigned a weight proportional to the number of correctly ranked pairs of documents. This paper also does not deal with label noise and label features. Unlike the transfer learning task, in our case, samples in the source and target dataset are drawn from the same distribution and are represented by the same set of features. 2 http://research.microsoft.com/en-us/um/beijing/ projects/letor/yahoodata.aspx
1. Utilize various approaches to consensus modeling and noise reduction in a single supervised framework;
106
While our framework solves a completely different problem, on the technical level, one of its optimization parts resembles the weights optimization method proposed by Ustinovskiy et.al. in [20]. This paper reweights clicks of users of a search engine in order to train a personalized ranker. Apart from tackling a different problem, our framework is also different in two important respects. Firstly, besides reweighting, our framework also includes a crucial label remapping step, which significantly advances the resulting quality. Secondly, the method from [20] has a serious shortcoming: it is targeted solely at simple linear background ranking algorithms. Whereas our framework is extended to a much wider class of background algorithms, including ensemble methods and neural networks. As we are going to use labels collected via crowdsourcing, we give an overview of the literature on models for crowd consensus. A usual setting in crowdsourcing is the following: each sample is labeled by multiple workers (and each worker labels several samples), these labels are called noisy labels, and the goal is to infer the best consensus label of each sample, which is referred to as the consensus label. Two popular consensus models are Dawid and Skene model (DS) [5] and Generative model of Labels, Abilities, and Difficulties (GLAD) [23]. Dawid and Skene used latent confusion matrices for workers to model their class-specific mistakes. And in GLAD, the probability that a sample is labled correctly depends on both the worker’s expertise and the sample’s difficulty. We describe these two models in more details in Section 3. There has been a number of interesting extensions of these two models. [11] imposed additional priors on confusion matrices and studied Bayesian version of DS model, later [21] extended it to model communities of workers with similar confusion matrices, which is useful in the situation when each worker provides only a few labels. Works [24, 25] extended the DS idea by introducing confusion vectors to encode samples difficulty, and proposed the minimax entropy principle for crowdsourcing. [16] describes a family of flexible consensus models (including DS and GLAD as special cases); models of this family can make use of additional features of samples and workers to model commonality among workers and samples. Another work [15] addresses the problem of learning a classifier when consensus labels are unknown: each sample is represented by a set of features and has multiple noisy labels, and the paper presents an approach to the joint estimation of workers’ parameters, consensus labels within a classification problem. Unlike our approach, this method is unsupervised and relies on the fixed noise model for the workers. Similar set-up was used [13], where the authors aim at estimating class-conditional label noise via unsupervised techniques. For other details on crowdsourcing see recent surveys, e.g., [17, 14]. To the best of our knowledge, this paper is the first work that directly employs outputs of various consensus models within a learning to rank framework. In this framework, a noisy label is associated with a set of features, e.g., likelihoods of the noisy labels under various consensus models. In our empirical study, features for noisy labels are generated by the original DS and GLAD models, but any other consensus models can be used similarly.
3.
we describe the experimental data in real-world YD dataset. The structure of this data and its features reveal the motivation for the particular problem we are solving and clarify the main ideas behind our approach. The first dataset consists of 132K query-document pairs. Each query-document pair is assigned three binary relevance labels from three crowd workers at a crowdsourcing platform YandexToloka3 . The use of binary relevance labels allows to simplify assessment instructions and engage a larger number of workers. With these query-document pairs and crowd labels a dataset X source consisting of 132K × 3 = 396K samples is formed. One sample in X source is a query-document pair together with one label assigned by one crowd worker. In particular, the same query-document pair may occur in X source with different labels. Besides these query-document pairs, there are 1900 honeypot tasks completed by the same workers (each task is completed by several workers). These are query-document pairs labeled by professional assessors with the same instructions. Usually, honeypots are used to detect and penalize spammer workers. Query-document pairs corresponding to honeypots do not get into X source . The second dataset T is collected in a standard manner for a learning to rank task. Namely, every query-document pair in the second dataset T consisting of 90K samples is judged once by one professional assessor and is endowed with a label, corresponding to one of the relevance classes {Perfect, Excellent, Good, Fair, Bad}. This is the dataset intended for evaluation and for supervised tuning of our framework. Every query-document pair in both datasets T and X source is represented by a feature vector x ∈ RN . These are standard ranking features, including text and link relevance, query characteristics, document quality, user behavior features, etc. Our framework is quite general and any ranking features could be utilized within it. Hence, particular choice of these features is out of the scope of the paper. Besides ranking features, the samples in the crowdsourced dataset X source are endowed with a number of label features. These features comprise numerical information about the worker who assigned the label as well as about the task, and the label itself. Our framework is applicable to any set of label features, however we describe these features in detail, since their employment within a learning to rank algorithm is one of the main contributions of the paper. Intuitively, the purpose of label features is to provide evidence of the crowd label being correct. To generate features for labels we utilize two classical consensus models. We describe these models assuming binary labeling tasks. For a sample in X source let us denote by w the worker who labeled this sample, by cl(w) ∈ {0, 1} the crowd label itself.
Dawid and Skene model (DS) [5]. It is assumed that the ith sample has a hidden consensus label ti , which is generated by the multinomial distribution with unknown parameters p = (p1 , p2 ): Pr(ti = c | p) = pc . Each worker w has 2 × 2 confusion matrix π (w) whose rows describe distributions generating noisy labels for a given (w) value of consensus label. Then a noisy label cli assigned by worker w to sample i is generated by the multinomial distri(w) bution with parameters π (w) (ti , ·): Pr(cli = k | ti = c) = (w) π (c, k). Assuming that noisy labels assigned by different
DATA
3
Prior to formulating the problem considered in this paper,
107
http://toloka.yandex.com/
1 2-5
(w)
Pr(cli
(w)
| p, π, t) = pti π (w) (ti , cli
).
This consensus model serves as the first source of label features.
Generative model of Labels, Abilities, and Difficulties (GLAD) [23]. Again, there is a hidden consensus label ti of sample i generated from the multinomial distribution with parameters p. Each worker w has parameter aw , which is a real number representing the worker’s expertise. And each sample i has parameter di ranging in (0, +∞] and representing the sample’s difficulty. Then if worker w labels sample i, (w) the probability that a noisy label cli is correct is modeled (w) 1 . By imposing the as Pr(cli = ti | aw , di ) = 1+exp(−a w di ) same set of assumptions on the labeling process as in Dawid and Skene and applying the EM algorithm, we obtain estimates for p, a, d and consensus labels t. Under this model the likelihood of each noisy label is computed as ( pti (w) , if cli = ti ; (w) 1+exp(−aw di ) Pr(cli | p, a, d, t) = pti , otherwise. 1+exp(aw di )
Table 1: List of raw label features X source # of queries # of query-document pairs # of samples # of workers average # of tasks per worker # of unique honeypot tasks # of completed honeypot tasks T # of queries # of query-document pairs (samples)
6600 90K
is calculated with the use of the labels assigned by professional assessors following a different (more complex) grading scheme. Slightly abusing the notation, we will write simply M(F) for the average of metrics M over all queries in T , with documents ranked according to the ranking model F. Remark. Of course, one can use datasets labeled on a crowdsourcing platform both for training and the evaluation of a ranking model. However, in real-life applications, it is crucial to have an evaluation methodology which is as precise and comprehensive as possible. In particular, we want to use a single evaluation scheme for all possible rankers. It is known [10] that evaluation based on the graded relevance is more realistic and accurate. Therefore, employment of highquality fine-graded assessor labels is essential. This potential difference between X source and T is particularly evident in the experimental part, where we have only coarse binary relevance labels in X source as opposed to the standard 5graded relevance labels in T . The aim of our framework is to utilize the label features (not to be confused with the ranking features), while training a ranker F, to directly maximize a selected quality performance measure M(F). The particular choice of label features is not essential for our learning framework, so we do not discuss any specialities of crowd consensus models here. Typically, before training a learning to rank algorithm with the use of crowd-labeled source set, one processes it in advance to reduce noise among the labels. This processing step may include filtration of samples with unreliable labels,
PROBLEM FORMULATION
In this section, we start with a general description of the problem we are solving. After this, we restrict the scope to a certain class of machine learning algorithms and provide the rigorous formulation within this specific class of algorithms.
4.1
7200 132K 396K 1720 233 1900 43300
Table 2: Statistics on YD dataset.
This is the source for the second set of features that we used for noisy labels. Besides the outputs of these two models we use several simple statistics on tasks and workers. For a worker w, let (w) n(w) be the number of completed tasks, n0 the number of zero-labeled documents. The complete set of raw label features is listed in Table 1. Given any raw feature f (except cl(w) ) we form two label features: fi · I(cl(w) = 1) and fi · I(cl(w) = 0), where I(·) is an indicator function. In total M = 1 + 2 × 14 = 29 label features are constructed. Some important details on the dataset are provided in Table 2
4.
cl(w) π (w)
the crowd label assigned by the worker confusion matrix of the worker in DS model 6 aw worker’s parameters in GLAD model 7-8 Pr(cl(w) = ti ) probability the crowd label is correct under DS/GLAD model 9-10 Pr(cl(w) = 1) probability of positive label under DS/GLAD model 11-12 Pr(cl(w) = 0) probability of negative label under DS/GLAD model 13 log(n(w) ) logarithm of the number of tasks, completed by the worker (w) 14 n0 /n(w) fraction of negative labels, assigned by the worker (w) 15 fhp fraction of correctly-labeled honeypots by a worker
workers are independent given the value of consensus label and that samples are independently identically distributed, we can define the joint likelihood of observed noisy labels and hidden consensus labels as the product over all samples: Q|W | (w) Q (j) (ti , cli ) (bold characPr(cl, t | p, π) = M w=1 π i=1 pti (w) ters cl, t, p, π correspond to {cl }, {ti }, etc). Then, to find the maximum likelihood estimates for p and π and consensus labels t, we apply the Expectation-Maximisation (EM) algorithm. After this, we are able to evaluate the likelihood of each noisy label:
General setting
As we discussed in the introduction, the subject of this paper lies on the intersection of (1) learning to rank and (2) crowd consensus modeling problems. The ultimate goal is to train, using the source crowdsourced dataset X source , a ranking function F of query-document features. The overall performance of a ranking function F on the dataset T is evaluated according to some fixed ranking performance measure M (e.g., DCG, ERR). This performance measure
108
(4)
X source
(1)
(2)
X train
ranker F
(3)
• the i-th sample in the training set X train is equipped with relevance value label li and weight wi (as described below, the relevance value and the weight are the outputs of our data processing framework);
M(F)
• the ranker F is a linear function of ‘query-document features’: F(xi ) = xi · b, where b ∈ RN is a column vector of coefficients and ‘·’ is the dot product;
T (1) (2) (3) (4)
Source dataset processing, P ← our focus Background learning to rank algorithm, A Evaluation of the ranker on the target dataset Feedback for step (1) optimization ← our focus
• ranker FA = FWMSE , trained via algorithm A, minimizes L2 -regularized weighted mean square error (WMSE for short), here µ is the parameter of L2 regularization: X FWMSE = argmin wi (xi ·b−li )2 +µ||b||2 (2)
Figure 1: General framework for learning to rank with a crowdsourced dataset.
F |F (x)=x·b
aggregation of labels based on majority voting among workers, reweighting of samples, etc. To distinguish the initial source dataset X source from the one actually used by the background learning to rank algorithm, we denote the processed dataset by X train . The outline of the global learning scheme is depicted in Figure 1. We focus on step (1), namely, given a fixed learning to rank algorithm at step (2) and a fixed evaluation metric at step (3), e.g., DCG, we aim at direct maximization of the ranking metric by proper processing of the source dataset X source . More formally, let A be a learning to rank algorithm at the second step and let us denote the ranker F, trained on the dataset X train via the algorithm A as F = FA (X train ). Finally, let P be a processing algorithm, with input X source and output X train . With these notations defined, the optimization problem we are solving is
At the first glance, the minimization problem (2) appears to be non-conventional, since each query-document pair appears in the sum several times with presumably different labels. However, it, in fact, includes various standard baseline algorithms as special cases. For example, if we set each weight to 1 and each label li to the corresponding noisy crowd label cl(w) , this minimization problem becomes equivalent to the ‘average label’ baseline, which predicts the average of labels assigned by different workers to a given querydocument pair. Setting some weights to zero is equivalent to discarding corresponding crowd labels, i.e. to data filtration. Similarly, one can implement the ‘majority vote’ baseline and various weighting techniques within this minimization problem. Remark It is well known (see [3]) that linear models are significantly outperformed by more sophisticated algorithms, e.g., neural networks, ensembles of trees, etc. In the next section, we show how our framework could be extended to cover these classes of learning models as well. Let us now discuss two particular problems the processing algorithms P : X source 7→ X train should solve. First, as we have discussed, the quality of crowd labeling significantly varies across different workers and tasks. Sometimes, it happens that our confidence in a particular crowd label in X source is low, e.g., the worker who labeled this sample makes many mistakes on honeypots, or the label contradicts the two labels from the other workers, etc. In this case, we want it to have less impact on the trained ranker F. In the WMSE algorithm this impact is controlled by weight wi . So the larger our confidence in a label is, the larger should be its corresponding weight. Second, some workers turn out to be more conservative than the others. For instance, imagine workers w0 and w00 , such that w0 assigns a positive label 0 cl(w ) only to ‘perfect’ query-document pairs and worker w00 assigns a positive label to each query-document pair, unless it is completely irrelevant. Clearly, in this case, the positive label of the first worker should have a greater value, than of the second. The relevance value of a sample for a ranker F, minimizing WMSE, is reflected by li . Recall that by the initial assumptions on the source dataset X source , its every labeled sample is equipped with an M -dimensional vector y ∈ RM of label features, see Table 1. Motivated by the above two issues, we consider the processing algorithms P assigning a weight wi and a relevance value li to every sample i in X source , where wi and li are now both functions of the label features yi ∈ RM (not to be mistaken with the ranking features xi ∈ RN ). Namely,
b = argmaxM(FA (X train )), P (1)
P
where X train = P(X source ). We treat this problem as a supervised machine learning task. Namely, dataset T with its editorial labels is used to fit P by maximizing M FA (X train ) on T . To make sense of this optimization problem we have to specify the learning to rank algorithm A and the class of possible processing algorithms {P} we are solving (1) within. We discuss them in the next section.
4.2
i∈X train
Our approach
Let RN be the space of ‘query-document features’, meaning that every sample in X source and T is characterized by a vector x ∈ RN . These are standard ranking features, including text and link relevance, query characteristics, document quality, etc. In order to solve optimization problem (1) directly by gradient descent, we need to compute the partial derivatives of the output values of the ranking algorithm FA (X train ) with respect to the parameters of the processing algorithm P. These derivatives do not necessarily exist and even their finite-difference approximation might be computationally expensive, since it requires additional trainings of the algorithm FA . However, if there is a smooth closed-form expression for the ranker FA (X train ) in terms of the samples and features in the training set, the partial derivatives could be computed efficiently. We are aware of such closed-form expression only for a linear, pointwise machine-learning algorithm, so from now on we assume that:
109
X source dataset labeled via crowdsourcing platform T dataset labeled by professional assessors N number of query-document features for samples in X source and T M number of label features for samples in X source S number of samples in X source xi row vector of ranking query-document features of a sample i ∈ X source or i ∈ T yi row vector of label features of a sample i ∈ X source P algorithm processing raw source dataset X source into X train A background learning to rank algorithm, used at step (2) in Figure 1 µ L2 -regularization parameter of WMSE algorithm X train output of processing step P, used by learning to rank algorithm A FA ranker trained via A α M -dimensional column vector of weight parameters of P β M -dimensional column vector of label parameters of P wi weight of a sample i ∈ X train li remapped label of a sample i ∈ X train X S × N matrix of ranking features for X source Y S × M matrix of relevance value label features for X source W S × S diagonal matrix with weights wi l S-dimensional column vector of labels li b N -dimensional column vector of parameters of a linear ranker
let us put wi and li to be sigmoid transforms of yi : wi = σ(yi · α),
li = σ(yi · β),
(3)
where α = (α1 , . . . , αM )T and β = (β1 , . . . , βM )T are parameter vectors of weight w and label l functions and σ(x) = 1/(1 + e−x ) is a sigmoid transform, which ensures that all weights and labels fall into the unit interval [0,1]. The computations of weights wi and relevance values li are referred to as reweighting and remapping steps. The class of the processing algorithms under the consideration forms a 2M -dimensional family with vectors of parameters (α, β). We stress out several important properties of this class of processing algorithms: 1. Sets of samples X source and X train are essentially the same, i.e., we do not need to explicitly filter out any samples; 2. The impact of an i-th sample on the trained ranker FA is controlled by weight wi . In particular, by setting wi to zero we completely discard the i-th sample; 3. Labels cl(w) assigned to samples in X source by crowd workers might differ from relevance values li used in the train dataset X train . Now let us reformulate problem (1) in terms of an optimization problem for parameters α and β of processing step P: b = argmax M FA (X train ) , b β) (α, (α,β)
where X
train
(4)
= P(X source ).
Table 3: List of notations.
Since M is a locally constant function of the values of the ranker F, this is a non-convex and non-differentiable optimization problem. In the next section we adopt wellestablished smoothing techniques (see [1, 19]) to reduce (4) to a differentiable optimization problem and solve it using gradient descent.
5. 5.1
a much faster algorithm, so, in what follows, we use only Lambda gradients. In order to alleviate notations, we will write simply ∂M/∂si , implying the LambdaRank gradient. To find the derivatives ∂b/∂αj and ∂b/∂βj we need a bit of linear algebra. Let S be the number of samples in the source dataset X source and let X be the S × N matrix with the i-th row xi , representing query-document features of the i-th sample. Similarly, let Y be the S × M matrix with the i-th row yi , representing label features of the i-th sample in X source . Finally, let l to be the column vector of labels {li = σ(yi · β)} in X train and W = diag({wi }S i=1 ) to be the diagonal S × S matrix. Note that, by the definition, datasets X source and X train have the same feature matrix X. We summarize the notations in Table 3. The minimizer of Equation (2) has the closed-form expression FA (x) = x · b, where
ALGORITHM Reweighting and label remapping learning framework
We treat optimization problem (4) as a supervised learning task. To solve it directly via gradient descent, we need to compute the gradients ∂M/∂αj and ∂M/∂βj . Via the chain rule one has: X ∂si ∂M X ∂b ∂M = = xi · ∂αj ∂α ∂α j ∂si j i∈T i∈T X X ∂M ∂si ∂M ∂b = = xi · ∂βj ∂β ∂β j ∂si j i∈T i∈T
∂M ; ∂si ∂M , ∂si
b = (X T W X + µIN )−1 X T W l.
(6)
(5) Now set Z := (X T W X + µIN ) and let b l := X · b be the column vector of values of the ranker FA on X train . Differentiating the equality X T W l = Zb with respect to αj one gets:
where si = F(xi ) = xi · b is a score assigned by a linear ranker F to a sample i ∈ T . To compute the derivatives ∂M/∂si we, following the state-of-the-art approach to learning to rank, transform the objective metric M into a smooth one by a smoothing procedure. We have experimented with LambdaRank smoothing by Burges et al. [1] and SoftRank by Taylor et al. [19]. Our experiments have shown that these techniques result in a similar quality, while LambdaRank is
XT
110
∂Zb ∂Z ∂b ∂W l= = b+Z = ∂αj ∂αj ∂αj ∂αj ∂W ∂b ∂W b ∂b = XT Xb + Z = XT l+Z . ∂αj ∂αj ∂αj ∂αj
(7)
Finally, a little manipulation with Equation (7) yields: ∂b ∂W = Z −1 X T (l − b l) = ∂αj ∂αj
weak learners: FA (x) =
(8)
(10)
where each weak learner fjweak is a function on the feature space RN . Note that the class of models in Equation (10) includes boosted decision trees [7], polynomial models, neural networks as its special cases. The model FA (x) is trivially a linear transform of the T -dimensional vector (f1weak (x), . . . fTweak (x)). Let us substitute the initial feature vector x ∈ RN with the new one xnew = (x1 , . . . , xN , f1weak (x), . . . fTweak (x)) ∈ RN +T . In what follows, this operation is referred to as feature extension.
Computation of derivatives of b with respect to β from Equation 6 is a bit simpler, since only factor l depends on β: (9)
where ∂l/∂βj = {yj σ 0 (yi · β)}S i=1 Plugging these expressions into equation (5), we get the derivatives of the objective function M with respect to parameters of the processing step α = (α1 , . . . , αM )T (and similarly β = (β1 , . . . , βM )T ). The algorithm for training the processing step P is summarized in pseudocode in Algorithm 1. The initial value α0 = (0, . . . , 0)T correspond to the all equal weights (corresponding to W = 1/2 IN ) and the initial value β 0 = (1, 0, . . . , 0)T distinguishes the label feature cl(w) .
Proposition 1. Let FA be the ranker in Equation (10). new Define ranker FWMSE (respectively FWMSE ) as the solution to the WMSE minimization problem (2) over the initial feature space RN (respectively, over the new extended feature space RN +T ) with L2 -regularization parameter µ = 0. For a ranker F, let E(F) be the weighted error on the train dataset: X E(F) := wi (F(xi ) − li )2 .
Input: Crowdsourced dataset X source and target dataset T labeled by professional assessors Parameters: number of iterations J, step size regularization , parameter of L2 regularization µ Initialization: form matrices X, Y ; set α = α0 , β = β 0 , l = {σ(yi · β 0 )}S i=1 , W = 1/2 IN for j = 0 to J do
i∈X train
Then one has the following bound: new E(FWMSE ) ≤ min(E(FA ), E(FWMSE )). new minimizes the error Proof. By its definition FWMSE E(F) among all the linear models of the form xnew · bnew . Since the rankers FA and FWMSE both have this form, the required bound holds.
αj+1 = αj + ∂M/∂α (see (5)); β j+1 = β j + ∂M/∂β; update W , l, b, (see (3), (6)); end for i ∈ X source do wi = σ(yi · αJ );
This proposition demonstrates that, in terms of E(·), the new trained with the use of the new, exlinear ranker FWMSE tended feature space is at least as good as FWMSE or FA . Although in the general scheme (see Figure 1) we are interested in a different quality measure on a different dataset, this proposition could be seen as an evidence of a possible benefit from the feature extension. In the experimental part we show that feature extension indeed significantly improves the quality of the trained ranker evaluated according to M on T .
li = σ(yi · β J ); end Output: X train Algorithm 1: Steepest descend optimization of the processing algorithm P.
6. 5.2
fjweak (x),
j=0
b = Z −1 X T diag({yj σ 0 (yi · α)}S i=1 )(l − l).
∂l ∂b = Z −1 X T W , ∂βj ∂βj
T X
EXPERIMENTS
We conduct two sets of experiments. The first series of experiments uses the Yahoo! Learning to Rank Challenge dataset (LTRCD). This dataset is labeled by professional assessors and has neither noisy crowd labels, nor any label features. Hence, LTRCD does not suit our setup as is, so we are forced to simulate label noise and derive some label features. We use this simulated noise to demonstrate the actual weights and relevance values our algorithm is able to learn. There are no large-scale publicly available crowdsourced learning to rank datasets, so for the second set of experiments we use the proprietary dataset YD. This is the dataset described in Section 3 and it is used directly to evaluate various baselines and compare them with our processing framework. In both cases, we use discounted cumulative gain measure at positions k = 1, 5, 10, denoted by DCG@k [9], as the ranking quality metric. For both the datasets, graded relevance labels are mapped into the standard numerical gains {15, 7, 3, 1, 0}.
Extension to other background learningto-rank algorithms
So far, the background algorithm considered in our approach (Equation (2)) was assumed to be linear. This is a very strong and non-realistic assumption, since in the vast majority of regression/classification/ranking problems linear algorithms are outperformed by more sophisticated models, e.g., decision trees, ensembles of trees, neural networks, etc. We explain how to modify the learning scheme we have fixed in Section 4 in order to relax the linearity assumption and incorporate some of the more complicated models. One of the main drawbacks of linear algorithms is their failure to leverage complex dependencies between individual features. To overcome it, we suggest improving algorithm (2) by transforming the feature space in order to make it more tractable for a linear model. To be more precise, let us consider any algorithm A, which trains an ensemble of
111
Simulated experiments
Relevance value
6.1
In Section 4.2, we motivated the introduction of the remapping and weighting steps in our framework by highlighting two specific shortcomings of crowd labels: the varying quality among the crowd workers and the inherent ambiguity of any labeling instructions. In the real world, both phenomena are difficult to observe as well as it is difficult to explicitly analyze their influence on the weights and the relevance values learned by our algorithm. For this reason, to illustrate the intuition behind our remapping and weighting steps, we design two simulated experiments with the known quality of the workers and interpretation of the instruction. With this setup, we demonstrate the particular label relevance values and weights our processing step learns.
1.0 0.8 0.6 0.4 0.2 0.0 0
1
2
Rigor r
3
(w)
Figure 2: The relevance values learned with our algorithm in Simulated Experiment 1.
Weight
1.0
Data. We use the training and the validating dataset from the Set1 (see [3] for details) in LTRCD as the bases for our datasets X source and T . Namely, samples and features in X source /T are the same as in the training/validation dataset in LTRCD; labels in T also coincide with the ones in the validation dataset, while the labels in X source are intentionally modified. Originally, samples in LTRCD are endowed with 5-graded relevance labels {4, 3, 2, 1, 0} (corresponding to the gains {15, 7, 3, 1, 0} in DCG@k respectively). In the simulated experiments, we treat each sample in X source as one task completed by one virtual worker. In the first simulated experiment, we assume that each worker w assigns a binary crowd label to each sample. This label depends on two quantities: the actual quality of the document, i.e., its label in the initial dataset l ∈ {4, 3, 2, 1, 0}, and the worker’s rigor r(w) ∈ {3, 2, 1, 0}. In this case, the worker assigns a crowd label: 1, if l > r(w) ; cl(w) := (11) 0, otherwise.
0.8 0.6 0.4 0.2 0.0 0.0
0.5
Quality q
0.75
1.0
(w)
Figure 3: The weights learned with our algorithm in Simulated Experiment 2. flecting the quality of the worker: y := {I(q (w) = q) for q ∈ {0.0, 0.5, 0.75, 1.0}}.
Results. Simulated Experiment 1. For the first simulated experiment we learn only the relevance values, li = σ(β · yi ), see Equation (3), i.e., in Algorithm 1 only parameters β are tuned, while the vector α is fixed. The resulting relevance values for the samples with positive crowd labels are depicted in Figure 2. As we have expected, the relevance value grows monotonically with the growth of the rigor of a worker r(w) . Simulated Experiment 2. In the second simulated experiment we, on the contrary, learn only the weights wi = σ(α · yi ) with β in Algorithm 1 fixed. Figure 3 demonstrates the resulting weights as the function of the worker’s quality q (w) . Again, the weights wi grow with the growth of worker’s quality. Surprisingly, the weight of workers with quality q (w) = 0.5 is negligible, implying that the use of the corresponding samples within the background machine learning algorithm does not improve the ranking quality.
In simple words, the most conservative workers with r(w) = 3 assign the positive label only to the perfect result, the less rigorous workers with r(w) = 2 assign positive labels to perfect and good documents, etc. This labeling scheme models a variable interpretation of labeling instructions by crowd workers. Intuitively, one expects that the more conservative the worker is, the higher actual relevance values her positive labels correspond to. To modify the initial dataset X source in LTRCD, we assign each worker with a uniformly random rigor r(w) and compute each crowd label according to Equation (11). Moreover, in this experiment, we endow each sample with 4 binary label features: y := I(cl(w) = 1& r(w) = r) for r ∈ {3, 2, 1, 0} .
6.2
Experiments on the real-world data
Experiment design. This set of experiments is conducted on the dataset described in Section 3. For a ranker F and a dataset X , let us define DCG@k(F, X ) to be the average of the values of the DCG@k metrics over all queries in X , ranked according to the values of F. To evaluate our algorithm along with a number of baselines we perform 5-fold cross validation. Each time, the target dataset T is randomly split into 3 parts T train , T validate and T test on the basis of the unique query id. The part T train is used to train Algorithm 1, specifically, to compute the LambdaRank-gradients ∂DCG@k(F, T train )/∂si . The part T validate is used to tune various hyperparameters of the background machine learning algorithm A and the pro-
The idea is that these features will help us to distinguish positive labels of different workers and remap them according to the worker’s rigor. In the second simulated experiment, we model this variable quality among workers. Namely, each worker is assigned with a random value q (w) ∈ {0.0, 0.5, 0.75, 1.0} reflecting her ‘quality’. Further, we assume that each worker w makes a mistake with probability 1−q (w) . We expect that the higher the quality of the worker, the higher our confidence in her labels and, therefore, the larger weight to the samples should be assigned by our framework. In this simulated experiment, we again endow each sample with 4 binary label features re-
112
Baselines Regression MV Av DS GLAD RD MV ∆DCG@1 0% 2.49%? 2.58%? -1.49% -1.63% 11.61%? ∆DCG@5 0% 1.93%? 1.66%? -1.03% -0.74% 5.38%? ∆DCG@10 0% 1.58%? 1.38%? -0.08% -0.84% 4.27%? Metric
Our approach LambdaMART Av DS 13.03%? 13.56%? 6.47%? 6.49%? 5.16%? 5.32%?
GLAD Rw Rm 9.53%? 9.76%? 12.46%? 4.5%? 4.99%? 6.06%? 3.64%? 4.05%? 4.80%?
Rw+Rm 15.35%†? 7.30%†? 5.97%†?
Table 4: Comparsion of various baselines with our framework. ? denotes a significant improvement over the majority vote baseline with p < 0.05, † denotes a significant improvement over the best baseline with p < 0.05. cessing step, i.e., L2 -regularization parameter µ; J and in our processing algorithm and the number of extended features T . The output of Algorithm 1 is the dataset X train with samples corresponding to the elements in X source and weights/labels given by (3). Finally, the third part T test is used to evaluate the ranker trained via background algorithm A on X train (our approach) and the baseline rankers. Due to the proprietary nature of the data, we do not disclose the actual values of DCG@k metrics on the test dataset T test . Instead, we report the relative improvement over the ‘majority vote’ baseline, described further. We denote this relative improvement by ∆DCG@k.
6.5%
∆DCG@10
6.0% 5.5% 5.0% 4.5% 4.0% 3.5%
0
50
100
150
200
250
300
350
400
number of added features T
Figure 4: Performance of the feature extension step for different choices of T .
Baselines. Since the specific problem we formulate in this paper is novel, there are no baselines intended specifically for our setting. For this reason, we adopt various common approaches to noise reduction and consensus modeling. All of them are used at the processing step (see Figure 1). As the background algorithm for the baselines, we use the proprietary implementation of the Friedman’s gradient boosted ensemble of decision trees [7]. We experiment with the standard pointwise regression regime and the state-of-the-art listwise algorithm LambdaMART [2] both trained on X train . We use the following processing algorithms as the baselines:
feature extension. All the resulting models are evaluated on T validate , see Figure 4. In accordance with the evidence provided in Section 5.2, the feature extension improves the quality of the linear background algorithm. However, with the growth of the number of the weak learners added, the background algorithm starts to overfit to the source dataset. Figure 4 suggests that the optimal number of trees is T = 300. Next, using the optimal parameter T = 300 tuned on T validate , we evaluate all the baselines together with our approach on T test . We consider three versions of our framework: (1) the reweighting approach Rw which tunes only parameters α in Algorithm 1; (2) the remapping approach Rm which tunes only parameters β; (3) their combination Rw+Rm which tunes both α and β. Relative improvements of all these methods over the majority vote baseline (MV) are shown in Table 4. The remapping approach Rm alone turns out to be significantly better then the reweighting approach Rw, and both are outperformed by their combination. Note that, according to all three evaluation measures, Rw+Rm statistically significantly (with p < 0.05) outperform the best performing baseline, namely, LambdaMART trained on DS consensus labels.
1. MV, assigning the majority vote to every querydocument pair; 2. Av, assigning the average of 3 crowd labels to every query-document pair; 3-4. DS/GLAD, assigning the most likely label to each task according to DS/GLAD model; Besides these baselines, we experiment with the classical noise reduction technique, reweighting by deviations [6, §3.3.2], denoted by RD. This reweighting method does not take into account any label features, instead, it assigns to the i-th sample the weight min(1/δ 2 , 1/∆2i ), where ∆i = |li − lbi | is the absolute error of the background learning algorithm on the initial dataset X source and δ ≥ 0 is a parameter.
Feature set Approach Basic + GLAD Basic + DS All Rw 2.12% 3.29% 4.05% Rm 3.20% 3.81% 4.80% Rm+Rw 4.01% 5.38% 5.97%
Results. One of the contributions of this paper is the introduction of the feature extension step (see Section 5.2). We first study the impact of the number of additional features T in the extended feature space RN +T . To construct the additional features, the ensemble of the gradient boosted decision trees on X source is trained. Each individual tree in this ensemble is treated as a single extended ranking feature. We run remapping and weighting over the extended feature space with T ∈ {0, 100, 200, 400} with T = 0 corresponding to no
Table 5: Feature sets and their performance. The performance results of the baselines suggest that DS consensus model suits better for our dataset than GLAD model. We confirm this observation by analyzing the contribution of various types of label features into the model. Let us divide all features into three groups: (1) the fea-
113
tures based on DS model; (2) the features based on GLAD model; (3) the remaining basic features (corresponding to the features 1,13-15 in Table 1). Feature importance is analyzed through the feature ablation, i.e., we train the whole framework on certain subsets of the full feature set and evaluate the resulting ranker according to ∆DCG@10 metric on T test . Similarly to our previous observations, on all the feature sets remapping Rm outperforms reweighting Rw and for all the algorithms GLAD features are outperformed by DS features.
7.
[6] B. Fr´ enay and M. Verleysen. Classification in the presence of label noise: a survey. Neural Networks and Learning Systems, IEEE Transactions on, 25(5):845–869, 2014. [7] J. H. Friedman. Stochastic gradient boosting. Comput. Stat. Data Anal., 38(4):367–378, 2002. [8] J. Garcke and T. Vanck. Importance weighted inductive transfer learning for regression. In Machine Learning and Knowledge Discovery in Databases, pages 466–481. Springer, 2014. [9] K. J¨ arvelin and J. Kek¨ al¨ ainen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002. [10] J. Kek¨ al¨ ainen. Binary and graded relevance in ir evaluations-comparison of the effects on ranking of ir systems. Inf. Process. Manage., 41(5):1019–1033, 2005. [11] H.-C. Kim and Z. Ghahramani. Bayesian classifier combination. In International conference on artificial intelligence and statistics, pages 619–627, 2012. [12] K. Lee, J. Caverlee, and S. Webb. The social honeypot project: Protecting online communities from spammers. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 1139–1140, New York, NY, USA, 2010. ACM. [13] T. Liu and D. Tao. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell., 38(3):447–461, 2016. [14] J. Muhammadi, H. R. Rabiee, and A. Hosseini. Crowd labeling: a survey. arXiv preprint arXiv:1301.2774, 2013. [15] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. The Journal of Machine Learning Research, 11:1297–1322, 2010. [16] P. Ruvolo, J. Whitehill, and J. R. Movellan. Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. [17] A. Sheshadri and M. Lease. SQUARE: A benchmark for research on computing crowd consensus. In First AAAI Conference on Human Computation and Crowdsourcing, 2013. [18] T. Strutz. Data fitting and uncertainty. A practical introduction to weighted least squares and beyond. Vieweg+ Teubner, 2010. [19] M. Taylor, J. Guiver, S. Robertson, and T. Minka. Softrank: Optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pages 77–86, New York, NY, USA, 2008. ACM. [20] Y. Ustinovskiy, G. Gusev, and P. Serdyukov. An optimization framework for weighting implicit relevance labels for personalized web search. WWW ’15, pages 1144–1154, 2015. [21] M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Community-based bayesian aggregation models for crowdsourcing. WWW ’14, pages 155–164, 2014. [22] J. Vuurens, A. P. de Vries, and C. Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. CIR ’11, pages 21–26, 2011. [23] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. NIPS ’09, pages 2035–2043, 2009. [24] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. NIPS ’12, pages 2195–2203. 2012. [25] D. Zhou, Q. Liu, J. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. ICML ’14, pages 262–270, 2014.
DISCUSSION AND FUTURE WORK
In this paper we address the problem of learning to rank with the use of noisy labels, acquired via crowdsourcing, and aggregated with different consensus labeling models. Unlike the existing approaches to consensus modeling and noise reduction, the framework described in this paper leverages the label features (e.g., outputs of various consensus models, features of a worker and a task, etc) directly optimizing the quality of the ranker it produces. Its success prompts one to collect as much information about the worker (assessor), the task and the label as possible, e.g., track the actions of the worker/assessor in the browser, the time spent at a given task, the ranking features of query-document pair, etc. By transforming this information into a large label features vector y, we are able to learn more refined weights and relevance labels. These, in turn, being fed to a background machine learning algorithm, will most likely lead to a better ranking function. Also, we would like to make a remark about the role of the target dataset T in the framework. As we have discussed above, the samples in the target set are labeled by professional assessors and, therefore, are expensive and rare. Nevertheless, our framework is supervised and requires these high-quality labels. Since commercial search engines have to update their rankings algorithm often, it might be expensive to renew the target dataset every time. Luckily, it suffices to tune the weighting and remapping functions only once. After it, they can be used directly within a background machine leaning algorithm with the new datasets X source to retrain the ranker F. In the future we are planning to apply the remapping framework to other problems. One of the promising directions is learning from the click-through data. Besides that, we intend to develop our algorithm on the technical level by extending it to deal with pairwise preferences.
8.
REFERENCES
[1] C. Burges, R. Rango, and Q. Viet Le. Learning to rank with nonsmooth cost functions. Proceedings of the Advances in Neural Information Processing Systems, 19:193–200, 2007. [2] C. Burges, K. Svore, P. Bennett, A. Pastusiak, Q. Wu, O. Chapelle, Y. Chang, and T. yan Liu. Learning to rank using an ensemble of lambda-gradient models. JMLR, pages 25–35, 2011. [3] O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. In Yahoo! Learning to Rank Challenge, pages 1–24, 2011. [4] D. Chen, Y. Xiong, J. Yan, G.-R. Xue, G. Wang, and Z. Chen. Knowledge transfer for cross domain learning to rank. Information Retrieval, 13(3):236–253, 2010. [5] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979.
114