Regularized Minimax Conditional Entropy for Crowdsourcing

17 downloads 30467 Views 282KB Size Report
Mar 25, 2015 - Email: [email protected]. †Department ... collecting labeled data in many domains have dropped dramatically enabling the collection of large ..... While crowdsourcing is cheap, collecting many redundant labels.
Regularized Minimax Conditional Entropy for Crowdsourcing Dengyong Zhou



Qiang Liu



John C. Platt



Christopher Meek

§

arXiv:1503.07240v1 [cs.LG] 25 Mar 2015

Nihar B. Shah¶

Abstract There is a rapidly increasing interest in crowdsourcing for data labeling. By crowdsourcing, a large number of labels can be often quickly gathered at low cost. However, the labels provided by the crowdsourcing workers are usually not of high quality. In this paper, we propose a minimax conditional entropy principle to infer ground truth from noisy crowdsourced labels. Under this principle, we derive a unique probabilistic labeling model jointly parameterized by worker ability and item difficulty. We also propose an objective measurement principle, and show that our method is the only method which satisfies this objective measurement principle. We validate our method through a variety of real crowdsourcing datasets with binary, multiclass or ordinal labels. Keywords: crowdsourcing, human computation, minimax conditional entropy



Microsoft Research, Redmond, WA 98052. Email: [email protected]. Department of Computer Science, University of California at Irvine, Irvine, CA 92637. Email: [email protected]. ‡ Microsoft Research, Redmond, WA 98052. Email: [email protected]. § Microsoft Research, Redmond, WA 98052. Email: [email protected]. ¶ Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, CA 94720. Email: [email protected]. †

1

1

Introduction

In many real-world applications, the quality of a machine learning system is governed by the number of labeled training examples, but the labor for data labeling is usually costly. There has been considerable machine learning research work on learning when there are only few labeled examples, such as semi-supervised learning and active learning. In recent years, with the emergence of crowdsourcing (or human computation) services like Amazon Mechanical Turk1 , the costs associated with collecting labeled data in many domains have dropped dramatically enabling the collection of large amounts of labeled data at a low cost. However, the labels provided by the workers are often not of high quality, in part, due to misaligned incentives and a lack of domain expertise in the workers. To overcome this quality issue, in general, the items are redundantly labeled by several different workers, and then the workers’ labels are aggregated in some manner, for example, majority voting. The assumption underlying majority voting is that all workers are equally good so they have equal vote. Obviously, such an assumption does not reflect the truth. It is easy to imagine that one worker is more capable than another in some labeling task. More subtly, the skill level of a worker may significantly vary from one labeling category to another. To address these issues, Dawid and Skene (1979) propose a model which assumes that each worker has a latent probabilistic confusion matrix for generating her labels. The off-diagonal elements of the matrix represent the probabilities that the worker mislabels an item from one class as another while the diagonal elements correspond to her accuracy in each class. The true labels of the items and the confusion matrices of the workers can be jointly estimated by maximizing the likelihood of the workers’ labels. In the Dawid-Skene method, the performance of a worker characterized by her confusion matrix stays the same across all items in the same class. That is not true in many labeling tasks, where some items are more difficult to label than others, and a worker is more likely to mislabel a difficult item than an easy one. Moreover, an item may be easily mislabeled as some class rather than others by whoever labels it. To address these issues, we develop a minimax conditional entropy principle for crowdsourcing. Under this principle, we derive a unique probabilistic model which takes both worker ability and item difficulty into account. When item difficult is ignored, our model seamlessly reduces to the classical Dawid-Skene model. We also propose a natural objective measurement principle, and show that our method is the only method which satisfies this objective measurement principle. The work is an extension of the earlier results presented in (Zhou et al., 2012, 2014). We organize the paper as follows. In Section 2, we propose the minimax conditional entropy principle for aggregating multiclass labels collected from a crowd and derive its dual form. In Section 3, we develop regularized minimax conditional entropy for preventing overfitting and generating probabilistic labels. In Section 4, we propose the objective measurement principle which also leads to the probabilistic model derived from the minimax conditional entropy principle. In Section 5, we extend our minimax conditional entropy method to ordinal labels, where we need to introduce a new assumption called adjacency confusability. In Section 6, we present a simple yet efficient coordinate ascent method to solve the minimax program through its dual form and also a method for model selection. Related work are discussed in Section 7. Empirical results on real crowdsourcing data with binary, multiclass or ordinal labels are reported in Section 8, and conclusion are presented in Section 9. 1

https://www.mturk.com

2

worker 1 worker 2 ··· worker m

item 1 x11 x21 ··· xm1

item 2 x12 x22 ··· xm2

··· ··· ··· ··· ···

item n x1n x2n ··· xmn

worker 1 worker 2 ··· worker m

item 1 π11 π21 ··· πm1

item 2 π12 π22 ··· πm2

··· ··· ··· ··· ···

item n π1n π2n ··· πmn

Figure 1: Left table: observed labels xij provided by worker i for item j. Right table: underlying distributions πij of worker i for generating a label for item j. In our approach, the rows and columns of of the unobserved right table are constrained to match the rows and columns of the observed left table.

2

Minimax Conditional Entropy Principle

In this section, we present the minimax conditional entropy principle for aggregating crowdsourced multiclass labels in both its primal and dual forms. We also show that minimax conditional entropy is equivalent to minimizing Kullback-Leibler (KL) divergence.

2.1

Notation and Problem Setting

Assume that there are a group of workers indexed by i, a set of items indexed by j, and a number of classes indexed by k or c. Let xij be the observed label that worker i assigns to item j, and Xij be the corresponding random variable. Denote by Q(Yj = c) the unobserved true probability that item j belongs to class c. A special case is that Q(Yj = c) = 1 and Q(Yj = k) = 0 for any other class k 6= c. That is, the labels are deterministic. Denote by P (Xij = k|Yj = c) the probability that worker i labels item j as class k while the true label is c. Our goal is to estimate the unobserved true labels from the noisy workers’ labels.

2.2

Primal Form

Our approach is built upon two four-dimensional tensors with the four dimensions corresponding to workers i, items j, observed labels k, and true labels c. The first tensor is referred to as the empirical confusion tensor of which each element is given by φbij (c, k) = Q(Yj = c)I(xij = k)

to represent an observed confusion from class c to class k by worker i on item j. The other tensor is referred to as the expected confusion tensor of which each element is given by φij (c, k) = Q(Yj = c)P (Xij = k|Yj = c) to represent an expected confusion from class c to class k by worker i on item j. We assume that the labels of the items are independent. Thus, the entropy of the observed workers’ labels conditioned on the true labels can be written as X X P (Xij = k|Yj = c) log P (Xij = k|Yj = c). Q(Yj = c) H(X|Y ) = − j,c

i,k

3

worker 1 worker 2 worker 3

item 1 1 2 1

item 2 2 1 1

  1 1 0 φb1 = 1 1 0 , 0 1 1

item 3 2 2 1

item 4 1 2 2

  1 1 0 φb2 = 0 2 0 , 1 0 1

item 5 3 1 2

item 6 2 3 3



 2 0 0 φb3 = 1 1 0 0 1 1

Figure 2: An illustration of the empirical confusion tensors. The table contains three workers’ labels over six items. These items are assumed to have deterministic true labels as follows: class 1 = {item 1, item 2}, class 2 = {item 3, item 4}, and class 3 = {item 5, item 6}. The (c, k)-th entry of matrix φbi represents the number of the items labeled as class k by worker i given that their true labels are class c. Both the distributions P and Q are unknown here. To attack this problem, we first consider a simpler problem: estimate P when Q is given. Then, we proceed to jointly estimating P and Q when both are unknown. Given the true label distribution Q, we propose to estimate P which generates the workers’ labels by max H(X|Y ), (1) P

subject to the worker and item constraints (Figure 1) i Xh φij (c, k) − φbij (c, k) = 0, ∀i, k, c, j

Xh i

i φij (c, k) − φbij (c, k) = 0, ∀j, k, c,

(2a) (2b)

plus the probability constraints X

P (Xij = k|Yj = c) = 1, ∀i, j, c,

(3a)

Q(Yj = c) = 1, ∀j,

(3b)

k

X c

Q(Yj = c) ≥ 0, ∀j, c.

(3c)

The constraints in Equation (2a) enforce the expected confusion counts in the worker dimension to match their empirical counterparts. Symmetrically, the constraints in Equation (2b) enforce the expected confusion counts in the item dimension to match their empirical counterparts. An illustration of empirical confusion tensors is shown in Figure 2. When both the distributions P and Q are unknown, we propose to jointly estimate them by min max Q

H(X|Y ),

P

(4)

subject to the constrains in Equation (2) and (3). Intuitively, entropy can be understood as a measure of uncertainty. Thus, minimizing the maximum conditional entropy means that, given 4

the true labels, the workers’s labels are the least random. Theoretically, minimizing the maximum conditional entropy can be connected to maximum likelihood. In what follows, we show how the connection is established.

2.3

Dual Form

The Lagrangian of the maximization problem in (4) can be written as L = H(X|Y ) + Lσ + Lτ + Lλ with Lσ =

X

σi (c, k)

Lτ =

X

τj (c, k)

X

λijc

j,c,k

Lλ =

i,j,c

i Xh φij (c, k) − φbij (c, k) , j

i,c,k

(5)

Xh

X k

i

i φij (c, k) − φbij (c, k) ,

 P (Xij = k|Yj = c) − 1 ,

where σi (c, k), τj (c, k) and λijc are introduced as the Lagrange multipliers. By the Karush-KuhnTucker (KKT) conditions (Boyd and Vandenberghe, 2004), ∂L = 0, ∂P (Xij = k|Yj = c) which implies log P (Xij = k|Yj = c) = λijc − 1 + σi (c, k) + τj (c, k). Combining the above equation and the probability constraints in (3a) eliminates λ and yields P (Xij = k|Yj = c) =

1 exp[σi (c, k) + τj (c, k)], Zij

(6)

where Zij is the normalization factor given by X Zij = exp[σi (c, k) + τj (c, k)]. k

Although the matrices [σi (c, k)] and [τj (c, k)] in Equation (6) come out as the mathematical consequence of minimax conditional entropy, they can be understood intuitively. We can consider the matrix [σi (c, k)] as the measure of the intrinsic ability of worker i. The (c, k)-th entry measures how likely worker i labels a randomly chosen item in class c as class k. Similarly, we can consider the matrix [τi (c, k)] as the measure of the intrinsic difficult of item j. The (c, k)-th entry measures how likely item j in class c is labeled as class k by a randomly chosen worker. In the following, we refer to [σi (c, k)] as worker confusion matrices and [τi (c, k)] as item confusion matrices. Substituting the labeling model in Equation (6) into the Lagrangian in Equation (5), we can obtain the dual form of the minimax problem (4) as (see Appendix A) X X max Q(Yj = c) log P (Xij = xij |Yj = c). (7) σ,τ,Q

j,c

i

5

It is obvious that, to be optimal, the true label distribution has to be deterministic. Thus, the dual Lagrangian can be equivalently expressed as the complete log-likelihood  YX Y Q(Yj = c) P (Xij = xij |Yj = c) . log j

c

i

In Section 3, we show how to regularize the objective function in (4) to generate probabilistic labels.

2.4

Minimizing KL Divergence

Let us extend the two distributions P and Q to the product space X ×Y. We extend the distribution Q by defining Q Q(Xij = xij ) = 1, and Q(Y ) stays the same. We extend the distribution P with P (X, Y ) = ij P (Xij |Yj )P (Yj ), where P (Xij |Yj ) is given by Equation (6), and P (Y ) is a uniform distribution over all possible classes. Then, we have Theorem 2.1 When the true labels are deterministic, minimizing the KL divergence from Q to P, that is,   X Q(X, Y ) , min DKL (Q k P ) = Q(X, Y ) log (8) P,Q P (X, Y ) X,Y

is equivalent to the minimax problem in (4). The proof is presented in Appendix B. A sketch of the proof is as follows. We show that, X X DKL (Q k P ) = − Q(Yj = c) P (Xij = k|Yj = c) log P (Xij = k|Yj = c) j,c

+

X

i,k

Q(Y ) log Q(Y ) − log P (Y ).

Y

By the definition of P (X, Y ), P (Y ) is a constant. Moreover, when the true labels are deterministic, we have X Q(Y ) log Q(Y ) = 0. Y

This concludes the proof of this theorem.

3

Regularized Minimax Conditional Entropy

In this section, we regularize our minimax conditional entropy method to address two practical issues: • Preventing overfitting. While crowdsourcing is cheap, collecting many redundant labels may be more expensive than hiring experts. Typically, the number of labels collected for each item is limited to a small number. In this case, the empirical counts in Equation (2) may not match their expected values. It is likely that they fluctuate around their expected values although these fluctuations are not large.

6

• Generating probabilistic labels. Our minimax conditional entropy method can only generate deterministic labels (see Section 2.3). In practice, probabilistic labels are usually more useful than deterministic labels. When the estimated label distribution for an item is close to uniform over several classes, we can either ask for more labels for the item from the crowd or forward the item to an external expert. For addressing the issue of overfitting, we formulate our observation by replacing exact matching with approximate matching while penalizing large fluctuations. For generating probabilistic labels, we consider an entropy regularization over the unknown true label distribution. This is motivated by the analysis in Section 2.4. Formally, we regularize our minimax conditional entropy method as follows. Let us denote the entropy of the true label distribution by X H(Y ) = − Q(Yj = c) log Q(Yj = c). j,c

To estimate the true labels, we consider min max H(X|Y ) − H(Y ) − Q

P

1 1 Ω(ξ) − Ψ(ζ) α β

subject to the relaxed worker and item constraints i Xh φij (c, k) − φbij (c, k) = ξi (c, k), ∀i, s, j

Xh i

i φij (c, k) − φbij (c, k) = ζj (c, k), ∀j, s,

(9)

(10a) (10b)

plus the probability constraints in Equation (3). The regularization functions Ω and Ψ are chosen as 1 XX Ω(ξ) = [ξi (c, k)]2 , (11a) 2 i

c,k

1 XX Ψ(ζ) = [ζj (c, k)]2 . 2 j

(11b)

c,k

The new slack variables ξi (c, k), ζj (c, k) in Equation (10) model the possible fluctuations. Note that these slack variables are not restricted to be positive. When there are a sufficiently large number of observations, the fluctuations should be approximately normally distributed, due to the central limit theorem. This observation motivates the choice of the regularization functions in (11) to penalize large fluctuations. The entropy term H(Y ) in the objective function, which is introduced for generating probabilistic labels, can be regarded as penalizing a large deviation from the uniform distribution. Substituting the labeling model from Equation (6) into the Lagrangian of (9), we obtain the dual form (see Appendix C) X X max Q(Yj = c) log P (Xij = xij |Yj = c) + H(Y ) − αΩ∗ (σ) − βΨ∗ (τ ), (12) σ,τ,Q

j,c

i

7

where Ω∗ (σ) =

1 XX [σi (c, k)]2 , 2

(13)

1 XX [τj (c, k)]2 . 2

(14)

i

Ψ∗ (τ ) =

c,k

j

c,k

When α = 0 and β = 0, the objective function in (12) turns out to be a lower bound of the log marginal likelihood  YXY P (Xij = xij |Yj = c) log j

c

i

 Q(Yj = c) Y P (Xij = xij |Yj = c) = log Q(Yj = c) c i j X X ≥ Q(Yj = c) log P (Xij = xij |Yj = c) + H(Y ). YX

j,c

i

The last step is based on Jensen’s inequality. Maximizing the marginal likelihood is more appropriate than maximizing the complete likelihood since only the observed data matters in our inference. Finally, we introduce a variant of our regularized minimax conditional entropy. It is obtained by restricting the feasible region of the slack variables through X ξi (c, c) = 0, ∀i. (15) c

This is equivalent to

Xh j,c

i φij (c, c) − φbij (c, c) = 0, ∀i.

It says that, the empirical count of the correct answers from each worker is equal to its expectation. According to the law of large numbers, this assumption is approximately correct when a worker has a sufficiently large number of correct answers. Note that this does not mean that the percentage of the correct answers from the worker has to be large. Let K denote the class size. Under the additional constraints in Equation (15), the dual problem can still be expressed by (12) except (see Appendix C)  i2 X h i2  1X h ∗ σi (c, c) − σi (c, c) + Ω (σ) = σi (c, k) − σi (c, k) , (16) 2 i,c

where σi (c, c) =

1 X σi (c, c), K c

k6=c

σi (c, k) =

XX 1 σi (c, k). K(K − 1) c k6=c

From our empirical evaluations, this variant is somewhat worse than its original version on most datasets. We include it here only for theoretic interest.

8

4

Objective Measurement Principle

In this section, we introduce a natural objective measurement principle, and show that the probabilistic labeling model in Equation (6) is a consequence of this principle. Intuitively, the objective measurement principle can be described as follows: 1. A comparison of labeling difficulty between two items should be independent of which particular workers were involved in the comparison; and it should also be independent of which other items might also be compared. 2. Symmetrically, a comparison of labeling ability between two workers should be independent of which particular items were involved in the comparison; and it should also be independent of which other workers might also be compared. Next we mathematically define the objective measurement principle. Assume that worker i has labeled items j and j ′ in class c. Denote by E the event that one of these two items is labeled as k, and the other is labeled as c. Formally,  E = I(Xij = k) + I(Xij ′ = k) = 1, I(Xij = c) + I(Xij ′ = c) = 1 .

Denote by A the event that item j is labeled as k and item j ′ is labeled as c. Formally,  A = Xij = k, Xij ′ = c .

It is obvious that A ⊂ E. Now we formulate the requirement (1) in the objective measurement principle as follows: P (A|E) is independent of worker i. Note that P (A|E) =

P (Xij = k|Yj = c)P (Xij ′ = c|Yj ′ = c) . P (Xij = k|Yj = c)P (Xij ′ = c|Yj ′ = c) + P (Xij = c|Yj = c)P (Xij ′ = k|Yj ′ = c)

Hence, P (A|E) is independent of worker i if and only if P (Xij = k|Yj = c)P (Xij ′ = c|Yj ′ = c) P (Xij = c|Yj = c)P (Xij ′ = k|Yj ′ = c) is independent of worker i. In other words, given another arbitrary worker i′ , we should have P (Xij = k|Yj = c)P (Xij ′ = c|Yj ′ = c) P (Xi′ j = k|Yj = c)P (Xi′ j ′ = c|Yj ′ = c) = . P (Xij = c|Yj = c)P (Xij ′ = k|Yj ′ = c) P (Xi′ j = c|Yj = c)P (Xi′ j ′ = k|Yj ′ = c) Without loss of generality, we choose i′ = 0, j ′ = 0 as the fixed references. Then, P (Xij = k|Yj = c) P (Xi0 = k|Y0 = c) P (X0j = k|Yj = c) ∝ . P (Xij = c|Yj = c) P (Xi0 = c|Y0 = c) P (X0j = c|Yj = c) By the fact that probabilities are nonnegative, we can write P (Xi0 = k|Y0 = c) = exp[σi (c, k)],

P (X0j = k|Yj = c) = exp[τj (c, k)].

The probabilistic labeling model in Equation (6) follows immediately. It is easy to verify that due to the symmetry between item difficulty and worker ability, we can instead start from formulating the requirement (2) in the objective measurement principle to achieve the same result. Hence, in this sense, the two requirements are actually redundant. 9

5

Extension to Ordinal Labels

In this section, we extend the minimax conditional entropy principle from multiclass to ordinal labels. Eliciting ordinal labels is important in tasks such as judging the relative quality of web search results or consumer products. Since ordinal labels are a special case of multiclass labels, the approach that we have developed in the previous sections can be used to aggregate ordinal labels. However, we observe that, in ordinal labeling, workers usually have an error pattern different from what we observe in multiclass labeling. We summarize our observation as the adjacency confusability assumption, and formulate it by introducing a different set of constraints for workers and items.

5.1

Adjacency Confusability

In ordinal labeling, workers usually have difficulty distinguishing between two adjacent ordinal classes whereas distinguishing between two classes which are far away from each other is much easier. We refer to this observation as adjacency confusability. To illustrate this observation, let us consider the example of screening mammograms. A mammogram is an x-ray picture used to check for breast cancer in women. Radiologists often rate mammograms on a scale such as no cancer, benign cancer, possible malignancy, or malignancy. In screening mammograms, a radiologist may rate a mammogram which indicates possible malignancy as malignancy, but it is less likely that she rates a mammogram which indicates no cancer as malignancy.

5.2

Ordinal Minimax Conditional Entropy

In what follows, we construct a different set of worker and item constraints to encode adjacency confusability. The formulation leads to an ordinal labeling model parameterized with structured confusion matrices for workers and items. We introduce two symbols ∆ and ∇ which take on arbitrary binary relations in {≥,

Suggest Documents