in labeling images of videos of humans with continuous traits. Worker .... ogy, which we are using in our challenge on automatic personality trait analysis from video data ...... gesture recognition 2013: grand challenge and workshop summary.
000 001 002 003
Overcoming Calibration Problems in Pattern Labeling with Pairwise Ratings: Application to Personality Traits
000 001 002 003 004
004
005
005
Baiyu Chen4 , Sergio Escalera1,2,3 , Isabelle Guyon3,5 , V´ıctor Ponce-L´ opez1,2,6 , Nihar Shah4 , and Marc Oliu Sim´ on6
006 007 008
1
009
2
010 011 012 013
Computer Vision Center, Campus UAB, Barcelona, Dept. Mathematics and Computer Science, University of Barcelona, 3 ChaLearn, California, USA, 4 University of California Berkeley, California, USA, 5 U. ParisSaclay, France, 6 EIMT at the Open University of Catalonia, Barcelona.
006 007 008 009 010 011 012 013
014
014
015
015
016
Abstract. We address the problem of calibration of workers whose task is to label patterns with continuous variables, which arises for instance in labeling images of videos of humans with continuous traits. Worker bias is particularly difficult to evaluate and correct when many workers contribute just a few labels, a situation arising typically when labeling is crowd-sourced. In the scenario of labeling short videos of people facing a camera with personality traits, we evaluate the feasibility of the pairwise ranking method to alleviate bias problems. Workers are exposed to pairs of videos at a time and must order by preference. The variable levels are reconstructed by fitting a Bradley-Terry-Luce model with maximum likelihood. This method may at first sight, seem prohibitively expensive because for N videos, p = N (N − 1)/2 pairs must be potentially processed by workers rather that N videos. However, by performing extensive simulations, we determine an empirical law for the scaling of the number of pairs needed as a function of the number of videos in order to achieve a given accuracy of score reconstruction and show that the pairwise method is affordable. We apply the method to the labeling of a large scale dataset of 10,000 videos used in the ChaLearn Apparent Personality Trait challenge.
017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034
Keywords: Calibration of labels, Label bias, Ordinal labeling, Variance Models, Bradley-Terry-Luce model, Continuous labels, Regression, Personality traits, Crowd-sourced labels.
035 036
016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036
037
037
038
038
039
1
Introduction
041 042 043 044
039 040
040
Computer vision problems often involve labeled data with continuous values (regression problems). This includes, job interview assessments [1], personality analysis [2,3], or age estimation [4], among others. To acquire continuous labeled data, it is often necessary to hire professionals that have had training on the task
041 042 043 044
2 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089
ECCV-16 submission ID 9
of visually examining image or video patterns. For example, the data collection that motivated this study requires the labeling of 10,000 short videos with personality traits on a scale of -5 to 5. Because of the limited availability of trained professionals, one often resorts to the “wisdom of crowds” and hire a large number of untrained workers whose proposed labels are averaged to reduce variance. A typical service frequently used for crowd-sourcing labeling is Amazon Mechanical Turk1 (AMT). In this paper, we work on the problem of obtaining accurate labeling for continuous target variables, with time and budgetary constraints. The variance between labels obtained by crowd-sourcing stems from several factors, including the intrinsic variability of labeling of a single worker (who, due to fatigue and concentration may be inconsistent with his/her own assessments), and the bias that a worker may have (his/her propensity to over-rate or under-rate, e.g. a given personality trait). Intrinsic variability is often referred to as “random error” while “bias” is referred to as “systematic error”. The problem of intrinsic variability can be alleviated by pre-selecting workers for their consistency and by shortening labeling sessions to reduce worker fatigue. The problem of bias reduction is the central subjet of this paper. Reducing bias has been tackled in various ways in the literature. Beyond simple averaging, aggregation models using confusion matrices have been considered for classification problems with binary or categorical labels (e.g [5]). Aggregating continuous labels is reminiscent of Analysis of Variance (ANOVA) models and factor analysis (see, e.g. [6]) and has been generalized with the use of factor graphs [5]. Such methods are referred to in the literature as “cardinal” methods to distinguish them from “ordinal methods”, which we consider in this paper. Ordinal methods require that workers rank patterns as opposed to rating them. Typically, a pair of patterns A and B is presented to a worker and he/she is asked to judge whether value(A) < value(B), for instance extroverted(A) < extroverted(B). Ordinal methods are by design immune to additive biases (at least global biases, not discriminative biases, such as gender or race bias). Because of their built-in insensitivity to global biases ordinal methods are well suited when many workers contribute each only a few labels [7]. In addition, there is a large body of literature [8–13] showing evidence that ordinal feed-back is easier to provide than cardinal feed-back from untrained workers. In preliminary experiments we conducted ourselves, workers were also more engaged and less easily bored if they had to make comparisons rather than rating single items. In the applications we consider, however, the end goal is to obtain for every pattern a cardinal rating (such as the level of friendliness). To that end, pairwise comparisons must be converted to cardinal ratings such as to obtain the desired labels. Various models have been proposed in the literature, including the Bradley-Terry-Luce (BTL) model [14], the Thurstone class of models [15], and non-parametric models based on stochastic transitivity assumptions [16]. Such methods are commonly used, for instance, to convert tournament wins in chess to ratings and in online video games such as Microsoft’s Xbox [17]. In this paper, we present experiments performed with the Bradley-Terry-Luce (BTL) 1
https://www.mturk.com/.
045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089
ECCV-16 submission ID 9 090 091 092
3
model [14], which provided us with satisfactory results. By performing simulations, we demonstrate the viability of the method within the time and budget constraints of our data collection.
095 096 097 098 099 100 101 102 103 104 105 106 107
Contribution For a given target accuracy of cardinal rating reconstruction, we determine the practical economical feasibility of running such a data labeling and the practical computational feasibility by running extensive numerical experiments with artificial and real sample data from the problem at hand. We investigate the advantage of our proposed method from the scalability, noise resistance, and stability points of view. We derive an empirical scaling law of the number of pairs necessary to achieve a given level of accuracy of cardinal rating reconstruction from a given number of pairs. We provide a fast implementation of the method using Newton’s conjugate gradient algorithm that we make publicly available on Github. We propose a novel design for the choice of pairs based on small-world graph connectivity and experimentally prove its superiority over random selection of pairs.
110
2
Problem Formulation
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
094 095 096 097 098 099 100 101 102 103 104 105 106 107
109 110 111
111 112
092
108
108 109
091
093
093 094
090
2.1
Application Setting: The Design of a Challenge
The main focus of this research is the organization of a pattern recognition challenge in the ChaLearn Looking at People (LAP) series [18–25], which is being run for ECCV 2016 [3] and ICPR 2016 . This paper provides a methodology, which we are using in our challenge on automatic personality trait analysis from video data [26]. The automatic analysis of videos to characterize human behavior has become an area of active research with a wide range of applications [1, 2, 27, 28]. Research advances in computer vision and pattern recognition have lead to methodologies that can successfully recognize consciously executed actions, or intended movements, for instance, gestures, actions, interactions with objects and other people [29]. However, much remains to be done in characterizing sub-conscious behaviors [30], which may be exploited to reveal aptitudes or competence, hidden intentions, and personality traits. Our present research focuses on a quantitative evaluation of personality traits represented by a numerical score for a number of well established psychological traits known as the ”big five” [31]: Extraversion, agreableness, conscientiousness, neurotism, and openness to experience. Personality refers to individual differences in characteristic patterns of thinking, feeling and behaving. Characterizing personality automatically from video analysis is far from being a trivial task because perceiving personality traits is difficult even to professionally trained psychologists and recruiting specialists. Additionally, quantitatively assessing personality traits is also challenging due to the subjectivity of assessors and lack of precise metrics. We are organizing a
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
4 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
ECCV-16 submission ID 9
challenge on “first impressions”, in which participants will develop solutions for recognizing personality traits of subjects from a short video sequence of the person facing the camera. This work could become very relevant to training young people to present themselves better by changing their behavior in simple ways, as the first impression made is very important in many contexts, such as job interviews. We made available a large newly collected data set sponsored by Microsoft Research of 10,000 15-second videos collected from YouTube, annotated with the “big-five” personality traits by AMT workers. See the data collection interface in Figure 1. We budgeted 20,000 USD for labeling the 10,000 videos. We originally estimated that by paying 10 cents per rating of video pair (a conservative estimate of cost per task), we could afford rating 200,000 pairs. This paper presents the methodology we used to evaluate whether this budget would allows us to accurately estimate the cardinal ratings, which we support by numerical experiments on artificial data. Furthermore, we investigated the computational feasibility of running maximum likelihood estimation of the BTL model for such a large number of videos. Since this methodology is general, it could be used in other contexts.
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
154
154
155
155
156
156
157
157
158
158
159
159
160
160
161
161
162
162
163
163
164
164
165
165 166
166 167 168
Fig. 1: Data collection interface. The AMT workers must indicate their preference for five attributes representing the “big five” personality traits.
167 168
169
169
170
170
171
171
172 173 174 175 176 177 178 179
2.2
Model Definition
Our problem is parameterized as follows. Given a collection of N videos, each video has a trait with value in [−5, 5] (this range is arbitrary, other ranges can be chosen). We treat each trait separately; in what follows, we consider a single trait. We require that only p pairs will be labeled by the AMT workers out of the P = N (N − 1)/2 possible pairs. For scaling reasons that we explain later, p is normalized by N log N to obtain parameter α = p/(N log N ). We
172 173 174 175 176 177 178 179
ECCV-16 submission ID 9 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
consider a model in which the ideal ranking may be corrupted by “noise”, the noise representing errors made by the AMT workers (a certain parameter σ). The three parameters α, N , and σ fully characterize our experimental setting depicted in Figure 2 that we now describe. Let w∗ be the N dimensional vector of “true” (unknown) cardinal ratings ˜ be the N dimensional vector of estimated ratings obtained (e.g. of videos) and w from the votes of workers after applying our reconstruction method based on pairwise ratings. We consider that i is the index of a pair of videos {j, k}, i = 1 : p and that yi ∈ {−1, 1} represents the ideal ordinal rating (+1 if wj∗ > wk∗ and -1 otherwise, ignoring ties). We use the notation xi to represent a special kind of indicator vector, which has value +1 at position j, −1 at position k and zero otherwise, such that < xi , w∗ >= wj∗ − wk∗ . We formulate the problem as estimating the cardinal rating values of all videos based on p independent samples of ordinal ratings yi ∈ {−1, 1} coming from the distribution:
195
P [yi = 1|xi , w∗ ] = F(
196 197 198 199 200 201 202 203 204 205 206 207 208
5
< xi , w ∗ > ), σ
where F is a known function that has value in [0, 1] and σ is the noise parameter. We use Bradley-Terry-Luce model, which is a special case where F is logistic function, F (t) = 1/(1 + exp(−t)). In our simulated experiments, we first draw the wj∗ cardinal ratings uniformly in [-5, 5], then we draw p pairs randomly as training data and apply noise to get the ordinal ratings yi . As test data, we draw another set of p pairs from the remaining data. It can be verified that the likelihood function of the BTL model is log-concave. We simply use the maximum likelihood method to estimate the cardinal rating ˜ This method should lead to a single global values and get our estimation w. optimum for such a convex optimization problem.
2.3
Evaluation
215 216 217 218 219 220 221 222 223 224
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
211 212
212
214
182
210
210
213
181
209
209
211
180
To evaluate the accuracy of our cardinal rating reconstruction, we use two different scores (computed on test data): Coefficient of Determination (R2 ). We use the coefficient of determination ˜ reconstructs w∗ . The residual residual sum of squares is to measure how wellPw defined asPSSres = i (wi∗ − w ˜i )2 . The total sum of squares SSvar is defined as: 2 ∗ ∗ SSvar = i (wi − w ) , where w∗ denotes the average rating. The coefficient of Determination is defined as R2 = 1 − SSres /SSvar . Note that since the wi∗ are on an arbitrary scale [−5, +5], we must normalize the w˜i before computing the R2 . This is achieved by finding the optimum shift and scale to maximize the R2 . Test-accuracy. We define test Accuracy as the fraction of pairs correctly re˜ from the test data pairs, i.e. those pairs not used for evaluating oriented using w ˜ w.
213 214 215 216 217 218 219 220 221 222 223 224
6
ECCV-16 submission ID 9
225
225
226
226
227
227
228
228
229
229
230
230
231
231
232
232
233
233
234
234
235
235
236
236
237
237
238
238
239
239
240
240
241
241
242
242
243
Fig. 2: Work Flow Diagram
244
246
246
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
244 245
245
247
243
2.4
Experiment Design
In our simulations, we follow the workflow of Figure 2. We first generate a score vector w∗ using a uniform distribution in [−5, 5]N . Once w∗ is chosen, we select training and test pairs. One original contribution of our paper is the choice of pairs. We propose to use a small-world graph construction method to generate the pairs [32]. Smallworld graphs provide high connectivity, avoid disconnected regions in the graph, have a well distributed edges, and minimum distance between nodes [33]. An edge is selected at random from the underlying graph, and the chosen edge determines the pair of items compared. We compare the small-world strategy to draw pairs with drawing pairs at random from a uniform distribution, which according to [7] yield near-optimal results. The ordinal rating of the pairs is generated with the BTL model using the chosen w∗ as the underlying cardinal rating, flipping pairs according to the noise level. Finally, the maximum likelihood estimator for the BTL model is employed ˜ to estimate w. We are interested in the effect of three variables: total number of pairs available, p; total number of videos, N ; noise level, σ. First we experiment on performance progress (as measured by R2 and Accuracy on test data) for fixed values of N and σ, by varying the number of pairs p. According to [14] with no noise and error, the minimum number of pairs needed for exactly recovering of original ordering of data is N logN . This prompted us to vary p as a multiple of N logN . We define the parameter α = p/(N log N ). The results are shown in Figures 3
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
ECCV-16 submission ID 9 270 271 272
7
and 7. This allows us, for a given level of reconstruction accuracy (e.g. 0.95) or R2 (e.g. 0.9) to determine the number of pairs needed. We then fix p and σ and observe how performance progress with N (Figures 6 and 8).
270 271 272
273
273
274
274
275
3
275
Results and Discussion
276
276
277
277 278 279 280
In this section, we examine performances in terms of test set R2 and Accuracy for reconstructing the cardinal scores and recovering the correct pairwise ratings when noise is applied at various levels in the BTL model.
278 279 280 281
281
282
282 283
1
283
284
0.9
284
286 287 288
R2 Score
285
289 290
285
0.8 0.7
287
N = 2500
0.5
N = 10000
289
R2 = 0.9
290
N = 5000
0.4 0
292
0.5
293
295
286
N = 1250
0.6
291
294
N = 625
1
1.5 2 α = p/(N logN )
2.5
3
288
291 292 293
Fig. 3: Evolution of R2 for different α with noise level σ = 1.
294 295
296
296
297
297
298
298 299
299
305 306 307 308 309 310
301 302 303 304 305 306 307
N
308 309 310 311
311 312
300
Small-world pairs without noise
10000
304
Small-world pairs with noise
5000
303
Random Pairs with noise
2500
302
2.4 2.2 2 1.8 1.6 1.4 1.2 1
625 1250
301
α = p/N logN
300
Fig. 4: Evolution of α∗ : α at R2 = 0.9 for with and without noise, with σ = 1.
312
313
313
314
314
8 315
3.1
ECCV-16 submission ID 9
Number of pairs needed
316
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
We recall that one of the goals of our experiments was to figure out scaling laws for the number of pairs p as a function of N for various levels of noise. From theoretical analyses, we expected that p would scale with N logN rather than N 2 . In a first set of experiments, we fixed the noise level at σ = 1. We were pleased to see in in Figures 3 and 7 that our two scores (the R2 and Accuracy) in fact increase with α = p/(N logN ). This indicates that our presumed scaling law is, in fact, pessimistic. To determine an empirical scaling law, we fixed a desired value of R2 (0.9, see horizontal line in Figure 3). We then plotted the five points resulting from the intersection of the curves and the horizontal line as a function of N to obtain the red curve in Figure 4. The two other curves are shown for comparison: The blue curve is obtained without noise and the brown curve with an initialisation with the small-world heuristic. All three curves present a quasi-linear decrease of alpha with N with the same slope. From this we infer that α = p/(N logN ) ' α0 − 4 × 10−5 N . And thus we obtain the following empirical scaling law of p as a function of N : p = α0 N logN − 4 × 10−5 N 2 logN. In this formula, the intercept α0 changes with the various conditions (choices of pairs and noise), but the scaling law remains the same. A similar scaling law is obtained if we use Accuracy rather than R2 as score.
339 340 341 342
3.2
Small-world heuristic
Our experiments indicate that an increase in performance is obtained with the small-world heuristic compared to a random choice of pairs (Figure 4). This is therefore what was adopted in all other experiments.
347 348 349 350 351 352 353 354 355 356 357 358 359
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
338 339 340 341 342
344
344
346
318
343
343
345
317
337
337 338
315
3.3
Experiment budget
In the introduction, we indicated that our budget to pay AMT workers would cover at least p = 200, 000 pairs. However, the efficiency of our data collection setting reduced the cost per elementary task and we ended up labeling p = 321, 684 pairs within our budget. For our N = 10, 000 videos, this corresponds to α = p/(N logN ) = 3.49. We see in Figure 4 that, for N = 10, 000 videos, in all cases examined, the α required to attain R2 = 0.9 is lower than 2.17, and therefore, our budget was sufficient to obtain this level of accuracy. Furthermore, we varied the noise level in Figures 6 and 8. In these plots, we selected a smaller value of α than what our monetary budget could afford (α = 1.56). Even at that level, we can see that we have a sufficient number of pairs to achieve R2 = 0.9 for all levels of noise considered and all values of N considered. We also achieve an accuracy near 0.95 for N = 10, 000 for all levels of noise considered. As expected, a larger σ requires a larger number of pairs to achieve the same level of R2 or Accuracy.
345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
ECCV-16 submission ID 9 360
3.4
9
Computational time
360
361
361
362
362 363
363
Running Time (s)
364 365 366 367 368 369 370
10,000
364
1,000
365
100
366
10
N = 625
1
0.5
373
375 376
369
N = 5000
0
372
374
368
N = 2500
N = 10000
371
1
1.5 2 α = p/(N logN )
2.5
3
Fig. 5: Evolution of running time for different α and N with noise and σ = 1 on log scale.
379 380 381 382 383 384 385 386 387 388 389 390 391 392
One of the feasibility aspect of using ordinal ranking concerns computational time. Given that collecting and annotating data takes months of work, any computational time ranging from a few hours to a few days would be reasonnable. However, to be able to run systematic experiments, we optimized our algorithm sufficiently that any experiment we performed took less than three hours. Our implementation, which uses Newton’s conjugate gradient algorithm [34], was made publicly available on Github2 . In Figure 5 we see that the log of running time increases quite rapidly with α at the beginning and then almost linearly. We also see that the log of the running time increases linearly with N for any fixed value of α. In the case of our data collection, we were interested in α = 2.17 (see the previous section), which corresponds to using 200,000 pairs for 10,000 videos (our original estimate). For this value of α, we were pleased to see that the calculation of the cardinal labels would take less than three hours. This comforted us on the feasibility of using this method for out particular application. 3.5
Experiments on real data
397 398 399 400
The data collection process included collecting labels from AMT workers. Each worker followed the protocol we described in Section 2 (see Figure 1). We obtained 321,684 pairs of real human votes for each trait, which were divided into 300,000 pairs for training and used the remainder 21,684 pairs for testing. This corresponds to α = 3.26 for training.3
401
2
402
3
403 404
372 373 374 375 376
378 379 380 381 382 383 384 385 386 387 388 389 390 391 392
394 395
395 396
371
393
393 394
370
377
377 378
367
N = 1250
https://github.com/andrewcby/Speed-Interview These experiments concern only cardinal label reconstruction, they have nothing to do with the pattern recognition task from the videos, for which a different split between training/validation/test sets was done for the challenge.
396 397 398 399 400 401 402 403 404
10
ECCV-16 submission ID 9
405
405
1
406
406
407
407
0.95
409
408
R2
408
410
409
412
411 412
415
5000
625 1250
414
10000
413
0.85
2500
413
410
σ=1 σ=2 σ=3
0.9
411
416
419
415 416
N
417
417 418
414
Fig. 6: Evolution of R2 for different σ with α = 1.56, a value that guarantees R2 ≥ 0.9 when σ = 1.
418 419 420
420 421
1
421
422
0.95
422
424 425 426
Accuracy
423
423
0.9 0.8
N = 1250
425
N = 2500
426
N = 5000
427
N = 10000
0.7
428
424
N = 625
Accuracy=0.95
0
430
0.5
1
431
433
1.5 2 α = p/N logN
2.5
3
Fig. 7: Evolution of Accuracy for different α with noise with σ = 1.
436
Accuracy
437
437
0.95
438
σ σ σ σ
0.9
441
445 446 447 448 449
=1 =2 =3 =4 10000
5000
2500
444
0.85
625 1250
442 443
433
435
436
440
432
434
1
435
439
430 431
434
438
428 429
429
432
427
N
Fig. 8: Evolution of accuracy for different σ with α = 1.56, a value that guarantees accuracy ≥ 0.9 when σ = 1.
439 440 441 442 443 444 445 446 447 448 449
ECCV-16 submission ID 9 450 451 452 453 454
11
We ran our cardinal score reconstruction algorithm on these data set and computed test accuracy. The results, shown in Table 1, give test accuracies between 0.66 and 0.73 for the various traits. Such reconstruction accuracies are significantly worse than those predicted by our simulated experiments. Looking at figure 7, the accuracies for α > 3 are larger than 0.95.
450 451 452 453 454
455
455
456
456
457
457
458
458
459
459
460
460
461
461
462
462
463
463
464
464
465
465
466
466
467
467
468
468
469
469
470
470
471
471
472
472
473
473
474
474
475
475
476
476
477
477
478
478
479
479
480
480
481
481
482
482
483
483
484
484
485
485
486
486
487
487
488
488
489
489
490
490
491
491
492
492
493
493
494
494
12
ECCV-16 submission ID 9
Several factors can explain such lower accuracies of reconstruction:
495
496
496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530
1. Use of “noisy” ground truth estimation in real data to compute the target ranking in the accuracy calculation. The overly optimistic estimation of the accuracy in simulations stems in part from using exact ground truth, not available in real data. In real data, we compared the human ranking and the BTL model reconstructed ranking in test data. This may account for at least doubling the variance, one source of error being introduced when estimating the cardinal scores, and the other when estimating the accuracy using pair reconstruction with ”noisy” real data. 2. Departure of the real label distribution from the uniform distribution. We carried out complementary simulations with a Gaussian distribution instead of a uniform distribution of labels (closer to a natural distribution) and observed a decrease of 6% in accuracy and a decrease of 7% in R2 . 3. Departure of the real noise distribution from the BTL model. We evaluated the validity of the BTL model by comparing the results to those produced with a simple baseline method introduced in [35]. This method consists in averaging the ordinal ratings for each video (counting +1 if it is rated higher than another video an -1 if it is rated lower). The performances of the BTL model are consistently better across all traits, based on the one sigma error bar calculated with 30 repeat experiments. Therefore, even though the baseline method is considerably simpler and faster, it is worth running the BTL model for the estimation of cardinal ratings. Unfortunately, there is no way to quantitatively estimate the effect of the third reason. 4. Under-estimation of the intrinsic noise level (random inconsistencies in rating the same video pair by the same worker). We evaluated the σ in the BTL model using bootstrap re-sampling of the video pairs. With an increasing level of σ, the results are consistently decreasing, as shown in figure 8. Therefore the parameters we chose for the simulation model proved to be optimistic and underestimated the intrinsic noise level. 5. Sources of bias not accounted for (we only took into account a global source of bias, not stratified sources of bias such as gender bias and racial bias. This is a voter-specific factor that we did not take into consideration when setting up the simulation. As this kind of bias is hard to measure, especially quantitatively, it can negatively influence the accuracy of the prediction.
4
Discussion and conclusion
537 538 539
499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530
533 534
534
536
498
532
532
535
497
531
531
533
495
In this paper we evaluated the viability of an ordinal rating method based on labeling pairs of videos, a method intrinsically insensitive to (global) worker bias. Using simulations, we showed that it is in principle possible to accurately produce a cardinal rating by fitting the BTL model with maximum likelihood, using artificial data generated with this model. We calculated that it was possible
535 536 537 538 539
ECCV-16 submission ID 9 540 541 542 543 544 545 546 547 548 549 550 551 552 553
13
to remain within our financial budget of 200,000 pairs and incur a reasonable computational time (under 3 hours). However, although in simulations we pushed the model to levels of noise that we thought were realistic, the performance we attained with simulations (R2 = 0.9 of Accuracy= 0.95 on test data) turned out to be optimistic. Reconstruction of cardinal ratings from ordinal ratings on real data lead to a lower level of accuracy (in the range 69% and 73%), showing that there are still other types of noise that are not reducible by the model. Future work can focus on methods to reduce this noise. Our financial budget and time constraints also did not allow us to conduct a comparison with direct cardinal rating. An ideal, but expensive, experiment could be to duplicate the ground truth estimation by using AMT workers to directly estimate cardinal ratings, within the same financial budget. Future work includes validating our labeling technique in this way on real data.
556 557 558 559 560 561 562 563
541 542 543 544 545 546 547 548 549 550 551 552 553 554
554 555
540
Acknowledgment
555
This work was supported in part by donations of Microsoft Research to prepare the personality trait challenge, and Spanish Projects TIN2012-38187-C03-02, TIN2013-43478-P and the European Comission Horizon 2020 granted project SEE.4C under call H2020-ICT-2015. We are grateful to Evelyne Viegas, Albert Clap´es i Sintes, Hugo Jair Escalante, Ciprian Corneanu, Xavier Bar´o Sol´e, C´ecile Capponi, and St´ephane Ayache for stimulating discussions. We are thankful for Prof. Alyosha Efros for his support and guidance.
557
556
558 559 560 561 562 563
564
564
565
565
566
566
567
567
568
568
569
569
570
570
571
571
572
572
573
573
574
574
575
575
576
576
577
577
578
578
579
579
580
580
581
581
582
582
583
583
584
584
14
ECCV-16 submission ID 9
585
585
586
586
587
587
588
588
589
589
590
590
591
591
592
592
593
593
594
594
595
595
596
596
597
597
598
598
599
599
600
600
601
601
602
602 603
603 604
Table 1: Estimation Accuracy of 10,000 videos and 321,684 pairs (3.49×N logN ).
604
605
BTL Model Averaging ordinal ratings Trait Accuracy STD Accuracy STD Extraversion 0.692 ± 0.027 0.575 ± 0.095 Agreeableness 0.720 ± 0.025 0.533 ± 0.087 Conscientiousness 0.669 ± 0.032 0.559 ± 0.092 Neuroticism 0.706 ± 0.022 0.549 ± 0.084 Openness 0.735 ± 0.021 0.542 ± 0.089
605
606 607 608 609 610 611
606 607 608 609 610 611
612
612
613
613
614
614
615
615
616
616
617
617
618
618
619
619
620
620
621
621
622
622
623
623
624
624
625
625
626
626
627
627
628
628
629
629
ECCV-16 submission ID 9 630
15
References
632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674
630 631
631
1. Marcos-Ramiro, A., Pizarro-Perez, D., Marron-Romera, M., Nguyen, L., GaticaPerez, D.: Body communicative cue extraction for conversational analysis. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. (April 2013) 1–8 2. Aran, O., Gatica-Perez, D.: One of a kind: Inferring personality impressions in meetings. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction. ICMI, New York, NY, USA, ACM (2013) 11–18 3. : Chalearn lap 2016: First round challenge on first impressions - dataset and results 4. Escalera, S., Gonzlez, J., Bar, X., Pardo, P., Fabian, J., Oliu, M., Escalante, H.J., Huerta, I., Guyon, I.: Chalearn looking at people 2015 new competitions: Age estimation and cultural event recognition. In: 2015 International Joint Conference on Neural Networks (IJCNN). (July 2015) 1–8 5. Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd International Conference on World Wide Web. WWW ’14, New York, NY, USA, ACM (2014) 155–164 6. Miller, J., Haden, P.: Statistical Analysis with The General Linear Model. (2006) 7. Shah, N., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., Wainwright, M.: Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. CoRR abs/1505.01462 (2015) 8. Whitehill, J., fan Wu, T., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A., eds.: Advances in Neural Information Processing Systems 22. Curran Associates, Inc. (2009) 2035–2043 9. Welinder, P., Branson, S., Perona, P., Belongie, S.J.: The multidimensional wisdom of crowds. In Lafferty, J., Williams, C., Shawe-taylor, J., Zemel, R., Culotta, A., eds.: Advances in Neural Information Processing Systems 23. (2010) 2424–2432 10. Welinder, P., Perona, P.: Online crowdsourcing: rating annotators and obtaining cost-effective labels. In: In W. on Advancing Computer Vision with Humans in the Loop. (2010) 11. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. J. Mach. Learn. Res. 11 (August 2010) 1297–1322 12. Kamar, E., Hacker, S., Horvitz, E.: Combining human and machine intelligence in large-scale crowdsourcing. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1. AAMAS ’12, Richland, SC, International Foundation for Autonomous Agents and Multiagent Systems (2012) 467–474 13. Bachrach, Y., Graepel, T., Minka, T., Guiver, J.: How To Grade a Test Without Knowing the Answers — A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing. ArXiv e-prints (2012) 14. Bradley, R., Terry, M.: Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika 39: 324-345. (1952) 15. Thurstone, L.L.: A law of comparative judgment. Psychological Review 34(4) (1927) 273 16. Shah, N.B., Balakrishnan, S., Guntuboyina, A., Wainwright, M.J.: Stochastically transitive models for pairwise comparisons: Statistical and computational issues. arXiv preprint arXiv:1510.05610 (2015)
632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674
16 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719
ECCV-16 submission ID 9
17. Herbrich, R., Minka, T., Graepel, T.: Trueskill: A Bayesian skill rating system. Advances in Neural Information Processing Systems 19 (2007) 569 18. Escalera, S., Gonz` alez, J., Bar´ o, X., Reyes, M., Lop´es, O., Guyon, I., Athitsos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: Dataset and results. In: ChaLearn Multi-Modal Gesture Recognition Workshop, ICMI. (2013) 19. Escalera, S., Gonz` alez, J., Baro, X., Reyes, M., Guyon, I., Athitsos, V., Escalante, H., Argyros, A., Sminchisescu, C., Bowden, R., Sclarof, S.: Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary. ICMI (2013) 365–368 20. Escalera, S., Baro, X., Gonz` alez, J., Bautista, M., Madadi, M., Reyes, M., PonceL´ opez, V., Escalante, H., Shotton, J., Guyon, I.: Chalearn looking at people challenge 2014: Dataset and results. (2014) 21. Escalera, S., Gonz` alez, J., Baro, X., Pardo, P., Fabian, J., Oliu, M., Escalante, H.J., Huerta, I., Guyon, I.: Chalearn looking at people 2015 new competitions: Age estimation and cultural event recognition. In: IJCNN. (2015) 22. Baro, X., Gonz` alez, J., Fabian, J., Bautista, M., Oliu, M., Escalante, H., Guyon, I., Escalera, S.: Chalearn looking at people 2015 challenges: action spotting and cultural event recognition. In: ChaLearn LAP Workshop, CVPR. (2015) 23. Escalera, S., Fabian, J., Pardo, P., , Bar´ o, X., Gonz` alez, J., Escalante, H., Misevic, D., Steiner, U., Guyon, I.: Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In: International Conference in Computer Vision, ICCVW. (2015) 24. Escalera, S., Athitsos, V., Guyon, I.: Challenges in multimodal gesture recognition. Journal on Machine Learning Research (2016) 25. Escalera, S., Gonz` alez, J., Bar´ o, X., Shotton, J.: Special issue on multimodal human pose recovery and behavior analysis. IEEE Tans. Pattern Analysis and Machine Intelligence (2016) 26. Park, G., Schwartz, H., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L., Seligman, M.: Automatic personality assessment through social media language. Journal of Personality and Social Psychology 108 (2014) 934–952 27. Ponce-L´ opez, V., Escalera, S., Bar´ o, X.: Multi-modal social signal analysis for predicting agreement in conversation settings. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction. ICMI, New York, NY, USA, ACM (2013) 495–502 28. Ponce-L´ opez, V., Escalera, S., P´erez, M., Jan´es, O., Bar´ o, X.: Non-verbal communication analysis in victim-offender mediations. Pattern Recognition Letters 67, Part 1 (2015) 19 – 27 Cognitive Systems for Knowledge Discovery. 29. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. (June 2008) 1–8 30. Pentland, A.: Honest Signals: How They Shape Our World. The MIT Press, Massachusetts (2008) 31. Goldberg, L.: The structure of phenotypic personality traits. (1993) 32. Watts, D.J., Strogatz, S.H.: Collective dynamics of’small-world’networks. Nature 393(6684) (1998) 409–10 33. Humphries, M., Gurney, K., Prescott, T.: The brainstem reticular formation is a small-world, not scale-free, network. Proceedings of the Royal Society of London B: Biological Sciences 273(1585) (2006) 503–511 34. Knoll, D.A., Keyes, D.E.: Jacobian-free Newton-Krylov methods: a survey of approaches and applications. Journal of Computational Physics 193 (January 2004) 357–397
675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719
ECCV-16 submission ID 9 720 721
17
35. Shah, N.B., Wainwright, M.J.: Simple, robust and optimal ranking from pairwise comparisons. arXiv preprint arXiv:1512.08949 (2015)
720 721
722
722
723
723
724
724
725
725
726
726
727
727
728
728
729
729
730
730
731
731
732
732
733
733
734
734
735
735
736
736
737
737
738
738
739
739
740
740
741
741
742
742
743
743
744
744
745
745
746
746
747
747
748
748
749
749
750
750
751
751
752
752
753
753
754
754
755
755
756
756
757
757
758
758
759
759
760
760
761
761
762
762
763
763
764
764