Overcoming Calibration Problems in Pattern Labeling with Pairwise

000 001 002 003

Overcoming Calibration Problems in Pattern Labeling with Pairwise Ratings: Application to Personality Traits

000 001 002 003 004

004

005

005

Baiyu Chen4 , Sergio Escalera1,2,3 , Isabelle Guyon3,5 , V´ıctor Ponce-L´ opez1,2,6 , Nihar Shah4 , and Marc Oliu Sim´ on6

006 007 008

1

009

2

010 011 012 013

Computer Vision Center, Campus UAB, Barcelona, Dept. Mathematics and Computer Science, University of Barcelona, 3 ChaLearn, California, USA, 4 University of California Berkeley, California, USA, 5 U. ParisSaclay, France, 6 EIMT at the Open University of Catalonia, Barcelona.

006 007 008 009 010 011 012 013

014

014

015

015

016

Abstract. We address the problem of calibration of workers whose task is to label patterns with continuous variables, which arises for instance in labeling images of videos of humans with continuous traits. Worker bias is particularly difficult to evaluate and correct when many workers contribute just a few labels, a situation arising typically when labeling is crowd-sourced. In the scenario of labeling short videos of people facing a camera with personality traits, we evaluate the feasibility of the pairwise ranking method to alleviate bias problems. Workers are exposed to pairs of videos at a time and must order by preference. The variable levels are reconstructed by fitting a Bradley-Terry-Luce model with maximum likelihood. This method may at first sight, seem prohibitively expensive because for N videos, p = N (N − 1)/2 pairs must be potentially processed by workers rather that N videos. However, by performing extensive simulations, we determine an empirical law for the scaling of the number of pairs needed as a function of the number of videos in order to achieve a given accuracy of score reconstruction and show that the pairwise method is affordable. We apply the method to the labeling of a large scale dataset of 10,000 videos used in the ChaLearn Apparent Personality Trait challenge.

017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034

Keywords: Calibration of labels, Label bias, Ordinal labeling, Variance Models, Bradley-Terry-Luce model, Continuous labels, Regression, Personality traits, Crowd-sourced labels.

035 036

016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036

037

037

038

038

039

1

Introduction

041 042 043 044

039 040

040

Computer vision problems often involve labeled data with continuous values (regression problems). This includes, job interview assessments [1], personality analysis [2,3], or age estimation [4], among others. To acquire continuous labeled data, it is often necessary to hire professionals that have had training on the task

041 042 043 044

2 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089

ECCV-16 submission ID 9

of visually examining image or video patterns. For example, the data collection that motivated this study requires the labeling of 10,000 short videos with personality traits on a scale of -5 to 5. Because of the limited availability of trained professionals, one often resorts to the “wisdom of crowds” and hire a large number of untrained workers whose proposed labels are averaged to reduce variance. A typical service frequently used for crowd-sourcing labeling is Amazon Mechanical Turk1 (AMT). In this paper, we work on the problem of obtaining accurate labeling for continuous target variables, with time and budgetary constraints. The variance between labels obtained by crowd-sourcing stems from several factors, including the intrinsic variability of labeling of a single worker (who, due to fatigue and concentration may be inconsistent with his/her own assessments), and the bias that a worker may have (his/her propensity to over-rate or under-rate, e.g. a given personality trait). Intrinsic variability is often referred to as “random error” while “bias” is referred to as “systematic error”. The problem of intrinsic variability can be alleviated by pre-selecting workers for their consistency and by shortening labeling sessions to reduce worker fatigue. The problem of bias reduction is the central subjet of this paper. Reducing bias has been tackled in various ways in the literature. Beyond simple averaging, aggregation models using confusion matrices have been considered for classification problems with binary or categorical labels (e.g [5]). Aggregating continuous labels is reminiscent of Analysis of Variance (ANOVA) models and factor analysis (see, e.g. [6]) and has been generalized with the use of factor graphs [5]. Such methods are referred to in the literature as “cardinal” methods to distinguish them from “ordinal methods”, which we consider in this paper. Ordinal methods require that workers rank patterns as opposed to rating them. Typically, a pair of patterns A and B is presented to a worker and he/she is asked to judge whether value(A) < value(B), for instance extroverted(A) < extroverted(B). Ordinal methods are by design immune to additive biases (at least global biases, not discriminative biases, such as gender or race bias). Because of their built-in insensitivity to global biases ordinal methods are well suited when many workers contribute each only a few labels [7]. In addition, there is a large body of literature [8–13] showing evidence that ordinal feed-back is easier to provide than cardinal feed-back from untrained workers. In preliminary experiments we conducted ourselves, workers were also more engaged and less easily bored if they had to make comparisons rather than rating single items. In the applications we consider, however, the end goal is to obtain for every pattern a cardinal rating (such as the level of friendliness). To that end, pairwise comparisons must be converted to cardinal ratings such as to obtain the desired labels. Various models have been proposed in the literature, including the Bradley-Terry-Luce (BTL) model [14], the Thurstone class of models [15], and non-parametric models based on stochastic transitivity assumptions [16]. Such methods are commonly used, for instance, to convert tournament wins in chess to ratings and in online video games such as Microsoft’s Xbox [17]. In this paper, we present experiments performed with the Bradley-Terry-Luce (BTL) 1

https://www.mturk.com/.

045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089

ECCV-16 submission ID 9 090 091 092

3

model [14], which provided us with satisfactory results. By performing simulations, we demonstrate the viability of the method within the time and budget constraints of our data collection.

095 096 097 098 099 100 101 102 103 104 105 106 107

Contribution For a given target accuracy of cardinal rating reconstruction, we determine the practical economical feasibility of running such a data labeling and the practical computational feasibility by running extensive numerical experiments with artificial and real sample data from the problem at hand. We investigate the advantage of our proposed method from the scalability, noise resistance, and stability points of view. We derive an empirical scaling law of the number of pairs necessary to achieve a given level of accuracy of cardinal rating reconstruction from a given number of pairs. We provide a fast implementation of the method using Newton’s conjugate gradient algorithm that we make publicly available on Github. We propose a novel design for the choice of pairs based on small-world graph connectivity and experimentally prove its superiority over random selection of pairs.

110

2

Problem Formulation

113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134

094 095 096 097 098 099 100 101 102 103 104 105 106 107

109 110 111

111 112

092

108

108 109

091

093

093 094

090

2.1

Application Setting: The Design of a Challenge

The main focus of this research is the organization of a pattern recognition challenge in the ChaLearn Looking at People (LAP) series [18–25], which is being run for ECCV 2016 [3] and ICPR 2016 . This paper provides a methodology, which we are using in our challenge on automatic personality trait analysis from video data [26]. The automatic analysis of videos to characterize human behavior has become an area of active research with a wide range of applications [1, 2, 27, 28]. Research advances in computer vision and pattern recognition have lead to methodologies that can successfully recognize consciously executed actions, or intended movements, for instance, gestures, actions, interactions with objects and other people [29]. However, much remains to be done in characterizing sub-conscious behaviors [30], which may be exploited to reveal aptitudes or competence, hidden intentions, and personality traits. Our present research focuses on a quantitative evaluation of personality traits represented by a numerical score for a number of well established psychological traits known as the ”big five” [31]: Extraversion, agreableness, conscientiousness, neurotism, and openness to experience. Personality refers to individual differences in characteristic patterns of thinking, feeling and behaving. Characterizing personality automatically from video analysis is far from being a trivial task because perceiving personality traits is difficult even to professionally trained psychologists and recruiting specialists. Additionally, quantitatively assessing personality traits is also challenging due to the subjectivity of assessors and lack of precise metrics. We are organizing a

112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134

4 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153


challenge on “first impressions”, in which participants will develop solutions for recognizing personality traits of subjects from a short video sequence of the person facing the camera. This work could become very relevant to training young people to present themselves better by changing their behavior in simple ways, as the first impression made is very important in many contexts, such as job interviews. We made available a large newly collected data set sponsored by Microsoft Research of 10,000 15-second videos collected from YouTube, annotated with the “big-five” personality traits by AMT workers. See the data collection interface in Figure 1. We budgeted 20,000 USD for labeling the 10,000 videos. We originally estimated that by paying 10 cents per rating of video pair (a conservative estimate of cost per task), we could afford rating 200,000 pairs. This paper presents the methodology we used to evaluate whether this budget would allows us to accurately estimate the cardinal ratings, which we support by numerical experiments on artificial data. Furthermore, we investigated the computational feasibility of running maximum likelihood estimation of the BTL model for such a large number of videos. Since this methodology is general, it could be used in other contexts.

135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153

154

154

155

155

156

156

157

157

158

158

159

159

160

160

161

161

162

162

163

163

164

164

165

165 166

166 167 168

Fig. 1: Data collection interface. The AMT workers must indicate their preference for five attributes representing the “big five” personality traits.

167 168

169

169

170

170

171

171

172 173 174 175 176 177 178 179

2.2

Model Definition

Our problem is parameterized as follows. Given a collection of N videos, each video has a trait with value in [−5, 5] (this range is arbitrary, other ranges can be chosen). We treat each trait separately; in what follows, we consider a single trait. We require that only p pairs will be labeled by the AMT workers out of the P = N (N − 1)/2 possible pairs. For scaling reasons that we explain later, p is normalized by N log N to obtain parameter α = p/(N log N ). We

172 173 174 175 176 177 178 179

ECCV-16 submission ID 9 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194

consider a model in which the ideal ranking may be corrupted by “noise”, the noise representing errors made by the AMT workers (a certain parameter σ). The three parameters α, N , and σ fully characterize our experimental setting depicted in Figure 2 that we now describe. Let w∗ be the N dimensional vector of “true” (unknown) cardinal ratings ˜ be the N dimensional vector of estimated ratings obtained (e.g. of videos) and w from the votes of workers after applying our reconstruction method based on pairwise ratings. We consider that i is the index of a pair of videos {j, k}, i = 1 : p and that yi ∈ {−1, 1} represents the ideal ordinal rating (+1 if wj∗ > wk∗ and -1 otherwise, ignoring ties). We use the notation xi to represent a special kind of indicator vector, which has value +1 at position j, −1 at position k and zero otherwise, such that < xi , w∗ >= wj∗ − wk∗ . We formulate the problem as estimating the cardinal rating values of all videos based on p independent samples of ordinal ratings yi ∈ {−1, 1} coming from the distribution:

195

P [yi = 1|xi , w∗ ] = F(

196 197 198 199 200 201 202 203 204 205 206 207 208

5

< xi , w ∗ > ), σ

where F is a known function that has value in [0, 1] and σ is the noise parameter. We use Bradley-Terry-Luce model, which is a special case where F is logistic function, F (t) = 1/(1 + exp(−t)). In our simulated experiments, we first draw the wj∗ cardinal ratings uniformly in [-5, 5], then we draw p pairs randomly as training data and apply noise to get the ordinal ratings yi . As test data, we draw another set of p pairs from the remaining data. It can be verified that the likelihood function of the BTL model is log-concave. We simply use the maximum likelihood method to estimate the cardinal rating ˜ This method should lead to a single global values and get our estimation w. optimum for such a convex optimization problem.

2.3

Evaluation

215 216 217 218 219 220 221 222 223 224

183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208

211 212

212

214

182

210

210

213

181

209

209

211

180

To evaluate the accuracy of our cardinal rating reconstruction, we use two different scores (computed on test data): Coefficient of Determination (R2 ). We use the coefficient of determination ˜ reconstructs w∗ . The residual residual sum of squares is to measure how wellPw defined asPSSres = i (wi∗ − w ˜i )2 . The total sum of squares SSvar is defined as: 2 ∗ ∗ SSvar = i (wi − w ) , where w∗ denotes the average rating. The coefficient of Determination is defined as R2 = 1 − SSres /SSvar . Note that since the wi∗ are on an arbitrary scale [−5, +5], we must normalize the w˜i before computing the R2 . This is achieved by finding the optimum shift and scale to maximize the R2 . Test-accuracy. We define test Accuracy as the fraction of pairs correctly re˜ from the test data pairs, i.e. those pairs not used for evaluating oriented using w ˜ w.

213 214 215 216 217 218 219 220 221 222 223 224

6


225

225

226

226

227

227

228

228

229

229

230

230

231

231

232

232

233

233

234

234

235

235

236

236

237

237

238

238

239

239

240

240

241

241

242

242

243

Fig. 2: Work Flow Diagram

244

246

246

248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

244 245

245

247

243

2.4

Experiment Design

In our simulations, we follow the workflow of Figure 2. We first generate a score vector w∗ using a uniform distribution in [−5, 5]N . Once w∗ is chosen, we select training and test pairs. One original contribution of our paper is the choice of pairs. We propose to use a small-world graph construction method to generate the pairs [32]. Smallworld graphs provide high connectivity, avoid disconnected regions in the graph, have a well distributed edges, and minimum distance between nodes [33]. An edge is selected at random from the underlying graph, and the chosen edge determines the pair of items compared. We compare the small-world strategy to draw pairs with drawing pairs at random from a uniform distribution, which according to [7] yield near-optimal results. The ordinal rating of the pairs is generated with the BTL model using the chosen w∗ as the underlying cardinal rating, flipping pairs according to the noise level. Finally, the maximum likelihood estimator for the BTL model is employed ˜ to estimate w. We are interested in the effect of three variables: total number of pairs available, p; total number of videos, N ; noise level, σ. First we experiment on performance progress (as measured by R2 and Accuracy on test data) for fixed values of N and σ, by varying the number of pairs p. According to [14] with no noise and error, the minimum number of pairs needed for exactly recovering of original ordering of data is N logN . This prompted us to vary p as a multiple of N logN . We define the parameter α = p/(N log N ). The results are shown in Figures 3

247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

ECCV-16 submission ID 9 270 271 272

7

and 7. This allows us, for a given level of reconstruction accuracy (e.g. 0.95) or R2 (e.g. 0.9) to determine the number of pairs needed. We then fix p and σ and observe how performance progress with N (Figures 6 and 8).

270 271 272

273

273

274

274

275

3

275

Results and Discussion

276

276

277

277 278 279 280

In this section, we examine performances in terms of test set R2 and Accuracy for reconstructing the cardinal scores and recovering the correct pairwise ratings when noise is applied at various levels in the BTL model.

278 279 280 281

281

282

282 283

1

283

284

0.9

284

286 287 288

R2 Score

285

289 290

285

0.8 0.7

287

N = 2500

0.5

N = 10000

289

R2 = 0.9

290

N = 5000

0.4 0

292

0.5

293

295

286

N = 1250

0.6

291

294

N = 625

1

1.5 2 α = p/(N logN )

2.5

3

288

291 292 293

Fig. 3: Evolution of R2 for different α with noise level σ = 1.

294 295

296

296

297

297

298

298 299

299

305 306 307 308 309 310

301 302 303 304 305 306 307

N

308 309 310 311

311 312

300

Small-world pairs without noise

10000

304

Small-world pairs with noise

5000

303

Random Pairs with noise

2500

302

2.4 2.2 2 1.8 1.6 1.4 1.2 1

625 1250

301

α = p/N logN

300

Fig. 4: Evolution of α∗ : α at R2 = 0.9 for with and without noise, with σ = 1.

312

313

313

314

314

8 315

3.1


Number of pairs needed

316

316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336

We recall that one of the goals of our experiments was to figure out scaling laws for the number of pairs p as a function of N for various levels of noise. From theoretical analyses, we expected that p would scale with N logN rather than N 2 . In a first set of experiments, we fixed the noise level at σ = 1. We were pleased to see in in Figures 3 and 7 that our two scores (the R2 and Accuracy) in fact increase with α = p/(N logN ). This indicates that our presumed scaling law is, in fact, pessimistic. To determine an empirical scaling law, we fixed a desired value of R2 (0.9, see horizontal line in Figure 3). We then plotted the five points resulting from the intersection of the curves and the horizontal line as a function of N to obtain the red curve in Figure 4. The two other curves are shown for comparison: The blue curve is obtained without noise and the brown curve with an initialisation with the small-world heuristic. All three curves present a quasi-linear decrease of alpha with N with the same slope. From this we infer that α = p/(N logN ) ' α0 − 4 × 10−5 N . And thus we obtain the following empirical scaling law of p as a function of N : p = α0 N logN − 4 × 10−5 N 2 logN. In this formula, the intercept α0 changes with the various conditions (choices of pairs and noise), but the scaling law remains the same. A similar scaling law is obtained if we use Accuracy rather than R2 as score.

339 340 341 342

3.2

Small-world heuristic

Our experiments indicate that an increase in performance is obtained with the small-world heuristic compared to a random choice of pairs (Figure 4). This is therefore what was adopted in all other experiments.

347 348 349 350 351 352 353 354 355 356 357 358 359

319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336

338 339 340 341 342

344

344

346

318

343

343

345

317

337

337 338

315

3.3

Experiment budget

In the introduction, we indicated that our budget to pay AMT workers would cover at least p = 200, 000 pairs. However, the efficiency of our data collection setting reduced the cost per elementary task and we ended up labeling p = 321, 684 pairs within our budget. For our N = 10, 000 videos, this corresponds to α = p/(N logN ) = 3.49. We see in Figure 4 that, for N = 10, 000 videos, in all cases examined, the α required to attain R2 = 0.9 is lower than 2.17, and therefore, our budget was sufficient to obtain this level of accuracy. Furthermore, we varied the noise level in Figures 6 and 8. In these plots, we selected a smaller value of α than what our monetary budget could afford (α = 1.56). Even at that level, we can see that we have a sufficient number of pairs to achieve R2 = 0.9 for all levels of noise considered and all values of N considered. We also achieve an accuracy near 0.95 for N = 10, 000 for all levels of noise considered. As expected, a larger σ requires a larger number of pairs to achieve the same level of R2 or Accuracy.

345 346 347 348 349 350 351 352 353 354 355 356 357 358 359

ECCV-16 submission ID 9 360

3.4

9

Computational time

360

361

361

362

362 363

363

Running Time (s)

364 365 366 367 368 369 370

10,000

364

1,000

365

100

366

10

N = 625

1

0.5

373

375 376

369

N = 5000

0

372

374

368

N = 2500

N = 10000

371

1

1.5 2 α = p/(N logN )

2.5

3

Fig. 5: Evolution of running time for different α and N with noise and σ = 1 on log scale.

379 380 381 382 383 384 385 386 387 388 389 390 391 392

One of the feasibility aspect of using ordinal ranking concerns computational time. Given that collecting and annotating data takes months of work, any computational time ranging from a few hours to a few days would be reasonnable. However, to be able to run systematic experiments, we optimized our algorithm sufficiently that any experiment we performed took less than three hours. Our implementation, which uses Newton’s conjugate gradient algorithm [34], was made publicly available on Github2 . In Figure 5 we see that the log of running time increases quite rapidly with α at the beginning and then almost linearly. We also see that the log of the running time increases linearly with N for any fixed value of α. In the case of our data collection, we were interested in α = 2.17 (see the previous section), which corresponds to using 200,000 pairs for 10,000 videos (our original estimate). For this value of α, we were pleased to see that the calculation of the cardinal labels would take less than three hours. This comforted us on the feasibility of using this method for out particular application. 3.5

Experiments on real data

397 398 399 400

The data collection process included collecting labels from AMT workers. Each worker followed the protocol we described in Section 2 (see Figure 1). We obtained 321,684 pairs of real human votes for each trait, which were divided into 300,000 pairs for training and used the remainder 21,684 pairs for testing. This corresponds to α = 3.26 for training.3

401

2

402

3

403 404

372 373 374 375 376

378 379 380 381 382 383 384 385 386 387 388 389 390 391 392

394 395

395 396

371

393

393 394

370

377

377 378

367

N = 1250

https://github.com/andrewcby/Speed-Interview These experiments concern only cardinal label reconstruction, they have nothing to do with the pattern recognition task from the videos, for which a different split between training/validation/test sets was done for the challenge.

396 397 398 399 400 401 402 403 404

10


405

405

1

406

406

407

407

0.95

409

408

R2

408

410

409

412

411 412

415

5000

625 1250

414

10000

413

0.85

2500

413

410

σ=1 σ=2 σ=3

0.9

411

416

419

415 416

N

417

417 418

414

Fig. 6: Evolution of R2 for different σ with α = 1.56, a value that guarantees R2 ≥ 0.9 when σ = 1.

418 419 420

420 421

1

421

422

0.95

422

424 425 426

Accuracy

423

423

0.9 0.8

N = 1250

425

N = 2500

426

N = 5000

427

N = 10000

0.7

428

424

N = 625

Accuracy=0.95

0

430

0.5

1

431

433

1.5 2 α = p/N logN

2.5

3

Fig. 7: Evolution of Accuracy for different α with noise with σ = 1.

436

Accuracy

437

437

0.95

438

σ σ σ σ

0.9

441

445 446 447 448 449

=1 =2 =3 =4 10000

5000

2500

444

0.85

625 1250

442 443

433

435

436

440

432

434

1

435

439

430 431

434

438

428 429

429

432

427

N

Fig. 8: Evolution of accuracy for different σ with α = 1.56, a value that guarantees accuracy ≥ 0.9 when σ = 1.

439 440 441 442 443 444 445 446 447 448 449

ECCV-16 submission ID 9 450 451 452 453 454

11

We ran our cardinal score reconstruction algorithm on these data set and computed test accuracy. The results, shown in Table 1, give test accuracies between 0.66 and 0.73 for the various traits. Such reconstruction accuracies are significantly worse than those predicted by our simulated experiments. Looking at figure 7, the accuracies for α > 3 are larger than 0.95.

450 451 452 453 454

455

455

456

456

457

457

458

458

459

459

460

460

461

461

462

462

463

463

464

464

465

465

466

466

467

467

468

468

469

469

470

470

471

471

472

472

473

473

474

474

475

475

476

476

477

477

478

478

479

479

480

480

481

481

482

482

483

483

484

484

485

485

486

486

487

487

488

488

489

489

490

490

491

491

492

492

493

493

494

494

12


Several factors can explain such lower accuracies of reconstruction:

495

496

496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

1. Use of “noisy” ground truth estimation in real data to compute the target ranking in the accuracy calculation. The overly optimistic estimation of the accuracy in simulations stems in part from using exact ground truth, not available in real data. In real data, we compared the human ranking and the BTL model reconstructed ranking in test data. This may account for at least doubling the variance, one source of error being introduced when estimating the cardinal scores, and the other when estimating the accuracy using pair reconstruction with ”noisy” real data. 2. Departure of the real label distribution from the uniform distribution. We carried out complementary simulations with a Gaussian distribution instead of a uniform distribution of labels (closer to a natural distribution) and observed a decrease of 6% in accuracy and a decrease of 7% in R2 . 3. Departure of the real noise distribution from the BTL model. We evaluated the validity of the BTL model by comparing the results to those produced with a simple baseline method introduced in [35]. This method consists in averaging the ordinal ratings for each video (counting +1 if it is rated higher than another video an -1 if it is rated lower). The performances of the BTL model are consistently better across all traits, based on the one sigma error bar calculated with 30 repeat experiments. Therefore, even though the baseline method is considerably simpler and faster, it is worth running the BTL model for the estimation of cardinal ratings. Unfortunately, there is no way to quantitatively estimate the effect of the third reason. 4. Under-estimation of the intrinsic noise level (random inconsistencies in rating the same video pair by the same worker). We evaluated the σ in the BTL model using bootstrap re-sampling of the video pairs. With an increasing level of σ, the results are consistently decreasing, as shown in figure 8. Therefore the parameters we chose for the simulation model proved to be optimistic and underestimated the intrinsic noise level. 5. Sources of bias not accounted for (we only took into account a global source of bias, not stratified sources of bias such as gender bias and racial bias. This is a voter-specific factor that we did not take into consideration when setting up the simulation. As this kind of bias is hard to measure, especially quantitatively, it can negatively influence the accuracy of the prediction.

4

Discussion and conclusion

537 538 539

499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

533 534

534

536

498

532

532

535

497

531

531

533

495

In this paper we evaluated the viability of an ordinal rating method based on labeling pairs of videos, a method intrinsically insensitive to (global) worker bias. Using simulations, we showed that it is in principle possible to accurately produce a cardinal rating by fitting the BTL model with maximum likelihood, using artificial data generated with this model. We calculated that it was possible

535 536 537 538 539

ECCV-16 submission ID 9 540 541 542 543 544 545 546 547 548 549 550 551 552 553

13

to remain within our financial budget of 200,000 pairs and incur a reasonable computational time (under 3 hours). However, although in simulations we pushed the model to levels of noise that we thought were realistic, the performance we attained with simulations (R2 = 0.9 of Accuracy= 0.95 on test data) turned out to be optimistic. Reconstruction of cardinal ratings from ordinal ratings on real data lead to a lower level of accuracy (in the range 69% and 73%), showing that there are still other types of noise that are not reducible by the model. Future work can focus on methods to reduce this noise. Our financial budget and time constraints also did not allow us to conduct a comparison with direct cardinal rating. An ideal, but expensive, experiment could be to duplicate the ground truth estimation by using AMT workers to directly estimate cardinal ratings, within the same financial budget. Future work includes validating our labeling technique in this way on real data.

556 557 558 559 560 561 562 563

541 542 543 544 545 546 547 548 549 550 551 552 553 554

554 555

540

Acknowledgment

555

This work was supported in part by donations of Microsoft Research to prepare the personality trait challenge, and Spanish Projects TIN2012-38187-C03-02, TIN2013-43478-P and the European Comission Horizon 2020 granted project SEE.4C under call H2020-ICT-2015. We are grateful to Evelyne Viegas, Albert Clapés i Sintes, Hugo Jair Escalante, Ciprian Corneanu, Xavier Baró Solé, Cécile Capponi, and Stéphane Ayache for stimulating discussions. We are thankful for Prof. Alyosha Efros for his support and guidance.

557

556

558 559 560 561 562 563

564

564

565

565

566

566

567

567

568

568

569

569

570

570

571

571

572

572

573

573

574

574

575

575

576

576

577

577

578

578

579

579

580

580

581

581

582

582

583

583

584

584

14


585

585

586

586

587

587

588

588

589

589

590

590

591

591

592

592

593

593

594

594

595

595

596

596

597

597

598

598

599

599

600

600

601

601

602

602 603

603 604

Table 1: Estimation Accuracy of 10,000 videos and 321,684 pairs (3.49×N logN ).

604

605

BTL Model Averaging ordinal ratings Trait Accuracy STD Accuracy STD Extraversion 0.692 ± 0.027 0.575 ± 0.095 Agreeableness 0.720 ± 0.025 0.533 ± 0.087 Conscientiousness 0.669 ± 0.032 0.559 ± 0.092 Neuroticism 0.706 ± 0.022 0.549 ± 0.084 Openness 0.735 ± 0.021 0.542 ± 0.089

605

606 607 608 609 610 611

606 607 608 609 610 611

612

612

613

613

614

614

615

615

616

616

617

617

618

618

619

619

620

620

621

621

622

622

623

623

624

624

625

625

626

626

627

627

628

628

629

629

ECCV-16 submission ID 9 630

15

References

632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674

630 631

631

1. Marcos-Ramiro, A., Pizarro-Perez, D., Marron-Romera, M., Nguyen, L., GaticaPerez, D.: Body communicative cue extraction for conversational analysis. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. (April 2013) 1–8 2. Aran, O., Gatica-Perez, D.: One of a kind: Inferring personality impressions in meetings. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction. ICMI, New York, NY, USA, ACM (2013) 11–18 3. : Chalearn lap 2016: First round challenge on first impressions - dataset and results 4. Escalera, S., Gonzlez, J., Bar, X., Pardo, P., Fabian, J., Oliu, M., Escalante, H.J., Huerta, I., Guyon, I.: Chalearn looking at people 2015 new competitions: Age estimation and cultural event recognition. In: 2015 International Joint Conference on Neural Networks (IJCNN). (July 2015) 1–8 5. Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd International Conference on World Wide Web. WWW ’14, New York, NY, USA, ACM (2014) 155–164 6. Miller, J., Haden, P.: Statistical Analysis with The General Linear Model. (2006) 7. Shah, N., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., Wainwright, M.: Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. CoRR abs/1505.01462 (2015) 8. Whitehill, J., fan Wu, T., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A., eds.: Advances in Neural Information Processing Systems 22. Curran Associates, Inc. (2009) 2035–2043 9. Welinder, P., Branson, S., Perona, P., Belongie, S.J.: The multidimensional wisdom of crowds. In Lafferty, J., Williams, C., Shawe-taylor, J., Zemel, R., Culotta, A., eds.: Advances in Neural Information Processing Systems 23. (2010) 2424–2432 10. Welinder, P., Perona, P.: Online crowdsourcing: rating annotators and obtaining cost-effective labels. In: In W. on Advancing Computer Vision with Humans in the Loop. (2010) 11. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. J. Mach. Learn. Res. 11 (August 2010) 1297–1322 12. Kamar, E., Hacker, S., Horvitz, E.: Combining human and machine intelligence in large-scale crowdsourcing. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1. AAMAS ’12, Richland, SC, International Foundation for Autonomous Agents and Multiagent Systems (2012) 467–474 13. Bachrach, Y., Graepel, T., Minka, T., Guiver, J.: How To Grade a Test Without Knowing the Answers — A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing. ArXiv e-prints (2012) 14. Bradley, R., Terry, M.: Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika 39: 324-345. (1952) 15. Thurstone, L.L.: A law of comparative judgment. Psychological Review 34(4) (1927) 273 16. Shah, N.B., Balakrishnan, S., Guntuboyina, A., Wainwright, M.J.: Stochastically transitive models for pairwise comparisons: Statistical and computational issues. arXiv preprint arXiv:1510.05610 (2015)

632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674

16 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719


17. Herbrich, R., Minka, T., Graepel, T.: Trueskill: A Bayesian skill rating system. Advances in Neural Information Processing Systems 19 (2007) 569 18. Escalera, S., Gonz` alez, J., Bar´ o, X., Reyes, M., Lopés, O., Guyon, I., Athitsos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: Dataset and results. In: ChaLearn Multi-Modal Gesture Recognition Workshop, ICMI. (2013) 19. Escalera, S., Gonz` alez, J., Baro, X., Reyes, M., Guyon, I., Athitsos, V., Escalante, H., Argyros, A., Sminchisescu, C., Bowden, R., Sclarof, S.: Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary. ICMI (2013) 365–368 20. Escalera, S., Baro, X., Gonz` alez, J., Bautista, M., Madadi, M., Reyes, M., PonceL´ opez, V., Escalante, H., Shotton, J., Guyon, I.: Chalearn looking at people challenge 2014: Dataset and results. (2014) 21. Escalera, S., Gonz` alez, J., Baro, X., Pardo, P., Fabian, J., Oliu, M., Escalante, H.J., Huerta, I., Guyon, I.: Chalearn looking at people 2015 new competitions: Age estimation and cultural event recognition. In: IJCNN. (2015) 22. Baro, X., Gonz` alez, J., Fabian, J., Bautista, M., Oliu, M., Escalante, H., Guyon, I., Escalera, S.: Chalearn looking at people 2015 challenges: action spotting and cultural event recognition. In: ChaLearn LAP Workshop, CVPR. (2015) 23. Escalera, S., Fabian, J., Pardo, P., , Bar´ o, X., Gonz` alez, J., Escalante, H., Misevic, D., Steiner, U., Guyon, I.: Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In: International Conference in Computer Vision, ICCVW. (2015) 24. Escalera, S., Athitsos, V., Guyon, I.: Challenges in multimodal gesture recognition. Journal on Machine Learning Research (2016) 25. Escalera, S., Gonz` alez, J., Bar´ o, X., Shotton, J.: Special issue on multimodal human pose recovery and behavior analysis. IEEE Tans. Pattern Analysis and Machine Intelligence (2016) 26. Park, G., Schwartz, H., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L., Seligman, M.: Automatic personality assessment through social media language. Journal of Personality and Social Psychology 108 (2014) 934–952 27. Ponce-L´ opez, V., Escalera, S., Bar´ o, X.: Multi-modal social signal analysis for predicting agreement in conversation settings. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction. ICMI, New York, NY, USA, ACM (2013) 495–502 28. Ponce-L´ opez, V., Escalera, S., Pérez, M., Janés, O., Bar´ o, X.: Non-verbal communication analysis in victim-offender mediations. Pattern Recognition Letters 67, Part 1 (2015) 19 – 27 Cognitive Systems for Knowledge Discovery. 29. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. (June 2008) 1–8 30. Pentland, A.: Honest Signals: How They Shape Our World. The MIT Press, Massachusetts (2008) 31. Goldberg, L.: The structure of phenotypic personality traits. (1993) 32. Watts, D.J., Strogatz, S.H.: Collective dynamics of’small-world’networks. Nature 393(6684) (1998) 409–10 33. Humphries, M., Gurney, K., Prescott, T.: The brainstem reticular formation is a small-world, not scale-free, network. Proceedings of the Royal Society of London B: Biological Sciences 273(1585) (2006) 503–511 34. Knoll, D.A., Keyes, D.E.: Jacobian-free Newton-Krylov methods: a survey of approaches and applications. Journal of Computational Physics 193 (January 2004) 357–397

675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719

ECCV-16 submission ID 9 720 721

17

35. Shah, N.B., Wainwright, M.J.: Simple, robust and optimal ranking from pairwise comparisons. arXiv preprint arXiv:1512.08949 (2015)

720 721

722

722

723

723

724

724

725

725

726

726

727

727

728

728

729

729

730

730

731

731

732

732

733

733

734

734

735

735

736

736

737

737

738

738

739

739

740

740

741

741

742

742

743

743

744

744

745

745

746

746

747

747

748

748

749

749

750

750

751

751

752

752

753

753

754

754

755

755

756

756

757

757

758

758

759

759

760

760

761

761

762

762

763

763

764

764

Overcoming Calibration Problems in Pattern Labeling with Pairwise

Overcoming Calibration Problems in Pattern Labeling with Pairwise

Suggest Documents

Overcoming Problems in Confirmatory Factor

Pairwise Line Labeling of Geographic Boundaries: An

Overcoming Demand Management Problems

Overcoming Problems in Automated Appraisal ... - Science Direct

calibration problems, calibration procedures and ... - CERF - CERN

The age-related perfusion pattern measured with arterial spin labeling

Approximate inference of labeling problems - Uni Greifswald

Sulforhodamine Labeling of Neural Circuits Engaged in Motor Pattern ...

Key Labeling Technologies to Tackle Sizeable Problems in RNA ...

Overcoming Fluctuation and Leakage Problems in the Quantification

GM labeling in China beset by problems - Convention on Biological ...

Inbuilt Mechanisms for Overcoming Functional Problems Inherent in ...

Overcoming Problems in Monitoring Bridge Deflections Using GPS

Chemometrics Pattern Recognition and Calibration ... - J-Stage

Linear-time Algorithms for Pairwise Statistical Problems - Google Sites

Calibration problems encountered while monitoring stump/socket ...

Calibration problems encountered while monitoring stump/socket ...

ORBIT - Overcoming Breakdowns in Teams with ...

Overcoming chemoresistance in prostate cancer with ...

In Vivo Labeling of Cortical Astrocytes with

Early calibration problems detected in TOMS Earth-Probe ... - CiteSeerX

problems in the extension of the radiocarbon calibration curve

Theoretical treatment of amplitude calibration problems in HEM data

Constraining cosmology with pairwise velocity estimator