TRECVID-2005 High-Level Feature task: Overview - TNO Publications

6 downloads 262 Views 195KB Size Report
11 (13.3%). C. 60. 83. Total runs ... 9. AvgP by feature (all runs). Middle half of the data. Median. Feature number. A ..... n (notebook paper not available yet).
TRECVID-2005 High-Level Feature task: Overview Wessel Kraaij TNO & Paul Over NIST

High-level feature task o o o o o

Goal: Build benchmark collection for detection methods Secondary goal: feature-indexing could help search/browsing Feature set selected from feature set used for annotation of development data (LSCOM-lite) Examples of thing/activity/person/location Collaborative development data annotation effort n n n

o

Tools from CMU and IBM (new tool) 39 features and about 100 annotators multiple annotations of each feature for a given shot

Range of frequencies in the common development data annotation

TRECVID 2005

2

True examples in the common training data 13%

7000 6000 5000 4000 3000 2000

2.3%

1000

TRECVID 2005

47

46

45

44

43

42

41

40

39

38

0

3

High-level feature evaluation o o o o

Each feature assumed to be binary: absent or present for each master reference shot Task: Find shots that contain a certain feature, rank them according to confidence measure, submit the top 2000 NIST pooled submissions to depth 250 Evaluate performance quality by measuring the average precision etc. of each feature detection method

TRECVID 2005

4

10 Features 38. People walking/running: segment contains video of more than one person walking or running (tv4: 35) 39. Explosion or fire: segment contains video of an explosion or fire 40. Map: segment contains video of a map 41. US flag: segment contains video of a US flag 42. Building exterior: segment contains video of the exterior of a building (tv3: 14) 43. Waterscape/waterfront: segment contains video of a waterscape or waterfront 44. Mountain: segment contains video of a mountain or mountain range with slope(s) visible 45. Prisoner: segment contains video of a captive person, e.g., imprisoned, behind bars, in jail, in handcuffs, etc. 46. Sports: segment contains video of any sport in action (tv3: 23)

TRECVID 2005

5

Participants (22/42) (up from 12/33 in 2004) Bilkent University Carnegie Mellon University CLIPS-IMAG, LSR-IMAG, Laboratoire LIS Columbia University Fudan University FX Palo Alto Laboratory Helsinki University of Technology IBM Imperial College London Institut Eurecom Johns Hopkins University Language Computer Corporation (LCC) LIP6-Laboratoire d'Informatique de Paris 6 Lowlands Team (CWI, Twente, U. of Amsterdam) Mediamill Team (Univ. of Amsterdam) National ICT Australia National University of Singapore (NUS) SCHEMA-Univ. Bremen Team Tsinghua University University of Central Florida / University of Modena University of Electro-Communications University of Washington

TRECVID 2005

Turkey USA France USA China USA Finland USA UK France USA USA France Netherlands Netherlands Australia Singapore EU China USA,Italy Japan USA

--SB -SB SB -SB SB ------SB --SB SB ---

LL -–-LL –-––-----LL LL --LL LL ---

HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL HL

SE SE -SE SE SE SE SE SE --SE -SE SE -SE SE SE SE ---

6

Who worked on which features Bilkent University Carnegie Mellon University CLIPS-IMAG, LSR-IMAG, Laboratoire LIS Columbia University Fudan University FX Palo Alto Laboratory Helsinki University of Technology IBM Imperial College London Institut Eurecom Johns Hopkins University Language Computer Corporation (LCC) LIP6-Laboratoire d'Informatique de Paris 6 Lowlands Team (CWI, Twente, U. of Amsterdam) Mediamill Team (Univ. of Amsterdam) National ICT Australia National University of Singapore (NUS) SCHEMA-Univ. Bremen Team Tsinghua University University of Central Florida / Univ. of Modena University of Electro-Communications University of Washington

TRECVID 2005

38 38 38 38 38 38 38 38 38 38 38 38

39 39 39 39 39 39 39 39 39 39 39

38 38 38 38

39 39 39 39

40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40

41 41 41 41 41 41 41 41 41 41 41

42 42 42 42 42 42 42 42 42 42 42

43 43 43 43 43 43 43 43 43 43 43

44 44 44 44 44 44 44 44 44 44 44

45 45 45 45 45 45 45 45 45 45 45

46 46 46 46 46 46 46 46 46 46 46

47 47 47 47 47 47 47 47 47 47 47

41 41 41 41 41 41 41

42 42 42 42

43 43 43 43 43 38 39 42 43 39 42 43 43 38 39 40 41 42 43

44 44 44 44

45 45 45 45

46 46 46 46

47 47 47 47

44 45 46 47 44 45 46 47 44 44 45 46 47

7

Number of runs each training type Tr-Type

2005

2004

2003

A

79 (71.8%)

45 (54.2%)

22 (36.7%)

B

24 (21.8%)

27 (32.5%)

20 (33.3%)

C

7 (6.3%)

11 (13.3%)

18 (30.0%)

Total runs

110

83

60

System training type: A - Only on common dev. collection and the common annotation B - Only on common dev. collection but not on (just) the common annotation C - not of type A or B TRECVID 2005

8

Average precision

AvgP by feature (all runs)

Middle half of the data Median

Feature number TRECVID 2005

9

2005: AvgP by feature (top 10 runs) 38. People walking/running 39. Explosion/fire 40. Map 41. US flag 42. Building exterior

43. Waterscape/waterfront 44. Mountain 45. Prisoner 46. Sports 47. Car

1 2

!

3 4 5 6 7 8 9 10 Median

Median Previous best result on CNN/ABC

TRECVID 2005

10

2004: AvgP by feature (top 10 runs) " " "

28. Boats/ships 29. M. Albright 30. B. Clinton 31. Trains 32. Beach

33. Basket scrore 34. Airplane takeoff 35. People walk/run 36. Phys. violence 37. Road

1 2 3

!

" " " "

4 5 6 7 8 9 10

"

Median

"

Median

TRECVID 2005

11

2003: AvgP by feature (top 10 runs) " " "

!

"

11. 12. 13. 14. 15. 16. 17. 18. 19.

" "

Indoors News subject face People Building Road Vegetation Animal Female speech Car/truck/bus

20. 21. 22. 23. 24. 25. 26. 27.

Aircraft News subject monologue Non-studio setting Sporting event Weather news Zoom in Physical violence Madeleine Albright

1 2 3 4 5 6 7

"

8 9

"

10 Median

"

Median ->

TRECVID 2005

12

AvgP by feature (top 3 runs by per feature) #$%&

" "

38. People walking/running 39. Explosion/fire 40. Map 41. US flag 42. Building exterior

" " " "

43. Waterscape/waterfront 44. Mountain 45. Prisoner 46. Sports 47. Car

'#

#$%&

(#

#$& )$*

#

#$& )$*

#

#$& )$*

#

#+ ,% -./#

0 #

#+ ,% -./#

01#

#+ ,% -./# ,* # #+ ,% -./# ,* # #+ ,% -./#% 0/#

" " "

#+ ,% -./#01%# #+ ,% -./#01% )# #

#

#

#

% 2

TRECVID 2005

13

Max AvgP by number of annotated training examples

44 4 21

4

2 2

22 2 2 1 1 212 1 1 11 3 3 21

3 33 1 31

1 1 21 3

TRECVID 2005

2 24 12

2 12 1

3

1

3 1 2

4 3 3 4 3 3 14 4 3 3 3 2 3 3 34 1 3 1 2

3

3 1

14

Median AvgP by number of annotated training examples

44 4 21

4

2 2

22 2 2 1 1 212 1 1 11 3 3 21

3 33 1 31

1 1 21 3

TRECVID 2005

2 24 12

2 12 1

3

1

3 1 2

4 3 3 4 3 3 14 4 3 3 3 2 3 3 34 1 3 1 2

3

3 1

15

Max AvgP by number true shots found

44 4 21

4

2 2

22 2 2 1 1 212 1 1 11 3 3 21

3 33 1 31

1 1 21 3

2 24 12

2 12 1

3

1

3 1 2

4 3 3 4 3 3 14 4 3 3 3 2 3 3 34 1 3 1 2

3

3 1

! TRECVID 2005

16

Median AvgP by number true shots found

44 4 21

4

2 2

22 2 2 1 1 212 1 1 11 3 3 21

3 33 1 31

1 1 21 3

2 24 12

2 12 1

3

1

3 1 2

4 3 3 4 3 3 14 4 3 3 3 2 3 3 34 1 3 1 2

3

3 1

! TRECVID 2005

17

TRECVID 2005

47

46

45

44

ldin g

exte rior

US fl

ag

Car

Spo r ts

Wa ters cap e/w ate rfro Mo nt unt a in Pris one r

Bui

40

43

50

42

60

41

70

40

80

39

90

Peo ple wal king /run ning Exp losi on/ fire Ma p

100

38

%

Al l te st sh ots

% of true shots by source language for each feature !

$( 3 (

30

20

10

0

Feature number 18

True shots contributed uniquely by team for each feature

4 24

2

1

1 3

1

13

Number of unique true shots

5 2

1 1

1

40 30 20 10 0

38

ar C ts nt 47 or r f ro Sp n e er at 46 iso in Pr ta e/w n ap ior 45 ou M rs c t e r 4 4 a te e x W ing 43 uild g B fl a ng 42 S ni e U un fi r 41 ap n/ g/r M osio lkin 40 xpl wa E le 39 eop P

7 8 sh 16 32 UwaUvA n 1 3 11 2 2 me 9 1 1 Ubrseinghua T 1 HUUS T 5 1 3 N A 2 8 1 6 NIClaTnds 1 6 8 1 3 LowarisLIPCC 14 L HU 5 1 UP J rial 1 12 1 4 al Impe FXP dan 2 6 2 Fu c o m e r 6 Eu CMU R S L . S.LIS Bilkent P I L C

TRECVID 2005

19

Observations o o o

Participation almost doubled over 2004 (12 -> 22) Focus on category A runs (increased comparability) Scores are generally higher than in 2004 despite o o o

new sources errorful text from speech (via MT) What does it mean?

o

Did anybody run last year’s system on this year’s task?

o o

Features were generally found in all language sources Top scores come from fewer systems/groups

TRECVID 2005

20

To follow: overview of the systems with map > 0.16 (median) o Only systems that were tested on all 10 features o Only category A o Runs were compared on map across 10 features

TRECVID 2005

21

Overview of approaches o HLF systems draw from a very wide range of signal processing and machine learning techniques n Generic vs feature specific n How to do feature selection for visual modalities such as color and texture n Visual representation: grid or salient feature clusters n Various fusion methods, normalization methods n Range of classifiers

TRECVID 2005

22

Carnegie Mellon University o

Approach n n n n n n

o

unimodal / multimodal (as in 2004) learn dependencies between semantic features (by using various graphical model representations): inconclusive global fusion < local fusion multilingual > monolingual multiple text sources > single text source Best run: local fusion

Results: 0,4

0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

23

Columbia University - presentation follows o

Approach n n

Parts based object representation (ARG) Captures: o o

n n

o

topological structure (spatial relationships among parts) Local attributes of parts

Model learns the parameter distribution properties due to differences in photometric conditions and geometry Runs vary across classifier combination schemes (fusion/selection)

Results: n n n

Significantly better than global (i.e. grid based) approach Esp. good for visual concepts where topology and local attributes are important (e.g. US flag) Text features play only a marginal role (contrastive experiment)

0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

24

Fudan University o Approach: n Several runs o Specific feature detectors o ASR based o Fusion of several unimodal SVM classifiers o Contrastive experiments with different dimension reduction techniques (PCA, locality preserving projection

o Results: n Best run: 0.19

TRECVID 2005

25

FXPAL o Approach n SVM trained on low level features donated by CMU n Classifier combination schemes based on various forms of regression n 1st time participation

o Results n Best result: map=0.18

TRECVID 2005

26

Helsinki University of Technology o Approach: n Self Organizing maps trained on multimodal features and LSCOM lite annotations

o Result: n 1 run : map 0.2

TRECVID 2005

27

IBM o

Approach n

Features: o o o

n n n n

o

Visual: Extensive experiments for selecting best feature type and granularity for individual modalities (color, texture etc.) Motion, Text, LSCOM LITE concepts Features also included meta-information such as time of broadcast, channel etc.

SVM > (ME, KNN, GMM ) Flat and hierarchical feature fusion Variations in classifier fusion methods Feature specific approaches (selection based on held-out data)

Results: 0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

28

Imperial College London o

Approach n

1. “Naïve model”: o o

n

o

locate salient clusters in feature space Learn HLF clusters models

2. Nonparametric Density estimation (kernel smoothing)

Results: n n

Naïve model: performance problems NPDE >> Naïve model

0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

29

Mediamill team (Univ. of Amsterdam) - presentation follows o

Approach n n n

o

Authoring metaphor Feature specific combination of content, style and context analysis 101 concept lexicon

Results: n

Textual features contribute only a small performance gain

0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

30

National University of Singapore (NUS) o

Approach 1. Ranked maximal figure of merit: ASR only, texture only, 2 fused runs 2. HMM for visual dependency (4X4 grid): ASR only, +visual, +audio,genre,OCR . RankBoost fusion

o

Results: n 2nd approach >> 1st approach

0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

31

University of Washington o

Approach: ? n

o

(notebook paper not available yet)

Results: 0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

TRECVID 2005

32