Gupta et al. ECCV 2008. Kumar et al. CVPR 2010. Wang et al. ECCV 2016. Sadeghi et al. CVPR 2011. Related work. 11. Yao e
Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University
* = equal contribution
image #1
llama person
image #2
llama
person
2
next to
chasing 3
4
Problem formulation
Input (image only)
6
Problem formulation
Input (image only)
person
person
riding
riding
horse
in front of
horse
Output 7
Problem formulation person
person
horse
horse
Input (image only)
Output 8
Problem formulation
Input (image only)
person
person
riding
riding
horse
horse
Output 9
Problem formulation
Input (image only)
person
person
riding
riding
horse
in front of
horse
Output 10
Related work Spatial relationships: cup on top of table
Action relationships: person kick ball
Common relationships: person wear shirt
Roger et al. ICCV 2008 Galleguillos, CVPR 2008
Yao et al. CVPR 2012 Maji et al. CVPR 2011 Rohrbach et al. ICCV 2013 Gupta et al. PAMI 2009
Gupta et al. ECCV 2008 Kumar et al. CVPR 2010 Wang et al. ECCV 2016 Sadeghi et al. CVPR 2011 11
Visual Genome dataset 33K object categories 42K relationship categories
dataset also contains descriptions, question answers and attributes Krishna et al. IJCV 2016 13
Observation 1: ride
Quadratic explosion of - N objects, - K relationships leading to N2K detectors
next to
lying Visual Genome dataset N = 33K K = 42K
drag
falling off
carry
resting on throw 14
# of occurrences
Observation #2
Long tail distribution of relationships - makes supervised training difficult
relationships
15
# of occurrences
Observation #2
ww tree ride behind skateboard car onw street dog dog
Long tail distribution of relationships - makes supervised training difficult
relationships
1 6
# of occurrences
Observation #2
w car onwstreet dog ride skateboard
elephant wdrink milk
Long tail distribution of relationships - makes supervised training difficult
dog ride wsurfboard relationships
1 7
Visual module
Language module Input
Tackles: Quadratic explosion of N2K detectors
Tackles: Long tail distribution of relationships
⋅
Output
Visual module Input
Definitions:
Output
Visual module Proposals: Uijlings et al. IJCV 2013
Input
Definitions:
Output
Visual module Proposals: Uijlings et al. IJCV 2013
Input Sample:
object detector
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] Output
Visual module Proposals: Uijlings et al. IJCV 2013
Input Sample:
object detector
relationship detector
Output
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …]
Visual module Proposals: Uijlings et al. IJCV 2013
Input Sample:
object detector
relationship detector
⋅ Output
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple
Visual module Proposals: Uijlings et al. IJCV 2013
Input Sample:
object detector
relationship detector
⋅
person in horse
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple
Visual module
Language module
Proposals: Uijlings et al. IJCV 2013
o1: man
r: ride
o2: horse
Input Sample:
object detector
relationship detector
⋅
person in horse
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple
Visual module
Language module
Proposals: Uijlings et al. IJCV 2013
o1: man
r: ride
o2: horse
Input Sample:
object detector
relationship detector
⋅
person in horse
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple
Visual module
Language module
Proposals: Uijlings et al. IJCV 2013
o1: man
r: ride
o2: horse
Input Sample:
⋅ object detector
relationship detector
⋅
person riding horse
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple
Visual module Proposals: Uijlings et al. IJCV 2013
Sample:
Tackles:
object detector
relationship detector
⋅
Quadratic explosion only requires N+K detectors
Language module Tackles:
Long tail distribution can predict rare relationships
o1: man
r: ride
o2: horse
Training the visual module 1. Pre-train using ImageNet
object detector
object detector
relationship detector
Definitions:
Deng et al. 2009
Training the visual module 1. Pre-train using ImageNet 2. Train object detector object detector
object detector
relationship detector
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …]
Girshirk et al. CVPR 2014
Training the visual module 1. Pre-train using ImageNet 2. Train object detector 3. Train relationship detector object detector
object detector
relationship detector
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …]
Training the visual module
object detector
relationship detector
object detector
⋅ Ranking loss
1. 2. 3. 4.
Pre-train using ImageNet Train object detector Train relationship detector Fine-tune both jointly
Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] Deng et al. 2009
Training the language module
w dog ride skateboard
dog ride wsurfboard
34
Training the language module
w dog ride skateboard
dog ride wsurfboard
where cos is the cosine distance
35
Training the language module
w dog ride skateboard
0
dog ride wsurfboard
0 where cos is the cosine distance
36
Training the language module
w dog ride skateboard
dog ride wsurfboard
where cos is the cosine distance Minimize: 37
Training both modules iteratively Visual module
Language module
⋅
Our results:
39
Our results: spatial, comparative, asymmetrical, verb, prepositional
taller than person
person left of wear
on
wear
shirt
snow
ski 40
Our results: spatial, comparative, asymmetrical, verb, prepositional
taller than person
person left of wear
on
wear
shirt
snow
ski 41
Our results: spatial, comparative, asymmetrical, verb, prepositional
taller than person
person left of wear
on
wear
shirt
snow
ski 42
Our results: spatial, comparative, asymmetrical, verb, prepositional
taller than person
person left of wear
on
wear
shirt
snow
ski 43
Relationship types: spatial, comparative, asymmetrical, verb, prepositional
taller than person
person left of wear
on
wear
shirt
snow
ski 44
Our results: spatial, comparative, asymmetrical, verb, prepositional
taller than person
person left of wear
on
wear
shirt
snow
ski 45
Ablation study
Sadeghi et al. 2011 Recall @ 50 Recall @ 100 mAP
Visual only
Visual + language
Ablation study
person wear shirt person wear shirt
Sadeghi et al. 2011 Recall @ 50 Recall @ 100 mAP
0.07 0.09 0.04
Visual only
Visual + language
Ablation study
Recall @ 50 Recall @ 100 mAP
person wear shirt person wear shirt
person in horse person in shirt
Sadeghi et al. 2011
Visual only
0.07 0.09 0.04
1.58 1.85 0.84
Visual + language
Ablation study
person wear shirt person wear shirt
Sadeghi et al. 2011 Recall @ 50 Recall @ 100 mAP
0.07 0.09 0.04
person in horse person ride horse person in shirt person near horse Visual only
1.58 1.85 0.84
Visual + language
13.86 14.76 1.52
person ride bicycle 50
person throw frisbee person throw frisbee 51
Zero shot detection
person sit chair 948 training examples
hydrant on ground 29 training examples
52
Zero shot detection
person sit chair 948 training examples
hydrant on ground 29 training examples
person sit hydrant 0 training examples 53
Zero shot detection
person ride horse 578 training examples
person wear hat 1023 training examples
54
Zero shot detection
person ride horse 578 training examples
person wear hat 1023 training examples
horse wear hat 0 training examples 55
Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University
Poster #4 Questions? 56