Nov 26, 2014 - 1 Facebook AI Research. 2 Tel Aviv University .... 95.53. 97.17. 95.87. 4096 4096 bits. 1024 1024 bits. 2
DeepFace for Unconstrained Face Recognition
1 Yaniv
Taigman 1 Facebook
1 Ming
Yang
AI Research
1 Marc’Aurelio
2 Tel
11/26/2014
Ranzato
Aviv University
2 Lior
Wolf
Era of big visual data 1.6M daily uploads 6B photos (12/2013)
60M daily uploads 20B photos (3/2014)
215M daily uploads ?B photos (11/2013)
350M daily uploads 0B photos (11/2013)
400M daily uploads 350B photos (3/2014)
100 hours video per min (4/2014)
• 1.75B smartphone users in 2014 • 880B digital photos will be taken in 2014 Sources: www.expandedramblings.com, www.emarketer.com
Tag suggestions
No automatic face recognition service in EU countries
Facerec main objective Find a representation & similarity measure such that:
• Intra-subject similarity is high • Inter-subject similarity is low
Milestones in face recognition
1973 1964 Kanade’s Bledsoe Thesis Face Recognition
1991 Turk & Pentland Eigenfaces
Slightly modified version of Anil Jain’s timeline
1997 Belhumeur Fisherfaces
1999 1999 Blanz & Vetter Wiskott Morphable EBGM faces
2001 Viola & Jones Boosting
2006 Ahonen LBP
Problem solved? NIST FRVT’s best-performer’s on: 1. Verification: FRR=0.3% at FAR=0.1% 2. Identification: with 1.6 million identities: 95.9% 3. Identification: on LFW with 4,249 identities: 56.7%
Answer: No. • L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: Identifying a person of interest from a media collection. IEEE Trans. Information Forensics and Security, 2014.
Constrained vs. unconstrained CONSTRAINED
Labeled Faces in the Wild
FRVT
property resolution about
UNCONSTRAINED
constrained 2000x2000
unconstrained 50x50
viewpoint
fully frontal
rotated, loose
illumination occlusion
controlled disallowed
arbitrary allowed
Challenges in unconstrained face recognition Gallery
2. Illumination
3. Expression 4. Aging
5. Occlusion
Probes for example
1. Pose
A case study • Gallery images: 1 million mug-shot + 6 web images • Probe images: 5 faces • Ranking results – w/o or with demographic filtering
Probe faces:
A case study of automated face recognition: the Boston Marathon bombing suspects, J. C. Klontz and A.K. Jain, IEEE Computer, 2013
Unconstrained Face Recognition Era: The Labeled Faces in the Wild (LFW)
13,233 photos of 5,749 celebrities
celebrities Labeled faces in the wild: A database for studying face recognition in unconstrained environments, Huang, Jain, LearnedMiller, ECCVW, 2008
Face verification (1:1)
= !=
Human-level performance • User study on Mechanical Turk – 10 different workers per face pair – Average human performance – Original images, tight crops, inverse crops 99.20%
97.53%
“These results suggest that automatic face verification algorithms should not use regions outside of the face, as they could artificially boost accuracy in a manner not applicable on real data.”
94.27% Attribute and simile classifiers for face verification, Kumar, et al., ICCV 2009
LFW: Progress over the recent 7 years •
Labeled faces in the wild: A database for studying face recognition in unconstrained environments, ECCVW, 2008. • Attribute and simile classifiers for face verification, ICCV 2009. • Multiple one-shots for utilizing class label information, BMVC 2009. • Large scale strongly supervised ensemble metric learning, with applications to face verification and retrieval, NEC Labs TR, 2012. Learning hierarchical representations for face verification with convolutional deep belief networks, CVPR, 2012. • Bayesian face revisited: A joint formulation, ECCV 2012. • Tom-vs-pete classifiers and identity preserving alignment for face verification, BMVC 2012. • Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification, CVPR 2013. • Probabilistic elastic matching for pose variant face verification, CVPR 2013. • Fusing robust face region descriptors via multiple metric learning for face recognition in the wild, CVPR 2013. • Fisher vector faces in the wild, BMVC 2013. • A practical transfer learning algorithm for face verification, ICCV 2013. Hybrid deep learning for computing face similarities, ICCV 2013. Employed deep learning models for face verification on LFW. Please check http://vis-www.cs.umass.edu/lfw/ for the latest updates.
LFW: Progress over the recent 7 years Accuracy / year Reduction of error wrt human / year
92.58% 85.54%
97.53% 95.17% 96.33%
88.00%
78.47% 73.93%
60.02% 52.32% 48.06%
49.15%
37.09%
37.08%
19.24%
20.52%
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments (results page), Gary B. Huang, Manu Ramesh, Tamara Berg and Erik Learned-Miller.
High-dim LBP • Accurate (27) dense facial landmarks • Concatenate multi-scale descriptors – ~100K-dim LBP, SIFT, Garbor, etc.
• Transfer learning: Joint Bayesian Likelihood ratio test: EM update of the between/within class covariance
• WDRef dataset – 99,773 images of 2,995 individuals – 95.17% => 96.33% on LFW (unrestricted protocol) Face alignment by explicit shape regression, Cao, et al., CVPR 2012 Bayesian face revisited: A joint formulation, Chen, et al., ECCV 2012 Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification, Chen, et al., CVPR 2013 A practical transfer learning algorithm for face verification, Cao, et al., ICCV 2013
Hybrid deep learning • 12X5 Siamese ConvNets X8 + RBM classification 12 face regions
8 pairs of inputs
CelebFaces dataset 87,628 images of 5,436 individuals Hybrid deep learning for computing face similarities, Sun, Wang, Tang, ICCV 2013.
Face recognition pipeline Detect
Align
Represent
Classify
Yaniv
Lubomir
Marc’Aurelio
Faces are 3D objects
Reconstruction accuracy and discriminability
Bornstein et al. 2007
Face alignment (‘Frontalization’)
Detect
2D-Aligned
3D-Aligned
2D alignment
f
Localize
2D Align
3D alignment 2D Align
+67 x2d Pnts
Piece-wise affine
Rendering of new views
Calista_Flockhart_0002.jpg Detection & Localization
Frontalization
Localization
C1: 32 filters 11x11
M2: 3x3
C3: 16 filters 9x9
Front-End ConvNet
L4: 16 x 9 x 9 x 16
L5: 16 x 7 x 7 x 16
Local (Untied) Convolutions
L6: 16 x 5 x 5 x 16
SFC labels
REPRESENTATION
Network architecture
F7: F8: 4096d 4030d
Globally Connected
SFC Training dataset
4.4 million photos blindly sampled, containing more than 4,000 identities (permission granted)
Transferred Similarities (Test) (a) Cosine angle DeepFace Replica
(b) Kernel Methods DeepFace Replica
(c) Siamese Network
Results on LFW
Youtube face dataset (YTF) • Data collection – 3,425 Youtube videos 1,595 celebrities (a subset of LFW subjects) – 5,000 video pairs in 10 splits – Detected and roughly aligned face frames available.
• Metric: mean recognition accuracy over 10 folds – Restricted protocol: only same/not-same labels – Unrestricted protocol: face identities, additional training pairs
Face recognition in unconstrained videos with matched background similarity, Wolf, Hassner, Maoz, ICCV 2011
Results on YouTube Faces (Video)
Trade-offs (LFW Acc. %)
1. Alignment: not “astonishing”
87.9
97.35
94.3
93.7
No Alignment 3D Pertrubation 2D Alignment
91.3
3D Alignment 3D Alignment + LBP
2. Dimensionality: 97
97.17
96.72 96.07
4096
3. Sparsity @ 4k dims:
1 0.8 0.6 0.4 0.2 0 0.1
4096 bits
0.2
0.3
95.87
95.53
1024
0.4
1024 bits
0.5
0.6
256
7
0.8
256 bits
0.9
1
Trade-offs – Cont’d DB Size / DNN Test Error (%)
4. Training data size: 8.74
10.9
100% of the data 50% of the data
20.7
15.1 20% of the data
10% of the data
5. Network Architecture:
8.74 C1+M2+C3+L4+L5+L6+F7
11.2
-C3
12.6
13.5
-L4 -L5
-C3 -L4 -L5
Failure cases • All false negatives on LFW (1%) age sunglasses
occlusion/ hats profile
errata
Failure cases • All false positive on LFW (0.65%)
Failure cases • Sample false negatives on YTF
Failure cases • Sample false positives on YTF
Face identification (1:N) Probe
Gallery
Unaccounted challenges in verification: I.Reliability II.Large confusion (P x G) III.Different distributions IV.Unknown class
= !=
LFW identification (1:N) 1. Close Set - #Gallery1: 4,249 - #Probes: 3,143 Measured3 by Rank-1 rate.
2 protocols Gallery Probe
Impostor Probe
1
…
2. Open Set - #Gallery1: 596 - #Probes: 596 - #Impostors: 9,491 (‘unknown class’) Measured3 by Rank-1 rate @ 1% False Alarm Rate.
UNKNOWN
Each identity with a single example 2 Unconstrained Face Recognition: Identifying a Person of Interest from a Media Collection Best-Rowden, Han, Otto, Klare and Jain (IEEE Trans. Information Forensics and Security,) 3 Training is not permitted on LFW (‘unsupervised’)
LFW identification (1:N) results Gallery Probe
NIST’s
Impostor Probe
Confusion Matrix = GT*P G is 4096x 4249 P is 4096x 3143
…
Cosine similarity measure (‘unsupervised’) : P
UNKNOWN
G
Bottleneck regularizes transfer learning
FC8
00000000100
SOFTMAX
FC7
DNN
Labels
Web-Scale Training for Face Identification; Taigman, Yang, Ranzato, Wolf
Low-dim DeepFace representation • Naïve binarization Verification accuracy (%) on LFW (restricted protocol) 97
97.17 97 96.78 96.72 96.42 96.07 95.87 95.53 95.5
96.1 94.5
95
93.38
92.75
93
91.45 91
89.4
89
87.15 87
85 dim=4096 dim=1024
dim=512
dim=256
dim=128
float
binary
dim=64
dim=32
dim=16
dim=8
CNN’s (can) saturate “Results can be improved simply by waiting for faster GPUs and bigger datasets to become available” -- Krizhevsky et al. What happens when the network is fixed & the number of training grows from 4m 0.5b ? Answer: our findings reveal that this holds to a certain degree.
Data is practically infinite.
Data is practically infinite.
– >350 billion photos – >400M photos uploaded/day – 3500 photos every sec – One ImageNet every 1:20h – One Flickr every 4 weeks
Scaling up DeepFace : 4.4 million images / 4,030 identities Random 108k : 6 million images / 108,000 identities Random 250k : 10 million images / 250,000 identities (yes : 250K softmax)
Saturation
Scaling up: Semantic Bootstrapping • 0.5B images 10M hyperplanes • Lookalike hyperplanes DB2 • Training on DB2 with more capacity.
Web-Scale Training for Face Identification; Taigman, Yang, Ranzato, Wolf
Second round results Results on LFW
Comparison to NIST’s State Of The Art Second-round DeepFace Same system that achieved 92% Rank-1 accuracy on a table of 1.6 million identities. (NIST’s SOTA, constrained)
SOTA single network, 2nd best 95.43%[DeepID2]
DeepFace efficiency (at test) • For a single 720p image on a single 2.2Ghz Intel CPU core: – Face detection: – 2D+ 3D Alignment: – Feed-forward: – Classification:
– Overall: 0.53 sec
0.3 sec 0.05 sec 0.18 sec ~0 sec
Summary • Coupling 3D alignment with large-capacity locally-connected networks • At the brink of human-level performance for face verification (1:1) • Pushing the performance significantly for face identification (1:N)
(a small part of-) tSNE visualization of LFW, constructed from all pairs (~88m) dot products, i.e. unsupervised.
Thank you! • Questions • Comments • Suggestions • Facebook AI Research: www.facebook.com/fair – Human attributes – Object recognition – NLP/word embedding – GPU training platform – ……
Recent work • 25 CNNs combining identification and verification loss
Deep learning face representation by joint identification-verification, Sun, Wang, Tang, technical report, arxiv, 6/2014
• Pyramid CNN: a group of layer-wised trained CNNs. Learning deep face representation, Fan, Cao, Jiang, Yin, technical report, arxiv, 3/2014