DeepFace for Unconstrained Face Recognition

0 downloads 229 Views 5MB Size Report
Nov 26, 2014 - 1 Facebook AI Research. 2 Tel Aviv University .... 95.53. 97.17. 95.87. 4096 4096 bits. 1024 1024 bits. 2
DeepFace for Unconstrained Face Recognition

1 Yaniv

Taigman 1 Facebook

1 Ming

Yang

AI Research

1 Marc’Aurelio

2 Tel

11/26/2014

Ranzato

Aviv University

2 Lior

Wolf

Era of big visual data 1.6M daily uploads 6B photos (12/2013)

60M daily uploads 20B photos (3/2014)

215M daily uploads ?B photos (11/2013)

350M daily uploads 0B photos (11/2013)

400M daily uploads 350B photos (3/2014)

100 hours video per min (4/2014)

• 1.75B smartphone users in 2014 • 880B digital photos will be taken in 2014 Sources: www.expandedramblings.com, www.emarketer.com

Tag suggestions

No automatic face recognition service in EU countries

Facerec main objective Find a representation & similarity measure such that:

• Intra-subject similarity is high • Inter-subject similarity is low

Milestones in face recognition

1973 1964 Kanade’s Bledsoe Thesis Face Recognition

1991 Turk & Pentland Eigenfaces

Slightly modified version of Anil Jain’s timeline

1997 Belhumeur Fisherfaces

1999 1999 Blanz & Vetter Wiskott Morphable EBGM faces

2001 Viola & Jones Boosting

2006 Ahonen LBP

Problem solved? NIST FRVT’s best-performer’s on: 1. Verification: FRR=0.3% at FAR=0.1% 2. Identification: with 1.6 million identities: 95.9% 3. Identification: on LFW with 4,249 identities: 56.7%

 Answer: No. • L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: Identifying a person of interest from a media collection. IEEE Trans. Information Forensics and Security, 2014.

Constrained vs. unconstrained CONSTRAINED

Labeled Faces in the Wild

FRVT

property resolution about

UNCONSTRAINED

constrained 2000x2000

unconstrained 50x50

viewpoint

fully frontal

rotated, loose

illumination occlusion

controlled disallowed

arbitrary allowed

Challenges in unconstrained face recognition Gallery

2. Illumination

3. Expression 4. Aging

5. Occlusion

Probes for example

1. Pose

A case study • Gallery images: 1 million mug-shot + 6 web images • Probe images: 5 faces • Ranking results – w/o or with demographic filtering

Probe faces:

A case study of automated face recognition: the Boston Marathon bombing suspects, J. C. Klontz and A.K. Jain, IEEE Computer, 2013

Unconstrained Face Recognition Era: The Labeled Faces in the Wild (LFW)

13,233 photos of 5,749 celebrities

celebrities Labeled faces in the wild: A database for studying face recognition in unconstrained environments, Huang, Jain, LearnedMiller, ECCVW, 2008

Face verification (1:1)

= !=

Human-level performance • User study on Mechanical Turk – 10 different workers per face pair – Average human performance – Original images, tight crops, inverse crops 99.20%

97.53%

“These results suggest that automatic face verification algorithms should not use regions outside of the face, as they could artificially boost accuracy in a manner not applicable on real data.”

94.27% Attribute and simile classifiers for face verification, Kumar, et al., ICCV 2009

LFW: Progress over the recent 7 years •

Labeled faces in the wild: A database for studying face recognition in unconstrained environments, ECCVW, 2008. • Attribute and simile classifiers for face verification, ICCV 2009. • Multiple one-shots for utilizing class label information, BMVC 2009. • Large scale strongly supervised ensemble metric learning, with applications to face verification and retrieval, NEC Labs TR, 2012.  Learning hierarchical representations for face verification with convolutional deep belief networks, CVPR, 2012. • Bayesian face revisited: A joint formulation, ECCV 2012. • Tom-vs-pete classifiers and identity preserving alignment for face verification, BMVC 2012. • Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification, CVPR 2013. • Probabilistic elastic matching for pose variant face verification, CVPR 2013. • Fusing robust face region descriptors via multiple metric learning for face recognition in the wild, CVPR 2013. • Fisher vector faces in the wild, BMVC 2013. • A practical transfer learning algorithm for face verification, ICCV 2013.  Hybrid deep learning for computing face similarities, ICCV 2013.  Employed deep learning models for face verification on LFW. Please check http://vis-www.cs.umass.edu/lfw/ for the latest updates.

LFW: Progress over the recent 7 years Accuracy / year Reduction of error wrt human / year

92.58% 85.54%

97.53% 95.17% 96.33%

88.00%

78.47% 73.93%

60.02% 52.32% 48.06%

49.15%

37.09%

37.08%

19.24%

20.52%

Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments (results page), Gary B. Huang, Manu Ramesh, Tamara Berg and Erik Learned-Miller.

High-dim LBP • Accurate (27) dense facial landmarks • Concatenate multi-scale descriptors – ~100K-dim LBP, SIFT, Garbor, etc.

• Transfer learning: Joint Bayesian Likelihood ratio test: EM update of the between/within class covariance

• WDRef dataset – 99,773 images of 2,995 individuals – 95.17% => 96.33% on LFW (unrestricted protocol) Face alignment by explicit shape regression, Cao, et al., CVPR 2012 Bayesian face revisited: A joint formulation, Chen, et al., ECCV 2012 Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification, Chen, et al., CVPR 2013 A practical transfer learning algorithm for face verification, Cao, et al., ICCV 2013

Hybrid deep learning • 12X5 Siamese ConvNets X8 + RBM classification 12 face regions

8 pairs of inputs

CelebFaces dataset 87,628 images of 5,436 individuals Hybrid deep learning for computing face similarities, Sun, Wang, Tang, ICCV 2013.

Face recognition pipeline Detect

Align

Represent

Classify

Yaniv

Lubomir

Marc’Aurelio

Faces are 3D objects

Reconstruction accuracy and discriminability

Bornstein et al. 2007

Face alignment (‘Frontalization’)

Detect

2D-Aligned

3D-Aligned

2D alignment

f

Localize

2D Align

3D alignment 2D Align

+67 x2d Pnts

Piece-wise affine

Rendering of new views

Calista_Flockhart_0002.jpg Detection & Localization

Frontalization

Localization

C1: 32 filters 11x11

M2: 3x3

C3: 16 filters 9x9

Front-End ConvNet

L4: 16 x 9 x 9 x 16

L5: 16 x 7 x 7 x 16

Local (Untied) Convolutions

L6: 16 x 5 x 5 x 16

SFC labels

REPRESENTATION

Network architecture

F7: F8: 4096d 4030d

Globally Connected

SFC Training dataset

4.4 million photos blindly sampled, containing more than 4,000 identities (permission granted)

Transferred Similarities (Test) (a) Cosine angle DeepFace Replica

(b) Kernel Methods DeepFace Replica

(c) Siamese Network

Results on LFW

Youtube face dataset (YTF) • Data collection – 3,425 Youtube videos 1,595 celebrities (a subset of LFW subjects) – 5,000 video pairs in 10 splits – Detected and roughly aligned face frames available.

• Metric: mean recognition accuracy over 10 folds – Restricted protocol: only same/not-same labels – Unrestricted protocol: face identities, additional training pairs

Face recognition in unconstrained videos with matched background similarity, Wolf, Hassner, Maoz, ICCV 2011

Results on YouTube Faces (Video)

Trade-offs (LFW Acc. %)

1. Alignment: not “astonishing”

87.9

97.35

94.3

93.7

No Alignment 3D Pertrubation 2D Alignment

91.3

3D Alignment 3D Alignment + LBP

2. Dimensionality: 97

97.17

96.72 96.07

4096

3. Sparsity @ 4k dims:

1 0.8 0.6 0.4 0.2 0 0.1

4096 bits

0.2

0.3

95.87

95.53

1024

0.4

1024 bits

0.5

0.6

256

7

0.8

256 bits

0.9

1

Trade-offs – Cont’d DB Size / DNN Test Error (%)

4. Training data size: 8.74

10.9

100% of the data 50% of the data

20.7

15.1 20% of the data

10% of the data

5. Network Architecture:

8.74 C1+M2+C3+L4+L5+L6+F7

11.2

-C3

12.6

13.5

-L4 -L5

-C3 -L4 -L5

Failure cases • All false negatives on LFW (1%) age sunglasses

occlusion/ hats profile

errata

Failure cases • All false positive on LFW (0.65%)

Failure cases • Sample false negatives on YTF

Failure cases • Sample false positives on YTF

Face identification (1:N) Probe

Gallery

Unaccounted challenges in verification: I.Reliability II.Large confusion (P x G) III.Different distributions IV.Unknown class

= !=

LFW identification (1:N) 1. Close Set - #Gallery1: 4,249 - #Probes: 3,143 Measured3 by Rank-1 rate.

2 protocols Gallery Probe

Impostor Probe

1



2. Open Set - #Gallery1: 596 - #Probes: 596 - #Impostors: 9,491 (‘unknown class’) Measured3 by Rank-1 rate @ 1% False Alarm Rate.

UNKNOWN

Each identity with a single example 2 Unconstrained Face Recognition: Identifying a Person of Interest from a Media Collection Best-Rowden, Han, Otto, Klare and Jain (IEEE Trans. Information Forensics and Security,) 3 Training is not permitted on LFW (‘unsupervised’)

LFW identification (1:N) results Gallery Probe

NIST’s

Impostor Probe

Confusion Matrix = GT*P G is 4096x 4249 P is 4096x 3143



Cosine similarity measure (‘unsupervised’) : P

UNKNOWN

G

Bottleneck regularizes transfer learning

FC8

00000000100

SOFTMAX

FC7

DNN

Labels

Web-Scale Training for Face Identification; Taigman, Yang, Ranzato, Wolf

Low-dim DeepFace representation • Naïve binarization Verification accuracy (%) on LFW (restricted protocol) 97

97.17 97 96.78 96.72 96.42 96.07 95.87 95.53 95.5

96.1 94.5

95

93.38

92.75

93

91.45 91

89.4

89

87.15 87

85 dim=4096 dim=1024

dim=512

dim=256

dim=128

float

binary

dim=64

dim=32

dim=16

dim=8

CNN’s (can) saturate “Results can be improved simply by waiting for faster GPUs and bigger datasets to become available” -- Krizhevsky et al. What happens when the network is fixed & the number of training grows from 4m  0.5b ? Answer: our findings reveal that this holds to a certain degree.

Data is practically infinite.

Data is practically infinite.

– >350 billion photos – >400M photos uploaded/day – 3500 photos every sec – One ImageNet every 1:20h – One Flickr every 4 weeks

Scaling up DeepFace : 4.4 million images / 4,030 identities Random 108k : 6 million images / 108,000 identities Random 250k : 10 million images / 250,000 identities (yes : 250K softmax)

 Saturation

Scaling up: Semantic Bootstrapping • 0.5B images  10M hyperplanes • Lookalike hyperplanes  DB2 • Training on DB2 with more capacity.

Web-Scale Training for Face Identification; Taigman, Yang, Ranzato, Wolf

Second round results Results on LFW

Comparison to NIST’s State Of The Art Second-round DeepFace Same system that achieved 92% Rank-1 accuracy on a table of 1.6 million identities. (NIST’s SOTA, constrained)

SOTA single network, 2nd best 95.43%[DeepID2]

DeepFace efficiency (at test) • For a single 720p image on a single 2.2Ghz Intel CPU core: – Face detection: – 2D+ 3D Alignment: – Feed-forward: – Classification:

– Overall: 0.53 sec

0.3 sec 0.05 sec 0.18 sec ~0 sec

Summary • Coupling 3D alignment with large-capacity locally-connected networks • At the brink of human-level performance for face verification (1:1) • Pushing the performance significantly for face identification (1:N)

(a small part of-) tSNE visualization of LFW, constructed from all pairs (~88m) dot products, i.e. unsupervised.

Thank you! • Questions • Comments • Suggestions • Facebook AI Research: www.facebook.com/fair – Human attributes – Object recognition – NLP/word embedding – GPU training platform – ……

Recent work • 25 CNNs combining identification and verification loss

Deep learning face representation by joint identification-verification, Sun, Wang, Tang, technical report, arxiv, 6/2014

• Pyramid CNN: a group of layer-wised trained CNNs. Learning deep face representation, Fan, Cao, Jiang, Yin, technical report, arxiv, 3/2014

Suggest Documents