Learning a Sparse Representation from Multiple ... - Semantic Scholar

1 downloads 0 Views 170KB Size Report
Also, faces de- tected by a face detector, may not be exactly centered in the frames extracted by the face detector. Hence, two video shots from the same person ...
Learning a Sparse Representation from Multiple Still Images for On-Line Face Recognition in an Unconstrained Environment Johan W.H. Tangelder

Ben A.M. Schouten

Centre for Mathematics and Computer Science (CWI), Amsterdam, The Netherlands {J.W.H.Tangelder, B.A.M.Schouten}@cwi.nl

Abstract In a real-world environment a face detector can be applied to extract multiple face images from multiple video streams without constraints on pose and illumination. The extracted face images will have varying image quality and resolution. Moreover, also the detected faces will not be precisely aligned. This paper presents a new approach to on-line face identification from multiple still images obtained under such unconstrained conditions. Our method learns a sparse representation of the most discriminative descriptors of the detected face images according to their classification accuracies. On-line face recognition is supported using a single descriptor of a face image as a query. We apply our method to our newly introduced BHG descriptor, the SIFT descriptor, and the LBP descriptor, which obtain limited robustness against illumination, pose and alignment errors. Our experimental results using a video face database of pairs of unconstrained low resolution video clips of ten subjects, show that our method achieves a recognition rate of 94% with a sparse representation containing 10% of all available data, at a false acceptance rate of 4%.

1

Figure 1. Face identification using video clips. Each video frame is shown together with its BHG descriptor. A1, A2, A3 belong to the same shot. B1, B2, B3 belong to another shot of the same person, and C1, C2, C3 belong to another shot of another person. Only the descriptors of A1 and B1 are very similar. image descriptors is build using a subset of the face images extracted from multiple sources, e.g. a number of webcams. This paper studies efficient representation of such a database and on-line recognition using a single query image. The contribution of our paper is twofold. We introduce a method to obtain a sparse representation of a set of descriptors of multiple still images, and apply it for recognition using a single descriptor. Also, we introduce the bi-cubic interpolated and histogram equalized gray level image (BHG) descriptor, which obtains limited robustness against registration errors by bi-cubic interpolation, and against illumination variations by histogram equalization. A sparse representation of the face images can be obtained by projecting high-dimensional descriptor data into a lower dimensional space using methods like Principal Component Analysis, Multidimensional Scaling [3], Self-Organizing Maps [8], Locally Linear Embedding [10], or by selecting representatives of clusters of the descriptor data using hierarchical clustering [6], partitional clustering, e.g. k-means [4]. Instead of ap-

Introduction

This paper investigates real-time face recognition from multiple still images, applicable to real environments like smart offices and smart homes. To recognize people unobtrusively from a distance, facial images can be acquired e.g. from video using one or more webcams. In such an unconstrained environment illumination, pose, and resolution vary. Also, faces detected by a face detector, may not be exactly centered in the frames extracted by the face detector. Hence, two video shots from the same person may share only a few frames showing the person’s face with similar pose and illumination, as illustrated by Fig. 1. During the last years, much research has been devoted to multiple still based face recognition. For an extensive review, see the paper by Zhou et al. [14]. In our framework we assume that a database of face 1

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

plying clustering our method selects most discriminative representatives by their recognition accuracies. Descriptors like the ’eigenfaces’ [13] or ’Fisherfaces’ [1] require that the face images to be compared are registered to reasonably high precision. However, this cannot be achieved in unconstrained environments with illumination, pose variations, and low resolution images. Gorodnichy [5] describes a neural network capable of recognizing faces from a query image with a resolution of 24x24 pixels. The neural network memorizes the faces after processing an image sequence and recognizes faces from a single low resolution image. To match sets of faces against sets of faces Sivic et al. [12] applied the SIFT descriptor [9], which is invariant to a registration error of a few pixels and to a linear transformation of image intensity within its support region. Hadid et al. [7] compared face sets against face sets using the (local binary patterns) LBP descriptor, which is a histogram of patterns obtained by thresholding the 3x3 neighbourhood of each pixel with the gray value of the center pixel.

2

Approach

In the preprocessing phase the IDIAP frontal face detector [11] is used to extract near-frontal face frames of size nxn pixels, n ≥ 24, which are approximately located at coordinates (n/4, n/4) and (3n/4, n/4), according to the format proposed by Gorodnichy [5]. In this face representation the nose tip is approximately located at the frame center (n/2, n/2) and the mouth at coordinates (n/2, 3n/4). Due to the low resolution and illumination variations, errors of a few pixels can not be avoided in the detection of face frames. Also, we did not apply separate eye, nose and mouth detectors, because low resolution and illumination variations are additional sources of possible errors for these detectors. As a face descriptor we introduce the bi-cubic interpolated and histogram equalized gray level (BHG) images. Given a fixed size s, e.g. 8, 16, 24 ..., the sBHG descriptor is obtained by bi-cubic interpolation from the frame image with size nxn pixels, to the face descriptor with size sxs pixels. The bi-cubic interpolation is robust against registration errors, because it estimates the gray value at a pixel in the face descriptor by a weighted average of 16 pixels surrounding the closest corresponding pixel in the frame image [2]. To increase robustness to illumination variations the BHG images are also histogram equalized. We assume that in the preprocessing phase for subjects with know identity, the detected face frames are labeled with identify of the subject. Hence, in this set-up we obtain a set F of face descriptors f , each labeled with the identity I(f ) of its subject. On-line

face recognition is supported using a single face descriptor as a query. If the query descriptor is within a certain distance of another descriptor in the database, the identity of the query face is found. Otherwise, the system is not able to recognize the face, because the face is not in the database, or its quality is too bad. In that case we discard the query. Instead of using F as a larger database, we use a smaller database D ⊂ F as a sparse representation. We use the identity of the descriptors d ∈ D to recognize the identity of new descriptors d ∈ / F by the identity of their nearest neighbour. We assume that F \D is more or less representative for the descriptors to be recognized. Therefore, given a candidate database D, at the learning stage we use the descriptors d ∈ F \D to estimate D’s recognition accuracy. D is learned incrementally by optimizing the recognition accuracy for the face descriptors not (yet) added to D. We use a greedy approach to build D, which initializes D by the most discriminative face descriptor with the highest recognition accuracy, and adds elements one by one until the user specified database size is reached. We formalize this approach as follows. Let #X denote the number of elements of a set X. Let X\Y denote the set of all elements of X, that are not in Y . Let F denote the set all face descriptors, and D ⊂ F the incrementally learned set of face descriptors representing sparsely F . Let d(f, g) denote a distance between two face descriptors f and g. Let I(f ) denote the identity label for face descriptor f . Let for a face descriptor f ND (f ) ≡ arg min d(f, g) g∈D

denote the nearest neighbour of f in D. We use the descriptors f ∈ F \D to estimate the recognition rate. Note, that for f ∈ D f = ND (f ). Hence, I(f ) = I(ND (f )). Let the set of correctly recognized face descriptors f ∈ F \D, be defined by R(D) ≡ {f ∈ F \D | I(f ) = I(ND (f ))} We define the recognition accuracy A(D) of D by the fraction #R(D) A(D) ≡ #(F \D) and its mean recognition distance R(D) by  f ∈R(D) d(f, ND (f )) M (D) ≡ #R(D) We construct a sparse database representation by applying the following greedy algorithm:

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

Recognition using identity specific thresholds (C=0.95) 1

0.8

0.9

BHG24

SIFT BHG16 BHG24

0.75

BHG8

BHG16

SIFT

BHG8

0.7

0.65

LBP

0.6

LBP

BHG24 BHG16 BHG8 SIFT LBP

0.55

0.5

10

20

30

40

50

60

70

80

90

BHG16

SIFT

BHG16

RR (solid) and DR (dashed)

RR learning (solid) and RR random (dashed)

Recognition learning a sparse representation

BHG24

LBP

0.8 BHG8 0.7 0.6 0.5

SIFT LBP

0.4 BHG8 0.3

BHG16

BHG16 BHG24 BHG16 BHG8 SIFT LBP

BHG24

0.2 BHG24 0.1

100

0

10

20

30

40

50

60

70

80

90

100

set size

set size

Figure 2. Comparison of RRs using a learned representation to a randomly selected representation.

Figure 3. RRs and DRs using identity specific thresholds (C=0.95).

Given a user specified maximal set size we use a greedy approach to build D. Initially we compare the recognition accuracies of all subsets consisting of one element and select one with maximal A(D). In case of ties we choose one with minimal M (D). Until the maximal set size is reached, we add to D, the element which maximizes A(D) its recognition accuracy, choosing one with minimal M (D) in case of ties. Also we learn identity specific thresholds TD (I) by the maximum distance from the face descriptors f ∈ R(D), with identity I, to their nearest neighbours, defined by

set-up as Gorodnichy [5], using one video clip of 10 pairs for learning. For these pairs the other video clip is used to estimate the Recognition Rate (RR) and the Discard Rate (DR), using each extracted face frame as a query. One video clip of the remaining pair is used to estimate the False Acceptance Rate (FAR) if an unknown subject is presented to the system. The DR is defined as the ratio of the number of queries discarded by the system to the number of queries presented to the system. For queries with a know identity, the RR is defined as the ratio of the number of queries correctly recognized by the system to the number of queries not discarded by the system. For queries with an unknown identity, the FAR is defined as the ratio of the number of queries not discarded by the system to the number of queries presented to the system. In our tests we compare implementations of the 24BHG, 16-BHG, 8-BHG descriptor with implementations of the SIFT descriptor and the LBP descriptor, both described in the literature. We compare two sBHG descriptors by their L1 distance defined as the sum of the absolute values of the gray value differences on the s2 image pixels, s = 8, 16, 24. The SIFT descriptor consisting of five overlapping local descriptors [9], is computed for faces normalized to 51x51 pixels [12]. Two SIFT descriptors are compared by their Euclidean distance in the 360dimensional SIFT feature space [12]. The LBP repreu2 descriptor from sentation used is the LBP4,1 + LBP8,1 [7], which is computed for faces normalized to 19x19 pixels. The LBP features, which are histograms of local binary patterns, are compared by the χ2 metric [7]. The first experiment compares RRs using a learned representation to one obtained by randomly selecting representatives, with the same number of representatives for each identity. Also, instead of applying iden-

TD (I) ≡ C

max

f ∈R(D),I=I(f )

d(f, ND (f ))

where C is a user specified constant. At recognition a query descriptor is recognized using ND and TD as follows. If d(q, ND (q)) < TD (I(ND (q)) we identify q by I(ND (q)). Otherwise, we discard q.

3

Experimental results

We tested our method for application in an unconstrained environment using the ITT-NRC facial video database [5], publicly available from http://synapse.vit.iit.nrc.ca/db/video/faces/cvglab/. The database contains pairs of short video clips each showing a face of a computer user sitting in front of the monitor exhibiting a wide range of facial expressions and orientations as captured by a webcam mounted on the computer monitor. Fig. 1 shows some frames extracted from these video clips by the IDIAP face detector. These video clips contain difficulties inherent to real-time environments: low resolution, motion blur, faces out-of focus, facial expression variation, and occlusions. For 11 individuals a pair of video clips is recorded in this database. In our experiments we used the same

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

Recognition using identity specific thresholds (C=0.95) 1

SIFT BHG16

0.9

RR (solid) and FAR (dashed)

BHG24

LBP

0.8 BHG8 0.7 BHG24 BHG16 BHG8 SIFT LBP

0.6 0.5 0.4

5

LBP BHG16

0.3 BHG8 0.2

SIFT

0.1 0

BHG24 10

20

30

40

50

60

70

80

90

100

set size

Figure 4. RRs and FARS using identity specific thresholds (C=0.95). tity specific thresholds, we always set the identify of the query to the identity of its nearest neighbour. The results illustrated by Fig. 2, show that our strategy improves the RR over 10%. But, the RR does not exceed 85%, using no identity specific thresholds. The second experiment tests recognition using identity specific thresholds setting C = 0.95. Fig. 3 and 4 compare RRs, DRs, and FARs for the 24-BHG, 16BHG, 8-BHG, SIFT, and LBP descriptors. Using a set size of 50 (10% of the data) the SIFT descriptor obtains the highest recognition rate (95%), followed by the 16-BHG descriptor (94%). However, the DR and FAR for the SIFT descriptor (47% and 18%) are much higher than for the 16-BHG descriptor (26% and 4%). We conclude that the 16-BHG descriptor achieves a RR of 94% using 10% of all data at a FAR of 4%.

4

in time. Moreover, fusion of different face descriptors may improve recognition rates. For larger databases our approach can be applied as an effective filter, after which more detailed comparisons can be made. Finally, fusion of speaker and face recognition is a promising research direction to handle larger databases.

Conclusions and further research

We proposed a new method to obtain a sparse representation of face descriptors and introduced the BHG descriptor. We evaluated our method and our descriptor for on-line face recognition using a single query on a video face database of pairs of unconstrained low resolution video clips of ten subjects. Our experimental results show that the proposed scheme to learn a sparse representation of the data using the newly introduced BHG descriptor, achieves a recognition rate of 94% using 10% of all data, at a false acceptance rate of 4%. We identify several further research topics. Alternatives to the greedy selection of most discriminative face descriptors, may produce smaller databases or improve recognition rates. To take into account data continuously recorded in a real environment, on-line learning can be used to prune and update the database of face descriptors. Also, the time dimension can be exploited, e.g. to avoid using two face descriptors recorded close

Acknowledgments

We thank Sebastian Marcel from IDIAP, Martigny, Switzerland for his assistance with using the IDIAP face detector downloaded from http://www.idiap.ch/ ˜marcel/en/facedemos.html and Josef Sivic from the University of Oxford for providing their implementation of the SIFT descriptor.

References [1] Belhumeur, P., Hesphana, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. PAMI, 19(7):711– 720, 1997. [2] Bourke, P.: Bicubic Interpolation for Image Scaling. http://astronomy.swin.edu.au/ pbourke/colour/bicubic/ [3] Cox, T., Cox, M.: Multidimensional Scaling. Chapman & Hall. [4] Gersho, A., Gray, R.M.: Vector quantization and signal compression. Kluwer, Boston. [5] Gorodnichy, D.O.: Video-based Framework for Face Recognition in Video. Proc. Second Canadian Conference on Computer and Robot Vision (CRV’05), 2005. [6] Johnson, S.C.: Hierarchical clustering schemes. Psychometrika, 32:241-254. [7] Hadid, A., Pietik¨ ainen, M., Ahonen, T.: A Discriminative Feature Space for Detecting and Recognizing Faces. Proc. CVPR’04, 2004. [8] Kohonen, T.: Self-Organizing Maps. Springer-Verlag. [9] Lowe, D.G. Distinctive Image Features from Scaleinvariant Keypoints. International Journal of Computer Vision, 60(2): 91-110, 2004. [10] Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323-2326, 2000. [11] Sauquet, T., Marcel, S., Rodriguez, Y.: Multiview Face Detection. internal report IDIAP-RR-0549, ftp://ftp.idiap.ch/pub/reports/2005/rr-05-49.pdf, IDIAP, 2005 [12] Sivic, J., Everingham, M., Zisserman, A., Person Spotting: Video Shot Retrieval for Face Sets. Proc. CIVR2005, 2005. [13] Turk, M., Pentland, A.P., Face recognition using eigenfaces. Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586-591, 1991. [14] Zhou, S., Chellapa, R.: Beyond one Still Image: Face Recognition from Multiple Still Images and Videos. Face Processing: Advanced Modeling and Methods, Academic Press, 2005

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

Suggest Documents