Face Detection Using Locally Linear Embedding - Semantic Scholar

Face Detection Using Locally Linear Embedding By Samuel Kadoury Department of Electrical and Computer Engineering McGill University, Montreal, Canada

November 2005

A thesis submitted to McGill University in partial fulfillment of the requirements of the Degree of Master of Engineering

© Samuel Kadoury 2005

Abstract

Abstract Human face detection in gray scale images has been researched extensively over the past decade, due to the recent emergence of applications such as security access control, visual surveillance and content-based information retrieval. However, this problem remains challenging because faces are non-rigid objects that have a high degree of variability in size, shape, color and texture. Indeed, few of the proposed face detection methods have been analyzed for performance under different conditions, such as head rotation, illumination, facial expression, occlusion and aging.

Nowadays, most face detection methods are based upon statistical and learning strategies. Many of these appearance-based methods tend to increase data complexity, by mapping it onto a higher-dimensional space in order to extract the predominant features; this, however, often requires much more computational time. A novel technique that is gaining in popularity, known as Locally Linear Embedding (LLE), adopts a different approach to the problem by applying dimensionality-reduction to the data for learning and classification. Proposed by Roweis and Saul, the objective of this method is to determine a locally-linear fit, so that each data point can be represented by a linear combination of its closest neighbors.

The first objective of the current research is to apply the LLE algorithm to 2D facial images, so as to obtain their representation in a sub-space under the unfavorable conditions stated above. The low-dimensional data then will be used to train a Support Vector Machine to classify images as being face or non-face. For this research, six different databases of cropped facial images, corresponding to variations in head rotation, illumination, facial expression, occlusion and aging, were built to train and test the classifiers. The second objective is to evaluate the feasibility of using the combined efficacy of the six SVM classifiers in a two-stage face detection approach. Experimental results obtained with image databases demonstrated that the performance of the proposed method was similar to and sometimes better than other face detection methods, introducing a viable and accurate alternative to previously existing techniques.

ii

Sommaire

Sommaire La détection de visages dans une image a été un domaine de recherche très actif au cours des dernières années. Ceci est dû notamment à l’émergence d’applications tel que l’authentification, la reconnaissance et la recherche de visages dans des bases de données. Cependant, peu des méthodes proposées pour la détection de visages ont analysé la performance de leur technique sous des conditions variables tel que la position du visage, l’illumination, les expressions faciales, l’occlusion ou l’âge.

Désormais, la plupart des nouvelles méthodes de détection sont basées sur des méthodes d’apprentissage statistiques. Cependant, plusieurs de ces méthodes cherchent à complexifier l’information fournie afin de récupérer les caractéristiques prédominantes, ce qui risque d'alourdir la tâche. Une nouvelle technique, dénommé Locally Linear Embedding (LLE), préconise une approche à l’inverse de ce problème en effectuant la réduction de complexité des données pour des fins d’apprentissage et de classification. Proposée par Roweis et Saul, l’objectif de cette méthode est de sélectionner un domaine approximativement linéaire par monceaux, afin que chaque point puisse être représenté par une combinaison linéaire de ses plus proches voisins.

Le premier objectif de cette recherche est d’appliquer la technique LLE sur des images de visages afin d’obtenir leurs représentations dans un sous domaine simplifié, sous les conditions peu favorables mentionnées précédemment. Ces caractéristiques seront ensuite utilisées pour l’apprentissage de Support Vector Machines (SVM), afin de classifier des images comme étant des visages ou non. Pour ce projet, six bases de données d’images faciales, correspondant aux conditions de variations de la position du visage, d’illumination, d’expressions faciales, d’occlusion et d’age, ont été utilisées. Le second objectif est d’évaluer la faisabilité d’utiliser une combinaison de six classificateurs dans un processus à deux étapes pour la détection de visages. Des expériences effectuées sur des bases de données d’images démontrent que la méthode de détection proposée offre des résultats similaires et parfois même supérieurs à d’autres méthodes de détection, indiquant ainsi une méthode fiable et précise.

iii

Acknowledgments

Acknowledgments First and foremost, I would like to thank my supervisor, Professor Martin D. Levine for having proposed such a fascinating and meaningful research project, for his guidance and patience throughout the preparation of this thesis, and for relating to me his vast experience in the field of computer vision. But I would be remiss not to thank many others. Hence, I would like to express my gratitude to several professors from McGill University, the University of Montreal and École Polytechnique, for having helped me to understand the various aspects and challenges of computer vision. They include Prof. Doina Precup for teaching me about machine learning intelligence and probabilistic reasoning; Prof. Richard Rose for automatic speech recognition perception; Prof. Martin D. Levine for image processing; Prof. Hannah Michalska for optimization algorithms; Prof. Benoit Godbout for computer graphics; and Prof. James Clark for statistical computer vision. I would like to thank Prof. Carl-Éric Aubin and Prof. Farida Cheriet for having accommodated my working part-time during my thesis, and for having provided me the opportunity to pursue research activities with their group. The research itself was supported by Le Fonds Québecois de la Recherche sur la Nature et les Technologies.

All through the course of this thesis, my interactions with my colleagues have been so beneficial; so I would like to thank Gurman Singh Gill, Donovan Parks, Jean-Philippe Gravel, and especially Hakirat Shambi, all of whom have given great suggestions and tips towards solving my problems, as well as developing the LLE software that was fundamental to my project. Thank you again, so much, for your help and cooperation. Finally, and most importantly, I dedicate this thesis to my wonderful wife, Pascale, and to my loving parents, Morris and Dominique, for having supported and encouraged me throughout my Masters. They always have shown me, by way of example, the meanings of perseverance and patience; and, for that, I am extremely grateful. - To my wife and parents.

iv

Table of Contents

Table of Contents Abstract .............................................................................................................................ii Sommaire .........................................................................................................................iii Acknowledgments............................................................................................................iv List of Figures ................................................................................................................viii List of Tables ...................................................................................................................xi

Chapter 1 - Introduction....................................................................................................1 1.1 1.2 1.3

Motivation for a Low-Dimensional Face Detection Classifier.........................3 Thesis Contributions .........................................................................................5 Thesis Outline ...................................................................................................6

Chapter 2 - Background: Face Detection Methods for Single Image Input .....................9 2.1 Challenges of Face Detection .........................................................................10 2.1.1 Scale........................................................................................................10 2.1.2 Pose .........................................................................................................11 2.1.3 Illumination.............................................................................................12 2.1.4 Facial expression.....................................................................................13 2.1.4 Occlusion ................................................................................................13 2.1.5 Aging.......................................................................................................14 2.2 Face Detection Techniques .............................................................................14 2.2.1 Knowledge-based methods .....................................................................15 2.2.2 Feature invariant approaches ..................................................................17 2.2.3 Template matching methods ...................................................................20 2.2.4 Appearance-based methods ....................................................................23 2.2.5 Summary .................................................................................................30 Chapter 3 - Locally Linear Embedding ..........................................................................32 3.1 Literature Review on Locally Linear Embedding (LLE) ...............................33 3.2 Image Processing Techniques.........................................................................36 3.2.1 Bi-cubic image scaling............................................................................36 3.2.2 Illumination normalization......................................................................37 3.3 Locally Linear Embedding (LLE) ..................................................................38 3.3.1 Motivation behind LLE...........................................................................38 3.3.2 Theory .....................................................................................................38 3.3.3 LLE algorithm.........................................................................................39 3.4 Effect of the Neighborhood Size K and Intrinsic Dimensionality d ...............43 on the LLE Embeddings .................................................................................43 3.5 Measures for Determining Optimal Parameter Values for LLE.....................43 v

Table of Contents

3.5.1 Measures for the intrinsic dimensionality d............................................43 3.5.2 Selecting the neighbourhood size K........................................................45 3.6 Embedding New Data Points ..........................................................................47 3.6.1 Mapping using K closest neighbours ......................................................48 3.6.2 Forward mapping using Support Vector Regression (SVR)...................48 3.7 Other Dimensionality Reduction Methods .....................................................51 3.7.1 Principal Component Analysis (PCA) ....................................................51 3.7.2 Eigenfaces ...............................................................................................52 3.8 Summary .........................................................................................................53 Chapter 4 - Statistical Learning Methods .......................................................................54 4.1 4.2 4.3 4.4 4.5

Hyperplane Classifiers ....................................................................................54 Support Vector Classifiers ..............................................................................56 Support Vector Regression .............................................................................57 Kernel Methods...............................................................................................58 Summary .........................................................................................................60

Chapter 5 - Experiments and Results..............................................................................61 5.1 Experimental Methodology ............................................................................61 5.1.1 Face databases.........................................................................................62 5.1.2 System description ..................................................................................66 5.2 Experiments on the Frontal Database (DB1) ..................................................66 5.2.1 Experiments on the D parameter.............................................................67 5.2.2 Experiments on the K parameter.............................................................71 5.2.3 Experiments on the d parameter .............................................................74 5.2.4 Experiments on mapping new images using SVR..................................76 5.2.5 Face detection accuracy ..........................................................................80 5.3 Experiments on the Head Pose Database (DB2).............................................84 5.3.1 Experiments on the K parameter.............................................................84 5.3.2 Experiments on the d parameter .............................................................86 5.3.3 Experiments on mapping new images using SVR..................................87 5.3.4 Face detection accuracy ..........................................................................88 5.4 Experiments on the Illumination (DB3), Occlusion (DB4) and Expression (DB5) Databases ..............................................................................................89 5.4.1 Experiments on the K parameter.............................................................90 5.4.2 Experiments on the d parameter .............................................................91 5.4.3 Face detection accuracy ..........................................................................91 5.5 Experiments on the Aging Database (DB6)....................................................95 5.6 Experiments on Random Input Images...........................................................98 5.6.1 Testing the system on facial databases ...................................................99 5.6.2 Testing the system on the CBCL face database....................................101 5.7 Summary .......................................................................................................104

vi

Table of Contents

Chapter 6 - Face Detection in a Single Image ..............................................................105 6.1 Face Detector Implementation......................................................................105 6.1.1 Scanning the detector............................................................................105 6.1.2 Integration of multiple detections .........................................................106 6.2 Improvements on the Face Detector .............................................................106 6.2.1 Illumination compensation....................................................................107 6.2.2 Non-face training set.............................................................................108 6.2.3 Fusion method for the six-decision process..........................................110 6.2.4 Contribution from classifiers ................................................................110 6.3 Face Detection Results..................................................................................112 6.3.1 Image database......................................................................................112 6.3.2 Face detection results on the MIT-CMU Database...............................113 Chapter 7 - Conclusion .................................................................................................116 References.....................................................................................................................119

vii

List of Figures

List of Figures Figure 1.1 : Sample gray scale image containing a human face…………………….3 Figure 1.2 : The problem of non-linear dimensionality reduction, as illustrated for a three-dimensional model of a face............................................................................4 Figure 1.3 : Diagram depicting the proposed strategy for face detection using LLE.5 Figure 2.1 : Example of a single image from the MIT-CMU database [65], with faces in different scales. ..........................................................................................11 Figure 2.2 : Effect of illumination on a face. ...........................................................12 Figure 2.3 : Effect of facial expression on facial geometry. ....................................13 Figure 2.4 : Effect of occlusion on a same face. ......................................................14 Figure 2.5 : Example of horizontal and vertical signatures used in [37]..................17 Figure 2.6 : A 14x16 pixel ratio template, composed of 16 regions and 23 relationships, is used for face detection as described in [74]..................................22 Figure 2.7 : The distance measures used in Sung and Poggio’s method……..........24 Figure 2.8 : System diagram of the Rowley-Kanade neural network method [65]..26 Figure 2.9 : Hidden Markov model for face detection with five hidden states [68].28 Figure 2.10 : Face detection cascade of classifiers shown in [84] .............................29 Figure 3.1 : Images of faces mapped into the embedding space [61] described by the first two coordinates of LLE, using the 12 nearest neighbours. .............................34 Figure 3.2 : Obtaining new data from a face image sequence which was dimensionality reduced by LLE [85]. .....................................................................36 Figure 3.3 : Illumination normalization on a given image, using histogram equalization and the resulting phase image from the FFT......................................37 Figure 3.4 : The problem of non-linear dimensionality reduction [63], as illustrated for three-dimensional data sampled from a given manifolds……………………..41 Figure 3.5 : Summary of the main steps in the LLE algorithm................................42 Figure 3.6 : Sample Eigenfaces for a given database...............................................52 Figure 3.7 : Example given in [80] of a face space used for face detection with Eigenfaces. ..............................................................................................................53 Figure 4.1 : The optimal hyperplane in SVMs is orthogonal to the shortest line connecting the convex hulls of the two classes ......................................................55 Figure 4.2 : Representation of a linear separating plane between classes O and ∆..56 Figure 4.3 : Transformation of the input space into a higher-dimensional feature space, via .............................................................................................................56 Figure 4.4 : Schematic representation of the Support Vector Regression (SVR) algorithm. ................................................................................................................58 Figure 4.5 : Schematic representation of the sequence of operations for SVR .......58 Figure 4.6 : Example of a Support Vector classifier found by using a radial basis function kernel. .......................................................................................................60 Figure 5.1 : Sample facial images from the Notre-Dame Database [51] .................62 Figure 5.2 : Sample facial images from the AR Database [48]................................63 Figure 5.3 : Sample facial images from the UMIST Database [21].........................64 Figure 5.4 : Sample facial images from the Aging Database [17] ...........................65 Φ

viii

List of Figures

Figure 5.5 : Schematic description of the training procedure for the proposed LLE face detection system. .............................................................................................67 Figure 5.6 : LLE embeddings of the Frontal Database (DB1) at d=2, with varying D dimensions ..............................................................................................................69 Figure 5.7 : CPU time for embedding the Frontal Database (DB1) with varying initial D dimensions. ...............................................................................................70 Figure 5.8 : Graphical representation of LLE embeddings on the Frontal Database (DB1), with varying K value (# of neighbours) with d=2 ......................................73 Figure 5.9 : Evolution of the percentage (%) increase in I(K) (number of significant weights) with respect to the number of K neighbours for the Frontal Database.....74 Figure 5.10 : Representation of embeddings at d =1, 2, 3 using the Head Pose Database (DB2).......................................................................................................75 Figure 5.11 : Evolution of the residual variance with respect to the lower d dimension for the Frontal Database (DB1).............................................................76 Figure 5.12 : Cross-validation accuracy for regression model at d=2 for the Frontal Database (DB1), with respect to the C and parameters of the RBF kernel..........77 Figure 5.13 : Cross validation accuracy for regression model at d=3 for the Frontal Database (DB1), with respect to the C and parameters of the RBF kernel..........78 Figure 5.14 : Embedding of new samples images onto the LLE embedding of the Frontal Database (DB1). .........................................................................................79 Figure 5.15 : Evolution of the detection accuracy (%) on DB1, with respect to the number of Principal Components (PC) for the PCA face detection approach. ......82 Figure 5.16 : ROC curves for the proposed LLE and PCA methods .........................83 Figure 5.17 : Graphical representation of LLE embeddings in three-dimensions (d=3) of the Head Pose Database (DB2) ..........................................................................85 Figure 5.18 : Evolution of the percentage (%) increase in I(K) (number of significant weights) with respect to number of K neighbours for the Head Pose Database. ....85 Figure 5.19 : Evolution of the residual variance, , with respect to the lower d dimension for the Head Pose Database (DB2). ......................................................86 Figure 5.20 : Embedding of non-face images using SVR analytical mapping, onto the LLE map of training facial images from the Head Pose Database (DB2)........88 Figure 5.21 : Evolution of the detection accuracy (%) on DB2, with respect to the lower d dimension for the proposed LLE face detection approach. .......................89 Figure 5.22 : Evolution of the percentage (%) increase in I(K) with respect to the number of K neighbours for the DB3, DB4 and DB5 Databases. ..........................90 Figure 5.23 : Evolution of the residual variance, , with respect to the lower d dimension for the DB3, DB4 and DB5 Databases..................................................91 Figure 5.24 : Evolution of the detection accuracy (%) for the DB3, DB4 and DB5 Databases, with respect to d dimension for the LLE face detection approach. .....92 Figure 5.25 : LLE embeddings in three-dimensions (d=3, with K=10) for the DB3, DB4 and DB5 Training Databases..........................................................................94 Figure 5.26 : Evolution of: (a) Percentage (%) increase in I(K) with respect to the number of K neighbours for the Aging Database (DB6), (b) Residual variance, , with respect to the lower d dimension for the Aging Database (DB6). ..................95 Figure 5.27 : Evolution of the detection accuracy (%) for the Aging Database (DB6), with respect to the d dimension for the proposed LLE face detection approach....96 ρ

γ

γ

ρ

ρ

ρ

ix

List of Figures

Figure 5.28 : ROC curves for the proposed LLE and PCA methods using the testing images from DB1, DB2, DB3, DB4, DB5 and DB6. .............................................97 Figure 5.29 : Schematic representation of the testing system for the proposed LLE face detection method. ............................................................................................99 Figure 5.30 : Distribution of classifiers used for detecting faces in the six testing databases (DB1, DB2, DB3, DB4, DB5, DB6). ...................................................101 Figure 5.31 : Sample face and non-face images from the CBCL database………..102 Figure 5.32 : Distribution of cropped images in the CBCL database identified as faces, using the proposed LLE method.................................................................103 Figure 5.33 : ROC curves of different systems tested on images from the CBCL database….............................................................................................................103 Figure 6.1 : ROC curves using three different pre-processing methods for illumination compensation....................................................................................107 Figure 6.2 : Testing six different non-face image sets for training the proposed LLE face detector ..........................................................................................................109 Figure 6.3 : ROC curves using three different fusion strategies. ...........................110 Figure 6.4 : ROC curves using each classifier individually. ..................................111 Figure 6.5 : Schematic representation of the final version for the LLE detector ...112 Figure 6.6 : Sample images from the subsets of the MIT-CMU image database . 113 Figure 6.7 : ROC curves for the proposed LLE face detector, the Rowley system [65] and the Viola-Jones method [84], each tested on the MIT-CMU Database. 114 Figure 6.8 : Results obtained from the proposed LLE face detector on a number of test images from the MIT-CMU Database............................................................115 Figure 7.1 : Example of face images undetected by the LLE face detector...........118 Figure 7.2 : Examples of false positives generated by the LLE face detector. ......118

x

List of Tables

List of Tables Table 2.1 : Categorization of face detection methods.................................................15 Table 2.2 : Comparison of detection rates for various face detection techniques ......30 Table 5.1 : Face databases available on the Internet that were used in this research to create six facial databases in order to train the proposed face detector. .................65 Table 5.2 : Initial D dimension values for correlation experiment on the Frontal Database (DB1).......................................................................................................68 Table 5.3 : Correlation values between LLE embeddings of the Frontal Database (DB1) at d=2, while varying initial D dimensions..................................................70 Table 5.4 : Cross-validation accuracy for all six regression models for the Frontal Database (DB1), with the optimal C and parameters of the RBF kernel. ............78 Table 5.5 : Cross-validation accuracy for SVM classification with varying d dimensions on DB1, with the optimal C and parameters of the RBF kernel. ......81 Table 5.6 : Detection results for the Frontal Database (DB1) with varying d dimensions. .............................................................................................................81 Table 5.7 : Cross-validation accuracy for the SVM classifier using the PCA face detection approach on the Frontal Database (DB1)................................................83 Table 5.8 : Area Under Receiver Operator Curves for the LLE and PCA methods when evaluated on the testing images form the Frontal Database (DB1)...............83 Table 5.9 : Cross-validation accuracy for all six regression models for the Head Pose Database (DB2), with the optimal C and parameters of the RBF kernel. ............87 Table 5.10 : Optimal detection results for the Head Pose Database (DB2)..................89 Table 5.11 : Optimal detection results for the DB3, DB4 and DB5 Databases............93 Table 5.12 : Optimal detection results for the Aging Database (DB6).........................96 Table 5.13 : Optimal K and d values obtained from experiments conducted on DB1, DB2, DB3, DB4, DB5 and DB6 Databases............................................................97 Table 5.14 : Comparison in detection rates, between identifying faces with a single classifier and using the proposed combination system.. .......................................100 Table 5.15 : Numerical comparison of results on images of the CBCL testing set....104 Table 6.1 : Comparison of detection rates for various face detection techniques, tested on the MIT-CMU Database [70]...........................................................................114 γ

γ

γ

xi

Chapter 1 - Introduction

Chapter 1 Introduction Human face detection has been researched extensively over the past decade, due to the recent emergence of applications such as security access control, visual surveillance, content-based information retrieval, and advanced human-to-computer interaction. It also usually is the first task performed in a face recognition system; consequently, to ensure good results in the recognition phase, face detection is a crucial procedure. In the last ten years, face and facial expression recognition have attracted much attention, though they truly have been studied for more than 20 years by psychophysicists, neuroscientists and engineers. Many research demonstrations and commercial applications have been developed from these efforts. A first step of any face processing system is to locate all faces that are present in a given image. However, face detection from a single image is a challenging task because of the high degree of spatial variability in scale, location and pose (rotated, frontal, profile). Facial expression, occlusion and lighting conditions also change the overall appearance of faces, as described in reference [90].

In the literature, many different approaches are described in which skin color has been used as an important cue for reducing the search space [71]; but, in general, features in gray scale images have been studied the most. In essence, there are four categories of approach to face detection, two of which now are predominant in the research field: (1) Knowledge-based approaches that use component extraction, such as eyes, nose and 1


mouth, as indicators to deduce whether a global face exists [88]; and (2) Appearancebased methods that employ Neural Networks or Support Vector Machines (SVMs). These appearance methods are rooted in learning algorithms and have attracted much attention recently, mostly because they have yielded excellent results. A full review on past and recent face techniques is presented later in this document.

To solve this problem correctly, one must understand that face detection is a two-class decision problem, which is to discriminate facial patterns from background (“nonfaces”) at every location in an image. The main difficulty in developing a reliable face detection approach arises from the tremendous variability that exists in the overall appearance of a face. Furthermore, noisy images and illumination conditions in the scene can increase the complexity of the detection process. Therefore, a successful face detection approach must be able to represent facial images in a manner that accounts for all of these challenges, in order to detect the presence of a face under a wide spectrum of conditions.

The goal of this research is to develop an appearance-based method which can detect all faces present in a static 2D gray scale image defined in an initial space with descent resolution (Figure 1.1 shows an example), under a variety of unfavorable conditions. These conditions include variations in head rotation, illumination, facial expression, occlusion and aging. The main challenge that had to be addressed was to learn how to use a recently-developed approach (Locally Linear Embedding) on 2D facial images, so as to obtain their representation in a low-dimensional sub-space. Once the facial data are transformed into a lower-dimensional space d, Support Vector Regression (SVR) is used to define a mapping from the input to the output space for these data. Thus, the SVR provides a new way to compute the location of a point in d-space, given its location in the input dimension. We demonstrate experimentally that, if this very same SVR is used to map the previously unseen non-face data, the latter will in general be clustered separately with respect to the specific facial data. An SVM then is used to classify new input patterns as being face or non-face. Six such classifiers were trained

2


using six different databases from different sources. The six decisions then were fused to provide a final decision.

Figure 1.1 : Sample gray scale image containing a human face, provided from the MITCMU database [65].

1.1

Motivation for a Low-Dimensional Face Detection Classifier

To detect faces in images efficiently, dimensionality reduction is an important and necessary operation for multi-dimensional data. The goal of dimensionality reduction is to obtain a more compact representation of the original data, a representation that nonetheless captures all the information necessary for higher-level decision-making. In reference [85], the author presents four reasons for reducing the dimensionality of observation data: (1) To compress the data to reduce storage requirements; (2) To eliminate noise; (3) To extract features from data for face detection; and (4) To project data to a lower-dimensional space, so as to be able to discern data distribution. For face detection and recognition, classical dimensionality reduction methods have included Eigenfaces, Principal Component Analysis (PCA) [75], Independent Component Analysis [3], and Linear Discriminate Analysis [4]. However, these methods all have serious drawbacks, such as being unable to reveal the intrinsic distribution of a given dataset, or inaccuracies in detecting faces that exhibit variations in head pose, facial expression or illumination.

Locally Linear Embedding (LLE) is a non-linear dimensionality reduction technique recently proposed by Roweis and Saul [63]. The LLE algorithm maps high-dimensional observation data, that are presumed to lie on a non-linear manifold, onto a single global

3


coordinate system of lower dimensionality. LLE preserves neighborhood relationships, thereby revealing the underlying structure of the data. Many types of high-dimensional data can be characterized in this way - for example, images generated by different views of the same three-dimensional object, such as a human face. This problem is illustrated by the non-linear manifold presented in Figure 1.2.

(a)

(b)

Figure 1.2 : The problem of non-linear dimensionality reduction, as illustrated for a three-dimensional model of a face (a). The 2D map (b) illustrates the neighbourhoodpreserving mappings discovered by the LLE algorithm, by finding the global internal coordinates of the manifold without indicating how the data should be embedded in two dimensions.

In this example, dimensionality reduction by LLE succeeds in recovering the underlying manifolds, whereas linear embedding methods, such as PCA or Multi-Dimensional Scaling (MDS), would map faraway data points to nearby points in the plane, creating distortions both in the local and global geometry. Another advantage of LLE is the low number of parameters that must be tuned to correctly apply the algorithm. Many qualities of LLE are shared by Tenenbaum’s ISOMAP algorithm [79], which is another successful method for non-linear dimensionality reduction. ISOMAP’s embeddings, however, are optimized to preserve geodesic distances between general pairs of data points, which only can be estimated by computing the shortest paths through large sublattices of data. LLE embeddings, on the other hand, are optimized for analyzing local symmetries, linear coefficients, and reconstruction errors. It, therefore, avoids the need to solve large-dynamic programming problems, and it also tends to accumulate very sparse matrices, whose structures can be exploited for savings in time and space.

4


At this time, LLE has not been used widely in computer vision applications, although it is becoming increasingly popular. Therefore, many problems that arise when using LLE remain unresolved. Some of the objectives of this thesis are: 1) to study the different challenges that arise when using LLE; 2) to propose solutions to solve these problems; and 3) to study the overall feasibility of LLE when confronting a face detection problem. To achieve this, a series of preliminary experiments were conducted to study the effect of LLE on a set of facial images. While discovering that selecting arbitrary values for the LLE parameters yielded unsatisfactory LLE mappings, methods to determine optimal parameter values for a given data set were investigated and implemented. Another challenge was to discover how to embed new sample points into a previously obtained LLE map. A thorough search of the literature discovered two feasible methods to perform such a task. Experiments found that statistical regression gave optimal results for mapping unseen points from a D-space to a low-dimensional LLE map. Once these tests were performed, a strategy was established to perform face detection using LLE embeddings. This strategy is summarized in Figure 1.3.

Figure 1.3 : Diagram depicting the proposed strategy for face detection using LLE.

1.2

Thesis Contributions

The main contribution that this thesis brings to the computer vision community resides in proposing and implementing a face detection method for single gray scale images, based upon LLE mappings. This method integrates the detection of a face under different conditions. For this to be feasible, six databases were assembled, corresponding to the challenges of 1) pose, 2) facial expression, 3) occlusion, 4) image orientation, 5) imaging conditions and 6) aging in order to train six classifiers that would be able to detect faces under these various conditions. Face detection

5


experiments on single images were conducted to assess the performance of this method, by comparing the results to other methods; thereafter, modifications to increase face detection accuracy were proposed and tested.

Other contributions relate to providing a comprehensive review of existing face detection techniques for use with gray scale images, first by describing the different challenges, and then by presenting the most significant work after dividing the field into four categories. A final contribution relates to the LLE algorithm, by proposing and developing programs that automatically can determine optimal LLE parameters (specifically, the number of neighbors, represented by the symbol K, and intrinsic dimensionality, d) for a given data set. This will help future LLE studies that focus on determining, in an efficient way, optimal parameters.

1.3

Thesis Outline

This thesis is divided into seven chapters. Following this introduction, Chapter 2 presents a full review of the various challenges in face detection, as well as the four existing categories of face detection techniques. 1) Knowledge-based methods are presented first, and include rule-based methods that encode human knowledge regarding what constitutes a typical face. Usually, the rules capture the relationships between facial features. 2) Feature-invariant approaches are algorithms that aim to find structural features that exist even when the pose, viewpoint or lighting conditions vary, and then use these to locate faces. 3) Then, template-matching methods will be described. These usually consist of several standard facial patterns, which are stored to describe the face as a whole or in terms of separate facial features. The correlation between an input image and the stored patterns are computed for detection. 4) The fourth and last category consists of appearance-based methods. In contrast to template matching, the models (or templates) used here are learned from a set of training images that are meant to capture the representative variability of facial appearance. These learned models then are used for image detection.

6


Chapter 3 presents a thorough description of the Locally Linear Embedding algorithm and its challenges. A review of previous applications using LLE is presented, followed by a description of the image processing techniques that we will use, in this research, before applying LLE. Then, the motivation and theory behind the LLE algorithm is presented in detail, followed by the effect of its parameters (number of neighbors, initial dimension and intrinsic dimensionality) on the LLE embeddings. A description of algorithms used to obtain the optimal values for these parameters follows. Finally, methods that perform the mapping of new sample points, which were unseen during the initial mapping, also are described.

Chapter 4 provides a brief overview of statistical learning methods, and more specifically of SVMs, as they have a fundamental role in this project. An introduction on separating hyperplane classifiers is provided, followed by an exposition on the theory behind Support Vector classifiers, as their use forms the basis of face/non-face classification. An introduction to Support Vector Regression (SVR) also is presented, as this statistical technique is used for mapping new sample points onto an LLE face map. Finally, two popular kernel methods are presented: the polynomial and the Gaussian Radial Basis Function (RBF) kernels.

Chapter 5 is dedicated to experiments on facial databases. The first section of this chapter presents the six facial databases that were employed for this thesis, and is followed by a schematic description of the face detection system based upon the LLE algorithm. The following sections present the experiments and results from tests on each individual facial database, in order to obtain the optimal parameters for the LLE and SVM modules. The rationale behind testing each database individually is that, while it is true that some databases are “corrupted” by another factor, such as the age factor, using a database that only is age-related gives an idea what this “corruption” is. An analysis of the results obtained from these databases indicated that each had a different d dimensional space while achieving optimal results. Hence, instead of combining all databases into a single cluster, the decision was made to adopt a sixdecision process. The last section of this chapter presents detection results obtained on a

7


standard database with cropped facial images, which then are compared to other established methods.

Chapter 6 presents the face detection problem that exists for a single image. Implementation of the face detector is described, followed by a description of improvements made on the LLE detector based upon the experiments conducted in Chapter 5. Experimental results obtained on an image database then are presented, and detection accuracy is compared to that achieved with other face detection methods, thereby providing an indication of how the proposed method compares against other well-known approaches.

Chapter 7 concludes the thesis with a summary of the work that has been accomplished, a review of the objectives and their fulfillments, and a glimpse at future work that is planned to improve the proposed method.

8

Chapter 2 - Background: Face Detection Methods for Single Image Input

Chapter 2 Background: Face Detection Methods for Single Image Input To build fully-automated systems that analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the goal of face detection is to identify all image regions that contain a face, regardless of its three-dimensional position, orientation or lighting conditions. Such a problem is challenging, because faces are non-rigid objects that have a high degree of variability in size, shape, color and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this chapter is to categorize and review these algorithms. The motivation behind face detection is that numerous research efforts such as face recognition, face tracking and pose estimation assume that all faces in the images are identified and localized (i.e., identifying the image position of each single face). Therefore, to obtain robust automated systems, one must be able to detect faces within images in an efficient and highly reproducible manner. In reference [90], within a definition of face detection, the author writes: “Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image and, if present, return the image location and extent of each face”. The challenges attributed to this problem are: 1) pose, 2) facial expression, 3) occlusion, 4) image orientation, 5) imaging conditions and 6) aging. The first section in this

9


chapter will examine some of these challenges, while the second will enumerate existing face detection methods that deal with these challenges when given a single image. Face detection in video sequences is not discussed here, as the focus in this research was on static gray scale images, although detection in color images will be discussed briefly.

2.1

Challenges of Face Detection

Face detection provides interesting challenges due to the stringent requirements for pattern classification and learning techniques. When a raw or filtered image is considered as input to a pattern classifier, the dimension of the feature space is extremely large (i.e., the number of pixels in normalized training images). These classes of face and non-face images are decidedly characterized by multimodal distribution functions, and effective decision boundaries likely will be non-linear. To be effective, a classifier either must be able to extrapolate from a modest number of training samples, or it must be efficient when dealing with a very large number of these high-dimensional training samples. The factors that affect a system’s performance in face detection are scale, pose, illumination, facial expression, occlusion and age.

2.1.1 Scale In a single given image, a group of faces may appear in different scales, such as is illustrated in Figure 2.1. The scale (or size) of a face may be dealt with by a simple resizing process or by warping, based on what is termed a stretching transform [6]. This recently-proposed transform requires localization of several feature points such as the eyes, nose, or mouth, in order to warp the face in a way that corresponds to the face’s biometrics. These feature points may be located manually [6], or by automatic low-level processing based on facial geometry. However, in general, the latter is not sufficiently robust, and because the location of the feature points is unknown in a face detection problem, a resizing process usually is preferred.

10


Figure 2.1 : Example of a single image from the MIT-CMU database [65], with faces in different scales.

2.1.2

Pose

The performance of face detection systems drops significantly when pose variations are present. Varying poses occur from a change in viewpoint or when the head is rotated in three-dimensions. This also may lead to large linear variations in facial appearance, due to self-occlusion and self-shading.

Most of the previous work in face detection is limited to frontal views. Some investigators have attempted to adapt their method to different views [60][65][54]; still, dealing with rotation in depth and, hence, being able to detect faces across multiple views remains difficult. With respect to the feature-based approach, Cootes et al. [12] proposed a 3D active appearance model to compute face pose variations explicitly. In reference [6], a template-matching approach was used for face recognition with varying poses, whereby faces are represented by a large number of facial feature templates associated with different poses and different individuals. To reduce the computational demand, a hierarchical coarse-to-fine searching strategy was proposed, to allow for recognition of a new face as a multi-level pyramid representation of a face image. Still, these methods require a large set of templates that may not be available in some applications, and thus are very demanding.

A similar strategy, which would impose fewer computational demands, is to use an appearance-based method to learn the representation of a face under varying poses. Li et

11


al. developed a SVM-based multi-view face detection and pose estimation model in reference [44], and improved their system in [45], by using Support Vector Regression to solve the problem of pose estimation. They then used a combined Eigenface and Support Vector Machines (SVMs) method, which improves the overall performance of detection in terms of speed and accuracy. However, this method demonstrates a weak performance when other factors, such as illumination or occlusion, come into play.

2.1.3 Illumination The problems that illumination creates are illustrated in Figure 2.2 in which the same face, with the same facial expression, and seen from the same viewpoint appears differently due to changes in lighting. In fact, changes brought about by differences in illumination often are larger than the differences that exist between individuals, causing systems based upon comparisons of images to misclassify input images.

Figure 2.2 : Effect of illumination on a face.

While there has been a great deal of literature in computer vision that details methods to handle image variation produced by changes in pose, few efforts have been devoted to image variation produced by changes in illumination. For the most part, object detection algorithms either have ignored variations in illumination, or have dealt with it by measuring some property or feature of the image. A traditional way of solving this dilemma is by intensity normalization, which computes the ratio of local intensity to the average brightness in a suitable neighborhood [9]. This method works well only when there is a slight change in the light source direction with no shadowing. Recently, an illumination subspace approach has been proposed to deal with scene lighting conditions [18]. This method constructs an illumination cone of a face from a set of images taken under unknown lighting conditions. This approach is reported to perform

12


significantly better, especially when evaluating images with extreme illumination, but has yet to be tested for face detection under a combination of other factors.

2.1.4 Facial expression As opposed to the effect of scale, pose and illumination, facial expression can modify significantly the geometry of a face, as shown in Figure 2.3. Attempts have been made in computer graphics to model facial expressions from a muscular point of view [67]. An affine transformation method also has been proposed to normalize expressions [59], where various facial expressions are warped to obtain a normal facial expression.

Figure 2.3 : Effect of facial expression on facial geometry.

It was shown that facial expressions particularly are important in affecting the automated detection of facial features [87]. Even today, the problem of effective and expression-invariant face detection and segmentation remains unsolved. Experimental findings have revealed that detecting facial features, using a knowledge-based approach, especially is affected by expressions of happiness and disgust [22]. In particular, the detection of the nose and mouth are affected by facial images with expressions of disgust and happiness, with detection accuracies of 75% and 62%, respectively. Observing such deteriorated detection results proves that a face detector must take into account these conditions to detect faces with various expressions.

2.1.4

Occlusion

Occlusion is another issue confronted by face detection in practice. Glasses, scarves and beards all change the appearance of a face. Figure 2.4 shows examples of the same face under different disguises. Most research so far only has addressed the problem of glasses [4][40]. The problem of hairstyle rarely is treated in face detection, because as 13


long as is does not obstruct any facial features, it can be ignored easily using a proper cropping method.

Figure 2.4 : Effect of occlusion on a same face.

In reference [26], an SVM method with local kernels was proposed to realize a robust face detector under partial occlusion. The robustness of their proposed method under partial occlusion was demonstrated by evaluating its performance in detecting an occluded face; and it was discovered that it could detect faces wearing sunglasses or a scarf. However, the authors proposed a polynomial kernel under restrained conditions, which might not be suitable for detecting occluded faces under a large variety of conditions. 2.1.5

Aging

Aging can be another problem confronted by face detection, since the same individual can have significantly different appearances throughout his or her lifetime. We note that research image databases that are available on the Internet usually contain a small range of ages (mostly graduate student age), and are not annotated by age. Therefore, it has not been possible to consider this factor. In the literature, there still are no studies that have analyzed the effect of aging on the performance of face detection.

2.2

Face Detection Techniques

Face detection can be viewed as a two-class recognition problem in which an image region is classified as being either a “face” or a “non-face”. With over 170 reported approaches to face detection [90], the impact of its research has broad implications on face recognition. The various approaches for face detection can be classified into four categories:

14


1. Knowledge-based methods use rule-based methods, which encode human knowledge of what a face is. 2. Feature invariant approaches are regrouping methods that aim to find robust structural features, invariant to pose, lighting, etc. 3. Template matching methods compute the correlation between standard patterns of a face and an input image for detection. 4. Appearance-based methods, in contrast to template matching, use models learned from training sets to represent the variability of facial appearance. Table 2.1 summarizes the four categories by indicating representative approaches for each category. Approach Knowledge-based

Representative Approaches

Multi-resolution rule-based method [88] Feature invariant - Facial features Grouping of edges [43] - Skin color Gaussian mixtures [49] - Multiple features Integration of skin color, size and shape [76] Template matching - Predefined Face templates Shape template [19] - Deformable templates Active Shape Model (ASM) [41] Appearance-based - Eigenface Eigenvector decomposition and clustering [80] - Distribution-based Gaussian distribution and multiplayer perceptron [77] - Neural Network Ensemble of neural networks and arbitration schemes [65] - Support Vector Machines SVM with polynomial kernel [54] - Hidden Markov Model Higher order statistics with HMM [58] - Adaboost Attentional cascade with boosted features [84] Table 2.1 : Categorization of face detection methods.

In this section, we present a review of these four general approaches to detecting faces in images. Each method has produced a subset of techniques that also are described briefly. A summary of the strengths and weaknesses inherent to each method will be presented at the end, with a comparative table on the performance of some methods. 2.2.1

Knowledge-based methods

With this approach, face detection methods are based upon the rules derived from the researcher’s knowledge of the geometry and anthropometry of a human face. Following

15


a set of simple rules (for example, the symmetric properties of the eyes and the relative distance between features) face candidates are identified. A verification step often is added to reduce the occurrence of false negatives. However, the difficulty with this method is that, if the rules either are too strict or too general (not strict enough), the algorithm will not perform well.

Yang and Huang [88] use a hierarchical knowledge-based method, consisting of three levels of rules. At the highest level, all possible face candidates are identified by applying a set of rules consisting of general descriptors. At the mid-level, a local histogram equalization is applied, followed by edge detection. Finally, at the lowest level, the surviving candidates are processed for facial features using another set of rules. Although the idea of multi-resolution has been reused frequently since, this technique does not yield high detection rates.

Kotropoulos and Pitas [37] proposed a similar method to [88], by which a rule-based localization procedure was utilized. Their technique locates the facial boundary using the horizontal and vertical projections of I(x,y), defined as: VI ( x) = ∑ x =1 I ( x, y ) . m

HI ( x ) = ∑ y =1 I ( x, y ) n

and

While HI determines the left and right side locations of the head, VI

locates the mouth, lips, nose tip and eyes. Figure 2.5 displays an example of this procedure. This technique was tested on images of faces against uniform backgrounds from the European ACTS M2VTS database [55]. Experiments show a detection rate of 86.5%. However, this algorithm is not able to detect multiple faces in an image and fails to perform well when dealing with non-uniform backgrounds. A primary advantage of the knowledge-based approach is that it is easy to come up with simple rules to describe the features of a face and their relationships. It is founded on coded rules: facial features in an input image are extracted first, and face candidates are identified. It works well for face localization in uncluttered backgrounds. Unfortunately, it is difficult to translate human knowledge into rules: detailed rules fail to detect faces, and general rules often identify many false positives. It also is extremely difficult to

16


extend this approach to detecting faces in different poses; frankly, it just is implausible to enumerate all the possible cases.

Figure 2.5 : Example of horizontal and vertical signatures used in [37].

2.2.2

Feature invariant approaches

Using the assumption that humans can detect faces and objects in different poses and lighting conditions effortlessly, researchers have been trying to identify features that are constant, or invariant, over a large range of imaging conditions. The idea behind the feature invariant approach lies in the extraction of specific facial features such as eyes, nose, etc. In contrast with the knowledge-based approach, these features then are used to build statistical models that describe their spatial relationships and, thereby, verify the existence of a face. However, one of the major disadvantages of this approach lies in its being easily corruptible by factors such as noise and occlusion, more so than any other approach.

Facial features

While Sirohey [75] developed a rudimentary method consisting of a boundary-fitting ellipse enclosing the head region using the edge image, Chetverikov [11] went a step further by using blobs and streaks instead of edges. His face model consisted of different blobs representing the eyes, nose and cheekbones. Triangular configurations

17


that connected these blobs then were used to encode their spatial relationships. With the knowledge that certain regions of the face exhibit relatively high intensity, another way to locate these features is to use histogram peaks, and to perform adaptive image segmentation by computing an adaptive threshold based upon these peaks. This is the method proposed by Graf et al. [20], by which connected components then are identified to locate facial features, and subsequently the face. However, it is not clear in their explanation how morphological operations are applied to enhance regions.

Leung et al. [43] discuss a probabilistic approach to locating a face in a cluttered scene based upon local feature detectors and random graph matching. The problem is viewed as a search problem in which the goal is to identify an arrangement of facial features that is likely to be a face. Candidate facial features are obtained by matching a filter response at each pixel of a test image. Given the top two features, a search for other features is performed using the expected locations from a statistical model of mutual distances. This system achieves an 86% detection rate. The same authors proposed a variation of this method, in which instead of mutual distances, a statistical theory of shape developed by Kendall [32] is used. In this case, the shape statistics are modeled by a joint probability density function over N feature points. The maximum-likelihood (ML) then is applied to determine face location. One advantage of this method is the possibility of detecting partially occluded faces.

Another feature-based method that uses a large amount of evidence from the visual image was proposed by Yow and Cipolla [94]. They first detect interest points in a multi-stage process using a filter that indicates possible facial features. Then, edge examination is performed in the neighborhood of these points by grouping and verifying edge length, strength, etc. Mahalanobis distances subsequently are used to validate facial features with respect to a certain threshold and a Bayesian network is used to verify the presence of a face. One interesting feature is that this method can detect faces in different orientations and poses. This method subsequently was enhanced by employing active contours.

18


Skin color Extensive research on skin color has proven that this method is a viable way of detecting faces, proving also that the difference in people’s skin color lies mostly in intensity, rather than in chrominance [20][89]. A simple method is to define a region of skin tone pixels from skin samples and to classify skin tones depending on the range they fall into. While Crowley and Coutaz [14] used a histogram of normalized RedGreen-Blue (RGB) values to classify pixels according to a threshold, , other nonτ

parametric methods have used the histogram intersection in the Hue-Saturation Value (HSV) color space [71][78] between skin patches and the input image, to identify facial regions.

However, color appearance often is unstable due to changes both in background and foreground lighting. Several methods have addressed this problem, such as that proposed by McKenna et al. [49], by which instead of using color constancy, stochastic models are exploited to accommodate for view and lighting changes. In reference [7], a specularity detection and removal technique is proposed, where a wave-front is generated outwards, starting from the peak of the specularity onto its boundary. Upon attaining the specularity boundary, the wavefront contracts inwards while coloring in the specularity, until the latter no longer exists. Still, the skin tone approach cannot be applied for multiple face detection, as it offers noisy detection results because of body parts and skin tone line regions. Multiple features The global (size, shape and skin color) and local features (eyes, nose and hair features described above), recently have been combined in numerous methods to locate or detect faces. A typical approach detects skin-like regions (described above), and then the skin pixels are grouped using clustering algorithms. If the shape of a resulting region is elliptic, it is considered to be a face candidate. Local features then are used for verification.

19


Sobottka and Pittas [76] also used shape and color by first performing color segmentation in the HSV color space to extract skin regions. Connected components are determined, and if a best-fit ellipse fits well, it is selected as a facial candidate. After determining skin/non-skin regions using class-conditional density functions, an elliptical face template is used to establish the similarity of skin regions based upon a Hausdorff distance. Inherent symmetry properties are used to locate the eye centers, and to deduce the position of the nose tip and mouth on the face. One drawback is that this method only is effective with frontal views. Other reported methods have consisted of using structure, color and geometry instead of pixel information [91], while some looked at range and color for detection [33].

Features are invariant to pose and orientation change. However, it is difficult to locate facial features due to the inefficiency of these methods to cope with external factors such as illumination, noise and occlusion. It also is difficult to detect features in complex backgrounds.

2.2.3 Template matching methods Given an input image, a template matching method will correlate it with a predefined or parameterized standard face pattern, either independently for the eyes, nose and mouth, or with the entire facial image. Based on the correlation values, the existence of a face subsequently is determined. Predefined templates Sakai et al. [66] first reported detecting frontal faces using several sub-templates for eyes, nose and mouth. Each of these sub-templates is defined as line segments, and extracted lines from the input image then are matched against the sub-templates. The first phase then will determine the focus of attention, while the second step will examine details in order to determine the existence of a face.

20


Govindaraju et al. [19] also presented a two-stage face detection method, whereby face hypotheses are generated and tested. Face models are built using edges in order to define the features, which describe the curves of the face (left side, right side, hair-line and frontal face). The Marr-Hildreth edge operator is used to obtain an edge map of the input image. Pairs of fragmented contours are linked based upon their proximity; contours are segmented into feature curves based upon corner detections; and the curves then are labeled based upon geometric properties. The ratio of feature pairs forming an edge is compared to a golden ratio, and a cost therefore assigned; if this value is low, the group of feature curves becomes a hypothesis. Their system reportedly has a detection rate of 70%, as tested on a set of 50 photographs. However, this system only detects frontal faces that are un-occluded. In order to counter the illumination variations in template matching techniques, Sinha proposed a method that uses small sets of spatial image invariants to describe the space of face patterns [74]. His insight is that, while variations in illumination change the individual brightness of different parts of faces, the relative brightness of these parts remains unchanged. Pairwise ratios of the brightness of a few regions and the “directions” of these ratios (one region brighter than the other) are computed. These regions correspond to key facial features such as eyes, cheeks and forehead. Hence, this provides a robust invariant method, and a face can be located if an image satisfies all the pairwise relationships between the different sub-regions. Figure 2.6 shows 16 subregions with its 23 relationships (defined by the arrows). The head of the arrow denotes the denominator of the brightness ratio, and a relationship therefore is satisfied if it exceeds a certain threshold.

21


Figure 2.6 : A 14x16 pixel ratio template, composed of 16 regions and 23 relationships, is used for face detection as described in [74].

Deformable templates

Deformable templates (which are described by parameterized templates), have been used by Yuille et al. [95] to fit an a-priori elastic model so as to model facial features. Energy functions then are applied to link edges and peaks from the input image with corresponding parameters in the template. Even if it performs well in tracking objects, this deformable template method cannot work without being provided with the proximity of interest.

Lantis et al. [41] then put forward a method that combines both shape and intensity information for representing the shape of a face. To do this, a vector of sample points is obtained using the training images of eyes, noses and chin contours. These points then are used in a Point Distribution Model (PDM), along with an approach similar to Kirby and Sirovich [34], in order to obtain a shape-normalized intensity appearance. Consequently, a face-shape PDM is available for locating faces in new images, using an Active Shape Model (ASM) search to estimate face location and shape parameters. Face patches then are deformed to the average shape, and intensity parameters can be extracted. Shape and intensity parameters later can be used together for classification purposes. Results showed that, after training, 35 faces were located out of 40 test images (88%). Since then, ASMs have been extended with Kalman filters to estimate shape-free intensity parameters, as well as for tracking faces in image sequences.

22


Even if this general approach is quite simple, templates still need to be initialized near the face images, and this approach has proven inadequate for face detection, because of its inefficiency towards scale, pose and shape. However, other methods based upon multi-resolution, mutli-scale and deformable templates [95] have been proposed to counter this problem. Unfortunately, as for the knowledge-based approach, it is tedious to enumerate templates for all possible head poses. 2.2.4 Appearance-based methods

With appearance-based methods, the “templates” that are used for face detection are learned from images, rather than predefined by experts, as in template matching methods. They rely on techniques from statistical analysis and machine learning to discover characteristics of face and non-face images. Dimensionality reduction also is an important aspect in these methods. Many of these methods can be viewed in a probabilistic framework, using the Bayesian classification method to classify a candidate image for the density functions p(x|face) and p(x|non-face), where x is a random variable viewed as the feature vector derived from the image. However, due to the high dimensionality of x, or because of the multimodal behavior of both these density functions, classification is not so straightforward. Discriminate functions between face and non-face classes (i.e., decision surface) also have been used for appearance-based methods, traditionally by projecting image patterns onto a lower dimensionality space, or by multi-layer neural networks to form non-linear decision surfaces. Recently, SVMs and other kernel methods have been proposed to implicitly project patterns onto a higher dimensional space in order to separate these classes.

Eigenfaces

Kirby and Sirovich [34] first demonstrated that images of faces can be encoded linearly using a modest number of basis images. This demonstration is based upon the Karhunen-Loeve transform, known as Principal Component Analysis [30]. Given a set of n by m pixel training images represented as a vector of size m × n , basis vectors

23


spanning an optimal subspace are determined, such that the mean square error between the projection of the training images on the subspace and the original images is minimized. This set of optimal basis vectors are denoted as Eigenfaces, since these are simply the eigenvectors of the covariance matrix computed from the vectorized face images in the training set. Similar to reference [34], Turk and Pentland applied Principal Component Analysis to face recognition and detection, by applying a training set of face images to generate Eigenfaces that span a subspace (called the face space) of the image space [80]. To detect the presence of a face in a scene, the distance between an image region and the face space is computed for all locations in the image. However, this method often cannot retrieve faces which expose variations in head rotation. Distribution-based methods

Sung and Poggio developed a distribution-based system for face detection [77] which demonstrated how the distribution of image patterns from one object class can be learned from positive and negative examples (i.e., images) of that class. Their system consists of two components, with a distribution-based model for face/non-face patterns and a multilayer perception classifier. The patterns are grouped into six face and six non-face clusters using a modified k-means algorithm, as shown in Figure 2.7.

(a)

(b)

Figure 2.7 : The distance measures used in Sung and Poggio’s method presented in [77]. (a) Computation of distance between test pattern and clusters. (b) Each distance measure is two-value metric. D1 is a Mahalanobis distance, while D2 is the Euclidean distance.

24


Each of the six clusters is represented as a Gaussian function, but set at different scales, hence possessing a different mean image and covariance for each cluster. Distance metrics are used to classify face window patterns from non-face patterns, using twelve pairs of distances to each face and non-face cluster. Because it is difficult to generate a representative sample of non-face patterns, the problem was alleviated by means of a bootstrap method that selectively adds images to the training set as training progresses. The testing process consists of incrementally adding, into the training set, non-face patterns that are detected as faces. This has significantly improved the performance of the system.

Other probabilistic visual learning methods followed. Moghaddam and Pentland [53] developed a technique rooted in density estimation in a high-dimensional space, using Eigenspace decomposition. Principal Component Analysis (PCA) was used to define the subspace which best represents a set of face patterns. However, PCA does not define a proper density model for the data, since the cost of coding a data point is equal anywhere along the principal component. Furthermore, it is not immune to independent noise in the features of the data. Two methods, one based on factor analysis (FA) and another based on Fisher’s Linear Discriminant (FLD) were proposed by [92] in order to project samples from the high-dimensional image space to a lower-dimensional feature space. Neural Networks Neural Networks have been applied successfully in many pattern recognition problems, such as optical character recognition, object recognition, and autonomous robot driving. Since face detection can be treated as a two-class pattern recognition problem, various neural network architectures have been proposed. The advantage of using neural networks for face detection is the feasibility of training a system to capture the complex class conditional density of face patterns.

There have been many attempts to apply NN (Neural Networks) to face detection, but the most significant work in this field was done by Rowley et al. [64][65], who

25


demonstrated significant improvements in many aspects of NN when compared to other methods. A multilayer neural network is used to learn the face and non-face patterns from face/non-face images (i.e., the intensities and spatial relationships of pixels). They also used multiple neural networks and several arbitration methods to improve the performance of the system. There are two major components: multiple neural networks (to detect face patterns) and a decision-making module (to render the final decision from multiple detection results). Figure 2.8 presents a diagram summarizing Rowley’s method in which, given a test pattern, the output of the trained neural network indicates evidence for a non-face (close to -1) or face pattern (close to +1). Nearly 1000 face samples of various sizes, orientations, positions and intensities were used to train the network. One limitation of the methods by Rowley [65] and Sung [77] is that they can only detect upright, frontal faces. Furthermore, one major drawback is that the network architecture must be tuned extensively (number of layers, number of nodes, learning rates, etc.) to achieve exceptional performance.

Figure 2.8 : System diagram of the Rowley-Kanade neural network method [65].

Support Vector Machines Support Vector Machines can be considered to be a new paradigm to train polynomial functions, neural networks and Radial Basis Function (RBF) classifiers. While most methods for training a classifier (e.g., Bayesian, neural networks, and RBF) are founded on minimizing the training error, (i.e., empirical risk), SVMs operate on another principle, called structural risk minimization, which aims to minimize the upper bound on the expected generalization error. SVMs initially were applied to face detection by

26


Osuna et al. [54], by developing an efficient method to train an SVM for large-scale problems; they then applied it to face detection.

Since then, a good number of face detection methods based upon Support Vector Machines were proposed. In reference [1], a subspace approach was presented, by which a linear SVM classifier is trained as a filter to produce a subspace, which then is used by a non-linear SVM classifier with an RBF kernel for face detection. This procedure also was used in reference [47], where a method rooted in hierarchical support vector machines was chosen for face detection. The authors for reference [57] experimented with a similar method, whereby an SVM was trained in the Eigenspace instead of in a linear space. Even though experimental results demonstrated promising performance, these approaches were unable to detect faces in various poses. The head pose problem remained unresolved.

Striving to solve the head pose problem, Li et al. developed a SVM-based multi-view face detection and pose estimation model in reference [44], and improved their system in [45], by using Support Vector Regression to first solve the problem of pose estimation. They subsequently used a combined Eigenface and Support Vector Machines (SVM) method, which improves the overall performance in terms of speed and accuracy. A similar method was proposed by the authors of reference [86], whereby a combination of SVMs, using both cascading and bagging methods, was developed to detect faces in seven different views, under complex backgrounds. However, as stated before, these methods perform poorly when other factors such as illumination or occlusion come into play.

Hidden Markov Model (HMM) The underlying assumptions of the Hidden Markov Model (HMM) are 1) that patterns can be characterized as a parametric random process, and 2) that the parameters of this process can be estimated in a precise, well-defined manner. In developing an HMM to solve a pattern recognition problem, the number of hidden states for the model must be

27


defined. Then, one can train an HMM to learn the transitional probabilities between states from the examples, with each example represented as a sequence of observations. The output probability of an observation then can identify the class to which it belongs.

During training and testing, an image is scanned in a given order (usually from top to bottom) and an observation is taken as a block of pixels. For face patterns, the boundaries between strips of pixels are represented by probabilistic transitions between states, as shown in Figure 2.9, and the image data within a region are modeled by means of a multivariate Gaussian distribution. If the face likelihood obtained for each rectangular pattern in the image is above a threshold, a face is detected.

Figure 2.9 : Hidden Markov model for face detection with five hidden states [68].

Samaria and Young applied 1D and pseudo 2D HMMs to facial feature extraction and face recognition [68]. Since significant facial regions such as hair, forehead, eyes, nose and mouth occur in the natural order from top to bottom, each of these regions is assigned to a state in a one-dimensional continuous HMM. Figure 2.9 shows these five hidden states. Rajagopalan et al. proposed two probabilistic methods for face detection [58] centered around HMMs. The first method uses Higher Order Statistics (HOS) for density estimation. Unknown distributions of faces and non-faces are clustered using six density functions rooted in higher order statistics of the patterns. The second method in reference [58] uses an HMM to learn the face to non-face and non-face to face transitions in an image. The optimal state sequence then is processed for binary classification. Experimental results show that both HOS and HMM methods have a higher detection rate (97.1% and 94.2%) than Rowley-Kanade method [65] (87.5%), but with more false detections (10 and 17 for HOS and HMM, respectively; versus 1 for Rowley). 28


Adaboost

The Adaboost method initially was presented by Freund and Schapire [16], whereby a set of weak classifiers (obtained from a series of observed distributions) are used for learning. Viola and Jones [84] presented one of the first face detection methods that used this statistical approach. The principle concept behind this technique is to select important features, using a focus of attention method, and then to rely on integral graphs for fast feature evaluation. Fundamental to the whole approach are Haar-like features (so called because they are computed similar to the coefficients in Haar wavelets), whereupon each feature is described by the template (shape of the feature). Very simple decision tree classifiers that usually have just two terminal nodes are built to yield a face/non-face response. Not every classifier is able to detect a face; some, called weak classifiers, merely react to some simple feature in the image that may relate to the face. A complex and robust classifier then is constructed out of multiple weak classifiers to filter out the regions that most likely do not contain faces, by means of the Adaboost method. Viola [84] suggests building several boosted classifiers, Fk, with constantly increasing complexity, and then chaining them into a cascade with simpler classifiers going first. During the detection stage, each current search window is analyzed by each of the Fk classifiers that may reject it or let it go through, as depicted in Figure 2.10

Figure 2.10 : Face detection cascade of classifiers shown in [84], whereby rejection can happen at any stage.

In experiments, about 70-80% of the candidates are rejected during the first two stages that use the simplest features (about 10 weak classifiers each), so this technique speeds

29


up the detection process. Furthermore, each of the stages need not offer optimal results; in fact, the stages are trained to achieve high detection rates, rather than low false positive rates. By choosing the desired hit rate and false positive rate at every stage, and by choosing the number of stages accurately, it is possible to achieve very good detection performance. By stacking 20 stages into a cascade, a detection rate of 98% and a false-detection rate of 0.0001% were achieved.

Lienhart [46] subsequently improved this method by adding two extensions: a novel set of rotated Haar-features was added to the method, as well as an enhanced boosting algorithm, which diminished the number of false detections by 10%. By reviewing the literature on face detection methods, this rapid face detection technique, initially proposed by Viola and Jones and improved by Lienhart, is the approach that offers the best results with the least number of constraints on the data in the field of face detection, because of its high accuracy and great computational speed. Table 2.2 shows the performance of this method, along with results provided from various other face detection methods.

Method Linear SVM+G+SVM [1] Linear SVM+G-SVM1+G-SVM2 [1] Hierarchical SVM [47] Hybrid Cascade-Wavelet System [83] Rowley-Kanade System 5 [65] Rowley-Kanade System 11 [65] Viola-Jones Rapid System [84]

Detection Rate 82.3% 75.4% 82.5% 82.4% 90.5% 90.2% 90.0%

False Detections 322 68 301 319 570 167 52

Table 2.2 : Comparison of detection rates for various face detection techniques, tested on the MIT-CMU Database [65].

2.2.5 Summary In this chapter, we presented a literature review of the four general methods currently used to detect faces in images. The advantage of the knowledge-based approach is that it is easy to generate simple rules to describe the features of a face and their relationships. Unfortunately, it is difficult to translate human knowledge into precise rules: detailed rules fail to detect faces and general rules often detect many false positives. It also is extremely difficult to extend this approach to detecting faces in 30


different poses, and implausible to enumerate all possible cases. Feature-based methods can be invariant to pose and orientation change. However, it may be difficult to locate all facial features due to the inefficiency of these methods to cope with external factors such as illumination, noise or occlusion. It also is difficult to detect features in complex backgrounds. The template approach is quite simple and straightforward, but templates need to be initialized near the face images. Similar to the knowledge-based approach, it is difficult to enumerate templates for different poses. It also has proven inadequate for face detection, because of its inefficiency dealing with scale, pose and shape. Finally,

appearance-based methods use powerful machine learning algorithms. They have demonstrated accurate, fast and fairly robust empirical results, and can be extended to detect faces in different pose orientations. On the other hand, they usually need to search over space and scale, and require many positive and negative examples.

The current state in the field of face detection shows that, in general, appearance-based methods centered on statistical learning techniques, constitute the approach that demonstrates the best results under the least number of constraints. More specifically, the Viola-Jones technique, which is a fast and robust method that can be used for real time application, stands out as the gold standard in the field of face detection. However, the training stage is very time-consuming, sometimes requiring days of training. Furthermore, it requires a very large number of training examples and does not attempt to train nor detect faces under a wide variety of conditions. Consequently, this research presented in this thesis will try to propose an appearance-based face detection method, which can achieve high detection rates, under a wider spectrum of facial conditions, while using a lower number of representative training facial images.

31

Chapter 3 - Locally Linear Embedding

Chapter 3 Locally Linear Embedding Traditional methods in data projection have focused upon the problem of dimensionality reduction. Methods are based either on Multi-Dimensional Scaling (MDS) [13], which computes embeddings that attempt to preserve pairwise distances between data points, or on linear projection, such as the well-known Principal

Component Analysis (PCA) [34]. They also include methods for non-linear projection such as Sammon’s non-linear mapping [69] and Self-Organizing Maps (SOM) [35]. Research on data projection is resurging by new non-linear mapping methods, such as

ISOMAP [79], Curvilinear Component Analysis (CCA) [15] and Curvilinear Distance Analysis (CDA) [42]. ISOMAP applies classical MDS with a geodesic distance measure.

The Locally Linear Embedding (LLE) algorithm recently has emerged as a promising technique for non-linear dimensionality reduction of high-dimensional data. Unlike classical linear dimensionality reduction methods, LLE provides information that can reveal the intrinsic manifold of data. In this chapter, we describe the algorithm proposed by Roweis and Saul [63], which assumes that each data point and its neighbors lie on a locally-linear patch, and then applies this patch in a low space to generate data configuration. LLE recovers global non-linear structures from locally-linear fits. In contrast, CCA tries to preserve short distances by ignoring distances over a given

32


threshold; and CDA uses the geodesic distances in CCA, similar to how ISOMAP handles classical MDS. The common objective of these non-linear mapping methods is to project data by unfolding the data distribution in a non-linear way. However, each of these methods requires a parameter that selects the number of neighbors.

We begin this chapter with a review of applications which use or are founded on LLE, followed by image processing techniques which are applied as a pre-processing step for the LLE algorithm. Then, a detailed explanation of the LLE algorithm will be presented, followed by a discussion of the different challenges that have to be addressed when using this method for an application such as face detection.

3.1 Literature Review on Locally Linear Embedding (LLE) Roweis and Saul [63] presented the LLE algorithm as an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Their motivation for such a method was the need to analyze large amounts of multivariate data, which raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. They present LLE as a method that maps its input into a single global coordinate system of lower dimensionality. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of non-linear manifolds. The authors present an overview of the algorithm, as well as the many properties of this method, including the few parameters that must be tuned (number of neighbors and the intrinsic dimension). In addition, numerous comparisons are made with Tenenbaum’s ISOMAP method [79]. LLE is tested on facial images (Figure 3.1) which were embedded in a low-dimensional map, and demonstrates the neighborhood properties.

33


Figure 3.1 : Images of faces mapped into the embedding space [61] described by the first two coordinates of LLE, using the 12 nearest neighbours. The bottom images correspond to points along the red path, illustrating one particular mode of variability in expression.

In reference [70], the same authors offered a much more extensive description of LLE, addressing the problem of estimating parameters such as the intrinsic dimensionality and the number of neighbors. Implementation of the algorithm is detailed and several extensions that enhance its performance are discussed. The algorithm then is applied to manifolds of known structure, as well as real data sets of facial images, digits and lips. Extensive illustrations of the algorithm's performance are provided.

Classification of handwritten digits was one of the first LLE applications and, with this, many extensions and variants were proposed in [38][39][60][61]. In reference [61], a supervised variation on LLE is proposed. This mapping, when combined with simple classifiers such as the nearest mean classifier, is discovered to yield remarkably good classification results. Their method applied on the MNIST dataset [52] demonstrated detection rates as high as 97%. Furthermore, a number of algorithmic improvements that simplify the application of both traditional and supervised LLE are described, that function by eliminating the need to set some of the parameters.

34


While the selection of the intrinsic dimensionality (lower d dimension) was discussed extensively in [63][70][61], the selection for the number of neighbors is analyzed in detail in reference [38]. A hierarchical method to automatically select the optimal number of neighbors is proposed. The approach is verified experimentally on two large data sets of real-world images, and applied to visualization of multidimensional data. Computational complexity is used as an estimator of its efficiency.

In reference [39], another supervised variant of LLE is described, called the Supervised

Locally Linear Embedding (SLLE) algorithm. Unlike LLE, SLLE projects highdimensional data onto the embedded space using class membership relations. This generates well-separated clusters in the embedded space (if the dimensionality of the embedded space is less by one than the number of classes); moreover, each cluster is represented by only one point in the embedded space. To test SLLE, it was coupled with Support Vector Machines (SVM) [81], since SVMs have yielded excellent results in many tasks. SVM performs classification by mapping data onto a high-dimensional space where classes are linearly separable by hyperplanes. The combination of SLLE and SVM was applied to the task of recognizing handwritten digits from the MNIST database [52].

Face manifolds, face recognition and multi-pose analysis also were performed using LLE. The Machine Vision group, at the University of Oulu in Finland, investigated face pose analysis using unsupervised Locally Linear Embedding [23]. The performance of the LLE algorithm was assessed on real-world data; for example, on faces with variable poses. The results were compared with two other approaches, PCA and SOM, and extensions to the LLE algorithm were proposed. The LLE estimator correctly recognized the face pose of their test set up to 90% of the time. Two more applications using LLE include: face recognition with weighted locally linear embedding [50], which uses LLE to model the local shape of the manifold using neighboring nodes of the graph; and using LLE to select representative face samples (models) from raw videos and build an appearance-based face recognition system [24]. Both these methods systematically offer better results than PCA, SOM or ISOMAP.

35


LLE can elicit the underlying manifold of observational data in an embedded space, but lacks effective mapping between source data and output data. In [85], a non-linear method is proposed to obtain a definite mapping for LLE, which also achieves an analytical representation of the manifold of high-dimensional data. By learning the common information from high-density data that have been set in advance, methods based on statistical learning theory are applied to establish the mapping. Experiments on the synthesis of multi-pose face images were conducted, and they demonstrated this approach’s efficiency in mapping new data. Figure 3.2 summarizes this method.

Figure 3.2 : Obtaining new data from a face image sequence which was dimensionality reduced by LLE [85].

3.2 Image Processing Techniques 3.2.1

Bi-cubic image scaling

A first pre-processing step which is necessary before using the LLE algorithm consists of resizing the input images in order to bring them all to the same initial dimension. There exists a wide spectrum of methods that can be used to enlarge or reduce an image. These generally must trade off between speed and the degree to which they reduce visual artefacts. The simplest method to enlarge an image by a factor 2 say, is to

36


replicate each pixel 4 times. Of course this will lead to more pronounced jagged edges than what existed in the original image. A standard approach, called bi-cubic interpolation, estimates the intensity of a pixel in the destination image by an average of 16 pixels, surrounding the closest corresponding pixel in the source image. There are two methods in common usage for interpolating the 4x4 pixel region: cubic B-spline and a cubic interpolation function. The B-spline approach will be used in the current studies, because of its computational efficiency. 3.2.2 Illumination normalization Illumination changes caused by light sources at arbitrary positions and intensities contribute to a significant amount of variability in acquired images. In order to cope with these different lighting conditions, a pre-processing step can be applied to the face images in order to make it independent of illumination conditions. In this research, we use two possible pre-possessing steps to normalize illumination, before applying the LLE algorithm on a dataset of face images. The first is similar to that proposed in reference [77], where histogram equalization was chosen to perform this task, as it is a straightforward, efficient and typical method for gray level transformation. However, this method has the effect of creating faces in relatively smooth areas, such as in background images. To counter this problem, another possibility is computing the Fast Fourier Transform (FFT) on the image [36], and taking the phase image, which is an illumination and contrast invariant measure of feature significance in the image. Figure 3.3 presents an illustration of both methods. Chapter 6 will present a thorough comparison of these two methods.

(a) (b) (c) Figure 3.3 : Illumination normalization on a given image (a), using histogram equalization (b) and the resulting phase image from the FFT (c).

37


3.3 Locally Linear Embedding (LLE) 3.3.1 Motivation behind LLE As stated earlier, the motivation for developing LLE was derived from a peculiar degeneracy exhibited by an extended line of work on mixture models. Prior work has demonstrated that non-linear manifolds can be learned from examples and parameterized by mixtures of locally linear models. However, their objective function, which measures the least squares reconstruction, was invariant to arbitrary rotations and reflections of the local coordinate systems in each linear model. In other words, instead of favoring a consistent alignment of the local linear models, these learning algorithms yielded internal representations that changed unpredictably. LLE was designed to solve this problem, by discovering a single global coordinate system of lower dimensionality. Roweis and Saul [63] developed a non-parametric setting in which the local coordinate systems are defined by each data point and its nearest neighbors. The main novelty of LLE lies in its appeal to particular symmetries. The reconstruction weights in LLE capture the intrinsic geometric properties of local neighborhoods; namely, those properties invariant to translation, rotation and scaling.

3.3.2 Theory Essentially, LLE attempts to compute a low-dimensional embedding of highdimensionality data assumed to lie on a non-linear manifold, with the property that nearby points in the high dimensional space remain nearby, and similarly remain colocated with respect to one another in the low dimensional space. Hence, embeddings are optimized to preserve the local configurations of the nearest neighbors. At its core, LLE uses linear methods, such as least squares optimization and matrix diagonalization, to obtain highly non-linear embeddings. The only element of non-linearity is introduced in the nearest-neighbor search, which can be viewed as a highly non-linear threshold procedure. The goal is to learn such representations from examples: to discover, in a general setting, without the use of prior knowledge, the few degrees of freedom that underlie observed modes of continuous variability.

38


3.3.3 LLE algorithm The LLE transformation algorithm is founded on the basis of simple geometric intuitions, where the input data is composed of N points Xi, Xi ∈ RD, i∈[1,N] , each of dimensionality D, which were obtained by an underlying manifold. As an output, it produces N points Yi, Yi ∈ Rd, i∈[1,N] where d 0

(4.23)

Figure 4.6 a graphical representation of a Support Vector Classifier using a Radial Basis Function kernel, in order to separate between two classes.

Figure 4.6 : Example of a Support Vector classifier found by using a radial basis function kernel k(x; x’) = exp(-||x-x’||2/2 2). σ

4.5 Summary This chapter presented an overview of Support Vector Machines, with its classification and regression models, as well as two popular kernel methods frequently used for pattern recognition problems. With the LLE methods presented in Chapter 3, all the necessary methods now are assembled to conduct the initial experiments for the proposed face detection method, which will be described in the following chapter.

60

Chapter 5 - Experiments and Results

Chapter 5 Experiments and Results This chapter describes the proposed face detection system based on LLE embeddings and Support Vector Machines. It presents the series of experiments that were conducted to determine the system’s optimal parameters and evaluate the performance of the method in terms of detection accuracy. The first section will expose the different face databases that were used to train the system. A detailed explanation of the system’s training phase then will be presented, followed by the experiments performed on these databases. This chapter will conclude by testing the detection software on facial databases.

5.1

Experimental Methodology

The first step to train the proposed face detector was to gather facial images from various databases available on the Internet, in order to create six distinct databases corresponding to the conditions enumerated in Chapter 2. They include variations in head pose, illumination, occlusion, expression, and aging. The following subsection will give a description of the databases used in this research, followed by a description of the proposed training procedure for the face detector.

61


5.1.1 Face databases Notre-Dame Database [51]

The Notre-Dame Database [51] was created at the University of Notre-Dame, Indiana. It consists of frontal images of 277 individuals. For 77 individuals, there exists exactly one facial image per person. For the remaining 200 individuals, there are between 3 to 8 facial images. The time interval between acquisitions of different images of one given individual varies between 3 and 13 weeks. All images in the Notre-Dame Database give ambient-illuminated frontal faces with a normal expression. This image database was chosen because it offers facial images under ideal conditions, and therefore can be used to detect faces in images with normal conditions. Hence, the entire image set of this database was utilized to create the Frontal Database, which will be referred to as “DB1”, and used for training and testing a classifier that detects frontal faces under ambient lighting with normal facial expressions. Figure 5.1 depicts sample images of the Notre-Dame Database.

Figure 5.1 : Sample facial images from the Notre-Dame Database [51], which includes frontal faces face under ambient lighting with a normal expression.

Aleix-Robert (AR) Database [48]

The Aleix-Robert (AR) Database [48] was created by Aleix Martinez and Robert Benavente at the Computer Vision Center at Purdue University. It contains over 4,000 color images corresponding to 126 human faces (70 men and 56 women). Images feature frontal view faces with variations in facial expression, illumination conditions, and occlusions (sun glasses and scarf). The images were taken under strictly controlled conditions. No restrictions on what to wear (clothes, glasses, etc.), make-up or hairstyle were imposed on participants. Each individual participated in two sessions, separated by two weeks (14 days). The same pictures were taken during both sessions. For this study,

62


all images were converted to gray scale, and cropped to eliminate all hair and clothing. This image database was chosen because it offers facial images under three distinct conditions: illumination, occlusion and expression, and therefore can be used to detect faces in images under these conditions. Hence, the entire image set was divided into three databases, corresponding to three facial conditions: the Illumination Database, hitherto referred to as “DB3”, the Occlusion Database, referred to as “DB4”, and the Expression Database, referred to as “DB5”. These three databases would be used for training and testing three distinct classifiers that would detect faces with different lighting conditions, degrees of occlusion, and facial expressions. Figure 5.2 shows sample images from the AR Database.

Figure 5.2 : Sample facial images from the AR Database [48]. The conditions are: (a) neutral, (b) smile, (c) anger, (d) scream, (e) both lights on, (f) left light on, (g) right light on, (h) wearing sunglasses and (i) wearing a scarf.

UMIST Database [21]

The University of Manchester Institute of Science and Technology (UMIST) Database [21] consists of 564 gray scale images of 20 different individuals. Each individual covers a range of poses from profile to frontal views (approximately 30 images between –90 degrees to 0 degrees), as well as a range in race, sex and ethnicity. To obtain the full rotation from -90 to +90 degrees (°), each facial image between –90 and 0 degrees

63


was mirrored symmetrically on the vertical axis to generate the remaining head pose images from 0 to 90 degrees. This image database was chosen because it offers facial images with large variations in head rotation. In this database, the representation of a face is given approximately every 3 degrees, and therefore can be used to detect faces in images under these conditions. The entire image set of this database was taken to create the Head Pose Database, which will be referred to as “DB2”, and used for training and testing a classifier which detects faces under different head positions, with ambient lighting and a normal expression. Figure 5.3 shows sample images from the UMIST Database.

(a)

(b)

(f)

(g)

(k)

(c)

(h)

(l)

(d)

(i)

(e)

(j)

(m)

Figure 5.3 : Sample facial images from the UMIST Database [21], which offers facial images with large variations in head rotation between +90 and -90 degrees. The head positions are at: (a) +90°, (b) +80°, (c) +75°, (d) +60°, (e) +45°, (f) +10°, (g) 0°, (h) -5°, (i) -15°, (j) -60°, (k) -75°, (l) -85° and (m) -90°.

Aging Database [17]

The Aging Database [17] was created by Maulin Gandhi at the Center of Intelligent Machines, McGill University, using a method for automatic synthesis of aged human facial images. It contains 818 frontal face images (of which 585 are male and 233 are female) labeled by age and gender. The main source was the Internet Movie Database [27], which contains a dated list of thousands of celebrity photographs with associated birthdates. The most significant uncontrolled factor in the Internet Movie Database was facial expression, but images with extreme facial expressions were discarded during the 64


selection process. The images include different celebrities with diverse ethnicity to cover a broad range of ages, from 15 years up to 99 years. This image database was chosen because it offers faces at different stages in life. In this database, a facial image is provided for every year in a lifetime, between 15 and 99 years old (some age faces were generated synthetically, using the automatic method). The entire image set of this database was selected to create the Aging Database, referred to as “DB6”, and used for training and testing a classifier that detects frontal faces at different stages in a lifetime, with ambient lighting and a normal expression. Figure 5.4 shows sample images from the Aging Database.

Figure 5.4 : Sample facial images from the Aging Database [17], consisting in facial images that cover a broad range of ages, from 15 years up to 99 years. Faces at different ages: (a) 15 yrs, (b) 25 yrs, (c) 40 yrs, (d) 48 yrs, (e) 58 yrs, (f) 78 yrs, (g) 92 yrs and (h) 99 yrs.

Table 5.1 summarizes the face databases available on the Internet that were used to create six databases used in this research for the face detector.

Internet Face Database Notre-Dame DB [51]

Created Face Database Frontal DB (DB1) Illumination DB (DB3) Occlusion DB (DB4) AR Face DB [48] Expression DB (DB5) UMIST DB [21] Head Pose DB (DB2) Maulin DB [17] Aging DB (DB6) Table 5.1 : Face databases available on the Internet that were used in this research to create six facial databases, in order to train the proposed face detector.

65


5.1.2 System description

After gathering and cropping all the facial images provided from the face databases available on the Internet, these images were subdivided into six distinct databases to address the six challenges: a Frontal Database (DB1), a Head Pose Database (DB2), an Illumination Database (DB3), an Occlusion Database (DB4), an Expression Database (DB5), and an Aging Database (DB6). The next step was to train the face detection system. A pre-processing stage first was applied to normalize and resize all of the training images. In this chapter, illumination compensation was achieved by histogram equalization solely for testing purposes. Chapter 6 will look into this pre-processing stage in more detail. Then, all of the training input images (X) at an initial D dimensional space (D=1024) for each database were individually embedded to a lower d dimensional space, using the LLE algorithm (see Section 3.3), thereby creating a set of embedded points (Y). This mapping then was regressed using the analytical Support Vector Regression (SVR) method presented in Section 3.6.

Subsequently, these

regressed models were used to obtain a d dimensional representation of non-face images (the same d as for the face images). The final step was to train a classifier which would distinguish between a face and a non-face image in the d dimensional space generated by LLE. This procedure was repeated six times, once for each of the six databases. Regression and classification models therefore were obtained for each of the six databases in order to embed and classify new sample images. A schematic description of this process is presented in Figure 5.5. To obtain the best results, experiments on each database had to be conducted to obtain the optimal parameters for the LLE (K number of neighbors and lower d dimension) and SVM methods (C cost and

γ

gamma

parameters for the Radial Basis Function kernel). The following sections in this chapter will present these experiments.

5.2

Experiments on the Frontal Database (DB1)

The first set of experiments tested the Frontal Database (DB1), as described in the section above. This enabled us to evaluate the proposed method under normal conditions, with all facial images being frontal, ambient illuminated and with a neutral

66


expression. At the end of this subsection, a comparison of the proposed method with the PCA derived approach described in Section 3.7 gives an indication of the feasibility of an LLE approach for face detection.

SYSTEM TRAINING

Figure 5.5 : Schematic description of the training procedure for the proposed LLE face detection system.

5.2.1

Experiments on the D parameter

The variability of the initial dimension of the input vectors, defined as the D parameter in the LLE algorithm, initially was tested. Given a set of sample facial images from

67


DB1, whereby each image was represented as a vector of D dimensions at the input of the system, the LLE algorithm embedded each sample of the set from D dimensions into a lower d dimension. All input images were pre-processed by having their histogram equalized (see Section 3.2) in order to compensate for illumination variance, and then normalized between 0 and 1, so as to have a reasonable range of data for training the SVM classifiers (see Section 5.2.5). Once again, illumination compensation was achieved by histogram equalization solely for testing purposes, as the next chapter will analyze other illumination normalization methods.

Let us pose the problem with which we obtain an embedding space for a given data set with a particular D value. Intuitively, the embedding space should remain very similar for the exact same set of input facial images, but with a different D dimension. This hypothesis comes from the fact that, if each input image is scaled with the same factor, each sample point in the data set should keep the same K closest neighbors. Therefore, the correlation between the embedding mapped at different D dimensions should be relatively high.

In order to validate this hypothesis and consequently perform experiments which are independent of the D parameter, the following experiment applied the LLE algorithm on the Frontal Database (DB1), with constant values of K and d (K=10, d=2). The only varying parameter was the initial D dimension of the images, by scaling the input images at various sizes. The initial dimensions for each image took on the values shown in Table 5.2, with the width (X), height (Y) and the corresponding D value. Square image sizes were chosen as they are frequently used in face detection systems ([65][77][84]). X Y D (X × Y) 8 8 64 16 16 256 21 21 441 32 32 1024 48 48 2304 64 64 4096 Table 5.2 : Initial D dimension values for correlation experiment on the Frontal Database (DB1).

68


The measure that was used to indicate the similarity between the various embeddings was the correlation factor, r 2 , defined as: r2 =

[n∑ x

n∑ xy − ∑ x ∑ y 2

][

− (∑ x ) 2 n ∑ y 2 − (∑ y ) 2

]

(5.1)

… where n is the number of samples, x represents the data in the first set (reference data) and y represents the data in the second set (comparative data). The graphical representation for the embeddings at these various D dimensions is shown in Figure 5.6. It must be noted that, in some cases, embeddings had to be rotated either by 90 or by 180 degrees, and sometimes had to be flipped symmetrically in order to correctly evaluate the correlation.

D=64

D=256

D=576

D=1024 D=2304 D=4096 Figure 5.6 : LLE embeddings of the Frontal Database (DB1) at d=2, with varying D dimensions (Y1 in horizontal axis, Y2 in vertical axis).

By observing the embedding spaces with different D dimension (Figure 5.6), we can say that, when D takes on the value of 576 or greater, the embedding space remains relatively unchanged. This observation is confirmed with the correlation results shown in Table 5.3. In this experiment, the reference embedding space was set at D=1024.

69

Chapter 5 - Experiments and Results D r2 64 0.56 256 0.72 576 0.85 1024 1.00 2304 0.98 4096 0.97 Table 5.3 : Correlation values between LLE embeddings of the Frontal Database (DB1) at d=2, while varying initial D dimensions. The reference data were used at D=1024.

The only significant difference that was observed was when D=64, corresponding to faces being scaled at 8x8. This result was not that surprising, because with 8x8 facial images, very few characteristics can be extracted from the algorithm, as the majority of important features from faces are lost when resizing the images.

On the other side, one would argue that the greater the D dimension, the more information the LLE algorithm would possess, and the better the embedding should be. However, as D increases, the greater the sparse matrix in the LLE algorithm is, and the greater the computational time. Figure 5.7 shows the computational time to perform the LLE algorithm with respect to D. Hence, when increasing D, D ∈ [64, 4096], as soon as the embedding space has established a constant pattern, we can consider the optimal D dimension has been found. This experiment was performed on an Intel 4 2.9 GHz processor, with 512 MB of RAM.

Figure 5.7 : CPU time for embedding the Frontal Database (DB1) with varying initial D dimensions.

70


By observing the graph shown in Figure 5.7, we can state that the evolution of the computational time versus the input D dimension tends to be linear. Furthermore, the correlation values presented in Table 5.3 show that, from D=1024, the embedded spaces stay relatively the same. Hence, for the remainder of the experiments conducted in this research, the dimension of the initial input vectors were assigned the value of D=1024.

5.2.2 Experiments on the K parameter The next set of experiments varied the number of K neighbors while embedding in two dimensions (d=2). As stated in Chapter 3, the value of d has no influence on the measure of K, as the step for determining the closest neighbors is done independent of the intrinsic dimension, d. The objective is to determine the best value of K for the Frontal Database (DB1), consisting of a collection of various faces in a frontal view, with ambient illumination and normal expression. This experiment increased the value of K at each iteration; and, for each value, the embedding space was plotted (Figure 5.8), along with recording the increase in the number of significant reconstruction weights I(K), as defined in Section 3.5.2. Figure 5.9 shows the evolution of the increase in the number of weights I(K) (significant reconstruction weights, Wij ) with respect to the K parameter. This evolution stopped increasing consistently when K=10, indicating that the neighborhood was large enough to capture most local contexts.

71


K=1

K=2

K=3

K=4

K=5

K=6

K=7

K=8

K=9

K=10

K=11

K=12

72


K=13

K=14

K=15

K=16

K=17

K=18

K=19

K=20

K=30

K=50 K=75 Figure 5.8 : Graphical representation of LLE embeddings on the Frontal Database (DB1), with varying K value (# of neighbours) with d=2 (Y1 in the horizontal axis, and Y2 in the vertical axis).

73


Figure 5.9 : Evolution of the percentage (%) increase in I(K) (number of significant weights) with respect to the number of K neighbours for the Frontal Database (DB1).

Results obtained from this experiment showed that the increase in the number of significant weights fell under 0.5% when K=10. This indicated that the evolution has stabilized and that the embedding space was well distributed. This value concurs with the graphical representation of the embeddings with various K values shown in Figure 5.8. We can see the embeddings going from a sparse distribution to a more compact one when K increases from 1 to 10. When K starts increasing from 10, the embeddings start to re-sparse once again, indicating that too large a value for the number of neighbors will lose the non-linear aspect of LLE and behave like traditional PCA, as the entire data set is interpreted as being local neighborhood.

5.2.3

Experiments on the d parameter

The next set of experiments tested the value of the intrinsic dimensionality, d, and observed the results on the embedded space. Because the measure that was used to determine the optimal K parameter is independent of d, the former parameter was kept at a constant value of neighbors (K=10). The first test used a subset of the Head Pose Database (DB2), consisting of a slowly rotating face. The observation that can be made when the intrinsic dimensionality takes on the value of (d +1) is that it is simply adding one extra column vector to the current ones that previously were computed with a 74


dimension of d. Hence, when increasing the value of d by one, instead of re-computing the whole embedding, only the new dth dimension needs to be computed, thus concurring with the theoretical aspects of LLE discussed in Chapter 3. Figure 5.10 shows the analogy that the embedding at d dimensions is simply a projection of the (d+1) embedding space.

Figure 5.10 : Representation of embeddings at d =1, 2, 3 using the Head Pose Database (DB2), showing the analogy that the embedding at d dimensions is simply a projection of the (d+1) embedding space.

To determine the optimal intrinsic dimensionality (lower d dimension) for the Frontal Database (DB1), residual variance, , was the measure chosen to determine d (see ρ

Section 3.5.1). The value of d varied such that d = [1,N], N=10. Figure 5.11 shows the evolution of residual variance with respect to d.

75


Figure 5.11 : Evolution of the residual variance, , with respect to the lower d dimension for the Frontal Database (DB1). ρ

The curve seems to stabilize when d=3 ( =0.11), and reaches a minimum at d=6 ρ

( =0.055). We, therefore, can expect that the optimal intrinsic dimension is found when ρ

d ≥ 3.

5.2.4

Experiments on mapping new images using Support Vector Regression

Given the K and d parameters which generate optimal LLE embeddings for the Frontal Database (DB1), the next step was to assess the mapping of random images (face and non-face) into the embedded space with K=10. This mapping was performed using Support Vector Regression (SVR) analytical mapping. As described in the algorithm (Section 3.6 for the algorithm), the optimal kernel, as well as its corresponding parameters, must be determined in order to have the best functions to map new sample images. Many studies have shown the efficiency and robustness of using a Gaussian kernel in SVMs [72]. They have shown a significant improvement in accuracy when compared to polynomial kernels (see Section 4.4). Therefore, for the remainder of our experiments, Support Vector Machines (SVMs) used Radial Basis Function (RBF) kernels.

76


A separate model for each d dimension was to be trained for mapping new sample images, and for each model, optimal parameters had to be determined. In the case of a Radial Basis Function (RBF) kernel, the C (cost) and

γ

(gamma) parameters had to be

tuned to obtain the best results, and cross-validation was used to evaluate the efficiency. In this study, a “grid-search” on C and was chosen, where pairs of (C, ) are tried and γ

γ

the one with the best cross-validation accuracy is picked. In [10], it was determined that trying exponentially-growing sequences of C and

γ

is a practical method to identify

good parameters. Hence, C = 2-5, 2-3, . . . 25 and = 2-11, 2-9, . . . 25 for this experiment. γ

Cross-validation consists in dividing the training data into N disjoint parts of equal size. For each part, a model is built from the N-1 other folds, and evaluated on the remaining fold. This procedure was repeated for all N folds. The final model was obtained using the entire data. In our case, N=10. The three-dimensional figures show the crossvalidation accuracy with respect to the cost and gamma values, for d=2 (Figure 5.12) and d=3 (Figure 5.13). Table 5.4 presents the numerical values that offer optimal regression models for all six intrinsic dimensions.

Figure 5.12 : Cross-validation accuracy for regression model at d=2 for the Frontal Database (DB1), with respect to the cost (C) and gamma ( ) parameters of the RBF kernel. γ

77


Figure 5.13 : Cross validation accuracy for regression model at d=3 for the Frontal Database (DB1), with respect to the cost (C) and gamma ( ) parameters of the RBF kernel. γ

# Support Cross-Validation Vectors (SV) Accuracy 200 0.02 94 0.942 1 200 0.02 151 0.922 2 200 0.02 208 0.850 3 200 0.02 206 0.906 4 200 0.02 211 0.917 5 200 0.02 213 0.884 6 Table 5.4 : Cross-validation accuracy for all six regression models for the Frontal Database (DB1), with the optimal cost (C) and gamma ( ) parameters of the RBF kernel. d

C

γ

γ

By looking at the kernel parameters (C and ) which give optimal cross-validation γ

accuracy for each d dimension, it is very interesting to note that the cost and gamma values remain the same for all six models (C=200 and =0.02). This means that all six γ

intrinsic dimensions for the training points can be modeled with the exact same RBF kernel.

Using the optimal parameters for the RBF kernels, three experiments mapped new sample images on the LLE embedding of DB1. These three image sets consisted of: 1) Non-face images from the Viola Database [84] (Figure 5.14 (a) )

78


2) Face images from the Frontal Database (DB1), previously used during the training procedure (Figure 5.14 (b) ) 3) Face images from the Frontal Database (DB1), unseen during the training procedure (Figure 5.14 (c) )

(a) +: Training Frontal Faces of DB1 embedded using LLE o : Non-face images from Viola DB embedded using regressed SVR models

(b) +: Training Frontal Faces of DB1 embedded using LLE +: Training Frontal Faces of DB1 embedded using regressed SVR models

(c) +: Training Frontal Faces of DB1 embedded using LLE +: Testing Frontal Faces of DB1 embedded using regressed SVR models Figure 5.14 : Embedding of new samples images onto the LLE embedding of the Frontal Database (DB1). (a) Non-face images, (b) Face images seen in training, (c) Face images unseen in training.

The different embeddings, which are shown in Figure 5.14, have been plotted with only two intrinsic dimensions (d=2). Still, we already can appreciate that the face and nonface images fall on distinct manifolds of the embedding, while new faces fall relatively

79


close to the original region. Consequently, we were ready to test the final step of the process, which consisted of identifying face and non-face images. 5.2.5 Face detection accuracy The final experiment with the Frontal Database (DB1) was to evaluate detection accuracy with an SVM classifier as described in Chapter 4, using a Leave-One-Out (LOO) procedure [73]. This classifier would distinguish between frontal faces with ambient illumination, and non-face images in the LLE space. To perform this experiment, the database was separated into two parts, with a training and testing subset of 235 facial images each. Non-face images for training and testing were taken from the Viola database [84], with 3000 images in each case. This low number of non-face images was used solely for testing purposes, and experiments on larger sets will be reported in the next chapter. The RBF kernel was used once again for SVM training, and cross-validation was performed to determine the optimal C and

γ

parameters for the

two-class problem. With the optimal parameters in hand, detection results on the testing database could be performed.

The system can exhibit have two types of errors: false positives (FP), when the system accepts a non-face image as a face, and false negatives (FN), when the system rejects a face image. It also can have two types of correct answers: true positives (TP), when the system accepts a face image, and true negatives (TN), when the system rejects a nonface image. In order to be independent of the specific dataset distribution, the performance of the system often is measured in terms of these rates: FPR = false positive rate = number of FP’s / number of non-face images, FNR = false negative rate = number of FN’s / number of face images, TPR = true positive rate = number of TP’s / number of face images, TNR = true negative rate = number of TN’s / number of non-face images.

This experiment reported these measures at varying d values of the LLE algorithm (intrinsic dimensionality of the embedded space), with d = [2-6]. The objective was to observe the effect of the number of features used in the training of the SVM classifier to

80


obtain the best results, and therefore determine the optimal dimensionality for the embedded space (lower d dimension). The first step was to determine the optimal kernel parameters for all d dimensions. Results are shown in the Table 5.5:

d C # SV Cross-Validation Accuracy 200 0.02 16 0.998 2 200 0.2 8 0.998 3 20 0.2 11 0.998 4 2 0.2 30 0.998 5 2 0.2 30 0.997 6 Table 5.5 : Cross-validation accuracy for SVM classification with varying d dimensions on Frontal Database DB1, with the optimal cost (C) and gamma ( ) parameters of the RBF kernel. γ

γ

With these kernel values, prediction on the testing database was performed with the trained models for each d dimension. Table 5.6 presents a series of detection indicators for each d dimension:

d LLE (# features in SVM training) 2 3 4 5 6

Overall Accuracy

True Positive Rate (TPR)

True Negative Rate (TNR)

False Positive Rate (FPR)

False Negative Rate (FNR)

98.85% (3198/3235) 99.21% (3211/3235) 98.91% (3201/3235) 98.88% (3199/3235) 98.79% (3196/3235)

85.53% (201/235) 91.12% (213/235) 85.95% (202/235) 85.10% (200/235) 83.82% (197/235)

99.90% (2997/3000) 99.91% (2998/3000) 99.93% (2998/3000) 99.96% (2999/3000) 99.96% (2999/3000)

0.10% (3/3000) 0.09% (2/3000) 0.09% (2/3000) 0.04% (1/3000) 0.04% (1/3000)

14.47% (34/235) 8.88% (22/235) 14.05% (33/235) 14.90% (35/235) 16.18% (38/235)

Table 5.6 : Detection results for the Frontal Database (DB1) with varying d dimensions.

This shows very interesting results as the optimal detection results were obtained when d=3, even though the experiments in 5.2.3 demonstrated that the residual variance was ρ

lower when d=6, suggesting better mapping from high-dimensionality data to a lower dimension. The difference in the false negative rates (8.88% for d=3 versus 16.18% for d=6) may indicate that six features for SVM training either is over-fitting the data or allotting more importance to the non-facial images in the training stage. Hence, this might suggest that a simpler representation of frontal, ambient illuminated facial images 81


gives better results. Furthermore, these high false negative rates prove that using 3000 non-faces clearly is insufficient to train the classifier. Chapter 6 will look more thoroughly at increasing the size of the non-face set.

To evaluate the performance of this method, a comparison was made with the PCA face detection method described in Section 3.7. To do this, the correct number of principal components (or “Eigenfaces”) for DB1 had to be determined by measuring the detection accuracy with respect to the number of principal components. Figure 5.15 shows this experiment, with the number of principal components varying such that #Principal Components (#PC) = [2-100]. All pre-processing steps that were performed with the proposed LLE method, such as illumination normalization (histogram equalization) and resizing, also were applied.

Figure 5.15 : Evolution of the detection accuracy (%) on DB1, with respect to the number of Principal Components (PC) (“Eigenfaces”) for the PCA face detection approach.

The optimal detection accuracy when using PCA was obtained when the number of principal components was set at #PC=30 (98.54% detection accuracy). The optimal kernel parameters for SVM classification in the case of PCA then were obtained using cross-validation, and shown in Table 5.7.

82

Chapter 5 - Experiments and Results C Cross-Validation Accuracy 2000 0.02 0.987 PC=30 Table 5.7 : Cross-validation accuracy for the SVM classifier using the PCA face detection approach on the Frontal Database (DB1). γ

The Receiver Operator Curves (ROC) then were plotted for the LLE and PCA methods, and are shown in Figure 5.16.

Figure 5.16 : ROC curves for the proposed LLE and PCA methods, using the testing images from DB1.

Clearly, the proposed LLE method outperforms the PCA approach in terms of detection results. One of the best measures that can be extracted from these curves is the Area Under the Receiver Operator Curve (AUROC), which is used to compare the two detection methods in a qualitative way.

Detection Area Under Receiver Operator Method Curve (AUROC) 0.991 LLE 0.982 PCA Table 5.8 : Area Under the Receiver Operator Curves for the LLE and PCA methods when evaluated on the testing images from the Frontal Database (DB1).

The numerical results shown in Table 5.8 confirm that, under normal conditions with frontal, ambient-illuminated and neutral-expression facial images, the proposed LLE method yields better results. Further experiments with different facial conditions such as

83


head rotation, illumination and expression will indicate how it can adapt to and thereby accommodate various challenges.

5.3

Experiments on the Head Pose Database (DB2)

The next set of experiments tested the head pose facial images from the Head Pose Database (DB2), as described in Section 5.1. This will enable the face detection software to account for head rotation. The first two experiments looked at determining the optimal K and d parameters, by means of the same methods that were used for the Frontal Database, DB1. The embeddings then were regressed using SVR, and detection accuracy was performed to evaluate the ability of the software to detect faces at various head poses.

5.3.1 Experiments on the K parameter The first experiment for the Head Pose Database consisted of varying the number of K neighbors while embedding the images in two-dimensions (d=2). As stated before, the value of d has no influence on the measure of K, as the closest neighbors are determined in the first step of the LLE algorithm. The objective was to determine the best value of K for the Head Pose Database (DB2), consisting of a collection of various faces at different head positions between –90o and +90o, with ambient illumination and normal expression. The optimal K value was determined when the mapping in the intrinsic dimensionality (lower d dimension) was adequately distributed and had stabilized. This experiment increased the value of K at each iteration, and for each value, the increase in the number of significant reconstruction weights I(K) was recorded (see Section 3.5.2).

Figure 5.18 shows the evolution of the increase in the number of weights I(K) (significant reconstruction weights Wij ) with respect to the K parameter, while Figure 5.17 illustrates sample embeddings at four different values of K (K=5, 10, 20, 50).

84


K=5

K=10

K=20

K=50

Figure 5.17 : Graphical representation of LLE embeddings in three-dimensions (d=3) of the Head Pose Database (DB2), consisting of facial images at various head rotations, with varying K neighbours (K=5, 10, 20, 50).

Figure 5.18 : Evolution of the percentage (%) increase in I(K) (number of significant weights) with respect to number of K neighbours for the Head Pose Database (DB2).

85


Results obtained from this experiment (Figure 5.18) indicate that the increase in the number of significant weights fell under 0.5% when K=10, indicating that the evolution had stabilized and that the neighborhood was large enough to capture most local contexts in the embedding space. This value is identical to the value of K obtained for the Frontal Database (DB1).

5.3.2 Experiments on the d parameter The next experiment varied the value of the intrinsic dimensionality, d, and computed the residual variance,

ρ

, of the embeddings for each value of d. The number of

neighbors was kept at a constant value of neighbors (K=10), in accordance with the value obtained in the previous experiment. To determine the optimal intrinsic dimensionality for the Head Pose Database (DB2), residual variance, , was used as the ρ

measure to determine d. The value of d varied, such that d = [1, N ], N = 10 . Figure 5.19 shows the evolution of the residual variance, , with respect to d. ρ

Figure 5.19 : Evolution of the residual variance, , with respect to the lower d dimension for the Head Pose Database (DB2). ρ

Contrary to the evolution shown for DB1 (Figure 5.11), the curve seems to stabilize later, more specifically at d=5 ( =0.10), indicating that a greater intrinsic dimension is ρ

required for head pose. The detection experiments in Section 5.3.4 will determine if optimal detection results are obtained when d=5, or if an even greater value is required.

86


5.3.3 Experiments on mapping new images using SVR Given the K and d parameters which generate optimal LLE embeddings for the Head Pose Database (DB2), consisting of facial images varying from –90 to +90 degrees, the next step was to examine the mapping of random images (face and non-face) into the embedded space with K=10 and d=3. This mapping was performed using SVR analytical mapping. As described in the algorithm (see Section 3.6), the optimal kernel, as well as its corresponding parameters, must be obtained first, in order to have the best kernel functions available to map new sample images. Table 5.9 presents the numerical kernel parameters that offer the best cross-validation accuracy for the first six regression models. γ d C Cross-Validation Acc. 200 0.02 0.982 1 200 0.02 0.984 2 200 0.02 0.983 3 200 0.02 0.983 4 200 0.02 0.985 5 200 0.02 0.984 6 Table 5.9 : Cross-validation accuracy for the first six regression models of the Head Pose Database (DB2), with the optimal cost (C) and gamma ( ) parameters of the RBF kernel. γ

Scrutinizing the kernel parameters (C and γ) which yield optimal cross-validation accuracy results for each d dimension (Table 5.9), it is interesting to see that the cost and gamma parameters remain the same for all d models. We, therefore, will make the assumption for the remainder of the experiments in this chapter, that for any d dimension that has to be modeled using SVR, the kernel parameters will take on the values of C=200 and =0.02. Figure 5.20 presents the LLE embedded space of the Head γ

Pose Database (DB2), along with non-face images that were mapped analytically onto this LLE space, using the kernel functions that were obtained in this experiment.

87


Figure 5.20 : Embedding of non-face images (o points) using SVR analytical mapping, onto the LLE map of training facial images (+ points) from the Head Pose Database (DB2).

5.3.4

Face detection accuracy

The final experiment with the Head Pose Database (DB2) was to evaluate detection accuracy by training a Leave-One-Out (LOO) SVM classifier, as described in Chapter 4 and [73]. This classifier would distinguish between faces with varying head rotations and non-face images in the LLE space. To perform this experiment, the database (DB2) was separated in two subsets, with a training and testing subset of 535 facial images in each. Non-face images for training and testing were selected from the Viola database [84], with 3000 images in each case. Once again, this low number of non-face images was used solely for testing purposes in this chapter, and experiments on larger sets will be reported in the next chapter.

The first test was to see the evolution of detection accuracy with respect to the number of features (lower d dimension). As observed in the experiment conducted in Section 5.3.2, the residual variance curve for DB2 seemed to stabilize later than the evolution observed with DB1 (Figure 5.11). Hence, DB2 might require a greater value of d (d>10) to obtain a minimum residual variance value, . Therefore, a greater range was chosen, ρ

so that d = [1, N ], N = 25 in this experiment. The RBF kernel was used for SVM training, while keeping the kernel parameters for SVM training constant at C=20 and

88


γ=2.0. Detection accuracy on the testing database was reported for each value of d, and the graphical representation is shown in the Figure 5.21.

Figure 5.21 : Evolution of the detection accuracy (%) on DB2, with respect to the lower d dimension for the proposed LLE face detection approach.

According to Figure 5.21, the optimal detection results were obtained at d=25, when the detection accuracy percentage stabilized at 99%. Table 5.10 presents the detailed detection results at d=25. This may suggest that, to correctly represent head rotation, we must embed the data at a much higher intrinsic dimensionality, d, so as to offer enough features to obtain good classification results for all possible head rotations. This value is much higher than that observed with frontal faces (DB1), among which only three features were required to represent faces. d LLE

Overall Acc. TPR TNR FPR FNR 98.96% 96.55% 99.2% 3.45% 0.8% 25 (3515/3552) (533/552) (2982/3000) (19/552) (18/3000) Table 5.10 : Optimal detection results for the Head Pose Database (DB2), given at d=25.

5.4

Experiments on the Illumination (DB3), Occlusion (DB4) and Expression (DB5) Databases

The following experiments tested the subsets of the AR Database [48], which previously had been subdivided into three databases: Illumination (DB3), Occlusion (DB4) and Expression (DB5). The Illumination DB consisted of facial images illuminated with a directional light source at three different positions (left, right and 89


center). The Expression DB included normal, surprised, screaming and frustrated facial expressions. And finally, the Occlusion DB had faces occluded alternatively with sunglasses or scarves. The objective here was to account for these three challenges within the proposed face detection software, by determining the optimal parameters, and thereby to generate the regression and classification models.

5.4.1 Experiments on the K parameter The following experiment looked at obtaining the optimal K parameter (number of neighbors) for each of the three databases (Illumination, Expression and Occlusion), by recording the increase in the number of significant reconstruction weights I(K) with respect to the value of K. The evolution for all three databases is demonstrated in Figure 5.22.

Figure 5.22 : Evolution of the percentage (%) increase in I(K) (number of significant weights) with respect to the number of K neighbours for the Illumination (DB3), Occlusion (DB4) and Expression (DB5) Databases.

As for the previous two databases, Frontal (DB1) and Head Pose (DB2), results showed that the increase in the number of significant weights fell under 0.5% when K=10 for all three curves, indicating that the embedding spaces were well distributed.

90


5.4.2 Experiments on the d parameter The next experiment varied the value of the intrinsic dimensionality, d (lower d dimension for the LLE embeddings), and computed the residual variance, , of the ρ

embeddings for each value of d, while keeping K=10. The value of d varied such that d = [1, N ], N = 10 . Figure 5.23 shows the evolution of the residual variance, ρ, with

respect to d, for all three databases tested in this section.

Figure 5.23 : Evolution of the residual variance, , with respect to the lower d dimension for the Illumination (DB3), Occlusion (DB4) and Expression (DB5) Databases. ρ

While the three databases begin at different initial residual variances (ρExpression=0.38,

ρOcclusion=0.30 and ρIllumination=0.18 at d=1), they all tend to group together when d=6 (ρ=0.10). From then on, as the value of d increases, the evolution of the residual variances stabilizes. Therefore, we can expect good detection accuracy when d ≥6. The following experiment will determine the optimal d value for these databases with respect to detection accuracy.

5.4.3 Face detection accuracy The final experiment with these three databases (DB3, DB4 and DB5) was to evaluate detection accuracy, by training three distinct SVM classifiers with respect to their challenge, and then evaluating them on their respective testing images. These three

91


classifiers would distinguish between faces (with varying illumination, occlusion and expression) and non-face images in the LLE space. To perform this experiment, the three databases were subdivided into subsets, with a training and testing subset for all three cases. Non-face images were taken from the Viola database [84] for training and testing, with 3000 images in each case. Experiments on larger sets of non-face images will be reported in the next chapter.

As stated in the previous section, to map new sample images onto the LLE embeddings using SVR, an RBF kernel with the same kernel parameters was required (C=200 and γ

=0.02). This assumption was made for any d dimension that has to be modeled using SVR, and is based on the series of experiments that were conducted for the Frontal DB1 (Section 5.2.4) and Head Pose DB2 (Section 5.3.3) databases. Then, for each value of d with d = [1, N ], N = 25 , all three SVM classifiers were trained and tested while keeping the kernel parameters constant at C=20 and =0.2. Detection accuracy on the testing γ

databases then was recorded for each value of d. This procedure was repeated for each of the three databases and the graphical representations are shown in Figure 5.24.

Figure 5.24 : Evolution of the detection accuracy (%) for the Illumination (DB3), Occlusion (DB4) and Expression (DB5) Databases, with respect to the lower d dimension for the proposed LLE face detection approach.

From Figure 5.24, we can then use the values of d which offer optimal detection results for all three databases. These values are shown in Table 5.11, along with the measures obtained from the testing databases. 92


The first observation is that the optimal intrinsic dimensionality, d (number of features in SVM), which offers the best results is different for all three databases. While seven features are sufficient for the Illumination Database (DB3) to obtain 100% accuracy, the two other databases require higher values (d=15 for the Occlusion DB and d=20 for the Expression DB). However, the Occlusion DB (DB4) and Expression DB (DB5) curves do not reach the 100% true positive rate. Still, all three achieved 0% false positive rates, confirming that the proposed system has the potential to be quite robust and accurate for rejecting non-facial and background images. Database d # SV Overall Acc. TPR TNR FPR FNR 100.00% 100.00% 100.00% 0.00% 0.00% Illumination 7 43 (3505/3505) (505/505) (3000/3000) (0/505) (0/3000) (DB3) 99.73% 97.63% 100.00% 2.37% 0.00% Occlusion 15 48 (3371/3380) (371/380) (3000/3000) (9/380) (0/3000) (DB4) 99.65% 97.62% 100.00% 2.38% 0.00% Expression 20 202 (3493/3505) (493/505) (3000/3000) (12/505) (0/3000) (DB5) Table 5.11 : Optimal detection results for the Illumination (DB3), Occlusion (DB4) and Expression (DB5) Databases, presented for d=7, d=15, d=20 respectively.

That the Illumination Database (DB3) does better in terms of true positive rates and has a lower d value, when compared to DB4 and DB5, can be explained by the fact that an illumination compensation operation was applied in the first stage of the proposed method. In this experiment, histogram equalization was used to eliminate illumination variance within the images, thus diminishing the illumination factor of the face detection problem. Chapter 6 will analyze whether this technique offers the best results. It is not surprising, therefore, that fewer features (7) are required to represent these faces in the LLE space, which is closer to the number of features needed for the frontal, ambient-illuminated faces (d=3) obtained in Section 5.2.

As for the Expression and Occlusion DB, we can notice, just by looking at the data images of these databases, that there exists a greater variation in facial expressions for all the individuals. However when analyzing the Occlusion DB, we can notice that the sunglasses and scarf are positioned in the exact same manner for every individual. It, therefore, is easier for the LLE to find a local linear map, and consequently requires a lower d intrinsic dimensionality.

Figure 5.25 shows the LLE embeddings for the

93


Illumination (DB3), Occlusion (DB4) and Expression (DB5) DB in d=3. For DB3, red points depict ambient illuminated faces, while green, blue and yellow points represent faces illuminated with a left, right and bi-directional light source, respectively. For DB4, non-occluded faces are shown by red dots, while green and blue points represent faces occluded with sunglasses and scarves. For both these databases, LLE is able to separate the different conditions efficiently. However, for DB5, where red, blue, green and yellow points show faces with normal, surprised, frustrated and screaming facial expressions respectively, LLE does not seem to be able to distinguish between the different facial expressions when embedding these faces in three-dimensions.

(a)

(b)

(c) Figure 5.25 : LLE embeddings in three-dimensions (d=3, with K=10) for the (a) Illumination (DB3), (b) Occlusion (DB4) and Expression (DB5) Training Databases. Each color point in a given database represents a particular facial condition.

In the following section, we will conduct the final face database experiment by testing the Aging Database (DB6). This will enable us to obtain the last set of parameters necessary to operate the proposed face detection software.

94


5.5

Experiments on the Aging Database (DB6)

The following experiments tested the Aging Database, which includes faces of individuals whose ages range between 15 and 99 years old. As for the previous subsections, the first two experiments consisted of observing the increase in the number of significant reconstruction weights with respect to the value of K (Figure 5.26 (a)), as well as the evolution in residual variance, ρ (Figure 5.26 (b)), to determine the values of the K and d parameters, respectively, for the Aging Database (DB6).

(a)

(b)

Figure 5.26 : Evolution of: (a) Percentage (%) increase in I(K) (number of significant weights) with respect to the number of K neighbours for the Aging Database (DB6), (b) Residual variance, , with respect to the lower d dimension for the Aging Database (DB6). ρ

From these two graphs, we can observe, once again, that the increase in the number of significant reconstruction weights has stabilized when K=10, while the residual variance, , falls to a minimum when d=3. ρ

Using the same Radial Basis Function (RBF) kernels as the previous experiments (see Sections 5.2 through 5.4), regression models for mapping new sample points onto the LLE maps were obtained. Then, for each value of d, with d = [1, N ], N = 25 , an SVM classifier was trained and tested with C=20 and =0.2, in order to distinguish between γ

faces with varying ages and non-face images in the LLE space. Detection accuracy on the aging testing database then was reported for each value of d. Figure 5.27 shows the

95


evolution of detection accuracy (%) with respect to d, while Table 5.12 shows the measures obtained on the testing images.

Figure 5.27 : Evolution of the detection accuracy (%) for the Aging Database (DB6), with respect to the lower d dimension for the proposed LLE face detection approach. d 7

Table 5.12

# SV

Overall Acc.

TPR

TNR

FPR

FNR

34

99.64% (3392/3404)

97.02% (39/404)

100.00% (3000/3000)

2.98% (12/505)

0.00% (0/3000)

: Optimal detection results for the Aging Database (DB6), given at d=7.

Similar to the Illumination (DB3) and Frontal (DB1) Databases, a lower d intrinsic dimensionality is required to detect a maximum number of faces while rejecting most of the non-face images. Indeed, only seven features are sufficient for the Aging Database to obtain a detection accuracy close to 100%. This may indicate that the aging factor is more easily identifiable in the LLE algorithm, when compared to other factors such as head pose or expression.

Table 5.13 gives a summary of the optimal K (number of neighbors) and d (intrinsic dimensionality) values found for all six databases:

96

Chapter 5 - Experiments and Results K d 10 3 DB1 10 25 DB2 10 7 DB3 10 15 DB4 10 20 DB5 10 7 DB6 Table 5.13 : Optimal K and d values obtained from experiments conducted on the Frontal (DB1), Head Pose (DB2), Illumination (DB3), Occlusion (DB4), Expression (DB5) and Aging (DB6) Databases, presented in Sections 5.2 through 5.5.

With these optimal values, a full comparison between the LLE method and the PCA face detection approach (see Section 3.7) was performed. In this experiment, each of six classifiers trained with the proposed LLE method were compared to the PCA approach, and all six were tested on the facial images provided from their respective testing databases. The ROC curves are shown in Figure 5.28. The optimal number of principle components (“Eigenfaces”), as described in Section 5.2.5, was determined in advance for each DB.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5.28 : ROC curves for the proposed LLE ( ) and PCA ( ) methods using the testing images from (a) DB1 (#PC=30), (b) DB2 (#PC=52), (c) DB3 (#PC=35), (d) DB4 (#PC=38), (e) DB5 (#PC=47), and (f) DB6 (#PC=30). ▬

▬

The ROC curves in Figure 5.28 show that the proposed LLE method outperforms the PCA approach on each database. In particular, it demonstrates the inability of the PCA

97


method to cope with variations in head pose (DB2) (see Section 2.2.4), which is clearly not the case for LLE (b). In the following section, we will present the proposed testing system, as well as the results obtained from testing the system on cropped face and nonface images.

5.6

Experiments on Random Input Images

After conducting the training phase in the system, which helped to determine the optimal K and d parameters for LLE, as well as generating the regression (for mapping new sample images) and classification (for distinguishing between face and non-face images) models for all six databases, the next challenge was to determine how to bring all these databases, models and parameters together to compose a single face detector. The objective is that, given a random input image, the system must utilize all six databases corresponding to the different challenges in face detection (see Chapter 2), in order to determine if the given image is a facial image or not. One possibility was to join all these databases into a single LLE embedding, which would represent what a face is in the LLE domain under these different conditions. However, because different lower intrinsic d dimensions were required to obtain optimal detection results for these various databases, the dimensionality of this single subspace cannot be selected at a single d value. Instead, it was decided to treat them all individually. As soon as an image falls in one of the face sub-spaces, given a certain threshold, the image would be identified as a facial image. The proposed testing strategy is illustrated in Figure 5.29.

In summary, when an input image enters into the system, it is pre-processed by having its histogram equalized and classified using all six trained models, sequentially, to predict whether it is a facial image or not. It must be noted that, in this chapter, illumination compensation was achieved by histogram equalization. However, the next chapter will experiment on other illumination normalization operations, such as using the FFT phase image, and will analyze the performance of face detection when using this new method. At the end of the process, the highest confidence level out of the six

98


given models (or classifier) is saved and, if this value passes a threshold, the image then is classified as a face.

SYSTEM TESTING

Figure 5.29 : Schematic representation of the testing system for the proposed LLE face detection method.

5.6.1

Testing the system on facial databases

This experiment tested the system on facial databases. The facial images were selected from the six testing databases which were built for this research (see Sections 5.2 to 5.5), each corresponding to one of the six challenges presented in Chapter 2. Hence, the frontal testing facial images (DB1), the head pose testing images (DB2), the illumination testing images (DB3), the occlusion testing images (DB4), the expression testing images (DB5) and the aging testing images (DB6) all were given as input to the

99


system. The first system measurement was to record the face detection accuracy for each of these testing databases, which combines the classfication results from all six classifiers (see Figure 5.29). These results then were compared with the results obtained with these same databases, but using only the single classifier associated with the database. For instance, we obtained in Section 5.3.4 the detection accuracy for the testing images from the Head Pose Database (DB2), using the single classifier which treats head pose. This value in percentage (%) was compared to the face detection accuracy (%) when using the combination of the six classifiers as presented in the proposed testing system (Figure 5.29). The comparative results shown in Table 5.14 confirm the advantage of using a combination system instead of using classifiers independently, as certain faces in a particular database were detected by another classifier treating another particular condition, instead of their respective classifier. Detection Accuracy on Testing Databases Detection Method DB1 DB2 DB3 DB4 DB5 DB6 Individual Classifiers 91.1% 96.5% 100% 97.6% 97.6% 97.0% Combination of all six classifiers 97.1% 99.2% 99.8% 98.9% 97.6% 99.5% Table 5.14 : Comparison in detection rates, between identifying faces with a single classifier and using the proposed combination system. This comparison was made on all six testing databases used in Sections 5.2 through 5.5.

This characteristic can be visualized more easily with the second information that was collected from the system. It recorded the distribution of identifyed faces for all six testing databases. These distributions are represented in Figure 5.30. For example, in Figure 5.30 (a), which tested the frontal facial images (DB1), 79% of the detected faces were found by the Frontal classifier (DB1), while 12% were detected by the Illumination classifier (DB3) and 8% by the Aging classifier (DB6).

(a) Frontal Testing DB (DB1)

(b) Head Pose Testing DB (DB2)

100


(c) Illumination Testing DB (DB3)

(d) Occlusion Testing DB (DB4)

(e) Expression Testing DB (DB5)

(f) Aging Testing DB (DB6)

Figure 5.30 : Distribution of classifiers used for detecting faces in the six testing databases (DB1, DB2, DB3, DB4, DB5, DB6).

In general, the testing images which represented faces under a particular condition (ex: illumination) were detected by the classifier which treats that respective condition. The only exception was testing expression images, where the faces that were detected mostly were identified by the illumination classifier (DB3) (55% by Illumination DB versus 28% by Expression DB). Finally, one can notice that the head pose classifier rarely is used to detect a face in any other testing database, besides the head pose testing database itself (DB2). This can be explained by the fact all other databases offer frontal facial images. 5.6.2

Testing the system on the CBCL face database

The next experiment tested the proposed method on a standardized database available on the Internet for testing, in order to compare the results obtained with the proposed method with other face detection techniques. The chosen face database for this experiment was the MIT Center for Biological & Computational Learning (CBCL) database [25], which includes 31022 face and non-face images. All images in the dataset are 19 x 19 gray scale images, which had their histogram equalized, and then were normalized between 0 and 1. The database is separated into a training and testing

101


set. The training dataset consisted of 2429 face images and 4548 non-face images, while the testing dataset included 472 face images and 23573 non-face images. The positive examples were generated with a high number of individuals of different ethnicities, with varying illuminations and head pose orientations. Two lighting conditions were used: ambient light and a directional source, which varied between −30° and 30° in azimuth and between 30° and 60° in elevation. Head pose orientation varied from −15° to 15° in azimuth, and −8° to 8° in elevation. Figure 5.31 shows some positive samples from this dataset.

It must be noted that this experiment does not treat the face detection problem in a single image, as all images already were cropped to 19x19 windows. However, because the nature of each cropped image already was known in advance (as being a face or non-face), it was easier to compute the hit rates and the number of false positives.

(a)

(b)

Figure 5.31 : Sample (a) face and (b) non-face images from the CBCL database [25], which includes facial images under a wide variety of conditions, and non-face images provided from various backgrounds.

This experiment ran all of the cropped testing images from the CBCL database, collected all the images which were identified as faces by the system, and recorded the distribution of identified faces from the six models (classifiers) used in the proposed algorithm (corresponding to the six challenges). Figure 5.32 presents this distribution for all the cropped images identified as faces.

102


Figure 5.32 : Distribution of cropped images in the CBCL database identified as faces, using the proposed LLE method, which combines six classifiers trained in the LLE domain, corresponding to the six challenges described in Chapter 2.

With the results obtained on the CBCL database, the performance of the proposed system then was compared to other face detection techniques. Figure 5.33 shows the ROC curves of different face detection methods, including the proposed LLE method.

Figure 5.33 : ROC curves of different systems (proposed LLE method ( ), Viola-Jones (X), SnOW ( ), SVM( ), FDCM ( ) ), tested on images from the CBCL database. ▬

◊

□

○

It clearly is apparent that the blue curve, representing the true/false positive ratio for the proposed LLE method, tested on images of the CBCL database, outperforms three of the four selected methods, those being the SVM, SnOW and FDCM (Face Detection Committee Machine) methods [2][93][29]. This latter method, which is based on a combination of SVM, SnOW and NN techniques, performs identically to the LLE method, up to a false positive rate of 5%. From then on, the LLE method tends to yield

103


higher true positive rates as the false positive rate increases. On the other hand, the Viola-Jones rapid detection approach [84] offers slightly better results than the LLE method. More specifically, the curve representing the Viola-Jones method outperforms LLE up to a false-positive rate of 0.3, thus offering higher hit-rates for the same falsepositive rates. Both curves offer equal performance from then on. Chapter 6 will treat the problem of false positives, by looking into more detail the effect of the size of the non-face set and using another method for illumination compensation, besides histogram equalization. Table 5.15 presents the numerical values of the ROC curves in Figure 5.33. True Positive Rate 10% 20% 30% 40% 50% 60% 70% 80% 90%

NN 0.56% 1.37% 2.54% 4.11% 6.32% 9.47% 13.89% 26.97% 48.95%

SNoW 0.41% 1.09% 1.67% 2.92% 4.91% 8.47% 14.67% 27.62% 49.26%

False Detection Rate SVM FDCM Viola-Jones 0.05% 0.02% 0.00% 0.16% 0.07% 0.00% 0.44% 0.14% 0.04% 0.83% 0.41% 0.13% 1.60% 0.77% 0.49% 3.07% 1.41% 0.78% 5.98% 3.90% 1.59% 12.32% 7.79% 2.51% 28.60% 22.92% 5.71%

LLE 0.00% 0.01% 0.12% 0.33% 0.81% 1.83% 3.29% 5.71% 13.25%

Table 5.15 : Numerical comparison of results on images of the CBCL testing set, between the proposed LLE face detection approach, and other face detection methods.

5.7

Summary

In this chapter, a series of experiments conducted on the six facial databases used in the proposed face detection system was presented. These experiments helped to determine the optimal parmeters for the LLE and SVM methods. In the final section of this chapter, the proposed method was tested on face and non-face images, and compared to other face detection methods. The results showed the potential of using an LLE technique for face detection. The next chapter will address the final challenge in this research, which is to perform face detection in a single input image.

104

Chapter 6 - Face Detection in a Single Image

Chapter 6 Face Detection in a Single Image The final chapter looks at the face detection problem in a single image. Given a gray scale input image, the software must detect the faces present in the image and identify them by creating windows in an output image that surrounds the faces. The first section describes the implementation of the face detector by addressing the problems of scanning the detector and coping with multiple detections. Improvements to the face detector then are presented, and focus primarily on diminishing the number of false positives. The final section presents the experiments conducted on the MIT-CMU database, which consists of 130 images containing 510 face images. The results obtained with the proposed LLE method are compared to those achieved with other face detection methods, which constitutes a valid assessment of the proposed method.

6.1

Face Detector Implementation

6.1.1 Scanning the detector The images were scanned at multiple scales and locations. Good results were obtained by increasing the size of the scanning window by a factor of 1.2 at successive scales. For each size, the detector sequentially visited all image locations and the extracted windows were input to the LLE face detector. Sequential locations in an image were obtained by shifting the window by ∆ pixels. The scale s of the scanning window

105


affects this shifting process: for example, if the scale of the detector is set at s1, then the window should be shifted by s1∆. The choice of ∆ affects both the speed as well as the accuracy of the detector. The authors in [84] showed that ∆=1.5 provided a significant speedup with only a slight decrease in accuracy.

6.1.2 Integration of multiple detections Since the face detector is insensitive to small changes in translation and scale, multiple detections usually will occur around each face in a scanned image. The same often is true of some types of false positives. In practice, it makes sense to return one final detection per face. To obtain this kind of result, it is useful to post-process the detected sub-windows so as to combine overlapping detections into a single detection. In the experiments presented in this chapter, detections are combined in a very simple fashion. The set of detections first are partitioned into disjoint subsets. Two detections are in the same subset if their bounding regions overlap. Each partition yields a single final detection. The corners of the final bounding region are the average of the corners of all detections in the set.

6.2

Improvements on the Face Detector

The detection results presented in the previous chapter were obtained with a very rudimentary classifier, only using 3000 non-face images for training the system, and using histogram equalization for illumination compensation as a pre-processing step. Furthermore, certain aspects of the detector, such as different fusion strategies for the six-decision process, were not investigated. In this section, we present experiments that analyzed the performance of the proposed LLE face detection system under the effect of three different factors: (1) The illumination compensation technique used in the preprocessing steps of the system; (2) The size of the non-face training set and; (3) The method used to fuse the information provided by the six-decision process. The tests were conducted on images provided from the MIT-CMU database [65]. The different configurations were analyzed and compared by studying their respective Receiver Operator Curves (ROC). 106


6.2.1 Illumination compensation The version of the system presented in Chapter 5 used histogram equalization as a preprocessing step for eliminating illumination variance. Another possibility is to compute the Fast Fourier Transform (FFT) on the image and then to take its phase image, which is an illumination and contrast invariant measure of feature significance in the image [36] (see Section 3.2). In this experiment, three different configurations were tested for illumination compensation: (1) No compensation for illumination; (2) Compensation using histogram equalization; and (3) Taking the FFT of the images and just using the phase image as the training image. The ROC curves for these three configurations are depicted in Figure 6.1.

Figure 6.1 : ROC curves indicating the performance of the proposed LLE face detector, using three different pre-processing methods for illumination compensation.

By looking at the curves in Figure 6.1, it is clear that using the phase image obtained by computing the FFT on the input image outperforms the method which equalizes the image’s histogram in order to eliminate illumination variance. A possible explanation for this lies in the fact that histogram equalization has the effect of creating faces in relatively smooth areas, such as in background images, therefore producing a higher number of false positives. However, this latter method does generate better results than using no compensation at all.

107


6.2.2 Non-face training set As stated earlier, Chapter 5 presented a version of the face detector that only used 3000 non-face images from the Viola Database [84] to train the classifiers. In this experiment, the effect of the size of the non-face set, as well as a bootstrap adapted from [77] was analyzed on the system, now using the phase image computed in the preprocessing step (see Section 6.2.1) to compensate for illumination. The bootstrap method, which is described below, used three sets, each consisting of 40 scenery and background images, to collect negative examples.

Bootstrap method: Step 1 -

Using the initial set of 3000 non-face images provided by the Viola Database [84], apply the pre-processing steps (resizing and computing FFT phase image) to each of these images.

Step 2 -

Train the proposed LLE system to produce an output of +1 for the face examples, and -1 for the non-face examples.

Step 3 -

Run the system on a set of scenery and background images that contains no faces. Collect sub-images for which the system incorrectly identifies a face (confidence level > 0.5).

Step 4 -

Extract all of the selected sub-images, apply the same pre-processing operations as in Step 1, and add them into the training set as negative examples. Go to Step 2. This is repeated two more times, as suggested in reference [77].

Six different non-face image sets were tested for training the SVM classifiers: 1- 3000 non-face images from the Viola Database; 2- 3000 non-face images from the Viola Database, then using the bootstrap method to retrain the SVM classifiers; 3- 14000 non-face images from the Viola Database; 4- 37000 non-face images from the Viola Database;

108


5- 100000 non-face images, combining 37000 from the Viola Database, 25000 from the CBCL Database and 38000 background and scenery images from the MIT Database; 6- 100000 non-face images, combining 37000 from the Viola Database. 25000 from the CBCL Database and 38000 background and scenery images from the MIT Database, then using a bootstrap method to retrain the SVM classifiers. The ROC curves for these six different configurations are shown in Figure 6.2. In the proposed face detection system, there are six classifiers which output a probability, P (see Section 5.6), for which the input image is a face. For a given input image, we therefore obtain six probabilities: P1, P2, P3, P4, P5 and P6. In this experiment, the method of fusion was the maximum probability with P = max(P1,P2,P3,P4,P5,P6)

(a)

(b)

Figure 6.2 : Testing six different non-face image sets for training. (a) ROC curves indicating the performance of the proposed LLE face detector, (b) Equal Error Rate (EER) with respect to the number of non-face images.

Given these results, it is clear that face detection accuracy is influenced greatly by the number of non-face images. Furthermore, the bootstrap method does seem to increase significantly the accuracy of face detection, when comparing the ROC curves representing 100000 non-faces (▬) and 100000 non-faces using the bootstrap method (▬). Furthermore, assuming that the curve in Figure 6.2 (b) is linear, it would take 209000 non-faces to obtain an Equal Error Rate=0. However, there are too few points to be sure of the trend, and this latter hypothesis is invalid if the curve is exponential, for example.

109


6.2.3 Fusion method for the six-decision process Using the classifier that demonstrated the best results according to the experiment conducted above (configuration #6, using FFT phase image), three types of fusion strategies presented in reference [62] were tested: (1) P=max(P1,P2,P3,P4,P5,P6); (2) P=P1+P2+P3+P4+P5; (3) P=P1P2P3P4P5P6 Figure 6.3 presents the ROC curves for each of these three methods:

Figure 6.3 : ROC curves indicating the performance of the proposed LLE face detector, using three different fusion strategies.

While the sum of the probabilities P=P1+P2+P3+P4+P5 generates slightly better results than the maximum probability P=max(P1,P2,P3,P4,P5,P6) (Area Under Curve = 0.9912 versus 0.9904) , the product fusion method P=P1P2P3P4P5P6 is inferior to both other methods. Furthermore, the sum requires no additional computation.

6.2.4

Contribution from classifiers

The last experiment analyzed the contribution of each of the six databases (DB1, Frontal; DB2. Head Pose; DB3, Expression; DB4, Occlusion; DB5, Illumination; and DB6, Aging) to the performance of the face detection system. We compared the performance of the system that used the sum of probabilities from all six decisions (see 6.2.3) versus each of the databases assessed individually. This was done by running the

110


system six times, each time with only one database. This experiment was performed on the enhanced system, computing the FFT phase image at the pre-processing stage (Section 6.2.1), and using 100000 non-facial images with the bootstrap method (configuration #6 in Section 6.2.2), while fusing the classifier’s output with a sum. The ROC curves are shown in Figure 6.4.

Figure 6.4 : ROC curves indicating the performance of the proposed LLE face detector when using each classifier individually.

The first observation that can be made is that the database that influences the classifier the most is the Frontal Database (DB1). With a 90% detection rate, this database has a false acceptance rate of 1e-05, compared to 4e-06 for the system that fused all the data. The expression, illumination and aging databases all offer respectable results, while the occlusion and head pose databases offer the lowest scores. This can be explained by the fact that the majority of faces present in the input images of the MIT-CMU database [65] are frontal faces of various ages, with a wide variety of facial expressions. Very few faces are occluded, and even fewer offer a wide range of head rotations, which is what the Head Pose database (DB2) represented.

In conclusion, the best configuration for the proposed LLE face detection system first compensates for illumination variance by means of the phase image computed by FFT on the input image. The system’s SVM classifiers were trained with 100000 non-face images using the bootstrap method. Finally, the six classifier outputs were fused by

111


summing them and estimating the probability, P, that an input image is a face. Figure 6.5 provides a schematic representation for the final version of the proposed LLE face detector, which will be tested on the MIT-CMU database in the next section

Figure 6.5 : Schematic representation of the final version for the proposed LLE detector.

6.3

Face Detection Results

6.3.1

Image database

The final system was tested on the MIT-CMU database [65], which consists of three large sets of images. Test Set A was collected at the Carnegie Mellon University (CMU), and consists of 42 scanned photographs, newspaper pictures, images collected from the World Wide Web, and digitized television pictures. These images contain 169

112


frontal views of faces, and requires that the system examines 22,053,124 windows. Test Set B consists of 23 images containing 155 faces (9,678,084 windows); it was used in [77] to measure the accuracy of their system. Test Set C is similar to Test Set A, but contains some images with more complex backgrounds, and without any faces, in order to more accurately measure the false detection rate. It contains 65 images, 183 faces, and 51,368,003 windows. Figure 6.6 presents sample images from these three sets.

(a)

(b)

(c)

Figure 6.6 : Sample images from the three subsets of the MIT-CMU image database [65], consisting of 130 images containing 510 face images. (a) Sample image from Test Set A, (b) Sample image from Test Set B and (c) Sample image from Test Set C.

6.3.2

Face detection results on the MIT-CMU Database

The proposed LLE face detector was tested on the images from the MIT-CMU database. The resulting ROC curve was plotted along with two other ROC curves from other face detection methods, the Rowley-Kanade system [65] and the Viola-Jones method [84], both of which also were tested on the MIT-CMU database. These curves present the opportunity to assess and compare the performance of the proposed LLE method to other techniques. In our experiments, 75,081,800 sub-windows were scanned to test the entire MIT-CMU database. The ROC curves are shown in Figure 6.7.

Scrutinizing the three ROC curves, the proposed LLE method clearly outperforms the system proposed by Rowley-Kanade system, which is based on a set on neural networks, and compares very well with the Viola-Jones face method. Table 6.1 presents the detection rates and number of false detections for other face detection methods, which merely is taking a single point on the ROC curves. At first sight, comparing the number of false detections at identical detect rates, the proposed technique outperforms most of the other methods. At a detection rate of 90%, the proposed method generates 113


71 false detections, compared to the best system presented by Rowley-Kanade [65] with 167, but yields a slightly higher number of false detections than the Viola-Jones method with 52. However, when comparing at a higher detection rate of 97%, LLE generates 3750 false detections, compared to 6520 for the Viola-Jones method. Finally, the proposed LLE method offers a lower Equal Error Rate (EER) compared to the ViolaJones method (0.08 versus 0.09) (see Figure 6.7). The final chapter will present some examples of undetected faces and false detections generated by the proposed system.

Figure 6.7 : ROC curves (using a log scale on the horizontal axis) for the proposed LLE face detector, the Rowley system [65] and the Viola-Jones method [84], each tested on the MIT-CMU Database. Method Linear SVM+G+SVM [1] Linear SVM+G-SVM1+G-SVM2 [1] Hierarchical SVM [47] Hybrid Cascade-Wavelet System [83] Proposed LLE System Rowley-Kanade System 11 [65] Viola-Jones Rapid System [84] Proposed LLE System Rowley-Kanade System 11 [65] Viola-Jones Rapid System [84] Proposed LLE System

Detection Rate 82.3% 75.4% 82.5% 82.4% 82.5% 90.2% 90.0% 90.0% 97.0% 97.0% 97.0%

False Detections 322 68 301 319 15 167 52 71 64529 6520 3750

Table 6.1 : Comparison of detection rates for various face detection techniques, tested on the MIT-CMU Database [65].

Figure 6.8 shows results of our face detector on images from the MIT-CMU Database.

114


v

Figure 6.8 : Results obtained from the proposed LLE face detector on a number of test images from the MIT-CMU Database.

115

Chapter 7 - Conclusion

Chapter 7 Conclusion As stated at the outset, the problem of human face detection is challenging, because faces are non-rigid objects that have a high degree of variability in size, shape, color, and texture. Indeed, few methods in the literature have analyzed the performance of face detection under the effects of different conditions such as head rotation, illumination, facial expression, occlusion and aging. Recently, the most popular methods are based on statistical and learning methods, and tend to increase data complexity by mapping it onto a higher-dimensional space in order to extract the predominant features. A novel technique, known as Locally Linear Embedding (LLE), adopts a different approach to the problem by applying dimensionality reduction to the data for learning and classification. The objective of this method is to determine a locally-linear fit so that each data point can be represented by a linear combination of its closest neighbors.

This thesis presents a face detection method that applies the LLE algorithm on 2D facial images to obtain their representation in a sub-space under the unfavorable conditions that were stated above. The low-dimensional data then is used to train SVM classifiers (Support Vector Machines) to classify images as being a face or a non-face. In this thesis, six different databases of cropped facial images, corresponding to ideal frontal faces, followed by variations in head rotation, illumination, facial expression, occlusion and aging, were utilized to train and test these classifiers.

116


This thesis evaluated the feasibility of using the combined efficacy of the six SVM classifiers within a two-stage face detection approach. This thesis also presented a set of detailed experiments on face detection databases that have been used widely in previous studies. These datasets included faces under a very wide range of conditions, including variations in illumination, scale, pose, and camera conditions. Experiments conducted on the CBCL and MIT-CMU databases demonstrated that the results obtained with this new system compared very well to other popular face detection methods. They also demonstrated that face detection accuracy using LLE can be similar to and sometimes better than most of the best previously-reported detection methods, indicating that the system functions well under many adverse conditions. We, therefore, conclude that the proposed method is a viable and accurate technique to perform face detection, and possibly also to perform face recognition.

We propose several improvements and additions to our system: In this thesis, the size of the positive training sets (facial images) varied approximately between 300 and 500 images for each of the six classifiers, while the non-face set included 100000 images. Increasing the size of positive and negative training databases may help to improve the hit rate accuracy for the face detector, while lowering the number of false detections. The computational time for training and testing the system is quite demanding, taking approximately 30 minutes in both cases (testing with a 320x280 image). Significant software improvements, such as using optimized mathematical libraries, can be made to improve the time performance of the proposed method. A property inherent in LLE is that the embeddings of LLE are optimized to preserve the geometry of nearby inputs. Thus, the most common failure mode of LLE is to map faraway inputs to nearby outputs in the embedding space. This makes it difficult to optimally separate face points with background images, when these are sometimes mapped near the face map. Additionally, the proposed method uses analytical regression to map a point from D dimension (input space) to d (intrinsic dimensionality), using d regression models that were learned independently with SVR. This ignores possible cross-over information

117


which may lead to inefficient regression models. Implementing a multi-output support vector regression [82] method for mapping new data may help to improve training of SVM classifiers. The current version of the Head Pose Database (DB2) includes images of faces rotating laterally between +90o and -90o. Images of faces rotating in the frontal plane, or in elevation, were not included. This may cause the system to miss faces with such rotations. This problem is demonstrated in Figure 7.1, which demonstrates faces rotated in the frontal plane that were not detected. Hence, including images with larger variations in head pose, such as in elevation, may help detect faces under a wider spectrum of rotations.

Figure 7.1 : Example of face images undetected by the proposed LLE face detector, because of head rotation in the frontal plane.

While testing images from the MIT-CMU database (Figure 6.8), the proposed system identified some non-face images as a being a face. Even though these images clearly are false positives, they do give the appearance of a face in these gray scale examples (Figure 7.2). However, if these were color images, it then would be quite clear that these do not show skin tone color pixels. Hence, adding color information to the LLE algorithm when embedding 2D images to a lower dimensional space, when input color images are available, may help to diminish the number of false positives.

Figure 7.2 : Examples of false positives generated by the proposed LLE face detector, which give the appearance of faces.

The proposed modifications can be integrated seamlessly into our software for improved results. In conclusion, our approach is an excellent stepping-stone for further research in the field of face analysis. Our system also can be made quite generic and have broader applications in computer vision and image processing.

118

References

References [1] [2] [3]

[4]

[5] [6] [7] [8]

[9] [10] [11] [12] [13] [14] [15]

[16]

[17]

H. Ai, L. Ying, and G. Xu, “A Subspace Approach to Face Detection with Support Vector machines”, Proc. IEEE Int’l Conf. on Pattern Recognition, 2002. M. Alvira and R. Rifkin, “An Empirical Comparison of SNow and SVMs for Face Detection”, CBCL Paper #193 / AI Memo, MIT, 2001. M. Bartlett and T.J. Sejnowski, “Independent components of face images: A representation for face recognition”, Proc. of the 4th Annual Joint Symposium on Neural Computation, 1997. P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(7):711-720, 1997. C. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?”, SIGKDD Explorations, 2(2):1-13, 2000. D. J. Beymer, “Face recognition under varying pose”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 756-761, 1994. J. Bhattacharyya and M. Levine, "Detecting and Removing Specularities in Facial Images", Computer Vision and Image Understanding, 2004. B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers”, Proc. of the 5th Annual ACM Workshop on Computational Learning Theory, 1992. R. Brunelli and T. Poggio, “Face recognition: Features versus templates”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(10):1042-1052, 1993. C. Chang and C. Lin, LIBSVM: a Library for Support Vector Machines, National Taiwan University, Dept. of Comp. Science and Information Eng., Taiwan, 2004. D. Chetverikov and A. Lerch, “Multiresolution Face Detection”, Theoretical Foundations of Computer Vision, vol. 69, pp. 131-140, 1993. T. Cootes, K. Walker, and C. Taylor, “View-based active appearance models”, Proc. of Int’l Workshop on Automatic Face- and Gesture- Recognition, 2000. T. Cox and M. Cox, Multidimensional Scaling, Chapman & Hall, London, 1994. J.L. Crowley and J.M. Bedrune, “Integration and Control of Reactive Visual Processes”, Proc. 3rd European Conf. Computer Vision, vol. 2, pp. 47-58, 1994. P. Demartines and J. Herault, “Curvilinear component analysis: A self-organizing neural network for non-linear mapping of data sets”, IEEE Trans. on Neural Networks, 8(1):148-154, 1997. Y. Freund and R. Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting”, Computational Learning Theory: Eurocolt ’95, pp. 23–37, Springer-Verlag, 1995. M. Gandhi, “A Method for Automatic Synthesis of Aged Human Facial Images”, Master Thesis, McGill University, 2004.

119

References

[18] A.S. Georghiades, D.J. Kriegman, and P.N. Belhumeur, “Illumination cones for recognition under variable lighting: Faces”, Proc. IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp. 52-59, 1998. [19] V. Govindaraju, “Locating Human Faces in Photographs”, Int’l J. Computer Vision, 19(2):129-146, 1996. [20] H. P. Graf, T. Chen, E. Petajan, and E. Cosatto, “Locating Faces and Facial Parts”, Proc. 1st Int’l Work. Automatic Face and Gesture Recognition, pp. 41-46, 1995. [21] D.B. Graham and N.M. Allinson, “Characterizing Virtual Eigensignatures for General Purpose Face Recognition”, Face Recognition: From Theory to Applications, Computer and Systems Sciences, vol. 163, pp. 446-456, 1998. [22] I. Guizatdinova and V.Surakka, “Detection of Facial Landmarks from Neutral, Happy, and Disgust Facial Images”, Int’l Conf. in Central Europe on Computer Graphics, pp. 55-62, 2005. [23] A. Hadid, O. Kouropteva, and M. Pietikäinen, “Unsupervised learning using locally linear embedding: experiments in face pose analysis”, Proc. IEEE Int’l Conf. on Pattern Recognition, vol. 1, pp.111-114, 2002. [24] A. Hadid and M. Pietikäinen, “Selecting models from videos for appearance-based face recognition”, Proc. Int’l Conf. on Pattern Recognition, pp. 304-308, 2004. [25] B. Heisele, T. Poggio, and M. Pontil, “Face Detection in Still Gray images,” A.I. memo AIM-1687, Artificial Intelligence Laboratory, MIT, 2000. [26] K. Hotta, “A Robust Face Detection under Partial Occlusion”, Proc. Int’l Conf. on Image Processing, pp. 597-600, 2004. [27] The Internet Movie Database, http://www.imdb.com/. [28] A. Jalali and F. Oroumchian, “Rich Document Representation for Document Clustering”, Conf. on Comp. Assisted Info. Retrieval, 2004. [29] K.F. Jang, H.M. Tang, M. Lyu, and I. King, “A Face Processing System Based on Committee Machine: The Approach and Experimental Results”, Proc. Comp. Analysis of Images and Patterns, pp. 614-622, 2003. [30] I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, New-York, 1986. [31] F. Katagiri and J. Glazebrook, “Local Context Finder (LCF) reveals multidimensional relationships among mRNA expression profiles of Arabidopsis responding to pathogen infection”, Proc. Natl. Acad. Sci., pp. 10842-10847, 2003. [32] D.G. Kendall, “Shape Manifolds, Procrustean Metrics, and Complex Projective Shapes”, Bull London Math Soc., vol. 16, pp. 81-121, 1984. [33] S.-H. Kim, N.-K. Kim, S.C. Ahn, and H.-G. Kim, “Object Oriented Face Detection Using Range and Color Information”, Proc. 3rd Int’l Conf. Automated Face and Gesture Recognition, pp.68-81, 1998. [34] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(1):103-108, 1990. [35] T. Kohonen, “Self-organizing maps”, Springer Series in Information Sciences, vol. 30, Springer, Berlin,1997. [36] P. Kovesi, "Symmetry and Asymmetry From Local Phase", 10th Australian Joint Conf. on A.I., 1997.

120

References

[37] C. Kotropoulos and I. Pitas, “Rule-Based Face Detection in Frontal Views”, Proc. Int’l Conf. Acoustics, Speech and Signal Processing, vol. 4, pp.2537-2540, 1997. [38] O. Kouropteva, O. Okun, and M. Pietikainen, “Selection of the Parameter Value for the Locally Linear Embedding Algorithm”, Int’l Conf. on Fuzzy Systems and Knowledge Discovery, pp. 359-363, 2003. [39] O. Kouropteva, O. Okun, and M. Pietikainen, “Classification of Handwritten digits using supervised locally linear embedding algorithm and support vector machine”, Proc. 11th Eur. Symp. on Artif. Neural Networks, pp. 229-234, 2003. [40] K. M. Lam and H. Yan, “An analytic-to-holistic approach for face recognition based on a single frontal view”. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(7):673-686, 1998. [41] A. Lanitis, C.J. Taylor, and T.F. Cootes, “An Automatic Face Identification System Using Flexible Appearance Models”, Image and Vision Computing, 13(5):393-401, 1995. [42] J.A. Lee, A. Lendasse, N. Donckers, and M. Verleysen, “A robust non-linear projection method”, Proc. 8th Eur. Symp. on Artificial Neural Networks, 2000. [43] T.K. Leung, M.C. Burl, and P. Perona, “Finding Faces in Cluttered Scenes Using Random Labeled Graph Matching”, Proc. IEEE Int’l Conf. Computer Vision, pp. 637-644, 1995. [44] Y. Li, S. Gong, and H. Liddell, “Support vector regression and classification based multi-view face detection and recognition”, Proc. Int’l Workshop on Automatic Face- and Gesture- Recognition, 2000. [45] Y. Li, S. Gong, J. Sherrah, and H. Liddell, “Multi-view Face Detection Using Support Vector Machines and Eigenspace Modelling”, Proc. Int’l Conf. on Knowledge-based Intelligent Eng. Sys. and Allied Tech., pp. 241-245, 2000. [46] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection”, Proc. 25th German Pattern Recognition Symposium, 2003. [47] Y. Ma and X. Ding, “Face Detection Based on Hierarchical Support Vector Machines”, Proc. IEEE Int’l Conf. on Pattern Recognition, 2002. [48] A. Martinez and R. Benavente, “The AR face database”, Technical Report 24, Computer Vision Center (CVC), Barcelona, Spain, 1998. [49] S. McKenna, Y. Raja, and S. Gong, “Tracking Color Objects Using Adaptive Mixture Models”, Image and Vision Computing, 17(3):223-229, 1998. [50] N. Mekuz, C. Bauckhage, and J. K. Tsotsos, “Face Recognition with Weighted Locally Linear Embedding”, 2nd Canadian Conf. on Computer and Robot Vision, pp. 290-296, 2005. [51] J. Min, P. Flynn, and K. Bowyer, “Assessment of time dependency in face recognition," TR-04-12, University of Notre Dame, 2004 [52] MNIST database, http://yann.lecun.com/exdb/mnist/index.html [53] B. Moghaddam and A. Pentland, “Probabilistic Visual Learning for Object Recognition”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(7):696-710, 1997. [54] E. Osuna, R. Freund, and F. Girosi, “Training Support Vector Machines: An Application to Face Detection”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 130-136, 1997.

121

References

[55] S. Pigeon and L. Vandendrope, “The M2VTS Multimodal Face Database,” Proc. 1st Int’l Conf. on Audio- and Video-Based Biometric Person Authentication, 1997. [56] M. Polito and P. Perona, “Grouping and dimensionality reduction by locally linear embedding”, Neural Information Processing System, 2001. [57] V. Popovici and J-P. Thiran, “Face Detection Using an SVM Trained in Eigenfaces Space”, Proc. 4th Int’l Conf. on Audio- and Video- Based Biometric Person Authentication, 2003. [58] A. Rajagopalan, K. Kumar, J. Karlekar, R. Manivasakan, M. Patil, U. Desai, P. Poonacha, and S. Chaudhuri, “Finding Faces in Photographs”, Proc. IEEE Int’l Conf. Computer Vision, pp. 640-645, 1998. [59] D. Reisfeld, N. Arad, and Y. Yeshurun, “Normalization of face image using few anchors”, Proc. IEEE Int’l Conf. on Pattern Recognition, pp.761-763, 1994. [60] D. de Ridder, O. Kouropteva, and O. Okun, “Supervised locally linear embedding”, Proc. Joint Int. Conf. ICANN/ICONIP, vol. 2714, pp. 333-341, 2003. [61] D. de Ridder and R.P.W. Duin, “Locally linear embedding for classification”, Pattern Recognition Group Technical Report Series, Delft University of Technology, 2002. [62] A. Ross and A. Jain, “Information fusion in biometrics”, Pattern Recognition Letters 24, pp. 2115–2125, 2003. [63] S. Roweis and L. Saul, “Non-linear Dimensionality Reduction by Locally Linear Embedding”, Science, vol. 290, 2000. [64] H. Rowley, S. Baluja, and T. Kanade, “Human Face Detection in Visual Scenes”, Advances in Neural Information Processing Systems 8, D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo, eds., pp. 875-881, 1996. [65] H. Rowley, S. Baluja, and T. Kanade, “Neural Network-Based Face Detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(1):23-38, 1998. [66] T. Sakai, M. Nagao, and S. Fujibayashi, “Line Extraction and Pattern Detection in a Photograph”, Pattern Recognition, vol. 1, pp. 233-248, 1969. [67] A. Samal and P. A. Iyengar, “Automatic recognition and analysis of human faces and facial expressions: A survey”, Pattern Recognition, 25(1):65-77, 1992. [68] F. Samaria and S. Young, “HMM Based Architecture for Face Identification”, Image and Vision Computing, vol. 12, pp. 537-583, 1994. [69] J.W. Sammon, “A non-linear mapping for data structure analysis”, IEEE Trans. on Computers, 18(5):401-409, 1969. [70] L.K. Saul and S.T. Roweis, “Think globally, fit locally: unsupervised learning of non-linear manifolds”, Technical Report MS CIS-02-18, University of Pennsylvania, 2002. [71] D. Saxe and R. Foulds, “Toward Robust Skin Identification in Video Images”, Proc. 2nd Int’l Conf. Automatic Face and Gesture Recognition, pp. 379-384, 1996. [72] B. Scholkopf, “Statistical Learning and Kernel Methods”, Technical Report MSRTR-2000-23, Microsoft Research Limited, 2000. [73] A. Shilton, M. Palaniswami, D. Ralph, and A. Tsoi, “Incremental Training of Support Vector Machines”, IEEE Trans. on Neural Networks, 16(1):114-131, 2005. [74] P. Sinha, “Object Recognition via Image Invariants: A Case Study”, Investigative Ophthalmology and Visual Science, 35(4):1735-1740, 1994.

122

References

[75] S.A. Sirohey, “Human Face Segmentation and Identification”, Technical Report CS-TR-3176, Univ. of Maryland, 1993. [76] K. Sobottka and I. Pittas, “Face Localization and Feature Extraction Based on Shape and Color Information”, Proc. IEEE Int’l Conf. Image Processing, 1996. [77] K.-K. Sung and T. Poggio, “Example-Based Learning for View- Based Human Face Detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(1):39-51, 1998. [78] M.J. Swain and D.H. Ballard, “Color Indexing”, Int’l J. Computer Vision, 7(1):1132, 1991. [79] J.B. Tenenbaum, V. de Silva and J.C. Langford, “A global geometric framework for non-linear dimensionality reduction”, Science, pp. 2319-2323, 2000. [80] M.A. Turk and A.P. Pentland, “Face Recognition using Eigenfaces”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 586-591, 1991. [81] V. Vapnik, The nature of statistical learning theory, 2nd Edition, Springer-Verlag, 1997. [82] E. Vazquez and E. Walter, “Multi-output support vector regression”, 13th IFAC Symposium on System Identification, pp.1820-1825, 2003 [83] R. Verschae and J. Ruiz-del-Solar, “A Hybrid Face Detector based on an Asymmetrical Adaboost Cascade Detector and a Wavelet-Bayesian-Detector”, Int’l Work-Conf. on Artificial and Natural Neural Networks, pp. 742-749, 2003. [84] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features”, Proc. IEEE Int’l Conf. Comp. Vision and Pattern. Rec., 2001. [85] J. Wang, Z. Changshui, and K. Zhongbao, “An Analytical Mapping for LLE and Its Application in Multi-Pose Face Synthesis”, 14th British Machine Vision Conf., 2003. [86] P. Wang and Q. Ji, “Multi-View Face Detection under Complex Scene based on Combined SVMs”, Proc. IEEE Int’l. Conf. on Pattern Recognition, 2004. [87] Y. Yacoob, H-M. Lam, and L. Davis, “Recognizing faces showing expressions”, Proc. Conf. IWAFGR’95, pp.278-283, 1995. [88] G. Yang and T.S.S Huang, “Human Face Detection in Complex Background”, Pattern Recognition, 27(1):53-63, 1994. [89] J. Yang and A. Waibel, “A Real-Time Face Tracker”, Proc. 3rd Workshop Applications of Computer Vision, pp. 142-147, 1996. [90] M.-H. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(1):34-58, 2002. [91] M.-H. Yang and N. Ahuja, “Detecting Human Faces in Color Images”, Proc. IEEE Int’l Conf. Image Processing, vol. 1, pp. 127-130, 1998. [92] M.-H. Yang, N. Ahuja, and D. Kriegman, “Mixtures of Linear Subspaces for Face Detection”, Proc. 4th Int’l Conf. Automatic Face and Gesture Recognition, 2000. [93] M.-H. Yang, D. Roth, and N. Ahuja, “A SNoW-based face detector”, Advances in Neural Information Processing System 12, pp. 855 -861, 2000. [94] K.C. Yow and R. Cipolla, “A Probabilistic Framework for Perceptual Grouping of Features for Human Face Detection”, Proc. 2nd Int’l Conf. Automatic Face and Gesture Recognition, pp. 16-21, 1996. [95] A. Yuille, P. Hallinan, and D. Cohen, “Feature Extraction from Faces Using Deformable Templates”, Int’l J Computer Vision, 8(2):99-111, 1992.

123

Face Detection Using Locally Linear Embedding - Semantic Scholar

Face Detection Using Locally Linear Embedding - Semantic Scholar

Suggest Documents

Incremental locally linear embedding - Semantic Scholar

Unified Locally Linear Embedding and Linear ... - Semantic Scholar

Feature Genes Selection Using Supervised Locally Linear Embedding ...

Shrunken Locally Linear Embedding for Passive ...

Multiview Locally Linear Embedding for Effective

Locally linear embedding for classification - CiteSeerX

Face Detection Using Mixtures of Linear Subspaces

Video Indexing Using Face Detection and Face ... - Semantic Scholar

Face Recognition with Weighted Locally Linear ... - CiteSeerX

A NOTE ON THE LOCALLY LINEAR EMBEDDING ... - CiteSeerX

Human face detection in video using edge ... - Semantic Scholar

Hessian eigenmaps: Locally linear embedding techniques for ... - PNAS

AdaBoost Face Detection on the GPU using Haar ... - Semantic Scholar

Real-Time Face Detection Using Boosting in ... - Semantic Scholar

Think Globally, Embed LocallyâLocally Linear Meta-embedding of ...

locally linear embedding for nonlinear dimension reduction ... - Statistica

Locally Linear Embedding for Exemplar-Based Spectral Conversion

Face and Eye Detection Using OpenCV: Step by ... - Semantic Scholar

OBJCUT for Face Detection - Semantic Scholar

heuristic-based automatic face detection - Semantic Scholar

Face detection for video summaries - Semantic Scholar

color image embedding using multidimensional ... - Semantic Scholar

Age-Dependent Face Detection and Face ... - Semantic Scholar

Locally Embedding Autoencoders: A Semi

Face Detection Using Locally Linear Embedding - Semantic Scholar