FAST AND ACCURATE HOLISTIC FACE RECOGNITION USING OPTIMUM-PATH FOREST Jo˜ao P. Papa, Alexandre X. Falc˜ao
Alexandre L. M. Levada, D´ebora C. Corrˆea
Denis H. P. Salvadeo, Nelson D. A. Mascarenhas
University of Campinas
University of S˜ao Paulo
Federal University of S˜ao Carlos
Computer Science Institute
Physics Institute of S˜ao Carlos
Computer Science Departament
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
ABSTRACT This paper presents a novel, fast and accurate holistic method for face-recognition using the Optimum-Path Forest (OPF) classifier. Our objective is to improve the face recognition accuracy and to reduce the computational effort in face recognition tasks. During the feature extraction stage we apply Principal Component Analysis to create reduced vector descriptors of several dimensionalities. Experiments using face images from two public datasets (ORL and CBCL) present good results. Comparison among two other widely used supervised classifiers, Artificial Neural Networks based on Multilayer Perceptron and Support Vector Machines, show that the proposed method drastically reduces the computational cost, achieving correct classification rates at least identical to SVM. Index Terms— Face Recognition, OPF, SVM, MLP, PCA 1. INTRODUCTION In the last decades, the human face has been explored in a variety of neural networks, computer vision and pattern recognition applications. Definitely, one of the most challenges is the problem of face recognition. Face recognition has several advantages over other biometric technologies but the most important one is: is a natural and nonintrusive approach [1]. Basically, a face recognition system can operate in two distinct modes: face verification (authentication) or face identification (recognition), which is the main focus of this work. Recognition involves one-to-many matches, that compares a query face against all the template images in the database [1]. Face recognition methods can be divided in the following categorization: Holistic matching methods, Feature-based matching methods and Hybrid methods [2]. Many technical challenges exist in face recognition and they are common to all existing approaches. According to [1], the key ones are: large variability, highly complex nonlinear manifolds, high dimensionality and small sample size. The holistic methods use the whole face as input. Statistical techniques based on principal component analysis (Eigenfaces) [3], and other feature extraction methods as Linear Dis-
criminant Analysis (Fisherfaces) [4] and Independent Component Analysis (ICA) [5], belong to this class of methods. The proposed Optimum-Path Forest (OPF) face recognition approach also is an example of holistic method, since it uses features that are linear combination of the entire face. Featurebased methods are based on the extraction of local features such as statistics and geometrical measures. Finally, hybrid methods combine both local features and the global face representation during the classification stage. Nowadays, one of the most widely used approaches for face recognition tasks are the Support Vector Machines (SVM) [6–8], since SVM’s can give much higher recognition accuracy compared with other traditional methods. SVM’s have been recently proposed as an effective method for classification problems [9]. However, despite of their good generalization capacity, SVM’s have a major drawback regarding its use in real applications: they are time consuming and demand a high computational burden. This point is crucial, mainly in situations in which we have large datasets, such that police and criminal files. Our motivation to use the OPF classifier is to significantly reduce the training time and computational burden, but achieving good classification performances, compared to the state-of-the art methodologies, such as SVM and Artificial Neural Networks using Multilayer Perceptrons (ANN-MLP),for example [10, 11]. The OPF classifier [12] presents theoretical advantages over ANN-MLP and SVM: (i) it handles multiple classes without modifications or extensions, (ii) it does not make assumptions about the shape and/or separability of the classes, and (iii) it runs training much faster. In this work, our objective is to improve the face recognition accuracy and to reduce the computational effort in this task by using the OPF classifier in two public datasets. Comparison among two other widely used supervised classifiers, ANN-MLP and SVM, are also presented. The remaining of the paper is organized as follows: Section 2 describes the OPF classifier, Section 3 briefly describes PCA and Section 4 describes the experiments and results. Finally, Section 5 shows the conclusions and final remarks.
2. OPTIMUM-PATH FOREST CLASSIFIER Let Z1 and Z2 be the training and and test sets with |Z1 | and |Z2 | samples such as points or image elements (e.g., pixels, voxels, shapes and texture information). Let λ(s) be the function that assigns the correct label i, i = 1, 2, . . . , c, from class i to any sample s ∈ Z1 ∪Z2 . Z1 is a labeled set used to the design of the classifier and Z2 is used to assess the performance of classifier and it is kept unseen during the project. Let S ⊂ Z1 be a set of prototypes of all classes (i.e., key samples that best represent the classes). Let v be an algorithm which extracts n attributes (color, shape or texture properties) from any sample s ∈ Z1 ∪ Z2 and returns a vector ~v (s) ∈ Ren . The distance d(s, t) between two samples, s and t, is the one between their feature vectors ~v (s) and ~v (t). One can use any valid metric (e.g., Euclidean) or a more elaborated distance algorithm. Our problem consists of using S, (v, d) and Z1 to project an optimal classifier which can predict the correct label λ(s) of any sample s ∈ Z2 . The OPF classifier creates a discrete optimal partition of the feature space such that any sample s ∈ Z2 can be classified according to this partition. This partition is an optimum path forest (OPF) computed in C(s), do 7. Compute cst ← max{C(s), d(s, t)}. 8. If cst < C(t), then 9. If C(t) 6= +∞, then remove t from Q. 10. P (t) ← s, L(t) ← L(s) and C(t) ← cst. 11. Insert t in Q.
Lines 1 − 3 initialize maps and insert prototypes in Q. The main loop computes an optimum path from S to every sample s in a non-decreasing order of minimum cost (Lines 4 − 11). At each iteration, a path of minimum cost C(s) is obtained in P when we remove its last node s from Q (Line 5). Ties are broken in Q using first-in-first-out policy. That is, when two optimum paths reach an ambiguous sample s with the same minimum cost, s is assigned to the first path that reached it. Note that C(t) > C(s) in Line 6 is false when t has been removed from Q and, therefore, C(t) 6= +∞ in Line 9 is true only when t ∈ Q. Lines 8 − 11 evaluate if the path that reaches an adjacent node t through s is cheaper than the current path with terminus t and update the position of t in Q, C(t), L(t) and P (t) accordingly. We say that S ∗ is an optimum set of prototypes when Algorithm 1 minimizes the classification errors in Z1 . S ∗ can be found by exploiting the theoretical relation between minimum-spanning tree (MST) [14] and optimum-path tree for fmax [15, 16]. By computing a MST in the complete graph (Z1 , A), we obtain a connected acyclic graph whose nodes are all samples of Z1 and the arcs are undirected and weighted by the distances d between adjacent samples). The spanning tree is optimum in the sense that the sum of its arc weights is minimum as compared to any other spanning tree in the complete graph. In the MST, every pair of samples is connected by a single path which is optimum according to fmax . That is, the minimum-spanning tree contains one optimum-path tree for any selected root node.
The optimum prototypes are the closest elements of the MST with different labels in Z1 . By removing the arcs between different classes, their adjacent samples become prototypes in S ∗ and Algorithm 1 can compute an optimum-path forest in Z1 . Note that, a given class may be represented by multiple prototypes (i.e., optimum-path trees) and there must exist at least one prototype per class. It is not difficult to see that the optimum paths between classes tend to pass through the same removed arcs of the minimum-spanning tree. The choice of prototypes as described above aims to block these passages, reducing the chances of samples in any given class be reached by optimum paths from prototypes of other classes. 2.2. Classification For any sample t ∈ Z2 , we consider all arcs connecting t with samples s ∈ Z1 , as though t were part of the training graph. Considering all possible paths from S ∗ to t, we find the optimum path P ∗ (t) from S ∗ and label t with the class λ(R(t)) of its most strongly connected prototype R(t) ∈ S ∗ . This path can be identified incrementally, by evaluating the optimum cost C(t) as C(t) = min{max{C(s), d(s, t)}}, ∀s ∈ Z1 .
(3)
Let the node s∗ ∈ Z1 be the one that satisfies Equation 3 (i.e., the predecessor P (t) in the optimum path P ∗ (t)). Given that L(s∗ ) = λ(R(t)), the classification simply assigns L(s∗ ) as the class of t. An error occurs when L(s∗ ) 6= λ(t). Similar procedure is applied for samples in the evaluation set Z2 . In this case, however, we would like to use misclassified samples of Z2 to learn the distribution of the classes in the feature space and improve the classification performance on Z3 .
The classification step can be done in linear time, i.e., O(|V1 |). As aforementioned in Section 2.2, a sample t ∈ Z2 to be classified is connected to all samples in Z1 , and its optimum-path cost is evaluated. Since the OPF classifier can be understood as a dynamic programming algorithm [14], we do not need to execute the OPF algorithm again in the test phase, because each node si ∈ Z1 , i = 1, 2, . . . , |Z1 | already has its optimum-path value stored. What we just need to do is to evaluate the paths from each node si until t. Finally, the 2 2 estimated OPF total time is O(|V1 | )+O(|V1 |), i.e., O(|V1 | ) operations. One of the main advantages of the OPF algorithm is that its does not interprets the classification task as a hyperplanes optimization problem. An ANN-MLP, for example, can address linearly and piecewise linearly separable feature spaces, by estimating the hyperplanes that best separates the data set, but can not efficiently handle non-separable problems. As an unstable classifier, collections of ANN-MLP [17] can improve its performance up to some unknown limit of classifiers [18]. SVMs have been proposed to overcome the problem, by assuming linearly separable classes in a higher dimensional feature space [19]. Its computational cost rapidly increases with the training set size and the number of support vectors. As a binary classifier, multiple SVMs are required to solve a multi-class problem [20]. These points make the SVM quadratic optimization problem an suffering task in situations in which you have large datasets. The problem still increases with the chosen of the nonlinear mapping functions, generally Radial Basis Functions (RBF), in which it is nedded to chose their optimal parameters by cross-validation manifold techniques, for instance, making the training phase time prohibitive.
2.3. Complexity Analysis The OPF algorithm can be divided in two phases, as aforementioned: training and classification. The most part of the computational effort is expent in the training phase, in which we need to compute the optimum prototypes (the closest samples in the MST) and after run the OPF algorithm (Algorithm 1). Let |V1 | and |E1 | be the number of samples (vertex) and the number of edges, respectively, in the training set represented by the complete graph (Z1 , A). We can compute the MST using the Prim’s algorithm [14] with complexity O (|E1 | log |V1 |) and to find the the prototypes in O (|V1 |). The OPF algorithm (Algorithm 1) main (Line 4) and inner (Line 6) loops run in O (|V1 |) times each one, because we have a complete graph. In this sense, the OPF algorithm runs 2 in O(|V1 | ). The overall OPF training step complexity can be 2 executed in O (|E1 | log |V1 |) + O (|V1 |) + O(|V1 | ), which is 2 dominated by O(|V1 | ).
3. PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis is the technique that implements the Karhunen-Love Transform, or Hotteling Transform, a classical unsupervised second order method that uses the eigenvalues and eigenvectors of the covariance matrix to transform the feature space, creating orthogonal uncorrelated features. It is a second order method because all the necessary information is available directly from the covariance matrix of the mixture data and no information regarding probability distributions is needed. Given a multivariate dataset, the objective of PCA is to reduce the dimensionality and redundancy existing in the data in order to find the best representation in terms of the Mean Square Error (MSE). More details concerning PCA and its relation with many interesting statistical and geometrical properties are widely found in pattern recognition literature [21–23].
3.1. PCA by the Minimization of Mean Square Error During the dimensionality reduction, PCA tries to obtain a set of m basis vectors (m < n), that span a m-dimensional subspace in which the mean square error between this new representation and the original one is minimum. The projection of x (lexicographic notation of a face image) in the subspace spanned by the wj vectors, j = 1, . . . , m, defines the following criterion:
2
m X
T P CA
x − x w w JM (w ) = E j j j SE
j=1
Fig. 1. Two ORL database example faces.
(4)
subject to kwj k = 1. Considering that the data is centralized (the mean vector is null) and due to the orthonormal basis, equation (4) is further simplified to: m h i X 2 P CA JM − wTj Cx wj SE (wj ) = E kx|
(5)
j=1
where Cx denotes the covariance matrix of the observed data. As the first term does not depend on wj , in order to minimize the MSE, we have to maximize the second term of (5) (that represents the variance of the projected data). This can be achieved by usign Lagrange multipliers, resulting in:
P CA JM SE (wj , γj ) =
m X j=1
wTj Cx wj −
m X
γj wTj wj − 1 (6)
j=1
Differentiating the above expression on wj and setting the result to zero, leads to the following result [24]:
C x wj = λ j wj
(7)
Therefore, we have an eigenvector problem, which means that the vectors wj of the new basis that maximize the variance of the transformed data are the eigenvectors of the covariance matrix Cx , which in case of face recognition are used to generate the eigenfaces (projection of the faces in the space spanned by the principal components). Thus, inserting equation (7) in (5), leads to: m h i X 2 P CA JM − γj SE (wj ) = E kx|
(8)
j=1
showing that that in order to obtain the representation that minimizes the MSE, one must choose the m eigenvectors associated to the m largest eigenvalues of the covariance matrix.
Fig. 2. Face descriptors for template patterns in Fig 1. 4. EXPERIMENTAL RESULTS We used two face image database to test and evaluate the proposed classification method: ORL 1 and CBCL 2 database. The ORL (Olivetti Research Laboratory in Cambridge, UK) database contains 400 images from 40 different individuals (10 different images of each one). The face images, available in PGM format, were taken at some variations as lighting, facial expressions and facial detais. All face images have the same spatial dimension of 92 x 112 pixels, with 256 grey levels by pixel. Each 92 x 112 image was transformed to a 1-D signal of 10304 elements. Therefore, each face image is represent by a 10304-D vector, and we called this the face descriptor. Samples of two face images from ORL database are shown in Figure 1 and the respective face descriptors in Figure 2. The CBCL (Center for Biological and Computational Learning, MIT) is a database for face and non-face recognition. Its training set contains 2429 face samples and 4548 non-face samples. The test set is formed by 472 face samples and 23573 non-face samples. The samples, also available in PGM format, are 19 x 19 pixels. Again, each 19 x 19 image was transformed to a 1-D signal of 361 elements. Samples of two face images from CBCL database are shown in Figure 3 and the respective face descriptors in Figure 4. 1 Available in http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 2 Available
in http://cbcl.mit.edu/software-datasets/FaceData2.html
Fig. 3. Two CBCL database example faces.
PCA features 5 10 25 50 75 100 200
OPF 88.76±1.40 94.48±1.43 96.51±0.94 97.58±1.08 97.02±0.47 96.94±0.87 96.84±0.56
SVM 89.71±1.07 96.35±1.19 97.17±1.01 98.51±0.87 98.00±0.80 98.07±0.65 98.17±1.00
ANN-MLP 68.82±1.48 74.12±1.6 72.00±2.59 71.17±3.06 69.66±2.0 69.53±1.93 64.00±1.86
Table 1. Mean accuracies over 10 runnings using 50% for training and testing for ORL dataset. PCA Features 5 10 25 50 75 100 200
OPF 0.0057 0.0065 0.0105 0.0175 0.0230 0.0256 0.0390
SVM 17.4659 19.9625 27.1684 40.3648 52.5302 65.3766 115.2084
ANN-MLP 113.9888 115.0964 121.8251 137.0656 148.2339 142.7118 109.6110
Table 2. Mean execution times in seconds for ORL dataset.
Fig. 4. Face descriptors for template patterns in Fig 3.
We applied PCA to reduce the dimensionality of the input patterns (361-D for CBCL database and 10304-D for ORL database), so we can avoid problems caused by high dimensional data. Then we performed the OPF classification using 5, 10, 20, 25, 50, 75, 100 and 200 principal components in order to analyse training time and the classification performance in each situation. In the following, we compare the results obtained by OPF with two well-know classitication methodos, SVM and ANN-MLP. Table 1 shows the accuracies by OPF, SVM and MLP classification using ORL database. The dataset was splitted in 50% for training images and 50 % for testing. The accuracy is obtained as an average over 10 runnigs for each experiment, according to the number of principal components used. In each running, training and test sets are randomly generated. Similary, Table 3 shows the accuracies by OPF, SVM and MLP classification using CBCL database. The objective here is to perform a two-class classification task: face x non-face. The same methodoly was performed, despite we used 25% of the training images and 75 % of the testing images, once CBCL is a very large database. Finaly, Table 4 presents the execution time for each training perfomed in Table 3. Note that, in all experiments, the proposed OPF classifier drastically decreases the execution times, achieving performance equivalent to SVM approach, at least, and much better than MLP.
5. CONCLUSIONS This paper presented a novel method for eigenfaces recognition using the OPF classifier. The main advantadges of this method are: it reduces the computational effort by drastically decreasing the processing times, a major drawback in SVM based systems; and it maintains a performance statistically equivalent to the state-of-the-art technique (SVM), showing much better performance in comparison to MLP neural networks. The obtained results show that this novel approach is fast and accurate, representing a step towards real time face recogniton. Future works may include the use of other feature extraction methods, as Linear Discriminant Analysis or blockbased PCA, and the application of OPF classifier in other face databases. It is also interesting to verify how OPF classifier performs when the face image is corrupted with noise or it is not completely available, which may occur in real time applications. 6. REFERENCES [1] A. K. Jain and S. Z. Li, Handbook of Face Recognition, Springer-Verlag, 2005. [2] W. Zhao, R. Chellapa, P. J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM Computing Surveys, vol. 35, pp. 399–459, 2003. [3] M. A. Turk and A. P. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3(1), pp. 71–86, 1991.
PCA Features 10 15 20 25 50
OPF 83.54±0.61 85.09±0.57 85.06±0.30 85.56±0.47 84.73±0.56
SVM 78.14±1.46 81.98±0.97 82.02±0.75 84.76±1.12 86.63±0.52
ANN-MLP 64.03±6.75 71.85±1.80 74.82±1.19 76.32±1.51 74.25±1.24
Table 3. Mean accuracies over 10 runnings using 50% for training and testing for CBCL dataset. PCA Features 10 15 20 25 50
OPF 12.4417 14.3565 17.5660 20.0343 34.5864
SVM 11237.82 10686.18 11594.53 12841.75 18855.51
ANN-MLP 785.91 887.9842 956.1693 1050.15 1480.99
Table 4. Mean execution times in seconds for CBCL dataset [4] K Etemad and R. Chellapa, “Discriminant analysis for the recognition of human face images,” Journal of the Opt. Society of America A, vol. 14, pp. 1727–1733, 1997. [5] B. A. Draper, K. Baek, M. S. Bartlett, and J. R. Beveridge, “Recognizing faces with pca and ica,” Computer Vision and Image Understanding, vol. 91(1), pp. 115–137, 2003.
national Joint Conference on Neural Networks., vol. 1, 2004. [12] J.P. Papa, A.X. Falc˜ao, C.T.N. Suzuki, and N.D.A. Mascarenhas, “A discrete approach for supervised pattern recognition,” in 12th International Workshop on Combinatorial Image Analysis. 2008, vol. 4958, pp. 136–147, Springer Berlin/Heidelberg. [13] A.X. Falc˜ao, J. Stolfi, and R.A. Lotufo, “The image foresting transform: Theory, algorithms, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 1, pp. 19–29, Jan 2004. [14] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms, MIT, 1990. [15] C. All`ene, J. Y. Audibert, M. Couprie, J. Cousty, and R. Keriven, “Some links between min-cuts, optimal spanning forests and watersheds,” in Mathematical Morphology and its Applications to Image and Signal Processing. 2007, pp. 253–264, MCT/INPE. [16] A. Rocha P.A.V. Miranda, A.X. Falc˜ao and F.P.G. Bergo, “Object delineation by κ-connected components,” EURASIP Journal on Advances in Signal Processing, 2008, to appear. [17] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley-Interscience, 2004.
[6] G. Guo, S.Z. Li, and K. Chan, “Face recognition by support vector machines,” Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 196–201, 2000.
[18] L. Reyzin and R. E. Schapire, “How boosting the margin can also boost classifier complexity,” in Proceedings of the 23th International Conference on Machine learning, New York, NY, USA, 2006, pp. 753–760, ACM Press.
[7] K. Jonsson, J. Matas, J. Kittler, and Y.P. Li, “Learning support vectors for face verification and recognition,” Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 208–213, 2000.
[19] B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the 5th Workshop on Computational Learning Theory, New York, NY, USA, 1992, pp. 144–152, ACM Press.
[8] B. Heisele, P. Ho, and T. Poggio, “Face recognition with support vector machines: Global versus componentbased approach,” Proc. of the Eighth IEEE International Conference on Computer Vision (ICCV), pp. 688–694, 2001.
[20] K. Duan and S. S. Keerthi, “Which is the best multiclass svm method? an empirical study,” Multiple Classifier Systems, pp. 278–285, 2005.
[9] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995.
[22] A. Webb, Statistical Pattern Recognition (second edition), Wiley, 2006.
[10] H. Ahmad Fadzil, M.H.; Abu Bakar, “Human face recognition using neural networks,” Proceedings ICIP94., vol. 3, pp. 936 – 939, 1994.
[23] P. E. Hart nad D. G. Stork R. O. Duda, Pattern Classification (second edition), Wiley-Interscience, 2001.
[11] J. Oravec, M.; Pavlovicova, “Face recognition methods based on principal component analysis and feedforward neural networks,” Proceedings. 2004 IEEE Inter-
[21] S. Theodoridis and K. Koutroumbas, Pattern Recogniton, Academic Press, 2006.
[24] T. Y. Young and T. W. Calvert, Classification, Estimation and Pattern Recogniton, Elsevier, 1974.