tween two face tracks is estimated by the similarity between two distributions. However, the scalability is as important as the accuracy in such databases.
FAST FACE SEQUENCE MATCHING IN LARGE-SCALE VIDEO DATABASES Hung Thanh Vu1 Thanh Duc Ngo2 Thao Ngoc Nguyen1 Duy-Dinh Le3 Shin’ichi Satoh3 Bac Hoai Le1 Duc Anh Duong1 1
University of Sciences, 227 Nguyen Van Cu, Ho Chi Minh city, Vietnam 2 The Graduate University for Advanced Studies (Sokendai), Japan 3 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan ABSTRACT There have recently been many methods proposed for matching face sequences in the field of face retrieval. However, most of them have proven to be inefficient in large-scale video databases because they frequently require a huge amount of computational cost to obtain a high degree of accuracy. We present an efficient matching method that is based on the face sequences (called face tracks) in large-scale video databases. The key idea is how to capture the distribution of a face track in the fewest number of low-computational steps. In order to do that, each face track is represented by a vector that approximates the first principal component of the face track distribution and the similarity of face tracks bases on the similarity of these vectors. Our experimental results from a large-scale database of 457,320 human faces extracted from 370 hours of TRECVID videos from 2004 - 2006 show that the proposed method easily handles the scalability by maintaining a good balance between the speed and the accuracy. Index Terms— Face retrieval, face track matching, subspace method. 1. INTRODUCTION A huge number of videos generated daily come from sources such as television programs, broadcast news, surveillance videos, and movies. The principal object in these videos is humans. This implies that the human face is the most important object in video retrieval. Currently, frontal faces can be efficiently extracted from these video sources by using face detectors [1]. These face detectors can produce large databases of up to tens of millions of faces. The problem is how to organize these databases for efficient and accurate retrieval. Solving this problem, we would be beneficial for a wide range of applications from video indexing, event detection, to person search in videos. The conventional approach is to use a single image for matching. In this approach, each single image can be represented in a high dimensional space as a point. The similarity between the query image of a person and an image in the database is the exact distance between two points in the
feature space. The main drawback with this approach is that the matching results rely completely on these points that are unstable due to the huge number of variations in the human face such as head poses, facial expressions, and illumination. Another approach deals with the dependence on such unstable points by using a face sequence instead of a single image and thus a person is represented by the distribution of a point set in the feature space. Methods following the face sequence based approach usually try to model this distribution. Shakhnarovich et al. [2] modeled a face sequence using a probability distribution. Cevikalp and Triggs [3] claimed a face sequence was a set of points and discovered a convex geometric region expanded by these points. The minmin method [4, 5, 6] considered a face sequence as a cluster of points and measured the distance between these clusters. Subspace methods [7, 8, 9] viewed a face sequence as points spread over a subspace. Although these methods can be highly accurate, a lot of computation is needed to represent the distribution of the face sequence, such as computing the convex hulls in [3], the probability models in [7] and the eigenvectors in [7, 8, 9]. For this reason, they are not scalable for large video databases. Other methods that can efficiently match numerous face tracks such as the k-Faces method [10] usually sacrifice the accuracy for the speed. Therefore, demand is growing for algorithms that have a balance of both speed and accuracy on a large-scale. We propose an efficient method for matching face tracks in large-scale video databases. We follow the idea described for the subspace methods [7, 8, 9] in which the similarity between two face tracks is estimated by the similarity between two distributions. However, the scalability is as important as the accuracy in such databases. Therefore, unlike subspace methods, we try to approximate the first principal component corresponding to largest variation by a vector instead of finding subspaces that need a huge computation cost. In this way, the computation cost is significantly reduced while the accuracy is still maintained and is comparable to other methods. The rest of our paper is organized as follows. Section 2 introduces an overview of our framework: Subsections 2.1 and 2.2 present our previous work in [10] for face track extraction;
face detections, and thus many fragmented face sequences may be created. We also apply a KLT tracker to associate faces of the same person but that is different from [4], we maintained the interest points in the face region instead of the shot and re-computed tracked points every frame. Our method was shown to be more efficient and robust than Everingham et al.’s method in [11]. 2.2. Facial feature representation
Fig. 1. Eigenvectors (with largest eigenvalues) and mean vectors of two 3D face tracks.
(a)
(b)
Fig. 2. Computing cosine distance before (a) and after (b) zero-mean normalization step. our proposed method is described in Subsection 2.3. Finally, our experiments and conclusion are presented in Sections 3 and 4. 2. FRAMEWORK OVERVIEW 2.1. Face sequence extraction There are several approaches for extracting face tracks from videos. Sivic et al. [6] use a face detector to locate human faces in every frame. Faces of the same person are associated together by tracking the covariance affine regions over time. Their method is able to yield high results but is too complex due to the huge cost of running affine covariant detectors and trackers. Another efficient extraction method used in [4] is Kanade-Lucas-Tomasi (KLT). KLT is applied to every frame to track the interest points in a shot. A pair of face regions in different frames is linked by comparing the number of tracked points that passed these face regions to the total points in both regions. However, the tracked points are usually sensitive to illumination changes, occlusions, and false
We use the Local Binary Pattern (LBP) [12] feature to represent the extracted faces. LBP has recently become one of the most popular features in face representation. Several remarkable advantages of the LBP feature are that it is invariant to monotonic changes in illumination and can be quickly computed. A direct extension of the LBP proposed in [12] is LBPP,R , which takes into consideration the LBP operators on different scales and rotations. The LBPP,R operator at the point (xc , yc ) in an image, where P is the number of sampling points on a circle of a radius R, compares the intensity of each sampling point with the intensity of (xc , yc ) to produce a string of binary code (0 or 1). Each string is converted into an integer number that falls into the unique bin in a k-bin histogram. A face image is partitioned into a regular grid of n × n cells; a k-bin histogram is individually built for every cell in the grid. These histograms are concatenated to create the LBP feature of the whole image. After this step, each face image is represented by a feature of D = n × n × k dimensions. 2.3. Face track representation and matching Since each face is represented as a feature point, each face track describes the distribution of the faces of one person in the feature space. When the number of the faces in each face track is huge, an efficient representation of the face track distribution is needed. We use mean vectors to represent face tracks to overcome this issue. Given the face track Fi of ni faces, the mean vector is given by: P ni j=1 fij , vi = ni where, fij is the face j-th of the face track Fi . Using mean vector can be explained that the mean vector of the face track is nearly the approximation of the first principal component that corresponds to the direction of maximum variance of data. On the other hand, the mean vector can be used to replace the first eigenvector (corresponding to the largest eigenvalue) of the subspace used in subspace methods. The example shown in Figure 1 shows two face tracks in a threedimensional space. The green vectors are the mean vectors of the face tracks while the red (blue) one is the eigenvector (corresponding to the largest eigenvalue) to represent the subspace of the red (blue) face track. It is easy to see that the
mean vectors and the eigenvectors are nearly identical. Based on this approximation, we believe that our method can inherit the advantages of the subspace methods in terms of the accuracy. In addition, our method only requires O(ni × D) for finding the mean vectors, and thus, the face track representation can be quickly computed. After extraction and representation processes are complete, the face tracks are organized into databases for the matching phase. Given an input face track, the similarity between the input face track and each face track in the databases is estimated and a rank list is returned according to the similarity scores. There are some common similarity measures for matching in face retrieval, such as Euclidean, L1 , and HIK. However, we choose the cosine distance to measure the similarity between two mean vectors. This idea is based on the angle distance between two subspaces, which is successfully used in subspace methods. We refer to our method as mean-cos (using the mean vector for representation and the cosine distance for matching) and its details are described as follows: mean-cos 1. Compute a mean for all faces in the database: PN Pni u=
i=1
j=1
fij
, Nf where, N is the number of face tracks and Nf = PN i=1 ni is the number of faces in the database. 2. Find the mean vector vi for the face track Fi . 3. Normalize the mean vector: vi = vi − u. 4. Compute the distance from face track G to each face track Fi . 5. Return a rank list. Steps 1 and 3 compose a zero-mean normalization step. The purpose of normalizing data is to the zero-mean in order to enhance the discrimination when we use the cosine distance. Figure 2a shows an example where this measure makes a mistake: the face track F3 is considered to be further from F2 than F1 since the angle ϕ between v3 and v2 is greater than angle θ between v1 and v2 . Meanwhile, the distances between face tracks can be correctly estimated (ϕ < θ) by using the zero-mean normalization in Figure 2b. 3. EXPERIMENTS 3.1. Database and evaluation We used the database described in [10] to evaluate our method. This database was collected from 370 hours of
Method CMSM mean-cos MSM min-min k-Faces
MAP(%) 58.39 58.13 57.72 56.93 54.97
Table 1. MAP results from TRECVID data. TRECVID news video from 2004-2006. According to [10], the faces were extracted, annotated, and organized into a database of 1,510 face tracks from 49 people containing 457,320 face images. The LBP feature was extracted for each image using a 3×3 grid and 59-bin LBP histograms to create a 531-dimensional feature. For evaluation, we used the mean average precision (MAP), which is a common measure generally used to evaluate information retrieval systems and particularly in face retrieval systems. MAP is also a standard benchmark in many reliable competitions such as the TRECVID workshop [13], and the PASCAL VOC challenge [14]. We used each face track in the database as a query, and thus we had 1,510 queries. The MAP value was evaluated from the results of 1,510 queries and used to compare the methods. 3.2. Results and analysis We compared our method (mean-cos) with MSM, CMSM, min-min, and k-Faces. All our experiments were performed on a Linux server with 24 cores, 2.66 GHz, 128 GB RAM. Table 1 lists the MAP values of all the methods. The mean-cos method obtained the MAP at 58.13%, outperformed k-Faces (54.97%), min-min (56.93%), MSM (57.72%), and is comparable to CMSM (58.39%). These results show that the mean vectors can be used to describe the first principal component of the distribution of face tracks very well. Suppose N is the number of face tracks in the database; M is the average number of faces per face track; D is the number of feature dimensions; and Dc is the constrained subspace dimensions in the CMSM method. In our experiments, N is equal to 1,510; M is 302; D is 531 and Dc is 500 (Dc is chosen so that CMSM yields the highest MAP). The performance time and computational complexities of all methods are listed in Table 2. The min-min method doesn’t have a representation phase but its matching time is too huge in terms of the complexity of O(N ×N ×D×M ×M ) against O(N × N × D) in the other methods. In the face track representation phase, the subspace methods (MSM and CMSM) need O(N × D × D × (D + M )) while our method requires O(N × M × D). This means that our method is D times faster than the subspace methods. The performance time results prove that mean-cos is respectively 438 and 559 times faster than MSM and CMSM. It means that MSM and CMSM need more than 6 minutes to represent 1,510 face tracks while
min-min MSM CMSM mean-cos k-Faces
Face track representation Time (seconds) Big O 0 None 396 O(N × D × D × (D + M )) 505 O(N × D × D × (D + M )) 0.9 O(N × D × M ) 0.28 O(N × D)
Match Time (seconds) 6,543,790 449 407 184 120
Big O O(N × N × D × M × M ) O(N × N × D) O(N × N × Dc ) O(N × N × D) O(N × N × D)
Table 2. Performance time and computational complexity of all methods. our method only requires 1 second. In the matching phase, all the methods (except min-min) have a very similar computational complexity of O(N × N × D) (actually, CMSM has O(N × N × Dc ), but Dc < D). However, mean-cos is 2 times faster than the subspace methods and 35,000 times faster than min-min since these methods need more time for complex operations, such as the matrix operations and eigenvector decompositions in the subspace methods or the distance calculations of all the point pairs in two face tracks, which makes them totally unsuitable for the scalability. The k-Faces method is the fastest of these methods (3.28 times faster than our method in the representation phase and 1.53 times in the matching phase), but it is less accurate than other methods (Table 1). The experimental results show that our proposed method is a good method for satisfying the tradeoff between two key problems in the scalability: the accuracy and the speed. 4. CONCLUSION We introduced an efficient and accurate method for matching face tracks in large-scale databases in this paper. The face tracks extracted from video sequences are represented by finding the mean vector of each face track. After the normalization step, the cosine distance is used to measure the similarity between two face track distributions. The efficiency of our method was proved in both theory and practice for a large-scale face track database extracted from TRECVID videos while the accuracy is equivalent to the state-of-the-art methods. 5. REFERENCES [1] Paul A. Viola and Michael J. Jones, “Rapid object detection using a boosted cascade of simple features,” CVPR, 2001. [2] Gregory Shakhnarovich, John W. Fisher, III, and Trevor Darrell, “Face recognition from long-term observations,” ECCV, 2002. [3] Hakan Cevikalp and Bill Triggs, “Face recognition based on image sets,” CVPR, 2010.
[4] M. Everingham, J. Sivic, and A. Zisserman, ““Hello! My name is... Buffy” – automatic naming of characters in TV video,” BMVC, 2006. [5] A. Hadid and M. Pietik¨ainen, “From still image to videobased face recognition: An experimental analysis,” FG, 2004. [6] J. Sivic, M. Everingham, and A. Zisserman, “Person spotting: Video shot retrieval for face sets,” CIVR, 2005. [7] Wei Fan and Dit-Yan Yeung, “Locally linear models on face appearance manifolds with application to dualsubspace based classification,” CVPR, 2006. [8] O. Yamaguchi, K. Fukui, and K. Maeda, “Face recognition using temporal image sequence,” FG, 1998. [9] Kazuhiro Fukui and Osamu Yamaguchi, “Face recognition using multi-viewpoint patterns for robot vision,” ISRR, 2003. [10] Thao Ngoc Nguyen, Thanh Duc Ngo, Duy-Dinh Le, Shin’ichi Satoh, Bac Hoai Le, and Duc Anh Duong, “An efficient method for face retrieval from large video datasets.,” CIVR, 2010. [11] Thanh Ngo Duc, Duy Dinh Le, Shin’ichi Satoh, and Duc Anh Duong, “Robust face tracking finding in video using tracked points,” Proc. Intl. Conf. on Signal-Image Technology and Internet-Based Systems, page 59-64, 2008. [12] Timo Ojala, Matti Pietik¨ainen, and Topi M¨aenp¨aa¨ , “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” PAMI, 2002. [13] Alan F. Smeaton, Paul Over, and Wessel Kraaij, “Evaluation campaigns and trecvid,” MIR, 2006. [14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, 2010.