An Efficient Method for Face Retrieval from Large Video Datasets

6 downloads 1521 Views 651KB Size Report
method for face retrieval in large video datasets. In order to make the face retrieval robust, the faces of the same per- son appearing in individual shots are ...
An Efficient Method for Face Retrieval from Large Video Datasets Thao Ngoc Nguyen

Thanh Duc Ngo

Duy-Dinh Le

University of Science Faculty of Information Technology 227 Nguyen Van Cu, Dist 5 Ho Chi Minh City, Vietnam

The Graduate University for Advanced Studies 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, Japan

National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, Japan

[email protected] Shin’ichi Satoh

[email protected] Bac Hoai Le

[email protected] Duc Anh Duong

National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, Japan

University of Science Faculty of Information Technology 227 Nguyen Van Cu, Dist 5 Ho Chi Minh City, Vietnam

University of Science Faculty of Information Technology 227 Nguyen Van Cu, Dist 5 Ho Chi Minh City, Vietnam

[email protected]

[email protected]

[email protected]

ABSTRACT The human face is one of the most important objects in videos since it provides rich information for spotting certain people of interest, such as government leaders in news video, or the hero in a movie, and is the basis for interpreting facts. Therefore, detecting and recognizing faces appearing in video are essential tasks of many video indexing and retrieval applications. Due to large variations in pose changes, illumination conditions, occlusions, hairstyles, and facial expressions, robust face matching has been a challenging problem. In addition, when the number of faces in the dataset is huge, e.g. tens of millions of faces, a scalable method for matching is needed. To this end, we propose an efficient method for face retrieval in large video datasets. In order to make the face retrieval robust, the faces of the same person appearing in individual shots are grouped into a single face track by using a reliable tracking method. The retrieval is done by computing the similarity between face tracks in the databases and the input face track. For each face track, we select one representative face and the similarity between two face tracks is the similarity between their two representative faces. The representative face is the mean face of a subset selected from the original face track. In this way, we can achieve high accuracy in retrieval while maintaining low computational cost. For experiments, we extracted approximately 20 million faces from 370 hours of TRECVID video, of which scale has never been addressed by the former attempts. The results evaluated on a subset consisting of manually annotated 457,320 faces show that the proposed

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIVR ’10, July 5-7, Xi’an China c 2010 ACM 978-1-4503-0117-6/10/07 ...$10.00. Copyright

method is effective and scalable.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Retrieval models; I.5.3 [Pattern Recognition]: Applications—Computer vision

General Terms Algorithms, Experimentation, Performance

Keywords face retrieval, face matching, face recognition, local binary patterns, TRECVID

1.

INTRODUCTION

The human face is one of the most important objects in videos, especially in news programs, dramas, and movies. By extracting and organizing face information from such videos, we can facilitate multimedia applications having contentbased access to reality, such as video retrieval, video indexing and video mining [11, 10, 12, 2, 5, 13]. Current state-of-the-art face detectors [15] that can reliably and quickly detect frontal faces with different sizes and locations in complex background images can be used to extract faces from videos. However, recognizing faces of the same person is still a challenging task due to large variation in pose changes, illumination conditions, occlusions, hairstyles and facial expressions. Figure 1 shows an example of these face variations . Working on video datasets, one popular approach is to use face tracks 1 for matching instead of single faces as in static image datasets. The main idea is to take advantages from the abundance of frames in each sequence. For example, X. Liu and T. Chen [6] used adaptive Hidden Markov Models (HMM) to model temporal dynamics for face recognition; A. 1

face sequence containing faces of one person.

Figure 1: Large variations in facial expressions, poses, illumination conditions, and occlusions making face recognition difficult. Best viewed in color. Hadid and M. Pietikainen [3] proposed an efficient method for extracting the representative faces from each face track using the Locally Linear Embedding (LLE) algorithm; And Sivic et al. [12] modeled each face track by a histogram of facial part appearance. These methods have shown better recognition performance compared to that using single face. However, the number of individuals and the number of face tracks used in the experiments performed by these methods are rather small. Therefore, the scalability was not taken into account. When the dataset is getting huge, for example, TRECVID datasets [14] might have several hundreds of hours of videos, such above methods [6, 3, 12] need to be revised or a new scalable method is developed to handle scalability. The significant point to solve the scalability problem is about matching two face sequences. The traditional approach that is to compute the minimum distance among all pairs of faces in two face tracks is not applicable when the number of faces in the dataset is huge, e.g. several millions of faces. S. Satoh and N. Katayama [10] proposed to use SR-tree to reduce the matching complexity for this approach. We propose an alternative approach for efficient face track matching in large video datasets. The main idea is to select a subset of faces in each face track and compute one representative face for matching. Specifically, given a number of selected faces k in each face sequence, we divide the face track into k equal parts according to its temporal information. For each part, we select one face as a representative for that part. These faces are represented as points in a high dimensional feature space. Then, the mean of these points is calculated and is considered as the representative face for the input face track. Then, we compute the mean face from this subset of k representative faces in the feature space. The similarity between two face tracks is defined as the similarity between their two representative faces. In this way, we can achieve very low computational cost. Although this method is simple, comprehensive experiments on the dataset consisting of 457,320 faces of 1,511 face tracks extracted from TRECVID dataset shows that it can achieve comparable matching performance with other state-of-theart methods.

2.

METHOD OVERVIEW

2.1

Face Track Extraction

There are several approaches that can be used to group faces into face tracks. For example, Sivic et al. tracks facial regions and connects them for grouping [12]. This ap-

proach is accurate but requires a high computational cost. To reduce computational cost while maintain accuracy, in the approach proposed by Everingham et al. [2], tracked points obtained from Kanade-Lucas-Tomasi (KLT) tracker were used. However, face tracks obtained by this method may be fragmented since tracked points are sensitive to illumination changes, occlusions, and false face detections. We proposed a method [7] that has successfully handled these cases. We also use tracked points for grouping faces detected from an individual in video sequences into a face track. Instead of generating interest points at a certain frame and tracking them through frames in the input frame sequence as in [2], we re-generate tracked points to complement lost points due to occlusions and new appearing faces. When points are distracted by flash, a simple flash light detector is used for detecting flash frames, then removing them from the grouping process. This method has shown to be robust and efficient in experiments on various long video sequences of TRECVID dataset. Its result (94.17%) has outperformed Everingham et al.’s method (81.19%). For more details, refer to [7].

2.2

Face Track Representation

We use LBP (local binary patterns) feature to represent faces in each face track. The LBP feature proposed by Ojala et al. [8] is a powerful method for texture description. It is invariant with respect to monotonic grey-scale changes; hence, no grey-scale normalization needs to be done prior to applying the LBP operator. This operator labels the pixels of an image by thresholding the neighborhoods of each pixel with the center value and considering the results as a binary number. Recently, the LBP operator has been extended to account for different neighborhood sizes [8]. In general, the operator LBPPP,R refers to a neighborhood size of P equally spaced pixels on a circle of radius R that form a circularly symmetric neighbor set. It has been shown that certain bins contain more information than others do. Therefore, it is possible to use only a subset of the 2P local binary patterns to describe the textured images. In [8], Ojala et al. defined these fundamental patterns (also called ”uniform” patterns) as those with a small number of bitwise transitions from 0 to 1 and vice versa. Accumulating the patterns which have more than 2 transitions into a single bin yields an LBP descriptor, denoted LBPu2 P,R , which less than 2P bins. In our experiment, the input image is divided into sub-images by a n × n grid and then LBP operator is applied to these sub-images and computes a kbin histogram. Consequently, a n × n × k-dimension feature vector is formed for each input image. We did not use PCA feature for face representation since it usually requires robust facial feature point detection, for example, eyes, nose, and mouth, for normalization. Building such robust feature point detectors for real video data is expensive. Furthermore, using PCA requires additional computational cost for projection from original feature space to eigen space.

2.3

Matching Method

The main purpose of face retrieval in videos is to get face tracks that are relevant to a given query. In order to do that, similarity estimation methods must be used to determine how relevant they are.

Figure 2: The face track extraction method - Faces of the same person appearing in one shot are grouped into one face track. In common, to compute the similarity of two face tracks, we can adopt the idea of hierarchical agglomerative clustering in which each cluster is a face track and the distance between clusters means the distance between face tracks. There are two common approaches following this idea.

two representatives. The representative face is the mean face of a subset selected from the original set of faces in the face track. In particular, selecting the representative face for each face track, k-Faces does these following steps:

1. Single Linkage Clustering based Distance: The single linkage clustering based distance defines the distance between two clusters (i.e. two face tracks) as the minimum distance between elements (i.e. faces) of each cluster:

1. Divide the face track into k equal parts according to its temporal information. For example, with k equal to five, a face track F that has 100 faces will be divided into five parts, which each part comprises 20 faces that were extracted from consecutive frames.

D(A, B) =

min (d(x, y))

x∈A,y∈B

2. For each part, select the middle face as the representative for that part. We then obtain a subset of k faces from the original set of faces in the face track.

(1)

where A and B are face tracks, x and y are faces of A and B respectively.

3. Compute the mean face from a subset of k faces. Note that mean face may be not a real face. We define the mean face (or representative face) is the ’face’ whose feature vector is computed by averaging out all feature vectors of k faces from the above step.

This method is widely used in many state-of-the-art methods [10, 3, 12, 2]. 2. Average Linkage Clustering based Distance: The average linkage clustering based distance defines the distance between two clusters (i.e. two face tracks) as the mean distance between elements (i.e. faces) of each cluster: 1 XX D(A, B) = (d(x, y)) (2) |A|.|B| x∈Ay∈B where A and B are face tracks, x and y are faces of A and B respectively. These methods leads to a huge computation because they employ pair-wise matching and face tracks usually have a large number of faces. To reduce the computation cost, representative faces instead of all faces in face tracks can be used for matching. One intuitive way of doing so is to choose the middle face as a representative face. However, this method does not work well when the face tracks have large variations as shown in Figure 3. We propose a robust but low computation cost method for matching, called k-Faces. This method, which is inspired by the idea of selecting representatives to reduce the computational cost, can overcome the aforementioned weakness. For each face track, we select one representative face. The similarity between two face tracks is the similarity between their

4. Finally, compute the Euclidean distance between two ’mean faces’ . We expect that averaging out multiple faces can leverage variations and therefore produce a better representative face. In this way, we can achieve high accuracy in retrieval while maintaining low computational cost. Figure 4 gives an example of selecting representatives when k = 3. After selecting an appropriate similarity estimation method, given a query face track, we compute the similarity between the query and each face track in the database, then a list of nearest neighbors will be returned quickly by employing the LSH (Local Sensitive Hashing) technique for indexing [4].

3. 3.1

EXPERIMENTS Dataset

We used the TRECVID news video datasets from 2004 to 2006. These datasets have about 370 hours of video broadcasts in different languages such as English, Chinese, and Arabic. The total number of frames that we processed was about 35 millions. 157,524 face tracks with about 20 million faces were extracted. This amount is much larger than

Figure 3: Selecting the middle face in the face track might lead to poor matching performance due to variation information lost.

Table 1: The MAP Accuracy of Four Methods: MinMin, Avg-Min, Single Face, and k-Faces (k=5). Method MAP (%) Min-Min 56.93 k-Faces (k=5) 54.97 Avg-Min 53.69 Single Face 46.46

Figure 4: The k-Faces matching method. A subset of k faces selected from the face track to compute the mean face. in previous studies such as [1, 9]. We filtered out short face tracks that had less than ten faces. This resulted in 35,836 face tracks. Then, 49 people were annotated from 1,511 face tracks containing 457,320 faces. Political figures such as George W. Bush, Hu Jintao and Saddam Hussein were among them. Figure 5 shows statistics of the evaluated dataset.

3.2

Evaluation Criteria

We evaluated the performance with measures that are commonly used in information retrieval, such as precision, recall, and average precision. Given a face track of queried person and letting Nret be the total number of face tracks returned, Nrel the number of relevant face tracks, and Nhit the total number of relevant face tracks in the dataset, recall and precision can be calculated as follows: Recall =

Nrel Nhit

P recision =

Nrel Nret

(3)

(4)

Average precision (AP) emphasizes returning more relevant face tracks earlier. It can be computed with the following formula: PN r=1 (P recision(r) × rel(r)) (5) AverageP recision = Nhit where r is the rank, Nret the number returned, Nhit the number of relevant faces at a given cut-off rank, rel () a binary function on the relevance of a given rank, and Precision() is the precision at a given cut-off rank. In addition, to evaluate the performance of multiple queries, we used mean average precision (MAP), which is the mean of the average precisions computed from the queries.

3.3

Results

We compared k-Faces with three other face matching methods: Single face based method (i.e. k-Face with k=1, picking

Figure 6: Precision-Recall of Min-Min, Avg-Min, Single Face, and k-Faces (k=5).

the middle face of the face track as the representative face for matching), Single Linkage Clustering based and Average Linkage Clustering based methods. For short representation, they are renamed k-Faces, Single Face, Min-Min, and Avg-Min, respectively. The LBP feature is 3×3×59 which is extracted from 3 × 3-grid from the input face and quantized into 59-bins. As shown in Table 1 and Figure 6, using one face for face track representation (i.e. Single Face) gave the worst results. Min-Min gave the best result, and k-Faces was comparable with Avg-Min and Min-Min. The Single Face method uses middle faces to estimate the distances between two face tracks. It is obviously fast. Experimental results showed that Single Face takes only six seconds on our dataset of 1,511 face tracks. However, because real life videos have large variations, the method fails when the middle faces of two face tracks are different in poses, illumination conditions, etc. In contrast, the use of face tracks, which considers multiple faces, can avoid this obstacle. Figure 7 shows an example for the weakness of Single Face. For a given face track Q in Figure 7a, Single Face ranked the relevant face track A 10th (see Figure 7b) and face track B 43th (see Figure 7c). The face in the rectangle is the representative face (middle face) chosen by Single Face. The middle faces of Q and A are similar in pose to each other,

Figure 5: Statistics of the evaluated dataset.

Figure 7: (a) The queried face track Q. (b) The returned face track A. (c) The returned face track B. Middle face is the face in the rectangle. Faces shown here are sampled from the real face track whose the number of faces is too large to represent.

Table 2: The Computational Cost of Four Methods: Min-Min, Avg-Min, Single Face, and k-Faces (k=5). Method Time (seconds) Min-Min 124,393 Avg-Min 124,119 k-Faces (k=5) 19 Single Face 6

while B’s is different from Q’s. This explains why Single Face wrongly ranked A higher than B. From Table 1 and Figure 6, we noticed that k-Faces is comparable in performance to other face track based matching method, Min-Min and Avg-Min. In particular, its MAP accuracy was 1.28% higher than Avg-Min and a little lower (1.96%) than Min-Min. However, k-Faces’s advantage in speed is impressive: it is over 6,500 times faster than AvgMin and Min-Min (see Table 2). k-Faces’s processing time is a bit slower than that of Single Face due to the cost of computing the “mean face”. This cost is much smaller than the cost of computing all pair-wise distance in Min-Min. Avg-Min usually gives bad results as faces in a face track have large variations from the beginning to the end. Averaging out all pair-wise distances is a good way to eliminate noise; However, it causes the estimated distance more different from actual observation. Thus, a relevant face track may be not considered relevant anymore. For example, given two face tracks A and B of the same person, if most of the faces in A are turned left while those in B are turned right, the distance may be large, and hence, they would be less relevant to each other. Meanwhile, if we select an appropriate k, somehow, we can choose a representation for each variation so that we can avoid the situation of the majority takes over the minority. In Figure 8, both query face track Q and relevant face track R have large variations. R is ranked third with the k-Faces (k = 5) but it is ranked 94th with Avg-Min. Although Min-Min is better than k-Faces overall, there are still cases that Min-Min gives worse results. For example, two face tracks belonging to different people can be considered matched by Min-Min because they have two faces that are unexpectedly similar to each other, as shown in Figure 9. Given a query face track Q, face track A contains the same person as in Q, and face track B contains a different person. Min-Min ranked B third and A 11th, after many irrelevant face tracks. Because of large variations, Min-Min could not find a suitable minimum pair for face track A (see Figure 9a), while faces in Q and B were very similar to each other in their poses, and illumination conditions. (see Figure 9b). In contrast, k-Faces, by averaging out subsets of k faces of face tracks, formed feature vectors that revealed the differences better. In this example, k-Faces ranked A at position 3 and the irrelevant face track B at a reasonable position of 196. However, the k-Faces method depends on the choice of an appropriate subset of faces. Figure 10 shows an example in which Min-Min is better than k-Faces. Min-Min was successful in finding an extremely similar pair of faces (see Figure 10a). Meanwhile, k-Faces found faces that were too different from each other in pose, so the mean faces were also different (see Figure 10b and 10c). Therefore, the question raised here is how to choose an appropriate k so that we can achieve high accuracy in retrieval while maintaining

Figure 11: The performance of k-Faces method with different k evaluated by MAP. Note that when k = 1, ˇ corresponds to the middle face in the S ¸ mean faceT the feature space (i.e. Single Face).

Figure 12: The computation cost of k-Faces method with different k. low computational cost. To answer that question, we investigated different values of parameter k. Figure 11 and Figure 12 show the accuracy evaluated by MAP and the computation cost for each k. We noticed that the computation cost linearly increases as k increases. On the other hand, the performance becomes stable from k = 5 onward. We therefore conclude that selecting a k that has the highest MAP is not a good solution because of the trade-off between accuracy and computational cost. Too small a k gives poor results; too big a k gives better results, but unnecessarily consumes time. We also compared the performance of our method with that of the method using k-means clustering for selecting representative faces for each face track. Figure 13 shows these performances are comparable across feature configurations while our method is simpler and requires less computational cost. This figure can also be used to consider the trade-off between computational cost and processing speed. For example, the feature 3.3.10 which stands for feature extracted from 3 × 3 grid and quantized in 10-bins has 90 dimensions and can be computed faster than the feature 3.3.59.

4.

CONCLUSIONS

The proposed method gives a robust but low computation cost way of face retrieval in videos compared to existing

Figure 8: (a) Faces in the queried face track. (b) Faces in the relevant face track R. (c) Five representative faces picked from the queried face track. (d) Five representative faces picked from the relevant face track R.

Figure 9: (a) Minimum pair computed by Min-Min that contains relevant people (left: query, right: relevant face track A). (b) Minimum pair computed by Min-Min that contains irrelevant people (left: query, right: irrelevant face track B). (c) Five representative faces picked from the queried face track. (d) Five representative faces picked from A. (e) Five representative faces picked from B.

Figure 10: (a) Minimum pair computed by Min-Min (left: query, right: relevant face track). (b) Five representative faces picked from the queried face track. (c) Five representative faces picked from the relevant face track.

Figure 13: The performance of the proposed method compared with that of the method using k-means clustering for selecting representative faces for each face track. baseline methods. Real life video data is huge, and faces in them have large variations; hence, efficient methods like ours are essential. Our future research will theoretically select appropriate k values or features for representing faces in order to improve face retrieval performance.

5.

REFERENCES

[1] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s in the picture? In Advances in Neural Information Processing Systems, 2004. [2] M. Everingham, J. Sivic, and A. Zisserman. “Hello, My name is... Buffy” – automatic naming of charecters in tv video. In Proc. British Machine Vision Conf., pages 899–908, 2006. [3] A. Hadid and M. Pietikainen. From still image to video-based face recognition: An experimental analysis. In Proc. Intl. Conf. on Automatic Face and Gesture Recognition, pages 813–818, 2004. [4] P. Indyk and R. Motwani. Approximate nearest neighbor - towards removing the curse of dimensionality. In Proc. 30th Symposium on Theory of Computing, pages 604–613, 1998. [5] D.-D. Le, S. Satoh, M. Houle, and D. Nguyen. Finding important people in large news video databases using multimodal and clustering analysis. In Proc. 2nd IEEE Intl. Workshop on Multimedia Databases and Data Management, pages 127–136, 2007. [6] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models. In Proc. Intl. Conf. on Computer Vision and Pattern Recognition, volume 1, pages 340–345, 2003. [7] T. Ngo, D.-D. Le, S. Satoh, and D. Duong. Robust face track finding in video using tracked points. In Proc. Intl. Conf. on Signal-Image Technology & Internet-Based Systems, pages 59–64, 2008. [8] T. Ojala, M. Pietikainen, and T. Maenpaa.

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(7):971–987, 2002. D. Ramanan, S. Baker, and S. Kakade. Leveraging archival video for building face datasets. In Proc. Intl. Conf. on Computer Vision, volume 1, pages 1–8, 2007. S. Satoh and N. Katayama. An efficient implementation and evaluation of robust face sequence matching. In Proc. 10th Intl. Conf. on Image Analysis and Processing, pages 266–271. S. Satoh, Y. Nakamura, and T. Kanade. Name-it: Naming and detecting faces in news videos. IEEE Multimedia, 6(1):22–35, 1999. J. Sivic, M. Everingham, and A. Zisserman. Person spotting: Video shot retrieval for face sets. In Proc. Int. Conf. on Image and Video Retrieval, pages 226–236, 2005. J. Sivic, M. Everingham, and A. Zisserman. “Who are you?” - learning person specific classifiers from video. In Proc. Intl. Conf. on Computer Vision and Pattern Recognition, pages 1145–1152, 2009. A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In MIR ’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321–330, 2006. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. Intl. Conf. on Computer Vision and Pattern Recognition, volume 1, pages 511–518, 2001.