Real-Time 2D Face Detection and Features-Based

0 downloads 0 Views 1MB Size Report
features-based tracking algorithm track them over time even with non- .... OpenCV Library is available at the following link: http://sourceforge.net/projects/ ... rigid deformation fitting which is computed using the thin-plate splines proposed.
Real-Time 2D Face Detection and Features-Based Tracking in Video Sequences Guillaume Lemaˆıtre, Eloisa Vargiu, Jos´e Antonio Lorenzo Fern´andez, and Felip Miralles Barcelona Digital, Roc Boronat 117, 08018 Barcelona - Spain [email protected]; {evargiu, jalorenzo, fmiralles}@bdigital.org, WWW home page: http://www.bdigital.org/

Abstract. In this paper, we propose a system aimed at detecting and tracking 2D human faces. The system is designed to perform in real-time and to be robust to partial occlusions and illumination changes and nonrigid transformation of the faces. After detection video shot boundaries, the multi-view detectors allow to localize frontal and profile faces and the features-based tracking algorithm track them over time even with nonrigid deformations. The system is validated on different video sequences performing at an average frame rate of 20 fps. Results show a clear improvement (i.e. recall, precision and time processing) over the results of tracking-by-detection.

Keywords: real-time, face detection, face tracking, multimedia videos, hardcut detection, SIFT

1

Introduction

The motivation behind our contribution is to develop an automatic system for semantic annotation of video frames in the context of a video retrieval system. Manually tagging frames is a very time consuming task and our aim is to provide tools that help the human operators of the system. One very relevant piece of information about a the content of a video frame is to know if people appears in it, what we have addressed in this contribution by detecting their faces. This is the first stage towards more advanced semantic tagging of the video, like face recognition. Face detection and tracking find applications in various areas of computer vision such as video structuring, indexing and visual surveillance. Our contribution is part of a broader project1 of content retrieval in multimedia videos where part of our motivation is to propose fast automatic content tagging. The diversity of our video dataset imply to be able to deal with high to low resolution videos. Human detection and tracking is part of our problematic and is overcome using 2D human face detection and tracking approach. 1

The project name is omitted for blind review

Face detection-tracking problem can be tackled by two different manners [1]: (i) tracking-by-detection approaches or (ii) integrated detection and tracking approaches. In tracking-by-detection methods, an exhaustive search of faces is performed in each frame independently. This approach is successfully accomplished thanks to fast face detection algorithms, as proposed in [2,3]. However, these methods present a drawback due to a fully exhaustive search inside each image, which involves that the processing speed is highly dependent to the image resolution. In integrated detection and tracking methods, faces are detected during an initialisation step (i.e., first frame) and then tracked using an independent algorithm through the video sequence. Face tracking can be divided into two categories [4]: (i) facial feature tracking and (ii) head tracking. The first group includes trackers which follow eye, mouth and other meaningful facial features and require independent trackers for each of these facial features. Head tracking is more a global approach using information of entire head and can be organized as [4]: region-based, color-based, shape-based or also model-based. Colour-based approaches are not illumination invariant methods [1]. Shape-based [5] and model-based [6] approaches are based on statistical model (shape, appearance) and allow to be deformable to shape and appearance variations of faces. However, these methods require exhaustive training stage [5,6]. In this work, we build on an approach based on a combination of a fast multiview face detector [2], used usually in a context of tracking-by-detection based approach, with a pure features-based tracking algorithm. However, instead of using facial features, we propose to use generalized features with a combination of Scale-Invariant Feature Transform (SIFT) [7] and Speeded Up Robust Feature (SURF) [8] detector-descriptor. This approach allows to overcome: (i) partial occlusion, (ii) illumination changes and (iii) non-rigid deformations. Moreover, we propose to follow the scheme introduced in [9] by subdivided videos into multiple shots by combining two shot boundaries detectors: Colour Histogram Differences (CHD) [10] and Edge Change Ratio (ECR) [11]. The rest of the paper is organized as follows: Section 2 presents the overall system as well as each particular module of the system: (i) shot boundaries detection, (ii) face detection and (iii) face tracking. Experiments and results are discussed in Section 3 as well as comparison with tracking-by-detection based algorithm. In Section 4, we summarize our contribution and propose some future directions.

2

The Proposed System

As previously mentioned, our system has to perform under several conditions. Due to the project constraints, our system have to be “real-time” using high and low resolution multimedia videos. It has to deal with the scenario where new faces move in and out of the frame and face re-detection is essential and also with multiple shots video. Moreover, it has to be robust to illumination changes and partial occlusion and be flexible to non-rigid deformations.

This section describes the overall system and also each particular module composing the system.

Fig. 1. The proposed face detection-tracking system.

2.1

Overview

The proposed system is composed of three main modules (See Fig. 1): the first is aimed at finding shots transitions in order to process independently each shot of the video; the second allows to detect faces and the third to track them. In so doing, first, the shot transition detector subdivides the video into continuous shot as proposed in [9]. Then, for each sub-sequence, an initial face detection is done on the first frame. Faces found are tracked using the algorithm presented in Section 2.4. In order to overcome the problem “Entry and Exit” of faces, a re-detection have to be carried-out. Based on the approach proposed in [12], the

face detection algorithm (Section 2.3) is performed on changed parts of a frame. This re-detection is performed every 15 frames. Changed parts are computing by subtracting two frames separated of n images (the rate being fixed to 15 in our experiment). This subtracted image is thresholded and filtered using a median filter and combined with some morphological operations (i.e., opening and closing). To avoid detecting the same face multiple times, we remove all the area where any face was previously tracked from this mask image. Finally, each new found face is tracked using the face tracking algorithm, which is run at every frame. The corresponding output provides the location of the detected faces. 2.2

Shot Boundaries Detection

Automatic shot boundary detection is an important field of research since it is the first step to simplify any processing on videos. A shot is defined as an unbroken sequence of frames from only one camera. Thus, in our case, the detection of shots is necessary in order to process separately each independent video sequence. Transitions between shots can be classified into two main categories: hard cut and gradual transition. In this work, we focus only on the first and we perform it in two stages: scoring-stage and detection-stage. Scoring Stage. We propose an automatic shot boundary detection approach where the score is estimated using a combination of Colour Histogram Differences (CHD) and Edge Change Ratio (ECR), the algorithms introduced by Ueda et al. [10] and Zabih et al. [11], respectively. CHD, the most robust histogram-based detection algorithm [13], allows to be motion-insensitive. Unfortunately, its main drawback lies in the fact that two images can have the same color histograms, while the frame content could be different. ECR solves this issue, since it is based on edges and, therefore, directly in the image content. However, this latter has other drawback, due to its partial motion-insensitivity which tends to increase the false hit ratio. By combining CHD and ECR scores, previous drawback are overcome. The corresponding score is calculated as follows: !  ! 2B −1 2B −1 2B −1 out X X X Si =

1 2

1 N

|pi (r, g, b) − pi−1 (r, g, b)|

r=0

g=0

+ max

Xiin Xi−1 , σi σi−1

.

(1)

b=0

where pi (r, g, b) is the number of color pixels in a frame Ii of N pixels and each color component is sampled to 2B different values. Thus, r, g, b ∈ [0, 2B − 1]. Xiin out and Xi−1 are respectively the number of edge pixels entering in the frame Ii and exiting in the frame Ii−1 . σi and σi−1 are defined as the number of edge pixels in the image Ii and Ii−1 . Detection Stage. Instead of defining a global threshold (as done in [13]), in the proposed system, we develop a decision stage that incorporates a pre-processing

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6 Score

Score

1 0.9

0.5

0.4

0.3

0.3

0.2

0.2

0.1 0

(a)

0.5

0.4

0.1 0

500

1000

1500

2000 2500 Index image

3000

3500

4000

4500

Scoring signal (blue signal) and baseline

0

(b)

0

500

1000

1500

2000 2500 Index image

3000

3500

4000

4500

Corrected scoring signal (blue signal)

detected (red signal). The scoring signal is

and threshold found using Otsu’s method

corrected by subtracting the baseline.

(red line).

Fig. 2. Shot transitions are represented by peaks.

aimed at correcting the scoring signal previously computed. The correction is done by detecting and subtracting the baseline of the scoring signal2 . In our case, the baseline is estimated fitting a polynomial of low degree using a LeastSquare minimization as shown on Fig. 2(a). Once the scoring signal is corrected, shot transitions are detected using the unsupervised clustering method proposed by Otsu [15]. Such method allows to find the optimal threshold between two classes by minimizing the intra-class variance. A shot cut is considered to happen whenever the signal raises above the threshold (see Fig.2(b)). 2.3

Face Detection

As face detection algorithm, we adopt the approach proposed by Viola and Jones [2]. They proposed a machine learning approach for object detection that is able to process images extremely rapidly and to achieve high detection rates for frontal faces. In their work, images are represented by a technique called “Integral Image” that computes the features used by the detector very quickly. Features are then selected by relying on a learning algorithm, based on AdaBoost [16]. Finally, a “cascade” of more complex classifiers is in charge of quickly disregarding background regions while spending more computation on promising object-like regions. Lienhart proposed an extension by modifying the set of HaarLike features in order to make the detector more robust and avoid false alarms [17]. Being interested in detecting both frontal and profile faces, we use the approach proposed in [18], which combines a frontal and profile cascade which can be seen as a fast multi-view face detector. The implementation was done by using the OpenCV library3 . 2 3

Baseline detection is a common problem tackled in NMR and MRI spectroscopy [14]. OpenCV Library is available at the following link: http://sourceforge.net/projects/ opencvlibrary/. Version 2.3 was used.

The output of this module is the set of rectangles around the location of each detected face. These rectangles are passed to the tracker. 2.4

Features-Based Face Tracking

A multi-cascade face detector is more computation-time expensive than a simple cascade classifier. To this end, to avoid a tracking-by-detection strategy, we develop a pure tracking module which allows to track the faces already detected. Let us point out that, in order to be competitive, the proposed algorithm should be robust to occlusions and illumination changes. To this end, we propose to determine the new location of the face by estimating the position and deformation of the polygon around it. This is done by finding corresponding points inside the polygon between two consecutive frames, and estimating the affine transformation follow by a warping. Thus, the algorithm is composed of three tasks: (i) correspondences matching, to find corresponding points in two consecutive frames; (ii) affine transformation and (iii) warping refinement, to estimate the deformation of the original rectangle. Correspondences Matching. First, in order to later estimate the affine displacement of the face, corresponding features of the target have to be found between an image at time t and t + 1. Hence, we extract features in the target area returned by the face detector using the difference of Gaussian (DoG) as proposed in [7] and also the scale-normalized determinant of the Hessian (DoH) as proposed in [8]. These two combined detectors allow to find points of interest which are invariant to uniform scaling, orientation, affine distortion and illumination changes [7], [19]. The previous cited properties allow to solve the problem of illumination changes. The next step consists of describing the detected features in order to find the corresponding features at time t and t + 1. Hence, in order to find correspondences, the Scale-Invariant Feature Transform (SIFT) descriptor proposed by Lowe [7] is adopted. Affine Estimation. Before estimating the affine transformation, outliers are discarded using RANSAC [20]. Then, the affine parameters (i.e., rotation, translation, and scale) are estimated using the Least-Median Square approach [21]. This affine estimation can be done in presence of partial occlusion since that only three inliers are needed in order to find the transformation. Warping Refinement. As already pointed out, model-based and shape-based method have the drawbacks to need an exhaustive training stage in order to learn face model and deformations. However, these methods proposed a nice feature in the way that they allow to be flexible to non-rigid deformation of the faces. In order to integrate this feature, we adopt a warping refinement which allows nonrigid deformation fitting which is computed using the thin-plate splines proposed by Bookstein [22]. Although, this approach is less efficient than model-based or

shape-based, it has the advantage to not require any pre-knowledge and training stage.

3

Experiments and Results

As for experimental results, we are interested in testing the proposed system in real-world scenarios. So that, we performed experiments on the combination of CHD and ECR on two videos: Project 4 and a sitcom episode 5 . Moreover, to show the limits of CHD algorithm due to the presence of only a main color, we also selected a synthetic video, cartoon 6 . Table 1 gives a brief description of the adopted videos.

Table 1. The test video set Video

Project

duration (mm:ss) 03 : 10 resolution (pixels) 1280 × 720 # of frames 4774 # of cuts 22 # of person(s) 1

Sitcom episode Sitcom extract Cartoon Web-cam 23 : 15 720 × 480 34885 276 −

00 : 43 720 × 480 1292 − 2

02 : 39 00 : 07 854 × 480 640 × 480 4767 − 34 − − 1

The overall system is then experimented on Project, an extract of the Sitcom episode and on a web-cam streaming 7 .

3.1

Shot Boundaries Detection Results

The performance of the system was evaluated using precision and recall as defined in [13]. Table 2 summarizes the results of the proposed shot boundaries detector. We can point out two main results. From the one hand, combining CHD and ECR allows to increase the hit rate of the only CHD detector, which can fail when color histograms in a transition are really similar. From the another hand, combining CHD and ECR allows to decrease the false hits of the ECR due of camera motion. Unfortunately, due to the lack of an available algorithm, it was no feasible to perform comparisons on ECR. 4

The project name is omitted for blind review The video is the sitcom IT Crowd (first season, first episode). 6 This video can be found at the following link: http://www.youtube.com/watch?v=r-VHmLn5_q4. 7 Part of the experiment can be seen at the following link: http://www.youtube.com/watch?v= PwKsbtWOp0M. 5

Table 2. Precision and recall of the different algorithms tested on the test video set Precision CHD Recall Precision CHD+ECR Recall

3.2

Project Sitcom episode 100% 98.3% 100% 83.8% 100% 99.6% 100% 93.8%

Cartoon 100% 61.2% 100% 97.1%

Overall Results

In this section, a quantitative evaluation and a time analysis has been carry out of the tracking-by-detection approaches and our approach, on two selected videos. In order to analyse the performance of the proposed system, we compared it with two different tracking-by-detection algorithms. The former is the initial frontal face detector of Viola and Jones [18], while the latter one is its the multi-view variant [2] (i.e., the one used in this work) allowing frontal and profile detection. Table 3. Overall system evaluation Project Sitcom Precision Recall Time Proc. Precision Recall Time Proc. Frontal Viola Frontal-Profile Viola Our system

50.3% 76.0% 1.5 fps 42.3% 87.4% 0.3 fps 87.9% 95.1% ≈ 20 fps

72.3% 75.9% 3.0 fps 73.7% 95.4% 1.3 fps 100.0% 96.8% ≈ 20 fps

Project video Let us introduce the scenario of the Project video. Different shots are taken from several cameras allowing to track the presence of a person in an entire apartment. Example shots are depicted in Fig. 3 (top). With a high resolution video, 1280 × 720, the proposed system is able to perform in “real-time”, about 20 fps, while both tracking-by-detection strategy are more time consuming, less than 1 fps. Table 3 shows the resulting precision/recall for the different strategies. It can be noticed that using multi-view detector allows to increase the hit rate since that the recall is increasing from 76% to 87.4%. However, the number of false detection is also increasing, as the precision is decreasing from 50.3% to 42.3%. The proposed system outperforms the previous methods for two reasons: (i) thanks to run the face detector only on changed part, the number of false alarm is highly decreased (i.e. precision 87.9%). (ii) recall is increasing to 95.1% due to the ability to our tracking algorithm to follow faces in complicated pose (see Fig. 3(c), 3(d)) where the multi-view face detector will failed. Figure 3 illustrates a comparison of the trajectories given by

(a) Shot 1

(b) Shot 2

(c) Shot 3

(d) Shot 4

Fig. 3. TOP: Scenario where a person is tracked using several cameras inside an apartment. BOTTOM: Comparison of trajectory (x-center-position) between the ground-truth (blue circles) and the output of the proposed system (red crosses). It can be seen, that the proposed system is robust without presenting drifting over time. However, we can notice two moments were our face detector failed.

a ground-truth and returned by the proposed system. It can be seen, that the proposed system is robust over time without any drift.

Sitcom extract We invite the reader to consult the following link in order to evaluate visually the result: http://www.youtube.com/watch?v=pWQ1esUdr6A. As for the previous scenario, our system outperforms the tracking-by-detection methods. Our system is still proposing a “real-time” processing, running at 20 fps. Precision and recall obtained are 100.0% and 96.8%. The improvement done comparing to the tracking-by-detection algorithms are due to the same reasons pointed out in the previous section.

Fig. 4. Example of tracking using web-cam streaming showing the robustness of the proposed system to partial occlusions and non-rigid transformations.

Web-cam streaming In order to explicitly show the ability of the proposed system to handle partial occlusion and non-rigid deformation, we proposed a simple experiment using a web-cam streaming. The result can be consulted at the following link: http://www.youtube.com/watch?v=PwKsbtWOp0M and also depicted in Fig. 4.

4

Conclusions and Future Work

In this paper, a system based on a combination of a fast multi-view face detector with a pure features-based tracking algorithm has been presented. The proposed system allows, in “real-time”, to detect profile and frontal faces and also to track them under difficult conditions (partial occlusions, illumination changes, nonrigid deformations), enabling to increase the performance of an only trackingby-detection strategy. We have demonstrated the results on three different video sequences. Experimental results show an excellent performance in the vase of different scenario as pose change, partial occlusion and non-rigid deformations. As for the future work, we are planning to integrate the system inside a Kalman filter to increase robustness. Furthermore, we are setting up further experiments on other video sequences through our project.

References 1. Verma, R.C., Schmid, C., Mikolajczyk, K.: Face detection and tracking in a video by propagating detection probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003) 1215–1228 2. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision 57 (2004) 137–154 3. Kalal, Z., Matas, J., Mikolajczyk, K.: Weighted sampling for large-scale boosting. Methods 1 (2008) 22 4. Wang, H., Wang, Y., Cao, Y.: Video-based face recognition. World Academy of Science, Engineering and Technology 60 (2009) 293–302 5. DeCarlo, D., Metaxas, D.: Deformable model-based face shape and motion estimation. In: Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on. (1996) 146 –150 6. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 23 (2001) 681 –685 7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60 (2004) 91–110 8. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110 (2008) 346–359 9. Liu, Z., Wang, Y.: Face detection and tracking in video using dynamic programming. In: ICIP. (2000) 10. Ueda, H., Miyatake, T., Yoshizawa, S.: Impact: an interactive natural-motionpicture dedicated multimedia authoring system. In: Proceedings of the SIGCHI conference on Human factors in computing systems: Reaching through technology. CHI ’91, New York, NY, USA, ACM (1991) 343–350 11. Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene breaks. In: Proceedings of the third ACM international conference on Multimedia. MULTIMEDIA ’95, New York, NY, USA, ACM (1995) 189–200 12. Nechyba, M.C., Brandy, L., Schneiderman, H.: Pittpatt face detection and tracking for the clear 2007 evaluation. In: CLEAR. (2007) 126–137 13. Lienhart, R.: Reliable transition detection in videos: A survey and practitioners guide. International Journal of Image and Graphics 1 (2001) 469–486 14. Lemaˆıtre, G., Walker, P.M.: Absolute quantification in 1h mrsi of the prostate at 3t. Master’s thesis, Universit´e de Bourgogne, Universitat de Girona, Heriot-Watt University (2011) 15. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9 (1979) 62–66 16. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the Second European Conference on Computational Learning Theory, London, UK, Springer-Verlag (1995) 23–37 17. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: IEEE ICIP 2002. (2002) 900–903 18. Viola, M., Jones, M.J., Viola, P.: Fast multi-view face detection. In: Proc. of Computer Vision and Pattern Recognition. (2003) 19. Juan, L., Gwon, O.: A comparison of SIFT, PCA-SIFT and SURF. International Journal of Image Processing (IJIP) 3 (2009) 143–152 20. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24 (1981) 381–395

21. Rousseeuw, P.J.: Least median of squares regression. Journal of the American Statistical Association 79 (1984) 871–880 22. Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (1989) 567–585