Scale Adaptive Face Detection and Tracking in ... - Semantic Scholar

4 downloads 21604 Views 741KB Size Report
tween the segments are tested to see whether its center can ... We call this filter a six-segmented ... In face tracking, a pattern of between-the-eyes is tracked.
Scale Adaptive Face Detection and Tracking in Real Time with SSR filter and Support Vector Machine Shinjiro Kawato and Nobuji Tetsutani ATR Media Information Science Laboratories Keihanna Science City, Kyoto 619-0288, Japan e-mail: {skawato tetsutani}@atr.co.jp

Abstract In this paper, we propose a method for detection and tracking of faces in video sequences in real time. It can be applied to a wide range of face scales. Our basic strategy for detection is fast extraction of face candidates with a Six-Segmented Rectangular (SSR) filter and face verification by a support vector machine. A motion cue is used in a simple way to avoid picking up false candidates in the background. In face tracking, the patterns of between-theeyes are tracked with updating template matching. To cope with various scales of faces, we use a series of approximately   scale-down images, and an appropriate scale is selected according to the distance between the eyes. We tested our algorithm with 7146 frames of a broadcasted sign language news video of 320  240 frame size, in which one or two persons appeared. Although gesturing hands often hid faces and interrupted tracking, 89% of faces were correctly tracked. We implemented the system on a PC with a Xeon 2.2-GHz CPU, running at 15 frames/second without any special hardware.

1. Introduction For head gesture recognition and/or facial expression recognition, head and face detection in real time is the first step to be taken. Once a face is detected, it should be tracked for further analysis. In this paper, we propose a real-time face detection and tracking algorithm for video sequences. Many methods or algorithms have been proposed for face detection. Hjelmas[4] did a comprehensive survey on this subject, listing more than two hundred references. He organized face detection techniques into two broad categories: the feature-based approach and the image-based approach. Most of the image-based approaches apply a window scanning technique[4] for face detection, which makes them computationally expensive. Therefore, most real-time systems take the feature-based approach, combining it with various kinds of fast face candidate (point or region) extraction methods.

The most popular face candidate extraction method is to extract a skin-tone region in color images [2][5][17] [19][20]. However, color information is very sensitive to lighting conditions, and it is very difficult to adapt the skin-tone model to the lighting environment in real-time. Depth information from stereo systems is also used to extract foreground objects (i.e. heads) from the background [10][3][12]. However, dense depth information recovery involves rather heavy computation. Moreover, stereo systems require precise system calibration, which is not an easy task. The background subtraction technique is another way to extract face candidate regions [11]. This technique requires a background image or model in advance. Therefore, it is not applicable to, for example, broadcasted video, for which we cannot get the background image in advance. Accordingly, we take a different approach to face candidate extraction. Another issue we want to address here is the fact that most feature-based approaches usually require features related not only to the eyes but also to the nose and mouth, or skin-colors and/or face contours, which are likely to be affected by beards, mustaches or hair styles. The nostrils are very stable features on the face, but sometimes they are invisible to the camera. We proposed a filtering technique to extract Betweenthe-Eyes in real time in [9]. This approach was very attractive because it only required information around the eyes. However, one of the drawbacks of this filtering approach was that it failed to detect Between-the-Eyes when hair covered the forehead, which was likely in many cases. Another drawback was that the calculation of the filter was a bit heavy. Therefore, for real-time processing, we had to extract skin-tone regions first as face candidates, as with other feature-based approaches, and apply the filter to only the candidate regions. However, skin-tone region extraction has problems as mentioned above. The filter size was fixed and the applicable face scale was limited. Here, we propose another filtering approach for face candidate extraction in real time. Viola and Jones’s work[18] provides some hints for this new filtering approach. They introduced an intermediate representation of an image

called an “integral image” to calculate their rectangular features. We use the integral image to calculate our filter. First, a rectangle of a certain size is scanned on the input image. The rectangular region is divided into six segments, and the average gray level in each segment is calculated from the integral image. Then, bright-dark relations between the segments are tested to see whether its center can be a candidate for a face. We call this filter a six-segmented rectangular filter, or SSR filter. The idea of using bright-dark relations of patches for face detection was first proposed by Sinha[16]. Scassellati implemented the idea in robot vision in a practical manner[15], but he used 16 patches distributed from the forehead and temples to the chin and 23 relations between them. Therefore, it was a rather strict template that was not applicable to a wide variety of facial appearances. On the other hand, we use the SSR filter as a very loose template for fast screening. Since the calculation of the SSR filter is simple and fast, we can apply different sizes of filters even in real-time processing. For each face candidate, eye-like regions are extracted. For each combination of right and left eye candidates, orientation, scale, and gray level of the face candidate are normalized and a region corresponding to the area between the eye-brows and the nose is extracted, excluding the forehead and the mouth. Then final face confirmation is done with a support vector machine[13][7]. In face tracking, a pattern of between-the-eyes is tracked with updating template matching. To cope with various sizes of faces, we use a series of approximately    scaledown images, and an appropriate scale is selected according to the distance between the eyes. We tested our algorithm with 7146 frames of a broadcasted sign language news video of 320  240 frame size, in which one or two persons appeared. Although gesturing hands often hid faces and interrupted tracking, 89% of the faces were correctly tracked. We implemented the system on a PC with a Xeon 2.2-GHz CPU, running at 15 frames/second without any special hardware. In Section 2, we explain the integral image. In Section 3, we describe the six-segmented rectangular filter for extracting face candidates, and in Section 4, we explain our implementation strategy. In Section 5, some experimental results are described, and Section 6 concludes the paper.

2. Integral Image For an image   , the integral image is defined as follows[18]. 

 



 &%'(%)  !"#$

(1)

The integral image can be computed in one pass over the original image by *  + 5 6 ,

*  ,-/.01,23  4 (2)  6 7.382 * 6 ,-(" (3) * * where  ,- is the cumulative raw sum, 6 ,!.9;: , and  .9 -( A@4B6CD@-E

F?G HI6J1H4K

Z

[

L?MDN1O6PN"Q

R?STU6V!T!W

Figure 1: Integral Image: The sum of pixels within rect angle \ is computed as ]_^ D "`ba c 6 ,d 6 fe&Aegh2  5  6 ,$1. 6 (- .  i-Aig .

3. SSR Filter For face candidate extraction, a rectangle is scanned on the input image. The rectangle is segmented into six parts as in Fig. 2.jlk We denote an average pixel value within a segment jlk j,n as m . Then, when one eye and eye brow are within jlo and the other within , we can expect j q n p m jlo p m

jlrtsu8v m jlrtsu8v m

j w n p m j8o p m

j$x m  j y z l m

(4) (5)

A point where (4) and (5) are satisfied can be a face candidate. We call this an SSR filter. ‡,ˆ

‰lŠ

‹,Œ

Ž

{ |

}~

€



‚

ƒ„

…†

Figure 2: Six-Segmented Rectangular (SSR) Filter. Average pixel values in each segment are computed and compared with each other to find whether they satisfy certain conditions.

The feasibility of the SSR filter was examined[14] using face images of the ORL database[1]. In the database, there are 400 facial images of 40 people, i.e., 10 for each. For some subjects, the images were taken at different times, under varied lighting, with open/closed eyes, smiling/not smiling, and with and without glasses. Each image size is 92  112 pixels with 256 gray levels. Evidently, points satisfying the inequalities (4) and (5) usually come out as clusters. Therefore, we select one point at the center of the bounding box of each cluster as a face candidate. An SSR r o r n n filter with ‘ ’‘ ’‘ “1 , ” ’” ’• could extract 97.25% true face candidates. Notably, computing time does not depend on the filter size when we use the integral image.

4. Implementation By SSR filters and a Support Vector Machine (SVM), we implemented a face detection and tracking system that is face-scale-adaptive and applicable to multiple faces. Our strategies for implementation are as follows. (1) Not using color information. Therefore, we use the green component of an RGB image as a gray scale image. (2) Using SSR filters for face candidate extraction and SVM for face confirmation. (3) Excluding forehead and mouth region from training patterns for SVM. Figure 3 shows a typical face pattern for SVM training. The pattern size is 35  21. Scale and orientation are normalized based on eye locations. The distance between the eyes is 23 pixels and they are aligned on the 8th row. Histogram equalization is applied to the gray level.

Figure 3: Typical face pattern for SVM learning. (4) Obtaining a normalized face candidate pattern for SVM by depneding on eye locations. Two local j, minimum n j x (i.e. j dark) points each are extracted from the 2  o y j and 2  areas of the SSR filter for left and right eye candidates. Therefore, for one face candidate, at most four patterns are tested by SVM. We extract two eye candidates because the darkness of an eye-brow is sometimes similarly to that of an eye. (5) In scaling and rotation of images, following a nearest pixel rule, which involves taking the value of the pixel nearest to the calculated coordinate. The quality of the resulting image may not be good, but this rule saves processing time. (6) Using a motion cue to avoid false face candidates in the background, which is still. Even if an SSR filter indi-

cates a face candidate, when the number of pixels, which value changes significantly from the previous frame, is less than a threshold (typically a few percent) in the SSR filter’s applied area, the point is not taken as a face candidate. The same usage of motion cue appears in [21]. (7) In face tracking, tracking the pattern between-theeyes with an updating template like that in [8]. However, the success of tracking is confirmed with SVM. (8) Using a series of scale-down images to track various scales of faces with a fixed template size. To save calculation time in constructing the scale-down images, we use sub-sampled images. The series of sub-sampling rate can be 2/3, 1/2, 1/3, 1/4, 1/6. These make up an approximately 1/   ratio series. Then, the scale down images consist of the original pixel values, and no additional calculation is required to make them. In the tracking process, an appropriate scale is selected according to the distance between the eyes. (9) Having the tracking process precede the detection process. If a face is tracked successfully, a region of the face is masked out so that the SSR filter is not applied in this region in the detection process. (10) In detection process, applying larger SSR filters before smaller ones. If a face is detected, a region of the face is masked out so that smaller SSR filters are not applied in this region. The sizes of SSR filters are 120  72, 80  48, 60  36, 40  24, 30  18, and 20  12. This series corresponds to the scaling of the scale-down images mentioned above. However, this correspondence is not necessary because tracking and detection are independent processes. Even still, it is convenient to monitor the process in down-scaled images. We have slightly changed the six segment ratio of the SSR filter from [14].n In theo smallestr SSR filter case, the sizes of segments are ‘ ;‘ —– , ‘ ˜e , r n ” ™” —š . We implemented the system on a PC with a Xeon 2.2GHz CPU. Figure 4 shows an example of the system monitor window, and this helps to understand how the system works. The upper-left area in Fig.4 is the original input color image with a size of 320  240. Tracked eyes are marked. Current templates for tracking the between-the-eyes are shown in the upper-left corner. The lower left area in Fig.4 is the monochrome image (i.e. the same as the original image but only green components), and green dots are overlaid at pixels where the gray level has significantly changed from the previous frame. The right column in Fig.4 shows the scaledown images; the scales are 2/3, 1/2, 1/3, 1/4, and 1/6 from top to bottom. Red dots are overlaid on the scale-down images where a corresponding size of SSR filter has positive output. Again, we need no calculation to make these scaledown images.

sequences. When we extracted face candidates from them with the SSR filter, all of them, except one that includes the tracked eyes, are false candidates. We extracted a pair of dark regions for each of the false candidates assuming they were eyes, then normalized the pattern as if it were a face, and registered the pattern as a non-face example. We balanced the number of face and non-face examples, which were actually 1550 and 1650, respectively. As a result of training SVM, the number of support vectors is 775.

5.2. Test with Sign Language Video

Figure 4: Monitor window of the system.

5. Experiment 5.1. Training SVM For training the SVM, we have to gather appropriate face images and non-face images. First, we used the faces in the ORL database[1]. Eye locations were manually assigned, and then scale, orientation and gray level were normalized as described in the previous section. However, 400 face patterns seemed to be too few for SVM. Therefore, we utilized broadcasted video. We recorded several short sequences of video in which only one person appears and we manually located the eyes in the first frame of each sequence. Once eyes are located, it is pretty easy to track them in the successive sequence by the program as long as both eyes are visible. In this way, we obtained 4599 faces of various orientations of several persons. Among them we used every fourth one because we thought adjacent frames would be very similar and thus redundant. Figure 5 shows an example of training pattern acquisition. We obtained the non-face patterns using the same video

Figure 5: Obtaining training face patterns from broadcasted video in a semi-automatic manner. (This scene is from a TV commercial for Mitsukan Vinegar.)

We tested the performance of our system with a broadcasted sign language news video. First, we recorded a 15-minute sign language news program (from NHK, 31 July 2003). The frame rate was 17 frames per second, a limitation due to the performance of our recording system. Then, we extracted all of the sequences in which sign language performers appeared. There are eleven sequences, and the total number of frames is 7146. Four persons appeared (Fig. 6). (A) and (B) are newscasters. (C) is a guest and not a sign language performer. (D) translated what (C) spoke into sign language. One or two persons appear at a time in a sequence, and our system tries to detect and track two faces even when only one person appears.

(A)

(B)

(C)

(D)

Figure 6: Four persons in the test video. The results are shown in Table 1. In the table, “Frm.” means the number of frames in the sequence. The column “Who” indicates which person from among (A), (B), (C), (D) appears. “Non” means no second person. “True” indicates number of frames in which both eyes are correctly located between or located at the eye corners. “Near” means the cases in which one of the eyes is tracked on the eye-brow or on the hair near the eye. The case in which both eyes are tracked on the eye-brows is counted as “False”. “None” means no point is detected. In sequences #2, #7, #8, and #10, the false counts for a person are zero, with a rather large number of counts for “Non”. This means that the false tracking for a non-face occurs as a second (false) face while the tracked face of the person is evaluated as “True” or “Near”. Most “False” cases

occur when gesturing hands and arms happen to produce local image patterns that pass the SSR filter and SVM. In sequence #11, person (B) is tracked very little. This is because person (B) remains still for a long time from the beginning and no motion cue appears. A total of 8706 faces out of 9811(notice sequences #1, #9, and #11 have two faces) or 88.7% are correctly tracked. Gesturing hands often hide faces and interrupt tracking, but the system quickly recovers the tracking. Figure 7 shows an example of a two-face tracking result from sequence #1. The upper-left corner shows two current tracking patterns. All of the sequences and tracking results can be seen at http://www.mis.atr.co.jp/~skawato.

Figure 7: Example of two-face tracking reslult from sequence #1.

Table 1: Results of Test with Sign Language Video Seq.

Frm.

#1

501

#2

544

#3

596

#4

721

#5

738

#6

451

#7

912

#8

199

#9

2077

#10

320

#11

87

Total

7146

Who A B A Non B Non A Non B Non A Non B Non A Non C D A Non A B

True 442 447 479

Near 4 5 6

505

29

703

2

695

4

421

13

802

41

193

0

1848 1788 296

0 88 4

69 18 8706

0 0 196

False 4 0 0 86 31 144 2 91 3 111 4 15 0 138 0 36 8 64 0 59 0 0 796

None 51 49 59 31 14 36 13 69 6 221 137 20 19 70 795

5.4. Real-time Processing 5.3. Scale Adaptability We tested zooming-up face tracking with an NTSC video signal. As shown in Fig. 8, the system continuously and successfully tracked the face from the upper-left image to the lower-right image. The face-scale difference is about nine times. The low quality of the images is due to the fact that these are LCD panel recordings by a consumer-model video camera.

Figure 8: While zooming up about nine times, the moving face was successfully tracked.

The tracking process uses a very light calculation, and there is no problem in tracking several faces at a time at a rate of 20 frames per second or faster. Face scale does not affect the calculation time at all due to using the series of scale-down images. On the other hand, the detection process is not so simple. Of course the detection processing time is approximately proportional to the number of face candidates. However, considering random patterns, we can roughly say that the number of face candidates is inversely proportional to the area of the SSR filter, and the area ratio of the series of SSR filters is 1/2(=1/  $ 1/   ). Therefore, the number of face candidates increases exponentially along the series of our SSR filters, which are 120  72, 80  48, 60  36, 40  24, 30  18, and 20  12. Down to the size of 40  24, the total number of face candidates is usually less than twenty in our office environment, and our system can run at a speed of 15 frames/sec. This filter size corresponds to detecting the face of the person sitting in the background of Fig. 4. It is adequate to detect to a person within a few meters of the camera. Recent rapid advances in CPU capabilities will soon make it possible to detect smaller faces in real-time.

6. Conclusions We proposed a scale-adaptive face detection and tracking system. For face candidate detection, a six-segmented rectangle (SSR) filter is scanned over the entire input image. › This approach is similar to the window-scanning technique often used in the image-based approach[4]. However, once the bright-dark relations between the six segments indicate a face candidate, eye candidate regions are searched in the manner of the feature-based approach[4]. Then, based on the locations of a pair of eye candidates, the scale, orientation and gray levels are normalized. Finally, the normalized image is fed to SVM for face confirmation. For face candidate detection, we prepared a series of SSR filters from 120  72 to 20  12 with a scale ratio of approximately 1/   . Therefore, the system is able to handle faces in wide variety of scales. The calculation time for the SSR filter is constant at any scale. For tracking faces of various scales, we use a series of scale-down images with a scale ratio of approximately 1/   . Based on the distance between the tracked eyes, an appropriate scale image is selected and the pattern betweenthe-eyes is tracked with an updating template. Following tracking of the between-the-eyes region, the eyes are relocated. Notably, no calculation is required to make the scale-down images because they are simply sub-sampled images. We tested the system with a broadcasted sign language news video. In total, 8706 faces out of 9811, or 88.7%, in 7146 frames were correctly tracked. In a zooming test for a single face, from a small face to nine-time zoomed up, the system continuously tracked the face. The system is implemented on a PC with a Xeon 2.2GHz CPU. As long as the size of the SSR filter is larger than or equal to 40  24, it runs at a speed of 15 frames/second for input images of 320  240. The task of increasing the accuracy of SVM is left to future work.

Acknowledgments We are grateful to AT&T Laboratories, Cambridge, UK, for "ž!Ÿ Database of Faces”[1]. letting us use “The k?ORL ”[6] by Thorsten Joachims to impleWe used “SVMœ ment SVM in our system. This research was supported in part by the Telecommunications Advancement Organization of Japan.

References [1] ATT Laboratories, Cambridge, UK. “The ORL Database of Faces”. http://www.uk.research.att.com/facedatabase.html.

[2] D. Chai and K. N. Ngan. Face segmentation using skin-color map in videophone applications. IEEE Trans. on Circuits and Systems for Video Technology, 9(4):551–564, 1999. [3] T. Darrell, G.Gordon, M. Harville, and J. Woodfill. Integrated person tracking using stereo, color, and pattern detection. Proc. CVPR 1998, pages 601–608, 1998. [4] E. Hjelmas and B. K. Low. Face detection: A survey. Computer Vision and Image Understanding, 83(3):236–274, 2001. [5] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain. Face detection in color images. IEEE Trans. on PAMI, 24(5):696–706, 2002. [6] T. Joachims. SVM  ¢¡¤£¦¥4§ . http://svmlight.joachims.org/. [7] T. Joachims. Making large-scale svm leraning practical. Advances in Kernel Methods — Support Vector Learning, B. Scholcopf and C. Burges and A. Smola(ed.), MIT-Press, 1999. [8] S. Kawato and J. Ohya. Two-step approach for real-time eye tracking with a new filtering technique. Proc. Int. Conf. on System, Man & Cybernetics, pages 1366–1371, 2000. [9] S. Kawato and N. Tetsutani. Real-time detection of betweenthe-eyes with a circle frequency filter. Proc. ACCV 2002, II:442–447, 2002. [10] S.-H. Kim, N.-K. Kim, S. C. Ahn, and H.-G. Kim. Object oriented face detection using range and color information. Proc. IEEE 3rd Int. Conf. on Automatic Face and Gesture Recognition, pages 76–81, 1998. [11] T.-K. Kim, S.-U. Lee, S.-C. Kee, and S.-R. Kim. Integrated approach of multiple face detection for video surveillance. Proc. ICPR 2002, pages 394–397, 2002. [12] F. Moreno, J. Andrade-Cetto, and A. Sanfeliu. Localization of human faces fusing color segmentation and depth from stereo. Proc. Int. Conf. on Emerging Technologies and Factory Automation, 2:527–535, 2001. [13] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. Proc. CVPR 97, pages 130–136, 1997. [14] O. Sawettanusorn, Y. Senda, S. Kawato, N. Tetsutani, and H. Yamauchi. Real-time face detection using six-segmented rectangular filter. accepted in ISPAC 2003. [15] B. Scassellati. Eye finding via face detection for a foveated, active vision system. Proc. AAAI ’98, pages 969–976, 1998. [16] P. Sinha. Object recognition via image invariants: a case study. Investigative Ophthalmology & Visual Science, 35(4):1626, 1994. [17] J. C. Terrillon and S. Akamatsu. Comparative performance of different chrominance spaces for color segmentation and detection of human faces in complex scene images. Proc. 12th Conf. on Vision Interface (VI ’99), 2:180–187, 1999. [18] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proc. CVPR 2001, 1:511–518, 2001. [19] H. Wu, Q. Chen, and M. Yachida. Face detection from color images using a fuzzy pattern matching method. IEEE Trans. on PAMI, 21(6):557–563, 1999. [20] J. Yang and A. Waibel. A real-time face tracker. Proc. 3rd IEEE Workshop on Application of Computer Vision, pages 142–147, 1996. [21] H.-X. Zhao and Y.-S. Huang. Real-time multiple-person tracking system. Proc. ICPR 2002, 2:897–900, 2002.