Face detection for video summaries - Semantic Scholar

Face detection for video summaries Jean Emmanuel Viallet and Olivier Bernier France Télécom Recherche & Développement Technopole Anticipa, 2, avenue Pierre Marzin 22307 Lannion, France jeanemmanuel.viallet, [email protected]

Abstract. In an image, the faces of the persons are the first information looked for. Performing efficient face detection in a video with persons (excluding cartoons and nature videos) allows to classify shots, and to obtain automatically face summaries. Shot sampling greatly improves time processing. Scene layout (same number of person, similar face position and size) provides a criterion to establish a similarity measure between shots. Similar shots are gathered within shot clusters and all but one shot of a cluster are discarded from the summaries.

1 Introduction Persons are the principal shooting operator, the major subject of shooting and also the primary concern of audience. This concern and aptitude for persons and specifically for person faces is illustrated every day in television magazines (paper or electronic) where, it is a well established convention, summaries of programs are quasi systematically illustrated with images of persons and most of the time with (cropped) close-up shots of face. An alternative to summarizing a whole video to a unique image consists in segmenting the video in shots. Segmentation consists in finding the location and the nature of the transition between two adjacent shots and has led to numerous techniques, trimmed to the nature of the transition both in the compressed [1] and uncompressed domain [2]. Each identified shot is summarized by a keyframe [3]. Shot detection and key frame extraction rely on low level information (colour, movement) but nothing is known on the content of the shot or key frame (presence/absence of persons or of specific objects). We present a technique to summarize video using face information obtained by face detection. This technique is adequate for videos with persons but unsuitable for videos such as cartoons or nature documentaries (without faces of persons). Sequences or scenes are narrative units, of level of abstraction greater than shots and thus scene segmentation may vary subjectively according to the director, editor or audience. Some key frames or shots can be viewed as carrying little information (intermediate shots) or similar information (alternate shots in a dialogue scene). We remove such key frames and decrease the size of the summary [4].

2 Face detection Since face represents high-level information to which humans are very sensitive, face/non face shot classification [5] and face-based summaries are relevant. Early work on face detection performed by Rowley, Pentland and others dealt with frontal face detection. Most of the work performed on face based video indexation deals with video news and face detection of anchors [6, 7]. Such videos are of particular interest since overlaid text and audio recognition efficiently contribute to indexation. As face detection is concerned, these videos are characterized by typical frontal face, waist high shot and central position when there is one anchor. The anchor usually looks at the camera, which eases the face detection process. Unfortunately, in most non-news videos, such as those dealt with in this paper, frontal views are not always available and side view face detection is needed [8, 9]; our own face detector achieves detection up to angles of 60° [10]. The performances of a face detector on a video can be evaluated by processing every image. Apart from being tedious, such an evaluation is biased since many images of a shot are highly similar. The performances are thus estimated on a shot basis (Table 1). Since this work does not focuss on automatic shot detection, shot boundaries (cuts for the videos processed) are manually determined. A shot is manually labelled face, if at least one image of the shot exhibits a least a face; otherwise it is labelled non-face shot. A face shot is correctly classified if at least one face is detected on at least one frame of the shot. A face shot is incorrectly classified if no face is detected on any of the frames of the shot. A non-face shot is correctly (incorrectly) classified if there is no (at least one) detection is detected on any (at least one) frame of the shot. Face detection rate, estimated on shots, if of 56% (Table I) and is below the 75% rate obtained on still images [10]. Face detection typically fails because of face orientation, size, occultation and colorimetry (when skin tone pre-filtering is implemented in order to accelerate the face detection process). Only five non-face shots (less than 4%) are misclassified (Table 1). The false alarm rate on video is equivalent to the one obtained with still images [10] (one false alarm for every full 250 images processed). Although the obtained false alarms are highly correlated (they look alike), their temporal stability is low and no false alarm is obtained on more than two consecutive frames. This low temporal stability could be used to automatically filter the false alarms [11]. The overall rate of correctly classified shots is 85%. 2.1 Frame sampling and face detection An alternative to using a fast face detector such as the one described in [12] is to process a limited number of frames per shot. This is of particular interest for the nonface shots (the more numerous of the tested videos), which otherwise must be entirely scanned before being classified as a non-face shot. On the 185 shots of the seven videos we have processed, a 737 shot/hour rhythm is found, corresponding to an average shot length of 109 frames. Once the limits of a shot are known (obtained with

an automatic shot segmentation for example), face detection is performed on frames sampled along the shot, until detection occurs; the corresponding frame represents the key frame of the shot. When no detection is found, the shot is classified as non-face and will be discarded from the video summary. On the processed videos, the face shot detection yield only slightly increases for sampling rate greater than 3 to 4 samples per shot. A face shot is a shot where detection occurs (face or false alarm). The detection yield is the ratio of the number of face shot obtained for s samples over the number of face shot obtained when all the frames are processed. Sampling is equivalent to Group Of Picture processing for compressed video [11]. On average, only 3.65 frames are processed when a maximum of four samples per shot is selected. Table 1. Face and non-face shot classification. 50 face shots 135 non face shots Total 185 shots

Correctly classified

Incorrectly classified

56% 96.3% 85.4%

44% 3.7% 14.6%

3 Video summaries The shot summary, obtained from shot segmentation, has a number of key frames images equal to the number of detected shot (key frames are manually selected as the middle of the shots images) (Fig. 1 top). Each of the shots (and corresponding key frames) has a priori the same importance. The face shot summary (Fig. 1 bottom left), far smaller than the shot summary, only keeps the key frames where detection has occurred. The face summary collects the (cropped) images of the faces detected and discards similar faces (Fig. 1 bottom right). A video could be summarized with the (cropped) image of the first face detected. A one face image summary limits processing time and provides a summary more interesting than the first image of a video (often a dark image or a video credits image) with which, until recently, video search engines used to summarize video before selecting a within video frame as summary [13]. From the face information knowledge, different processes may be thought off. For example, retaining the face key frame corresponding to the longest face shot is presumably preferable to selecting the key frame of the longest shot. A more difficult process deals with selecting an image corresponding to the face detected in the greatest number of shots. Ascertaining that the face belongs to the same person [14] can be straightforward when the images have little difference (for example, top images in Fig. 1 bottom left) but is usually difficult (bottom images in Fig. 1 bottom left).

4 Scene layout similarity and shot clusters A same person may be found in different scene corresponding either to a change of location, time or characters (for example the bottom images in Fig. 1 bottom left). A same person may also be encountered in different shots of a scene, owing to the editing technique of alternate shots or to the insertion of shots. From one scene to another, the changes of pose, of facial expression, of light conditions and of background are among the major reasons which make face identification difficult, regardless of the fact that face recognition only succeeds with front view faces [15]. On the contrary, within a scene without camera change, the position of the face and the background do not change much. We consider that two shots i, j have a similar scene layout and belong to the same shot cluster if the number of detected faces is the same in both shots and if the positions and scales of the detected faces have changed less than a predefined value between the two shots. Let us consider the relative variation of the horizontal position (Equation 1) and of the vertical position (Equation 2) with respect to the width w and the height h of the face, together with the relative variation of the size z (Equation 3) of the face; x and y are the image coordinates of the position of the face. Equation (4) measures the scene layout similarity, according to our criteria, when only one person is found in the shots i and j. These shots are said to be similar, when their mutual similarity Si,j is greater than a given threshold. If similar, these shots are merged within a same shot cluster. Otherwise, if a shot cannot be merged to cluster, a new cluster is initiated from this shot. The value of the threshold used in the following experiment is set to 0.5 and corresponds to relative variation of lateral, vertical position and size of face of 25%.

X i, j = 2 *

Yi, j = 2 *

Z i, j = 2 * Si, j =

xi - x j wi + w j yi - y j hi + h j zi - z j zi + z j

1 1 1 * * 1 + X i , j 1 + Yi , j 1 + Z i , j

(1)

(2)

(3) (4)

From the 28 shots for which faces have been detected, and the threshold value of 0.5, the shot clusters obtained are given in figure 3, and presented on a video per video basis. The criteria used (Equations 5 to 8) to compare the effectiveness of shot segmentation techniques [16], are also used to measure the quality of the shot cluster obtained.

Accuracy = Recall =

NC - NI = 0.77 NT

NC NT + ND

= 0.70

ND + NI = 0.11 NT + NI NC Precision = =1 NC + NI Error rate =

(5) (6) (7)

(8)

The total number of cluster is estimated by the author to NT = 18. This estimation is subjective and a different, although close, number of clusters could have been found by someone else. The situation is similar for shot segmentation for which there is no ground truth. For instance, in one of the videos, the first cluster is obtained in the same room as the second cluster, but with a greater field of view and a slightly different camera angle (Fig. 2 top left and Fig. 3). In another video, the first cluster corresponds to a location estimated to be different from the location of the last cluster (Fig. 2 top right and Fig. 3). The number of correctly identified cluster is NC = 14, the number of incorrectly inserted cluster is NI = 0 and ND = 2 is the number of incorrectly deleted clusters. The first deleted cluster corresponds to an obtained cluster, which incorrectly merges a same person in a similar position but in two different places (Fig. 2 bottom left and Fig. 3). The second deleted cluster (Fig. 2 bottom right and Fig. 3) corresponds to the cluster that incorrectly merges a man and a woman. If the colorimetry of the images had been taken into account, these two errors would have probably been dismissed as shown by keyframe clustering based on compressed chromaticity signature [17]. Keeping only one sample per cluster yields smaller video summaries. Summaries assembling (cropped) images of face (Fig. 1 bottom right) focalise on the person to the detriment of contextual information.

4 Conclusion Face detection is a mean to obtain video summaries, which people are familiar with that is to say that focus on face information. The size of the obtained video summaries is far smaller than the standard shot summary, and even benefits from non-detected faces together with a low false alarm rate. Many of the face images are similar and can be gathered in shot clusters and discarded from the summary.

Fig. 1. Top: the standard key frame "shot" summary. Bottom left: the "shot-face" summary obtained by selecting shots where faces are detected. Bottom right: the "face" summary, keeping only facial parts of the images and discarding similar redundant faces.

Fig. 2. Top left: same person and location, different frames. Top right: same person, different location and face position. Bottom left: same person, different locations and similar positions. Bottom right: different persons and locations, similar face position.

Fig. 3. Face shot clusters. Each of the seven black frames corresponds to a video. Columns show the different clusters of a video and rows show the shots of a cluster, according to the chronology of the video. Only the information's on the number, size and position of faces are used and image colorimetry is not taken into account. Two clusters are incorrect: one merges a woman and a man and, in the second one, a man is first in front of a bookshelves than in front of a window. For the top video, two images of the second shot cluster, enclosed with hyphen lines, are positioned on the first row for convenience.

References 1. Wang, H.L., Chang, S.F.: A Highly Efficient System for Automatic Face Region Detection in MPEG Video. CirSys Video, 7(4) (1997) 615-628 2. Demarty, C.H., Beucher, S.: Efficient morphological algorithms for video indexing. ContentBased and Multimedia Indexing, CBMI'99 (1999) 3. Chen, J.-Y., Taskiran C., Albiol, A., Delp, E. J., Bouman, C. A.: ViBE: A Video Indexing and Browsing Environment. Proceedings of the SPIE Conference on Multimedia Storage and Archiving Systems IV, 20-22 septembre, Boston, vol. 3846 (1999) 148-164 4. Aoki, H., Shimotsuji, S., Hori, O.: A shot classification method of selecting effective keyframe for video browsing. In Proc. of ACM Int'l Conf. on Multimedia, pages 1--10, Boston, MA, November 1996. 5. Chan, Y., Lin, S.H., Tan, Y.P., Kung, S.Y.: Video Shot Classification Using Human Faces. ICIP (1996) 843-846 6. Eickeler, S., Muller, S.: Content-Based Indexing of TV Broadcast News Using Hidden Markov Models. IEEE Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP), Phoenix, Arizona (1999) 7. Liu, Z., Wang, Y.: Face Detection and Tracking in Video Using Dynamic Programming, ICIP00 (2000) MA02.08 8. Schneidermann, H., Kanade, T.: Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition. IEEE Computer Vision and Pattern Recognition, Santa Barbara (1998) 45-51 9. Wei, G., Li, D., Sethi, I. K.: Detection of Side View Faces in Color Images. WACV00 (2000) 79-84 10. Féraud, R., Bernier, O., Viallet, J. E., Collobert, M.: A fast and accurate face detector based on neural networks. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 23 (2001) 42-53 11. Wang, H., Stone, H. S., Chang, S.-F.: FaceTrack: Tracking and Summarizing Faces from Compressed Video. SPIE Multimedia Storage and Archiving Systems IV, Boston (1999) 12. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Conference on Computer Vision 01 (2001) II:747 13. Altavista video search engine: http://www.altavista.com 14. Eickeler, S., Wallhoff, F., Iurgel, U., Rigoll, G.: Content-Based Indexing of Images and Video Using Face Detection and Recognition Methods. IEEE Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, Utah (2001) 15. Satoh, S.: Comparative Evaluation of Face Sequence Matching for Content-based Video Access. Proc. of Int'l Conf. on Automatic Face and Gesture Recognition (FG2000) (2000) 163-168 16. Ruiloba, R., Joly, P., Marchand-Millet, S., Quenot, G.: Towards a standard protocol for the evaluation of video-to-shots segmentation algorithms. CMBI 1999 Proceedings of the European workshop on content-based-multimedia indexing, Toulouse, France (1999) 17. Drew, M. S. Au. J.: Video keyframe production by efficient clustering of compressed chromaticity signatures. ACM Multimedia '00, pp.365368, November 2000