f______________________ face detection, tracking, and recognition ...

3 downloads 83236 Views 458KB Size Report
Definition: Face detection and tracking are techniques for locating faces in images ... vector machines, and boosting, are the best. .... retrieved by inputting a keyword like “Bush” or “Julia Roberts” instead of by inputting ..... Software and Tools.
____________________

F

______________________

FACE DETECTION, TRACKING, AND RECOGNITION FOR BROADCAST VIDEO Duy-Dinh Le, Xiaomeng Wu, and Shin’ichi Satoh National Institute of Informatics, Tokyo, Japan Synonyms: Face localization, face grouping, face identification, face matching, person information analysis; Definition: Face detection and tracking are techniques for locating faces in images and video sequences; face recognition is a technique for identifying or verifying unknown people using a stored database of known faces.

1. Introduction Human face processing techniques for broadcast video, including face detection, tracking, and recognition, has attracted a lot of research interest because of its value in various applications, such as video structuring, indexing, retrieval, and summarization. The main reason for this is that the human face provides rich information for spotting the appearance of certain people of interest, such as government leaders in news video, the pitcher in a baseball video, or a hero in a movie, and is the basis for interpreting facts. Face processing techniques have several applications. For example, Name-It [Satoh99MM] aims to associate names and faces appearing in news video for the purpose of person annotation. [DLe07MDDM] developed a system to find important people who appear frequently in large video archives. A celebrity search engine developed by Viewdle1 can find video segments in which the queried person (query by name) appears. Localizing faces and recognizing their identities are challenging problems: facial appearance varies largely because of intrinsic factors, such as aging, facial expressions, and make-up styles; and extrinsic factors such as pose changes, lighting conditions, and partial occlusion. These factors make it difficult to construct good face models. Many

1

http://www.viewdle.com

2

A

efforts have been made in the fields of computer vision and pattern recognition, but good results have been limited to restricted settings. This article describes state-of-the art techniques for face detection, tracking, and recognition with applications to broadcast video. For each technique, we firstly describe the challenges they are to overcome. Then several modern approaches are presented. Finally, a discussion of these techniques is given.

2. Face Detection Face detection, which is the task of localizing faces in an input image, is a fundamental part of any face processing system. The extracted faces can then be used for initializing face tracking or automatic face recognition. An ideal face detector should possess the following characteristics: - Robustness: it should be capable of handling appearance variations in pose, size, illumination, occlusion, complex backgrounds, facial expressions, and resolution. - Quickness: it should be fast enough to perform real-time processing, which is an important factor in processing large video archives. - Simplicity: the training process should be simple. For example, the training time should be short, the number of parameters should be small, and training samples should be able to be collected cheaply.

2.1. Real-time Face Detection Using Cascaded Classifiers There are many approaches for building fast and robust face detectors [YangH02PAMI]. Among them, those using advanced learning methods, such as neural networks, support vector machines, and boosting, are the best. As shown in Figure 1, detecting the faces in an image typically takes the following steps: - Window scanning: in order to detect faces at multiple locations and sizes, a fixed window size (e.g. 24 x 24 pixels) is used to extract image patterns at every location and scale. The number of patterns extracted from a 320 x 240 frame image is large, approximately 160,000, and only a small number of these patterns contain a face. - Feature extraction: the features are extracted from the given image pattern. The most popular feature type is the Haar wavelet because it is very fast to compute using the integral image [Viola01CVPR]. Other feature types include pixel intensity [Rowley98PAMI], local binary patterns [Hadid04CVPR], and edge orientation histogram [Levi04CVPR]. - Classification: the extracted features are passed through a classifier that has been previously trained to classify the input pattern associated with these features as a face or a non-face. - Merging overlapping detections: since the classifier is insensitive to small changes in translation and scale, there might be multiple detections around each face. In order to return a single final detection per face, it is necessary to combine the overlapping detections into a single detection. To this end, the set of detections is partitioned into disjoint subsets so that each subset consists of the nearby detections for a specific location and scale. The average of the corners of

Encyclopedia of Multimedia

3

all detections in each subset is considered to be the corners of one face region that this set returns.

Figure 1: A typical face detection system in which a fixed size window is used to scan to every location and scale to extract image patterns which are then passed through a classifier to check for the existence of a face.

Figure 2: A cascaded structure for fast face detection in which easy patterns are rejected by simple classifiers in earlier stages while more difficult patterns are processed by more complicated classifiers in later stages. Since the vast majority of processed patterns are non-face, the single classifier based systems, such as the neural network [Rowley98PAMI] and support vector machines [Hadid04CVPR], are usually slow. To overcome this problem, a combination of simple-tocomplex classifiers has been proposed [Viola01CVPR] and this has led to the first real-time robust face detector. In this structure, fast and simple classifiers are used as filters in early stages of detection to quickly reject a large number of non-face patterns, whereas slower but more accurate classifiers are used in later stages for classifying face-like patterns. In this way, the complexity of classifiers can be adapted to correspond to the increasing difficulty of the input patterns. Figure 2 shows an example of this structure. Training classifiers usually consist of the following steps: - Training set preparation: Supervised learning methods require a large number of training samples to obtain accurate classifiers. The training samples are patterns

4

A

-

that must be labeled as face (positive samples) or non-face (negative samples) in advance. Face patterns are manually collected from images containing faces, and they are then scaled to the same size and normalized to a canonical pose in which the eyes, mouth, and nose are aligned. These face patterns can be used to generate other artificial faces by randomly rotating the images (about their center points) by up to 10 degrees, scaling them between 90 and 110%, translating them up to half a pixel, and mirroring them to enlarge the number of positive samples [Rowley98PAMI]. The collection of non-face patterns is usually done automatically by scanning through images which contain no faces. The accurate classifier described in [Viola01CVPR] requires about five thousand original face patterns and hundreds of millions of non-face patterns extracted from 9,500 non-face images. In [Levi04CVPR], a smaller number of training samples can be used to build a robust face detector by using an edge orientation histogram. Learning method selection: Basically, in an ideal situation with the proper settings, advanced learning methods, such as the neural network, support vector machine, and AdaBoost, perform similarly. However, in practice, it is difficult to find proper settings. A neural network method entails designing layers, nodes, etc., which is a complicated task. Therefore, it is preferable to use support vector machines because only two parameters are necessary if an RBF kernel is used and many tools are available. AdaBoost (and its variants) is another popular learning method that is used in many object detection systems. The advantage of AdaBoost is it can be used for both selecting features and learning the classifier.

The current frontal-view face detection systems work in real-time and with high accuracy. However, development a face detection system to handle arbitrary views of faces is still a challenging job. A simple approach is to divide the entire face space into subspaces that correspond to specific views such as frontal, full-profile and half-profile, and build several face detectors so that each detector handles one view. Figure 3 shows the tree structure used in the multi-view face detection system described in [HuangC05ICCV].

Figure 3: The tree structure used in the multi-view face detection system proposed by Huang et al. [HuangC05ICCV]. Each node in the tree is a detector which handles a specific range of poses. (Courtesy of C. Huang, H. Ai, Y. Li and S. Lao.)

Encyclopedia of Multimedia

5

2.2. Discussion The face detection techniques presented above are mainly for still images rather than videos. However, by considering each video frame to be a still image, these techniques can be made to work for videos. Although frame-based face detection techniques have been demonstrated on real images, their ability of detecting faces in videos is still primitive. The performance of the detector may decrease for various reasons including occlusions and changes in lighting conditions and face poses. Without additional information, the detector’s responses can easily be rejected, even if they indicate the presence of a face. To provide more complete video segments in which to track the person of interest, it is therefore important to incorporate temporal information in a video sequence.

3. Face Tracking Face tracking is the process of locating a moving face or several of them over a period of time by using a camera, as illustrated in Figure 4. A given face is first initialized manually or by a face detector. The face tracker then analyzes the subsequent video frames and outputs the location of the initialized face within these frames by estimating the motion parameters of the moving face. This is different from face detection, the outcome of which is the position and scale of one single face in one single frame. Face tracking acquires information on multiple consecutive faces within consecutive video frames. More importantly, these faces have the same identity.

Figure 4. Overview of face tracking

3.1. Benefits of Face Tracking One of the main applications of face tracking is person retrieval from broadcast video, for example: “intelligent fast-forward”, where the video jumps to the next scene containing a certain person/actor; or retrieval of different TV segments, interviews, shows, etc., featuring a given person in a video or a large collection of videos. [Sivic05CIVR] proposes a straightforward way of face tracking for person retrieval from feature-length movie

6

A

video. At run time, the user outlines a face in a video frame, and the face tracks within the movie are then ranked according to their similarity to the outlined query face in the same way as Google. Since one face track corresponds to one identity, unlike in framebased face detection, the workload of intra-shot face matching is greatly reduced. In addition, face tracking provides multiple examples of the same character’s appearance to help with inter-shot face matching. Face tracking is also used for face-name association, the objective of which is to label television or movie footage with the identity of the person present in each frame of the video. Everingham et al. [Everingham06BMVC] proposed an automatic face-name association system. This system uses a face tracker similar to the one in [Sivic05CIVR] that can extract a few hundred tracks of each particular character in a single shot. Based on the temporal information obtained from the face tracker, textual information for TV and movie footage including subtitles and transcripts is employed to assign the character’s name to each face track. For instance, shots containing a particular person can be retrieved by inputting a keyword like “Bush” or “Julia Roberts” instead of by inputting an outlined query face, as is used in [Sivic05CIVR]. Besides broadcast video, face tracking also has important applications in humanoid robotics, visual surveillance, human-computer interaction (HCI), video conferencing, face-based biometric person authentication, etc.

3.2. Selection Criteria of Face Tracking Methods Choosing a face tracker can be a difficult task because of the variety of face trackers currently available. The application provider must decide which face tracker is best suited to his/her individual needs and, of course, the type of video that he/she wants to use as the target. Generally speaking, the important issues are the tracker’s speed, robustness, and accuracy. Can the system run in real time? Similar to the case of many processing tools for broadcast video, speed is not the most critical issue because offline processing is permitted in most video structuring and indexing activities. However, a real-time face tracker is necessary if the target archive is based on too large a quantity of video, e.g. 24 hours of continuous video recording that needs daily structuring. Moreover, the speed of the tracker is critical in most non-broadcast video applications, e.g. HCI. Note that there is always a tradeoff between speed and performance-related issues such as robustness and accuracy. Can the system cope with varying illuminations, facial expressions, scales, poses, camerawork, occlusion, and large head motions? A number of illumination factors, e.g. light sources, background colors, luminance levels, and media, greatly affect the appearance of a moving face, for instance, when tracking a person who is moving from an indoor to an outdoor environment. Face tracking also tends to fail when there are large facial deformations of the eyes, nose, mouth, etc., due to changes in facial expression. Different from non-broadcast video, e.g. video used for HCI, faces appearing in broadcast video vary from large in close-ups to small in long shots. A smaller face scale always leads to a lower resolution, and most face trackers designed by computer

Encyclopedia of Multimedia

7

vision researchers will reject such faces. Pose variations, i.e. head rotations including pitch, roll, and yaw, can cause parts of faces to disappear. In some cases, scale and pose variations might be caused by camerawork changes. Occlusion by other objects will also partially obscure faces, and other motions onscreen may interfere with the acquisition of motion information. Moreover, the task of face tracking becomes even more difficult when the head is moving fast relative to the frame rate, so that the tracker fails to “arrive in time”. How accurate is the tracking? When initializing the tracker with a face detector, the first factor that affects accuracy is false face detections. This problem is difficult to solve because the face detector has a fixed threshold. Lowering the threshold of the face detector reduces the number of false rejections, but increases the number of false detections. Drift, or the long sequence motion problem, also affects accuracy. These problems always are a result of imperfect motion estimation techniques. A tracker might accumulate motion errors and eventually lose track of a face, for instance, as it changes from a frontal view to a profile.

3.3. Workflow of Face Tracking Face tracking can be considered to be a kind of algorithm that analyzes video frames and outputs the location of moving faces within each frame. For each tracked face, three steps are involved, i.e., initialization, tracking, and stopping as illustrated in Figure 5.

Figure 5. Face tracking flowchart Most methods use a face detector for initialization of their tracking processes. An always ignored difficulty with this step is how to control false face detections as described above. Another problem is in handling new non-frontal faces. Although there have been studies on profile or intermediate pose face detectors, they all suffer from the false-detection problem far more than a frontal face detector does. To alleviate these problems, Chaudhury et al. [Choudhury03PAMI] used two face probability maps instead of a fixed threshold to initialize the face tracker, one for frontal views and one for profiles. All local maxima in these maps are chosen as face candidates, the face probabilities of which are propagated throughout the temporal sequence. Candidates whose probabilities either go to zero or remain low over time are determined to be non-face and are eliminated. The information from the two face probability maps is combined to represent an intermediate

8

A

head pose. Their experiments showed that the proposed probabilistic detector was more accurate than a traditional face detector and could handle head movements covering ±90 degrees out-of-plane rotation (yaw). After initialization, one should choose the features to track before tracking a face. The exploitation of color is one of the more common choices because it is invariant to facial expressions, scale, and pose changes [Boccignone05ICIAP, LiY06HCIW]. However, color-based face trackers often depend on a learning set dedicated to a certain type of processed video and might not work on unknown videos with varying illumination conditions or on faces of people of different races. Moreover, the color image is susceptible to occlusion by other head-like objects. Two other choices that are more robust to varying illuminations and occlusions are key point [Sivic05CIVR, Everingham06BMVC] and facial features [Arnaud05ICIP, ZhuZ05CVIU, TongY07PR], e.g. eyes, nose, mouth, etc. Although the generality of key points allows for tracking of different kinds of objects, without any facespecific knowledge, this method’s power to discriminate between the target and clutter might not be enough to deal with background noise or other adverse conditions. Facial features enable tracking of high-level facial information, but they are of little use when the video is of low quality. Most facial-feature-based face trackers [ZhuZ05CVIU, TongY07PR] have been tested using only non-broadcast video, e.g. webcam video, and their applicability to broadcast video is questionable. Note that the different cues described above may be combined. An appearance-based or featureless tracker matches an observation model of the entire facial appearance with the input image, instead of choosing only a few features to track. One example, in [Choudhury03PAMI], is the appearance-based face tracker mentioned above. Another example, in [LiY06HCIW], uses a multi-view face detector to detect and track faces from different poses. Besides the face-based observation model, a head model is also included to represent the back of the head. This model is based on the idea that a head can be an object of interest because the face is not always trackable An extended particle filter is used to fuse these two sets of information to handle occlusions due to out-ofplane head rotations (yaw) exceeding ±90 degrees. During the tracking procedure, face tracking systems usually use a motion model that describes how the image of the target might change for different possible face motions. Examples of simple motion models are as follows. Assuming the face to be a planar object, the corresponding motion model can be a 2D transformation, e.g. affine transformation or homography, of a facial image, e.g. the initial frame [Arnaud05ICIP, ZhuZ05CVIU]. Some research treats the face as a rigid 3D object; the resulting motion model defines aspects depending on 3D position and orientation [TongY07PR]. However, a face is actually both 3D and deformable. Some systems try to model faces in this sense, and the image of face can be covered with a mesh, i.e. a sophisticated geometry and texture face model [Dornaika04TSMC, Dornaika06CSVT]. The motion of the face is defined by the position of the nodes of the mesh. If the quality of the video is high, a more sophisticated motion model will give more accurate results. For instance, a sophisticated geometry and texture model might be more insusceptible to false face detections and drifting than a simple 2D transformation model. However, most 3D-based and meshbased face trackers require a relatively clear appearance, high resolution, and a limited

Encyclopedia of Multimedia

9

pose variation, e.g. out-of-plane head rotations (roll and yaw) that are far less than ±90 degrees. These requirements cannot be satisfied in the case of broadcast video. Therefore, most 3D-based and mesh-based face trackers are only tested on non-broadcast video, e.g. webcam video [Dornaika04TSMC, Dornaika06CSVT, TongY07PR]. Finally, the stopping procedure is rarely discussed. This constitutes a major deficiency for the face tracking algorithms that are generally not able to stop a face track in case of tracking errors, i.e. drifting. [Arnaud05ICIP] proposed an approach that uses a general object tracker for face tracking and a stopping criterion based on the addition of an eye tracker to alleviate drifting. The two positions of the tracked eyes are compared with the tracked face position. If neither of the eyes is in the face region, drifting is determined to be occurring and the tracking process stops. In addition, most mesh-based or top-down trackers are assumed to be able to avoid drifting.

3.4. Discussion Face tracking has attracted much attention from researchers of multimedia content analysis, computer vision, etc. However, while most of the face trackers have been for high-quality video in computer vision, only a limited number have been designed for broadcast video. This is because the current face trackers still require a relatively clear appearance, high resolution, and limited pose variations, which cannot be guaranteed in broadcast video. On the other hand, face trackers are still evaluated with different types of video and different criteria. A general evaluation criterion, in terms of speed, robustness, and accuracy, is needed for comparing the performances of face trackers with different purposes.

4. Face Recognition Face recognition is the process of identifying or verifying one or more persons appearing in a scene by using a stored database of faces [ZhaoW03ACS]. The applications of face recognition in video are as follows. - Face retrieval: retrieve shots containing a person’s appearances using one or several face images as a query [Sivic05CIVR, Arandjelovic05CVPRa]. - Face matching: match face sequences of unknown people against annotated face sequences in the database for annotation or identification [Satoh00FG]. - Face grouping: organize detected face sequences into clusters for auto-cast listing [Arandjelovic06CVPR]. - Name-face association: associate names and faces in video by multi-modal analysis for annotation and retrieval [Satoh99MM, YangJ04ACMM, Everingham06BMVC]. Similar to face detection and face tracking, face recognition faces difficulties in handling variations in resolution, face size, pose, illumination, occlusion, and facial expression. In addition, it is crucial to handle inter-variations, i.e., variations among individuals, and intra-variation, i.e., variations affecting each individual for a robust face recognition system.

4.1. Pre-processing Techniques The detected faces usually vary wildly and are not reliable for matching. Therefore a normalization step is required to eliminate the effects of complex backgrounds and

10

A

different illuminations, poses and sizes. A simple technique to handle different illuminations (Figure 6) is to subtract the best fit brightness plane and do histogram equalization. To handle pose changes and different sizes, facial features such as eyes, nose and mouth, are detected and used to rectify all faces to a canonical pose and scale to the same size. Elliptical masks or other background subtraction techniques can be used to remove background clutter. [Arandjelovic05CVPRa] proposes a sophisticated face normalization technique that involves a series of transformations, each aimed at removing the effect of a particular variation.

Figure 6: Faces before and after the normalization process.

4.2. Face Recognition Techniques In general, a single face image can be viewed as a point in a Euclidean image space. The dimensionality, D, of this space is equal to the number of pixels of the input face image. Usually D is large, leading to the curse of dimensionality problem. However, the surfaces of faces are mostly smooth and have regular texture, making their appearance quite constrained. As a result, it can be expected that face images can be confined to a face space, a manifold of lower dimension d