MPEG-4 based animation with face feature tracking - CiteSeerX

2 downloads 0 Views 262KB Size Report
MPEG-4 based animation with face feature tracking. Taro Goto, Marc Escher, Christian Zanardi, Nadia Magnenat-Thalmann. MiraLab, University of Geneva.
MPEG-4 based animation with face feature tracking Taro Goto, Marc Escher, Christian Zanardi, Nadia Magnenat-Thalmann MiraLab, University of Geneva http://www.miralab.unige.ch E-mail: {goto, escher, zanardi, thalmann}@cui.unige.ch Abstract Human interfaces for computer graphics systems are now evolving towards multi-modal approach. Information gathered using visual, audio and motion capture systems are now becoming increasingly important within user-controlled virtual environments. This paper discusses real-time interaction with virtual world through the visual analysis of human facial features from video. The underlying approach to recognize and analyze the facial movements of a real performance is described in detail. The output of the system is compatible with MPEG-4 standard and therefore enhances the ability to use the available data in any other MPEG-4 compatible application. The MPEG-4 standard mainly focuses on networking capabilities and it therefore offers interesting possibilities for teleconferencing, as the requirements for the network bandwidth are quite low. The real-time facial analysis system enables the user to control the facial animation. This is used primarily with real-time facial animation systems, where the synthetic actor reproduces the animator’s expressions. Keywords: Facial analysis, Real-time feature tracking, MPEG-4

1

Introduction

In the last few years, the number of applications that require a multimodal interface with the virtual environment has steadily increased. Within this field of research, recognition of the facial expressions is a very complex and interesting subject where there have been numerous research efforts. For instance, DeCarlo et al. [1] have applied optical flow and a generic face model based algorithm. This method is robust but it takes much time to recognize the face, and it is not in real-time. Cosatto et al. [2] used a sample-based method which needs to make a sample for each person. Kouadi et al. [3] used a database and some face markers to process it effective in real-time. However, the use of markers is not always practical and it is more attractive to recognize features without them. Pandzic et al. [4] use an algorithm based on edge extraction in real-time, without markers. The presented method is inspired from this method. This paper describes a method to track face features in real-time without the need for markers or lip creams. The output is converted to MPEG-4 based FAP points. These feature points are sent to a real-time player that deforms and displays a synthetic 3D animated face. In the next section, the complete system for interactive facial animation is described. A brief description of the MPEG-4 standard and the related facial animation parameters is given in Section 3. In Section 4, the facial feature tracking system will be described in detail. The paper will conclude with the real-time results applied to a MPEG-4 compatible facial animation system.

2

System Overview

Figure 1 shows an overall system with of different tasks and interactions required to generate a real-time virtual dialog between a synthetic clone and an autonomous actor in systems developed [7,8,9]. The video input and the speech of the user drive a clone facial animation, while the autonomous actor uses the speech and the facial expressions information from the user to generate an automatic behavioral response. MPEG-4 Facial Animation Parameters are extracted in real-time from the video input of the face. These FAPs can be used either to animate the cloned face or to compute the high-level emotions transmitted to the autonomous actor. The phonemes and the text information can be extracted from speech. The phonemes are then blended with the FAPs from the video to enhance the animation of the clone. The text is then sent to the autonomous actor and processed together with emotions to generate a coherent answer to the user. For the system that recognizes a face and makes a CG animation, a camera records a face of a person, and the face features are tracked in real-time by the computer to which the camera is connected. The extracted face features as result of tracking are converted to MPEG-4 based FAPs and are sent to another computer that makes a virtual environment. The FAPs have very small traffic to make animation in real-time. The system is compliant with the MPEG-4 definition [10] briefly described in the next section.

Figure 1. System Overview

3

MPEG-4

Newly standardized MPEG-4 is developed in response to the growing need for a coding method that can facilitate access to visual objects in natural and synthetic video and sound for various applications such as digital storage media, internet, and various forms of wired or wireless communication. ISO/IEC JTC1/SC29/WG11

(Moving Pictures Expert Group - MPEG) made it International [5]. Targets of standardization include mesh-segmented video coding, compression of geometry, synchronization between Aural and Visual (A/V) objects, multiplexing of streamed A/V objects, and spatial-temporal integration of mixed media types [6]. In a world where audio-visual data is increasingly stored, transferred and manipulated digitally, MPEG-4 sets its objectives beyond “plain” compression. Instead of regarding video as a sequence of frames with fixed shape and size and with attached audio information, the video scene is regarded as a set of dynamic objects. Thus the background of the scene might be one object, a moving car another, the sound of the engine the third etc. The objects are spatially and temporally independent and therefore can be stored, transferred and manipulated independently. The composition of the final scene is done at the decoder, potentially allowing great manipulation freedom to the consumer of the data. MPEG-4 aims to enable integration of synthetic objects within the scene. It will provide support for 3D Graphics, synthetic sound, text to speech, as well as synthetic faces and bodies. This paper will describe the use of facial definitions and animation parameters in an interactive real-time animation system. 3.1

Face definition and animation The Face and Body animation Ad Hoc Group (FBA) has defined in detail the parameters for both the definition and animation of human faces and bodies. Definition parameters allow a detailed definition of body/face shape, size and texture. Animation parameters allow the definition of facial expressions and body postures. These parameters are designed to cover all natural possible expressions and postures, as well as exaggerated expressions and motions to some extent (e.g. for cartoon characters). The animation parameters are precisely defined in order to allow an accurate implementation on any facial/body model. Here we will mostly discuss facial definitions and animations based on a set of feature points located at morphological places on the face. 3.2

FAP The FAP are encoded for low-bandwidth transmission in broadcast (one-tomany) or dedicated interactive (point-to-point) communications. FAPs manipulate key feature control points on a mesh model of the face to produce animated visemes (visual counterpart of phonemes) for the mouth (lips, tongue, teeth), as well as animation of the head and facial features like the eyes or eyebrows. All the FAP parameters involving translational movement are expressed in terms of Facial Animation Parameter Units (FAPU). These units are defined in order to allow the interpretation of FAPs on any facial model in a consistent way, producing reasonable results in terms of expression and speech pronunciation. They correspond to fractions of distances between some essential facial features (e.g. eye distance). The fractional units used are chosen to allow enough accuracy.

4

Face Tracking

Tracking face features is a key issue in face analysis. To obtain a real-time tracking, many problems have to be resolved. One important problem lies in the variety of the appearances of individuals, such as skin color, eye color, beard, glasses,

and so on. The facial features are sometimes not separated by the sharp edges, or the edges appear at unusual places. This diversity increases the difficulty for the recognition of the faces and the tracking of the facial features. In this application, the face features and their associated information are set during an initialization phase that will solve the main problem in facial feature differences between people. Figure 2(a) shows this initialization. The way to initialize one’s face is very easy. The user only moves some feature boxes, like pupil boxes, a mouth box etc, to proper regions. Once the user sets the feature positions, the information around the features, the edge information, and the face color information are extracted automatically. Various parameters used during feature tracking are then generated automatically. Those parameters, gathered during initialization phase for every face, contain all the relevant information for tracking the face position and its corresponding facial features without any marker. The tracking process is separated into two parts: 1) mouth tracking and 2) eye tracking. The edge and the gray level information around the mouth and the eyes are the main information used during tracking. Figure 2(b)-(c) display two examples of the tracked features superimposed on the face images. The method used for tracking these features is described in more detail in Section 4.2 and 4.3, for the mouth and the eyes respectively.

(a) Initialization

(b) Track sample1

(c) Track sample2

Figure 2. Initialization and sample tracking results

4.1

Initialization In the initialization part, user moves feature boxes to the proper points. The eye box is moved to each pupil location and it is changed to proper size of the pupil. This box helps to extract the strength of the edge of pupil and pixel information. After this process, the skin color and the pupil color are compared, and the weight information is decided. This weight information indicates the importance of the edge and the pixel value that is used to detect the pupil position. The eyebrow box is moved about above the each pupil. These pupils and eyebrows boxes make the relation information of each location. Small jaw position markers are used only for texture extraction, and these positions are not tracked, but they depend on lower teeth positions. The nose mask is used for extracting the texture, and for a tracking of whole face movement. This face tracking system is considered to track face features by an image from a head-mounted camera. Thus, it is not on important factor to track the global movement of the face position. Hence the system uses such a small nose marker to track the face position.

The mouth mask is split into some regions. It is split into an upper lip region, above the upper lip region (mustache information), a lower lip region, a jaw and a region for both the edges of the mouth. Then the edge appearances between each region are calculated. 4.2

Mouth Tracking The mouth is one of the most difficult face feature to analyze and track. Indeed, the mouth has a very versatile shape and almost every muscle of the lower face drives its motion. Furthermore, beard, mustache, the tongue or the teeth might appear sometimes and further increases the difficulty in tracking. Our method is taking into account some intrinsic properties of the mouth: 1) Upper teeth are attached to the head bone and therefore their position remains constant. 2) Conversely, lower teeth move down from their initial position according to the rotation of the jaw joints. 3) Basic mouth shape (open or closed) depends upon bone movement. From these properties it follows that the detection of the positions of hidden or apparent teeth from an image is the best way to make a robust tracking algorithm of the mouth shape and its associated motion. 4.2.1 Mouth tracking algorithm The basic algorithm for tracking the mouth is depicted in Extract Edge points Figure 3. The system proceeds first with the extraction of all edges crossing a vertical line going from the nose to the jaw. Calculate edge Possibility In a second phase, the energy combinations Data Base that shows possibility of a mouth shape is calculated. Finally, among all possible mouth shapes, Calculate Energy for the best candidate is chosen every combinations according to a “highest energy” criterion. Figure 4 presents the gray Loop level value along the vertical line from the nose to the jaw for different possible shape of the Choose best energy mouth and the corresponding position as a mouth detected edges: • Closed mouth: In this case, Figure 3. Mouth tracking flow the center edge appears strong, the other two edges appear normally weak, and teeth are hidden inside. • Opened mouth: As shown in the figure, when teeth are present, the edges are stronger than the edge on the outside lips, or between a lip and the teeth or between the lip and the inside of the mouth. If the teeth are hidden by the lips (upper or lower) then of course the edge of the teeth is not detected.

Edge value

Edge value

Figure 4. Edge configuration for possible mouth shape Once this edge detection process is finished, this extracted edge information is compared with the data from a generic shape database and a first selection of possible corresponding mouth shapes is done. 4.3

Eye Tracking The eye tracking system includes the Calculate pupil positions following subsystem: pupil tracking, eyelid position recognition, and eyebrow tracking. These three Calculate eyebrow subsystems are deeply positions dependent on one another. For example, if the eyelid is closed, the pupil position Calculate eyelid is hidden, and it is positions obviously impossible to detect its position. Some researchers have used a deformable generic model Yes Possibility Strange face ? for detecting the eye Data Base position and shape. This approach has a serious drawback that it is usually Figure 5. Eye tracking flow quite slow. In our first attempt, we considered also a generic model of an eye, but the system was failing when the eye was closed, and the stabilization of the result was difficult to obtain. We improve this first method by 1) both pupil positions are calculated, 2) eyebrow positions are calculated, 3) eyelid positions are extracted with respect to their possible position, 4) every data is checked for the presence of movement. After this stage, inconsistencies are checked again and a new best position is chosen if necessary. Figure 5 presents this algorithm.

4.3.1 Pupil position recognition For the pupil tracking, the same kinds of energy functions used for the mouth are applied. The main difference here is that during the pupil tracking the pupil might disappear when the l eyelid are closing. Furthermore, with the eyelid half closed or when the person looks up, some part of the pupil is hidden. The method will take Figure 6. Eye definition into account such cases. Figure 6 displays a schematic eye, where l represents the length of the visible edge of part of the pupil not hidden by the eyelid. This eye tracking system first proceeds by tracing a box area (Figure 7(a) gives an example of such area) and finds the largest energy value using the appearance of the pupil and pixel information, secondly, the position of the pupil point is extracted according to the method described before. Even with the eyelid half closed (Figure 7(b)), the equations will not change. In this case, the energy value will be smaller because half of the pupil is hidden, but the value is independent on the pupil position whether it is at the center or at the corner. Wherever the pupil is, its energy will always be a small constant. When the eyelid is completely closed, the position of the pupil is obviously undefined. It is not a major problem since the eyelid has a great chance to be detected as closed. Track area

(a) Search region

Track area

(b) Half closed case Figure 7. Trace area

4.3.2 Eyebrow tracking An eyebrow is defined as a small box in an initialization, and the position of it is roughly the center of an eyebrow. An eyebrow is sometimes difficult to indicate the region correctly, because some people have very thin eyebrows at both the sides. Hence, we use such a small box to define the position. Eyebrow track region

Figure 8. Eyebrow search

Figure 9. Eyebrow trace

The method to detect the position of the eyebrow through the small box follows a similar approach to the mouth shape recognition system. In the eyebrow case, a vertical line goes down from the forehead until the eyebrow is detected (Figure 8). The eyebrow position is given by the maximum energy calculated with edge value and pixel value. After the center of the eyebrow is found, the edge of the brow is following to the left and right to recognize the shape that is shown on Figure 9. This method is similar the case of the closed mouth described before, but with the added advantage of using the pixel information. 4.3.3 Eyelid recognition As soon as the pupil and eyebrow locations are detected using the methods described previously, it is possible to guess an eyelid location. When the energy of pupil is large or almost the same as its initial value, this means that the eye is opened. When the energy is small, the eye may be closed or the person is looking up. The eyebrow position narrows the possible eyelid position. When the eyebrow is lower than its initial position, the eye is considered to be closed or half closed. This method helps the detection of the true eyelid position opposed to a possible wrong detection that may occurred with a wrinkle. After this process, the program finds the strongest edge in the considered area, and set it as the eyelid. 4.3.4 Data check After data around the eyes are taken, they are checked again to see if they are in a normal movement compared with templates in the database. For example, if the eyes moved to the opposite direction, next possible position of the eyes has to be calculated again. This process improves the robustness and reliability of the whole process. 4.4

FAP conversion The data that were extracted by the image recognition are converted to MPEG-4 based FAPs. It is indeed attractive to obtain an output compatible with the latest standard. This way, the stream of output value can be used on any clone, avatar or even comic characters without any reprogramming. Figure 10 shows some results. First author’s face movement is used for another one’s cloned face movement. FAPs are made from distance data of movements of each feature from the neutral face. This neutral face is captured in its initialization. We use the difference between the initial points and tracked points to make these FAPs. For example, a mouth width was 100 dots and the edge moved 10 dots, movement is shown as follow.

Moveval = [10 / 100 * 1024]

So that, the size of the mouth edge movement was 102. This value does not change when the cloned face is changed to another person.

Figure 10. Results with FAPs

5

Conclusions

This paper has discussed the method to track facial features in real-time. The recognition method for facial expressions in our system does not use any special markers or make-up. It does not need training but a simple initialization of the system allowing new user to adapt immediately. Extracted data is transformed into MPEG-4 FAPs that can be used easily for any compatible facial animation system and transmitted to web pages all over the world. This system works on Windows NT and 95/98. The speed for the recognition is 10 to 15 frames per second with Pentium II 300Mhz processor and with capturing 15 images per second at 320x240 resolution. The recognition time depends upon the size of regions that are set in the initialization period and the possible motion of the feature that we consider.

6

Acknowledgement

The research is supported by European project eRENA (Electronic Arenas for Culture, Performance, Art and Entertainment) - ESPRIT IV Project 25379.

7

References

[1] D. DeCarlo, D. Metaxas, ‘Optical Flow Constraints on Deformable Models with Applications to Face Tracking’, CIS technical report MS-CIS-97-23

[2] E. Cosatto, H.P. Graf, ‘Sample-Based Synthesis of Photo-Realistic Talking Heads’, Computer Animation 1998, pp103 – 110 [3] C. Kouadio, P. Poulin, P. Lachapelle, ‘Real-Time Facial Animation based upon a Bank of 3D Facial Expressions’, CA98, 1999, pp128 – 136 [4] I.S. Pandzic, T.K. Capin, N. Magnenat-Thalmann, D. Thalmann, ‘Towards Natural Communication in Networked Collaborative Virtual Environments’, FIVE ’96, December 1996 [5] SNHC, ‘Information Technology – Generic Coding Of Audio – Visual Objects Part 2: Visual’, ISO/IEC 14496-2, Final Draft of International Standard, Version of:13, Nov., 1998, ISO/IEC JTC1/SC29/WG11 N2502a, Atlantic City, Oct. 1998 [6] SNHC, ‘Text for ISO/IEC FDIS 14496-1 Systems (2nd draft)’, ISO/IEC IEC JTC1/SC29/WG11 2501, 1997. ISO/IEC JTC1/SC29/WG11 N2501, Nov.,1998 [7] P. Doenges, F. Lavagetto, J. Ostermann, I.S. Pandzic and E. Petajan, ‘MPEG-4: Audio/Video and Synthetic Graphics/Audio for Mixed Media’, Image Communications Journal, Vol.5, No.4, May,1997 [8] I.S. Pandzic, T.K. Capin. E. Lee, N. Magnenat-Thalmann, D. Thalmann (1997). ‘A flexible Architecture for Virtual Humans in Networked Collaborative Virtual Environments’, Proceedings Eurogrhics 97, Budapest, Hungary, 1997. [9] G. Sannier, S. Balcisoy, N. Magnenat-Thalmann, D. Thalmann, ‘VHD: A System for Directing Real-Time Virtual Actors’, The Visual Computer, Springer, 1999. [10] W. S. Lee, M. Escher, G. Sannier, N. Magnenat-Thalmann, ‘MPEG-4 Compatible Faces from Orthogonal Photos’, CA 99, 1999 pp186-194.

Suggest Documents