Multimodal interaction in collaborative virtual

0 downloads 0 Views 219KB Size Report
real-time facial animation systems, where the synthetic actor reproduces the .... the face to produce animated visemes (visual counterpart of phonemes) for the ...
Multimodal interaction in collaborative virtual environments Taro Goto, Marc Escher, Christian Zanardi, Nadia Magnenat-Thalmann MiraLab, University of Geneva http://www.miralab.unige.ch E-mail: {goto, escher, zanardi, thalmann}@cui.unige.ch Abstract Human interfaces for computer graphics systems are now evolving towards a total multi-modal approach. Information gathered using visual, audio and motion capture systems are now becoming increasingly important within user-controlled virtual environments. This paper discusses real-time interaction through the visual analysis of human face feature. The underlying approach to recognize and analyze the facial movements of a real performance is described in detail. The output of the program is directly compatible with MPEG-4 standard parameters and therefore enhances the ability to use the available data in any other MPEG-4 compatible application. The real-time facial analysis system gives the user the ability to control the graphics system by means of facial expressions. This is used primarily with real-time facial animation systems, where the synthetic actor reproduces the animator’s expression. The MPEG4 standard mainly focuses on networking capabilities and it therefore offers interesting possibilities for teleconferencing, as the requirements for the network bandwidth are quite low. Keywords: Facial analysis, Real-time feature tracking, MPEG-4, Real-time Facial Animation.

1. Introduction In the last few years, the number of application that requires a fully multimodal interface with the virtual environment has steadily increased. Within this field of research, recognition of facial expressions is a very complex and interesting subject where there have been numerous research efforts. For instance, DeCarlo and Metaxas [1] have applied optical flow and a generic face model based algorithm. This method is robust but it takes much time to recognize face, and it is not in real-time. Cosatto and Graf [2] used

a sample-based method. This method needs to make a sample for each person. Kouadi et al. [3] also used sample based database and some face markers to process in real-time. However, use of markers is not always practical and it is attractive to allow recognition without them. Igor et al. [4] use edge extraction based algorithm in real-time, without markers. This paper describes a method to track face features in real-time without markers or lip creams, and more details. The output is converted to MPEG-4 based FAP points. These feature points are sent to a real-time player that deforms and displays a synthetic 3D animated face. In the next section, the complete system for interactive facial animation is described. A simple description of the MPEG4 standard and the related facial animation parameters is given in Section 3. In Section 4, the facial feature tracking system will be described in details. The paper will conclude on real-time results applied to a compatible facial animation system.

2. System Overview Figure 1 sketches the different tasks and interactions to generate a real-time virtual dialog between a synthetic clone and an autonomous actor [1]. The video and the speech of the user drive the clone facial animation, while the autonomous actor uses the information of speech and facial emotions from the user to generate an automatic behavioral response. MPEG-4 Facial Animation Parameters are extracted in real-time from the video input of the face. These FAPs can be either used to animate the cloned face or are processed to compute high-level emotions transmitted to the autonomous actor. Out of the user’s speech, phonemes and text information can be extracted. The phonemes are then blended with the FAPs from the video to enhance the animation of the clone. The text is then sent to the autonomous actor and processed together with emotions to generate a coherent

answer to the user. Our system is compliant with the MPEG-4 definition [2] briefly described in the next section.

3. MPEG-4 The newly standardized MPEG-4 is developed in response to the growing need for a coding method that can facilitate access to visual objects in natural and synthetic video and sound for various applications such as digital storage media, internet, and various forms of wired or wireless communication. ISO/IEC JTC1/SC29/WG11 (Moving Pictures Expert Group MPEG) had been working on this to make it International in March 1999 [5]Erreur! Source du renvoi introuvable.. It will provide support for 3D Graphics, synthetic sound, text to speech, as well as synthetic faces and bodies. This paper will describe the use of facial definitions and animation parameters in an interactive real-time animation system.

3.1 Face definition and animation

exaggerated expressions and motions to some extent (e.g. for cartoon characters). The animation parameters are precisely defined in order to allow an accurate implementation on any facial/body model. Here we will mostly discuss facial definitions and animations based on a set of feature points located at morphological places on the face. The following section will shortly describe the Face Animation Parameters (FAP).

3.2 FAP The FAP are encoded for low-bandwidth transmission in broadcast (one-to-many) or dedicated interactive (point-to-point) communications. FAPs manipulate key feature control points on a mesh model of the face to produce animated visemes (visual counterpart of phonemes) for the mouth (lips, tongue, teeth), as well as animation of the head and facial features like the eyes or eyebrows.

Figure 1 System Overview The Face and Body animation Ad Hoc Group (FBA) has defined in detail the parameters for both the definition and animation of human faces and bodies. Definition parameters allow a detailed definition of body/face shape, size and texture. Animation parameters allow the definition of facial expressions and body postures. These parameters are designed to cover all natural possible expressions and postures, as well as

All the FAP parameters involving translational movement are expressed in terms of Facial Animation Parameter Units (FAPU). These units are defined in order to allow the interpretation of FAPs on any facial model in a consistent way, producing reasonable results in terms of expression and speech pronunciation. They correspond to fractions of distances between some

essential facial features (e.g. eye distance). The fractional units used are chosen to allow enough accuracy.

Figure 3 Initialization and sample tracking results

4. Facial feature tracking system The facial feature tracking system is described in Figure 2. This system not only recognizes face motion but it is also animating an MPEG4 compatible virtual face. Using a camera, face features are tracked in realtime by the computer and extracted face features motion and shape are converted to MPEG-4 FAPs and are then sent to a virtual environment over the Internet. To obtain real-time tracking, several problems must be resolved. The main problem lies in the variety of individual persons' appearance, such as skin color, eye color, beard, glasses, and so on. The facial features are sometimes not separated by sharp edges, or edges appear at unusual places. This diversity increases the difficulty for the recognition of faces and the tracking of facial features.

Animation Making

Motion Tracking

Image Capture

Figure 2 Tracking In this application, important face features characteristic and their associated information are set during an initialization phase that will help solve the face diversity problem. Figure 3(a) shows this simple initialization. The user only moves some feature boxes, like pupil boxes, a mouth box etc, to intuitive positions. Once the user sets the feature positions, information around the features, edge information, and face color information are extracted automatically together with face-dependent parameters containing all the relevant information for real-time tracking the face position and its corresponding facial features without any marker.

The tracking process is separated into two parts: 1) mouth tracking and 2) eye tracking. Edge and gray level information around the mouth and the eyes are the main information used during tracking Figure 3(b) displays a sample result of features tracked superimposed on the face image. The tracking method for the mouth and eye is described in the next two sections.

4.1 Mouth Tracking The mouth is one of the most difficult face feature to analyze and track. Indeed, the mouth has a very versatile shape and almost every muscle of the face drives its motion. Furthermore, beard, mustache, the tongue or the teeth might appear sometimes and further increases the already difficult tracking. Our method is taking into account some intrinsic properties of the mouth: 1) Upper teeth are attached to the head bone and therefore their position remains constant. 2) Conversely, lower teeth move down from their initial position according to the rotation of the jaw joints. 3) Basic mouth shape (open Vs closed) depends upon bone movement. From these properties it follows that the detection of the positions of hidden or apparent teeth from an image is the best way to make a robust tracking algorithm of the mouth shape and its associated motion. The system proceeds first with the extraction of all edges crossing the vertical line going from the nose to the jaw. In a second phase, a pattern-matching algorithm is used to compute what we call here energy that corresponds to the similarity with initial parameters extracted for the mouth. Finally, among all possible mouth shape, the best candidate is chosen according to a “highest energy” criterion. Edge value

Edge value

Figure 4 Edge configuration for possible mouth shape (a)

(b)

Figure 4 presents the gray level value along the vertical line from the nose to the jaw for different

possible shape of the mouth and the corresponding detected edges: • Closed mouth: In this case, the center edge appears strong, the other two edges appear normally weak, and teeth are hidden inside. But lip cream make three strong edges or center edge would be the weakest edge. Like this, there are some personalities in the result of the edge detection. Which is also happen when the mouth is opened. • Opened mouth: As shown in the figure, when teeth are present, the edges are stronger than the edge on the outside lips, or between a lip and teeth or between the lip and the inside of the mouth. If teeth are hidden by lips (upper or lower) then of course the edge of the teeth is not detected. Once this edge detection process is finished, this extracted edge information is compared with the data from a generic shape database and a first selection of possible corresponding mouth shape is done as show in Figure 5. LIPS

TEETH

Figure 5 Possible candidates After this selection, the corresponding energy for this position is calculated for every possible candidate. The position that has the largest energy is defined as the next mouth shape. But this method will fails with a closed mouth because there is no region inside the lips. To compensate for this problem, the center of the mouth is first calculated as in the open case, then the edge between the upper lip and the lower lip is followed to the left and right of the center. Figure 6 shows an example of this lip separation detection.

transmitted over the network to the facial animation system.

4.2. Eye Tracking The eye tracking system includes the following subsystem: pupil tracking, eyelid position recognition, and eyebrow tracking. These three subsystems are deeply dependent on one another. For example, if the eyelid is closed, the pupil position is hidden, and it is obviously impossible to detect its position. Some researchers have used a deformable generic model for detecting the eye position and shape. This approach has a serious drawback with it comes to real-time face analysis because it is usually quite slow. In our first attempt, we considered also a generic model of an eye, but the system was failing when the eye was closed, and the stabilization of the result was difficult to obtain. We improve this first method by 1) calculating both pupil positions, 2) eyebrow positions are calculated, 3) eyelid positions are extracted with respect to their possible position, 4) every data is checked for the presence of movement. At the last stage, inconsistencies are checked again and a new best position is chosen if necessary. For the pupil tracking, the same kinds of energy functions used for the mouth are applied. The main difference, here, is that during the tracking the pupil might disappear when the eyelid are closing. Our method will take into account such cases. This eye tracking system first proceeds by tracing a box area around the complete eye and finds the largest energy value, secondly, the position of the pupil point is extracted according to the method described before for the mouth center. When the eyelid is completely closed, the position of the pupil is obviously undefined but in return the Eyebrow track region

eyelid has a great chance to be detected as closed. Figure 7 Eyebrow search = SEARCH REGIONS

Figure 6 Detection of lips separation The result of the algorithm is the opened or closed mouth cases are then transformed into FAP values to be

The method to detect the position of the eyebrow through the small box that is defined during initialization follows a similar approach to the mouth shape recognition system. Vertical line goes down from the forehead until the eyebrow is detected (Figure 7). The eyebrow position is given by the maximum value of the energy.

After the center of the eyebrow is found, the edge of the brow is followed to the left and to the right to recognize the shape as shown in Figure 7 As soon as the pupil and eyebrow locations are detected using the methods described previously, it is possible to guess an eyelid location. When the energy of pupil is large or almost the same as its initial value, this means that the eye is opened. When the energy is small, the eye may be closed or the person is looking up. The eyebrow position narrows further the possible eyelid position. When the eyebrow is lower than its initial position, the eye is considered to be closed or half closed. It helps the detection of the true eyelid position opposed to a possible wrong detection that may occurred with a wrinkle. After this process, the strongest edge in the considered area is detected, and set as the eyelid position.

the associated animated face. This program works on Windows NT and 95/98. The speed for the recognition is 10 to 15 frames per second with Pentium II 300Mhz processor and with capturing 15 images of 320x240 per second. The recognition time depends upon the size of markers that are set in initialization and the possible motion of feature that we consider. This paper has shown the method to track facial features in real-time. The recognition method for facial expressions in our system does not use any special markers or make-up. It does not need training but a simple initialization of the system allowing new user to adapt immediately. Extracted data is transformed into MPEG-4 FAPs that can be used easily for any compatible facial animation system. 6. Acknowledgement

The authors would like to thank every member of MIRALab who helped create this system, especially Dr. I.S. Pandzic who create the original face tracking software, and W.S. Lee for the cloning system. The research is supported by European project eRENA (Electronic Arenas for Culture, Performance, Art and Entertainment) - ESPRIT IV Project 25379.

7. References

Figure 8 Result with FAPs

5. Results and conclusion Figure 8 presents several example of the real-time tracking of face feature with different persons and with

[1] D. DeCarlo, D. Metaxas, ‘Optical Flow Constraints on Deformable Models with Applications to Face Tracking’, CIS technical report MS-CIS-97-23 [2] E. Cosatto, H.P. Graf, ‘Sample-Based Synthesis of Photo-Realistic Talking Heads’, Computer Animation 1998, pp103 – 110 [3] C. Kouadio, P. Poulin, P. Lachapelle, ‘Real-Time Facial Animation based upon a Bank of 3D Facial Expressions’, CA98, 1999, pp128 – 136 [4] I.S. Pandzic, T.K. Capin, N. Magnenat-Thalmann, D. Thalmann, ‘Towards Natural Communication in Networked Collaborative Virtual Environments’, FIVE ’96, December 1996 [5] SNHC, ‘Information Technology – Generic Coding Of Audio – Visual Objects Part 2: Visual’, ISO/IEC 14496-2, Final Draft of International Standard, Version of:13, Nov., 1998, ISO/IEC JTC1/SC29/WG11 N2502a, Atlantic City, Oct. 1998 [6] P. Doenges, F. Lavagetto, J. Ostermann, I.S. Pandzic and E. Petajan, ‘MPEG-4: Audio/Video and Synthetic Graphics/Audio for Mixed Media’, Image Communications Journal, Vol.5, No.4, May,1997 [7] I.S. Pandzic, T.K. Capin. E. Lee, N. MagnenatThalmann, D. Thalmann (1997). ‘A flexible

Architecture for Virtual Humans in Networked Collaborative Virtual Environments’, Proceedings Eurogrhics 97, Budapest, Hungary, 1997. [8] G. Sannier, S. Balcisoy, N. Magnenat-Thalmann, D. Thalmann, ‘VHD: A System for Directing RealTime Virtual Actors’, The Visual Computer, Springer, 1999. [9] W. S. Lee, M. Escher, G. Sannier, N. MagnenatThalmann, “MPEG-4 Compatible Faces from Orthogonal Photos”, CA 99, 1999 pp 186-194.

Suggest Documents