Prototyping and transforming visemes for animated speech - CiteSeerX

9 downloads 0 Views 220KB Size Report
as the Festival TTS produced by Edinburgh University [21]. Figure 3. Prototypes of the silence viseme (/#/) and the 7 consonant visemes. More advanced speech ...
Prototyping and transforming visemes for animated speech Bernard Tiddeman and David Perrett School of Computer Science and School of Psychology, University of St Andrews, Fife KY16 9JU. {bpt,dp}@st-and.ac.uk

Abstract Animated talking faces can be generated from a set of predefined face and mouth shapes (visemes) by either concatenation or morphing. Each facial image corresponds to one or more phonemes, which are generated in synchrony with the visual changes. Existing implementations require a full set of facial visemes to be captured or created by an artist before the images can be animated. In this work we generate new, photo-realistic visemes from a single neutral face image by transformation using a set of prototype visemes. This allows us to generate visual speech from photographs and portraits where a full set of visemes is not available. Keywords: Visual speech, facial animation, image transformation.

1. Introduction The synthesis of realistic animated facial images remains a hot topic in computer graphics research and several approaches have been taken over the last 20 years. Early work focussed on the animation of geometrical facial models, which can be performed using predefined expression and mouth shape components [1][2][3]. The focus then shifted to physics-based anatomical face models, with animation performed using numerical simulation of muscle actions [4][5]. More recently an alternative approach has emerged which takes photographs of a real individual enunciating the different phonemes and concatenates or interpolates between them [6][7][8]. The image-based approach can produce a photo realistic face, but requires collection of all the required visemes. In many situations the collection of a complete set of visemes is inconvenient or impossible. For example, if we want to animate a portrait or a photograph of a well known historical figure, or to give a disabled speaker (such as Prof. Stephen Hawking) a more natural, interactive style

when presenting using computer generated speech. This paper presents a method for creating a set of viseme images from a single neutral photograph. Other approaches to animating images have ignored the colour changes (e.g. the appearance of the tongue and teeth inside the mouth) when animating, and have simply warped the original image (e.g. [9]). Incorrect or absent visual speech cues are an impediment to the correct interpretation of speech for both hearing and lip-reading viewers [10]. Facial transformations based on prototypes have been developed in psychology research for changing the apparent age and gender of facial images. Two prototypes from different sets can be used to define a transformation from one set to another [11][12]. For example a young prototype and an old prototype could be used to define an ageing transformation, or a male prototype and female prototype to define a gender transformation. In this work we propose using these facial prototyping and transformation methods to produce a novel set of visemes from a single example image. The resultant viseme images can then be integrated with a text-to-speech engine and animated to produce realtime photo realistic visual speech.

2. Methods 2.1 Data Collection In order to elicit the required mouth shapes, 7 subjects were filmed speaking a standard sentence. The sentence ("A note about the bored worker who beat his fat chum's head in and got the boot"), was designed to elicit the 16 mouth shapes defined by Ezzat and Poggio [8]. After viewing the films it was decided to extend the set of mouth shapes by splitting the /t,d,s,z,th,dh/ viseme into a /t,d,s,z/ viseme, where the tongue is not visible, and a /th,dh/ viseme, where the tongue can be seen. The frame corresponding to the most extreme position of each viseme was manually selected from each clip. The facial features of

each viseme image were then delineated with a set of lines and contours. In this work the neutral (silent) viseme was automatically delineated using an active-shape model [13] [14]. Adjustments away from the neutral position in the remaining visemes were made by hand.

2.2 Prototyping The shape and colour average of each viseme was then constructed across the 7 subjects. Each prototype was constructed by averaging the set of viseme images in terms of 2D shape, pixel colour [15] [16] and multiscale texture [12] (Figure 1). The shape of each face in the set is delineated with 179 points located along contours around the major facial features (eyes, nose and mouth) and the facial border. The average shape is found by averaging the position of each delineated point across the set. The colour of each pixel in the prototype image is then found by warping each component image into the average shape and calculating the mean colour.

silent prototype and the viseme prototype images are all warped into the new shape. At each pixel the colour difference between the silent prototype image and the viseme prototype image is added to the new image to give a shape and colour viseme of the subject. In this work we mask the colour changes so that only the area of the jaw is effected. This process is similar to image morphing [18][19] but the identity of the subject does not change through the sequence. We also use an additional texture transformation method [12] ensures that we don’t add unwanted artefacts (such as stubble to a female face) while preserving important textural cues such as smile lines and lip crinkling.

Figure 2. An illustration of the transformation process: a) Delineate the input subject and prototype images (from row (b)) and define the new shape. (b-c) Warp the subject and prototype images into the new shape and (c) transform the colours at each pixel.

2.4 Animation Figure 1. Constructing a shape and colour viseme prototype. The facial features of each input face (b) are delineated (a) and averaged to define the mean shape. Each component image is then warped into the average shape (c) and averaged to produce the prototype image.

2.3 Synthetic viseme construction Given a static facial image in a neutral position we can transform it into the desired mouth shape and coloration (Figure 2). After first aligning the facial feature points using a (least-square error) rigid body fit [17], the new shape is defined as the subject's silent shape plus the difference between the prototype silent shape and the desired prototype viseme shape. Then the subject silent,

The simplest approach to producing animated speech using visemes is to simply switch to the image that corresponds to the current phoneme being spoken. Several existing speech engines support concatenated viseme visual speech by reporting either the current phoneme or a suggested viseme. For example, the Microsoft Speech API (SAPI) will report which of the 21 Disney visemes is currently being spoken. We use a table to convert between our extended Ezzat and Poggio set and the Disney set. For slightly smoother visual speech we detect changes in viseme and form a 50/50 blend of the previous and current viseme. Concatenated speech can produce visual speech that is jerky and unrealistic, especially for slow speech, where the viseme transitions become visible. A better approach is to

interpolate smoothly between visemes, in terms of both the 2D shape and colour i.e. to use the well-known technique of image morphing [18][19][20]. Ezzat and Poggio [8] used an automatic optical flow based method to interpolate between neighbouring visemes. Optical flow methods reflect the fact that the desired changes in the animated faces should correspond to movement, but optical flow methods can not generally be computed in real time. In this work the main facial features have already been delineated in each viseme and so feature based morphing can be used, as in [6] [7]. We exploit the 3D graphics hardware available in most new PC's to perform the shape and colour changes in real time using texture mapping and alpha blending. The morphing viseme approach is most suitable for use with a diphthong concatenation speech engine, such as the Festival TTS produced by Edinburgh University [21].

shown in Figure 4. The changes in the external mouth shape, as well as the appearance of the teeth and tongue are visible in the prototypes.

Figure 4. The 7 monophthong and 2 diphthong viseme prototypes.

Figure 3. Prototypes of the silence viseme (/#/) and the 7 consonant visemes. More advanced speech animation needs to include context dependence on the choice of viseme (coarticulation) [22] and the need for other head movements - blinking, nods and expression changes, required for realism. Goyal et al [23] have tackled some of these problems by adding additional images corresponding to nods and blinks. Alternatively the hidden Markov model (HMM) method of Brand [9] builds a statistical model of head movement which is driven by the audio signal, and includes some coarticulation. We have only used the simple linear morphing method, but the method proposed here (for producing more realistic synthetic visemes with tongue, teeth and appropriate shading changes) could be combined with HMM driven animation.

3. Results The silence viseme and the seven visemes that we are using to represent the 24 consonantal phonemes are shown in Figure 3. The nine visemes used to represent the 12 monophthong phonemes and the 2 diphthong phonemes are

The results of transforming Leonardo DaVinchi's Mona Lisa into a number of the selected visemes are shown in Figure 5. Examples of transforming an image of the well know theoretical physicist Professor Stephen Hawking into a number of visemes is shown in Figure 6. Again the changes in the mouth shape are visible, along with the appearance and disappearance of the internal mouth features such as the teeth and tongue. Example video clips are available from http://monty.stand.ac.uk/more/TextToVideo/.

4. Conclusions We have applied a facial transformation method designed for computer graphic ageing and gender changing to the problem of generating a talking face from a static 2D image. The results show that we can render visual speech of a comparable realism to other viseme morphing approaches, without needing to collect the necessary visual corpus from every desired speaker. This has allowed us to produce animated visual speech where collecting a complete set of viseme images is impossible, such as from a portrait or a disabled subject. We have integrated these visemes into a real-time visual speech generator, which uses the hardware morphing capabilities of modern PC’s.

M. Lee and K.J. Hussey, “Synthesis of speaker facial movement to match selected speech sequences”, In Proc. 5th Australian Conf. On Speech Science and Technology, 1994, Vol. 2, pp 620625. [7] S.H. Watson, J.R. Wright, K.C. Scott, D.S. Kagels, D. Freda and K.J. Husset, “An advanced morphing algorithm for interpolating phoneme images to simulate speech”, Jet Propulsion Lab, California Institute of Technology, 1997. [8] T. Ezzat and T. Poggio, “Visual speech synthesis by morphing visemes”, International Journal of Computer Vision, 200, Vol. 38, No. 1, pp45-57. [9] M. Brand, “Voice Puppetry”, in SIGGRAPH99 Conference Proceedings, 1999, pp. 21-28. [10] H. McGurk and J. MacDonald, “Hearing lips and seeing voices”, Nature, December 1976, pp. 746-748.

Figure 5. Example visemes from the Mona Lisa set.

[11] D.A. Rowland and D.I. Perrett, “Manipulating Facial Appearance through Shape and Color”, IEEE Computer Graphics and Applications, 1995, Vol. 15, No. 5, pp70-76. [12] B.P. Tiddeman, D.M. Burt and D.I. Perrett, “Prototyping and Transforming Facial Textures for Perception Research”, IEEE Computer Graphics and Applications, Sept/Oct 2001, Vol 21, No. 5, pp 42-50. [13] T. Cootes, C. Taylor, D. Cooper and J. Graham, “Active shape models - their training and application”, Computer Vision, Graphics and Image Understanding, 1995, Vol. 61, No. 1, pp3859. [14] A. Lanitis, C.J. Taylor and T.F. Cootes, “Automatic interpretation and coding of face images using flexible models”, IEEE Trans. on PAMI, 1997, Vol. 19, No. 7, pp743-756. [15] P.J. Benson and D.I. Perrett, “Synthesizing continuous-tone caricatures”, Image and Vision Computing, 1991, Vol. 9, pp123129. [16] P.J. Benson and D.I. Perrett, “Extracting prototypical facial images from exemplars”, Perception, 1993, Vol. 22, pp257-262.

Figure 6. Example visemes from the Stephen Hawking set.

5. References [1] F.I. Parke, “Parameterised models for facial animation”, IEEE Computer Graphics Applications, 1982, Vol. 2, No. 9, pp 61-68. [2] N.D. Duffy and J.F.S. Yau, “Facial image reconstruction and manipulation from measurements obtained using a structured lighting technique”, Pattern Recognition Letters, 1988, Vol. 7, pp 239-243. [3] N. Magneneat-Thalmann , H. Minh, M. Angelis and D. Thalmann, “Design, transformation and animation of human faces”, Visual Computer, 1989, Vol. 5, pp32-39. [4] K. Waters, “A muscle model for animating three-dimensional facial expression”, in SIGGRAPH87 Conference Proceedings, 1987, pp17-24. [5] Y. Lee, D. Terzopoulos and K. Waters, “Realistic face modelling for animation”, in SIGGRAPH95 Conference Proceedings, 1995, pp55-62. [6] K.C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R. Wright,

[17] K.S. Arun, T.S. Huang and S.D. Blostein, “Least-squares fitting of two 3-D point sets”, IEEE Trans. on PAMI, 1987, Vol. 9, No. 5, pp698-700. [18] T. Beier and S. Neely, “Feature based image metamorphosis”, in SIGGRAPH92 Conference Proceedings, 1992, pp35-42. [19] S-Y. Lee, K-Y. Chwa, S.Y. Shin and G. Wolberg, “Image metamorphosis using snakes and free-form deformations”, in SIGGRAPH95 Conference Proceedings, 1995, pp439-448. [20] Wolberg G., Digital Image Warping, IEEE Computer Society Press, Los Alamitos, CA, 1990. [21] A. Black and P. Taylor, The Festival Speech Synthesis System, http://www.cstr.ed.ac.uk/projects/festival/, University of Edinburgh, 1997. [22] M.M. Cohen and D.W. Massaro, “Modeling coarticulation in synthetic visual speech”, In N.M. Thalmann and D. Thalmann, editors, Models and Techniques in Computer Animation, pp 103110, Philadelphia, Pennsylvania, 1998. [23] U.K. Goyal, A. Kapoor, P. Kalra, “Text-to-audiovisual speech synthesizer”, Virtual Worlds, Lecture Notes in Artificial Intelligence, 2000, Vol. 1834, pp256-269.

Suggest Documents