Expressive Face Animation Synthesis Based on Dynamic Mapping ...

1 downloads 0 Views 2MB Size Report
to learn the mapping from LPCs to face animation parameters, they used current frame ... and appear limited efficiency due to the restriction of algorithms.
Expressive Face Animation Synthesis Based on Dynamic Mapping Method* Panrong Yin, Liyue Zhao, Lixing Huang, and Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences, Beijing, China {pryin,lyzhao,lxhuang,jhtao}

Abstract. In the paper, we present a framework of speech driven face animation system with expressions. It systematically addresses audio-visual data acquisition, expressive trajectory analysis and audio-visual mapping. Based on this framework, we learn the correlation between neutral facial deformation and expressive facial deformation with Gaussian Mixture Model (GMM). A hierarchical structure is proposed to map the acoustic parameters to lip FAPs. Then the synthesized neutral FAP streams will be extended with expressive variations according to the prosody of the input speech. The quantitative evaluation of the experimental result is encouraging and the synthesized face shows a realistic quality.

1 Introduction Speech driven face animation aims to convert incoming audio stream into a sequence of corresponding face movements. A number of its possible applications could be seen in multimodal human-computer interfaces, visual reality and videophone. Till now, most of their work was still focused on the lips movement [1][2][3]. Among them, Rule-based method [2] and Vector Quantization method [3] are two direct and easily realized ways. But the results from them are usually inaccurate and discontinuous due to limited rules and codebooks. Neural network is also an effective way for the audio-visual mapping. For instance, Massaro [1] trained a neural network model to learn the mapping from LPCs to face animation parameters, they used current frame, 5 backward and 5 forward time step as the input to model the context. Although the neural network has merits of moderate amount of samples and smooth synthesized result, it is deeply influenced by the initial parameters setting, and it is easier to be bogged down into local minimum. Hidden Marcov Model (HMM) is also widely used in this area because of its successful application in speech recognition. Yamamoto E. [3] built a phoneme recognition model via HMM, and directly mapping recognized phonemes to lip shapes. The smoothing algorithm was also used. The HMMs can only be generated based on phonemes, the work has to be linked to a specific language, furthermore, the synthesized lip sequence is also not very smooth. *

The work is supported by the National Natural Science Foundation of China (No. 60575032) and the 863 Program (No. 2006AA01Z138).

A. Paiva, R. Prada, and R.W. Picard (Eds.): ACII 2007, LNCS 4738, pp. 1–11, 2007. © Springer-Verlag Berlin Heidelberg 2007


P. Yin et al.

Most of those systems are based on phonemic representation (phoneme or viseme) and appear limited efficiency due to the restriction of algorithms. In order to reduce the computing complexity and make the synthesized result smoother, some researchers have applied dynamic mapping method to reorder or concatenate existing audiovisual units to form new visual sequence. For instance, Bregler(Video Rerite) [6] reorders existing mouth frames based on recognized phonemes. Cosatto [7] selects corresponding visual frames according to the distance between new audio track and stored audio track, and concatenates the candidates to form a smoothest sequence. While lip shapes are closely related to speech content (linguistic), facial expression is a primary way of passing non-verbal information (paralinguistic) which contains a set of message related to the speaker’s emotional state. Thus, it would be more natural and vivid for the talking head to show expressions when communicating with the human. In Pengyu Hong’ research [4], he not only applied several MLPs to map LPC cepstral parameters to face motion units, but also used different MLPs to map the estimated motion units to expressive motion units. Ashish Verma [12] used optical flow between visemes to generate face animations with different facial expressions. Yan Li [16] adopted 3 sets of cartoon templates with 5 levels of intensity to show expression synchronized with face animation. Although some work has been done for the expressive facial animation from the speech, the naturalness of synthesized results is still an open question. In our work, we design a framework of speech driven face animation system based on the dynamic mapping method. To investigate the forming process of facial expression, the correlation between neutral facial deformation and expressive facial deformation is firstly modeled by GMM. Then we combine the VQ method and frame-based concatenation model to facilitate the audio-visual mapping, and keep the result as much realistic as possible. In the training procedure, we cluster the training audio vectors according to the phoneme information, and use a codebook to denote phoneme categories. The number of phonemes in database determines the size of the codebook. During the synthesizing, we apply a hierarchical structure by the following steps. First: each frame of the input speech is compared to the codebook, then we get three different candidate codes. Second: for each candidate code, there are several phoneme samples. The target unit (the current 3 frames) of the input speech together with its context information is shifted within the speech sequence of a tri-phone under so as to find out the most matched sub-sequence. Third: the visual distance between two adjacent candidate units is computed to ensure that the concatenated sequence is smooth. Last: the expressive variation that is predicted from the k-th GMM, which is determined by the prosody of the input speech, will be imposed on the synthesized neutral face animation sequence. Figure 1 shows a block diagram of our expressive talking head system. In the rest of the paper, section 2 introduces the data acquisition, section 3 analyses the trajectories of expressive facial deformation and GMM modeling process, section 4 focuses on the realization of lip movement synthesis from speech, section 5 gives the experimental result, and section 6 is the conclusion and future work description.

Expressive Face Animation Synthesis Based on Dynamic Mapping Method


Fig. 1. Expressive speech driven face animation system framework

2 Data Acquisition and Preprocessing For establishing audio-visual database, a digital camera and a microphone on the camera were used to collect the facial movement and speech signal synchronously. The training database used in our work consists of 300 complete sentences and about 12000 frame samples. The speaker was directed to articulate the same content of text with neutral accent and with natural intensity expressions respectively. Here, we choose 3 emotional states – angry, happy and surprise. Once the training data is acquired, audio and visual data will be analyzed separately. For visual representation, our tracking method is implemented in an estimation-and-refining way, and this method is applied in each successive image pair. 20 salient facial feature points (Fig. 2(a)) including two nostrils (p1 and p2), six brow corners (p3, p4, p5, p6, p7 and p8), eight eye corners (p9, p10, p11, p12, p13, p14, p15 and p16), and four mouth corners (p17, p18, p19 and p20) are used to represent the facial shape. They are initialized in the first frame interactively, and the KLT is used to estimate the feature points in the next frame. Assume X = ( x 1 , y 1 , x 2 , y 2 , " , x N , y N ) T are the positions of feature points in the current frame, and dX are the estimated offset by KLT, we try to refine the initial tracking result X + dX by applying the constraints of Point Distribution Models (PDM). Figure 2(b) shows some examples of expressive facial image. Then after coordinate normalization and affine transformation, 19 FAPs (see Table 1) related to lip and facial deformation will be extracted. For audio representation, the Mel-Frequency Cepstrum Coefficients (MFCC) which gives an alternative representation to speech spectra is calculated. The speech signal, sampled at 16 kHz, is blocked into frames of 40 ms. 12-dimentional MFCC coefficients are computed for every audio frame, and one visual frame corresponds to one audio frame. Furthermore, global statistics on prosody features (such as pitch, energy


P. Yin et al.

and so on) responsible for facial expressions will be extracted in our work. F0 range, the maximum of F0s, the minimum of F0s, the mean of F0s and the mean of energy, which have been confirmed to be useful for emotional speech classification, are selected for each sample.






Fig. 2. (a) Facial feature points and (b) Expressive facial image Table 1. FAPs for visual representation in our system

Group 2 4 4 4 4 4 4 8 8

FAP name Open_jaw Raise_l_i_eyebrow Raise_r_i_eyebrow Raise_l_m_eyebrow Raise_r_m_eyebrow Raise_l_o_eyebrow Raise_r_i_eyebrow Lower_t_midlip_o Raise_b_midlip_o

Group 8 8 8 8 8 8 8 8

FAP name Stretch_r_cornerlip_o Stretch_l_cornerlip_o Lower_t_lip_lm_o Lower_t_lip_rm_o Raise_b_lip_lm_o Raise_b_lip_rm_o Raise_l_cornerlip_o Raise_r_cornerlip_o

3 Neutral-Expressive Facial Deformation Mapping In our work, the problem of processing facial expression is simplified by taking advantage of the correlation between the facial deformation without expressions and facial deformation with expressions that account for the same speech content. Here, we choose 3 points for following analysis: p4(middle point of right eyebrow), p18(middle of upper lip), p19(left corner of the mouth). Figure 3 shows the dynamic vertical movement of p4 for “jiu4 shi4 xia4 yu3 ye3 qu4” in neutral and surprise condition and the vertical movement of p18 in neutral and happy condition. It is evident from Fig.3(a) that facial deformations are strongly affected by emotion. Although the trajectory of vertical movement of right eyebrow in surprise condition is following the trend of that in neutral condition, the intensity is much stronger and the time duration tends to be longer. On the other hand, the lip

Expressive Face Animation Synthesis Based on Dynamic Mapping Method


movement is not only influenced by the speech content, but also affected by the speech emotional rhythm. The action of showing expressions will deform the original mouth shape. From Fig.3(b) and Fig.3(c), we can see that p18 which is on the upper lip moves up more under happy condition, and p18 and p19 have similar vertical trajectories because of their interrelate relationship on the lip.




Fig. 3. Movements of (a) p4, (b) p18 and (c) p19 under different emotional states

Since the degree of expressive representation is in a continuous space, it is not reasonable to simply classify the intensity of expression into several levels. Therefore, for estimating natural expressive facial deformation from neutral face movements, some GMMs are used to model the probability distribution of the neutral-expressive deformation vectors. Each emotion category corresponds to a GMM. To collect training data, these N vector and E vector sequences are firstly aligned with Dynamic Time Warping (DTW). Then, we cascade neutral facial deformation features with the expressive facial deformation features to compose the joint feature vector: Zik=[Xk, Yik] T, i [0,1,2,3,4,5], k=1,…,N. Therefore, the joint probability distribution of the neutral-expressive deformation vectors is modeled by GMM, which is a weighted sum of Q Gaussian functions:


P( Z ) = ∑ wq N ( Z ; μ q ; Σ q ),

q =1


∑w q =1


= 1.


Where N(Z;μq; q) is the Gaussian distributed density component, μq and q are the q-th mean vector and q-th covariance matrix, wq is the mixture weight and Q is the number of Gaussian functions in the GMM. GMMs’ parameters: (w,μ, ) (Equation.2) can be obtained by training with Expectation-Maximization(EM) algorithm.

⎡Σ qXX ⎡ μ qX ⎤ μ q = ⎢ Y ⎥, Σ q = ⎢ YX ⎢⎣ Σ q ⎣⎢ μ q ⎦⎥

Σ qXY ⎤ ⎥, q = 1,..., Q . Σ YY ⎥ q ⎦


After the GMMs are trained with the training data, the optimal estimate of expressive facial deformation (Yik) given neutral facial deformation (Xk) can be obtained according to the transform function of conditional expectation (Equation.3). Q

XX −1 X Yik = E{Yik / X k } = ∑ pq ( X k )[ μ qY + Σ YX q (Σ q ) ( X k − μ q )] . q =1



P. Yin et al.

Where pq(Xk) is the probability that the given neutral observation belongs to the mixture component (Equation.4).

pq ( X k ) =

wq N ( X k ; μ q ; Σ q ) Q

∑ wp N ( X k ; μ p ; Σ p )



p =1

4 Lip Movement Synthesized from Speech Dynamic concatenation model applied in our previous work [8] has output a result of natural quality and could be easily realized. Considering searching efficiency and automatic performance of system, we introduce hierarchical structure and regard every three frames as a unit instead of a phoneme in our previous work. Unlike Cossato’s work [7], in our approach, every three frames of the input speech are not directly compared with existing frames in corpus, because the frame-based corpus would be too large to search. Therefore, we break the mapping process (see Fig.4) into two steps: finding out approximate candidate codes in codebooks and finding out approximate sub-sequence by calculating cost with context information.

Fig. 4. Hierarchical structure of improved dynamic concatenation model

Expressive Face Animation Synthesis Based on Dynamic Mapping Method


4.1 Phoneme Based Acoustic Codebook For higher efficiency, a phoneme based codebook is produced in training process. The training audio vectors are clustered according to the phoneme information. Every code in the codebook corresponds to a phoneme. The number of phonemes in database determines the size of the codebook. There are totally 48 codes including the SIL part in sentences. When coming to the synthesizing process, the distance between each frame of the input speech and the codebook is calculated. The most 3 approximate phoneme categories represented by candidate codes will be listed, so that mistakenly choosing similar phoneme by choosing only one code can be avoided. The first layer cost COST1 can be obtained by Equation.5. t ⎧ ⎫ COST 1 = ⎨dist (k ); k = arg min a − codebook (n) ⎬ . n ⎩ ⎭


4.2 Cost Computation For each candidate code, there are several phoneme samples. Then more detailed audio-visual unit mapping is employed to synthesize continuous FAP streams. Sublayer cost computation is based on two cost functions:

COST 2 = αC a + (1 − α )C v . a



where C is the voice spectrum distance, C is the visual distance between two adjacent units, and the weight “α” balances the effect of the two cost functions. In the unit selection procedures, we regard every 3 frames of the input speech as a target unit. In fact, the process of selecting candidate units is to find out the most approximate sub-sequence within the range of the candidate phoneme together with its context. The voice spectrum distance (see Fig.5) not only accounts for the context of target unit of the input speech, but also considers the context of the candidate phoneme. The context of target unit covers former m frames and latter m frames of the current frame. The context of candidate phoneme accounts for a tri-phone, which covers the former phoneme and the latter phoneme. When measuring how close the candidate unit compared with the target unit, the target unit with its context is shifted from the beginning to the end of the tri-phone. In this way, all the positions of subsequence within the tri-phone could be considered. Finally, the sub-sequence with the minimum average distance is output, thus the candidate unit is selected. So the voice spectrum distance can be defined by Equation.7:

Ca =



∑ ∑w

t = curr − p m = −6

t ,m

can attar , m − at , m .


The weights wm are determined by the method used in [10], we compute the linear blending weights in terms of target unit’s duration.

can attar ,m − at , m is the Eucilidian

distance of the acoustic parameter of two periods of frames. For the sake of reducing the complexity of Viterbi search, we set a limit of sequence candidate number for every round selection.


P. Yin et al.

Fig. 5. Voice spectrum cost

Not only should we find out the correct speech signal unit, but also the smooth synthesized face animation should be considered. The concatenation cost measures how closely the mouth shapes of the adjacent units match. So the FAPs of the last frame of former candidate unit are compared with that of the first frame of current candidate unit (Equation.8).

C v = v(canr −1 , canr )


Where v(canr-1, canr) is the Eucilidian distance of the adjacent visual features of two candidate sequence. Once the two costs COST1 and COST2 are computed, the graph for unit concatenation is constructed. Our approach finally aims to find the best path in this graph which generates minimum COST (Equation.9), Viterbi is a valid search method for this application. n/3

COST = ∑ COST 1 (r ) * COST 2 (r ) .



Where n indicates the total number of frames in the input speech, and r means the number of target units contained in the input speech.

5 Experimental Result When the speech is input, the system will calculate MFCC coefficients and prosody features respectively. The former one is used to drive content related face movement, the latter one is used to choose the appropriate GMM. Once the neutral FAP stream is synthesized by improved concatenation model, the chosen GMM will predict expressive FAP stream based on the neutral FAP stream. Fig.6 shows the examples of synthesized FAP stream results and Fig.7 shows the synthesized expressive facial deformation of 3 points (3 FAPs related) compared with the recorded deformation. In Fig.6(a) and Fig.6(b), we compare the synthesized FAP 52 stream with recorded FAP stream from validation set and test set respectively. In Fig.6(a), the two curves (smoothed synthesized sequence and recorded sequence) are very close, because the

Expressive Face Animation Synthesis Based on Dynamic Mapping Method


validation speech input is more likely to find out the complete sequence of the same sentence from corpus. In Fig.6(b), the input speech is much different from those used in training process. Although the two curves are not so close, the slopes of them are similar in most cases. To reduce the magnitude of discontinuities, the final synthesized result is smoothed by curve fitting.



Fig. 6. Selected synthesized FAP streams from (a) validation sentence and (b) test sentence

In Fig.7, the synthesized expressive trajectories appear good performance of following the trend of recorded ones. It is noticed that the ends of synthesized trajectories do not well estimate the facial deformation, this is mainly because that the longer part of expressive training data is a tone extension, so the features of end part of two aligned sequences do not strictly correspond.




Fig. 7. Synthesized expressive trajectories of (a) p4, (b) p18 and (c) p19 under different emotional states

For quantitative evaluation, correlation coefficient (Equation.10) is used to represent the deviation similarity between recorded FAP stream and synthesized stream. The coefficient is closer to 1, the result is better follow the trends in original values. ∧

1 T ( f (t ) − μ )( f (t ) − μ ) CC = ∑ . ∧ T t =1 σσ



P. Yin et al. ∧

Where f(t) is the recorded FAP stream, f (t ) is the synthesis, T is the total number of frames in the database, and μ and σ are corresponding mean and standard deviation. From Table 2, we can see that estimates for p4 are generally better than those of other points on lips. The eyebrows movements are observed to be mainly affected by expressions, while lip movements are not only influenced by the speech content, but also extended with emotional exhibition. The trajectories of p18 and p19 are determined by both content and expression, thus they have smaller correlation coefficients. Table 2. The average correlation coefficients for different expressions Correlation Coefficients Improved dynamic concatenation model (Neutral) GMM for Surprise GMM for Happy







0.737 0.729

0.656 0.725

0.688 0.701

At last, a MPEG-4 facial animation engine is used to access our synthesized FAP streams qualitatively. The animation model displays at a frame rate of 30 fps. Fig.8 shows some frames of synthesized talking head compared with recorded images.

Fig. 8. Some frames of synthesized talking head

We have tried different voices on our system. The synthesized result is well matched their voices, and the face animation seems very natural. Although Chinese voice of a man is used in the training process, our system can adapt to other languages as well.

6 Conclusion and Future Work In the present work, we analyze the correlation between neutral facial deformation and expressive facial deformation and use GMM to model the joint probability distribution. In addition, we also make some improvements to the dynamic mapping method in our previous work. By employing a hierarchical structure, the mapping process can be broke into two steps. First-layer acoustic codebook gives a general classification for the input speech frames. Sub-layer frame-based cost computation ensures that most approximate candidate sequence is selected. Through this method, the unit searching speed is largely enhanced, the artificial operation in animation process is avoided and the synthesized result keeps natural and realistic. But our work

Expressive Face Animation Synthesis Based on Dynamic Mapping Method


is presently limited to some typical emotions. The affective information represented by speaker in natural spiritual state is still a challenged work. In the future, more work will be done to investigate into how dynamic expressive movement is related to prosody features. It also needs to align FAP sequences with better strategy and finding more appropriate parameters to smooth the synthesized trajectories.

References 1. Massaro, D.W., Beskow, J., Cohen, M.M., Fry, C.L., Rodriguez, T.: Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks. In: Proceedings of AVSP’99, Santa Cruz, CA, August, pp. 133–138 (1999) 2. Ezzat, T., Poggio, T.: MikeTalk: A Talking Facial Display Based on Morphing Visemes. In: Proc. Computer Animation Conference, Philadelphia, USA (1998) 3. Yamamoto, E., Nakamura, S., Shikano, K.: Lip movement synthesis from speech based on Hidden Markov Models. Speech Communication 26, 105–115 (1998) 4. Hong, P., Wen, Z., Huang, T.S.: Real-time speech-driven face animation with expressions using neural networks. IEEE Trans on Neural Networks 13(4) (2002) 5. Matthew Brand. Voice Puppetry. In: Pro of SIGGRAPH’99. p21-28 6. Bregler, C., Covell, M., Slaney, M.: Video Rewrite: Driving Visual Speech with Audio, ACM SIGGRAPH (1997) 7. Cosatto, E., Potamianos, G., Graf, H.P.: Audio-visual unit selection for the synthesis of photo-realistic talking-heads. In: IEEE International Conference on Multimedia and Expo, ICME, vol. 2, pp. 619–622 (2000) 8. Yin, P., Tao, J.: Dynamic mapping method based speech driven face animation system. The First International Conference on Affective Computing and Intelligent Interaction (2005) 9. Tekalp, A.M., Ostermann, J.: Face and 2-D mesh animation in MPEG-4, Signal Processing. Image Communication 15, 387–421 (2000) 10. Wang, J.-Q., Wong, K.-H., Pheng, P.-A., Meng, H.M., Wong, T.-T.: A real-time Cantonese text-to-audiovisual speech synthesizer, Acoustics, Speech, and Signal Processing. Proceedings (ICASSP ’04) 1, 653–656 (2004) 11. Arun, K.S., Huang, T.S., Blostein, S.D.: Least-square fitting of two 3-D point sets. IEEE Trans. Pattern Analysis and Machine Intelligence 9(5), 698–700 (1987) 12. Verma, A., Subramaniam, L.V., Rajput, N., Neti, C., Faruquie, T.A.: Animating Expressive Faces Across Languages. IEEE Trans on Multimedia 6(6) (2004) 13. Tao, J., Tan, T.: Emotional Chinese Talking Head Syste. In: Proc. of ACM 6th International Conference on Multimodal Interfaces (ICMI 2004), State College, PA ( October 2004) 14. Gutierrez-Osuna, R., Kakumanu, P.K., Esposito, A., Garcia, O.N., Bojorquez, A., Castillo, J.L., Rudomin, I.: Speech-Driven Facial Animation with Realistic Dynamics. IEEE Trans. on Multimedia 7(1) (2005) 15. Huang, Y., Lin, S., Ding, X., Guo, B., Shum, H.-Y.: Real-time Lip Synchronization Based on Hidden Markov Models. ACCV (2002) 16. Li, Y., Yu, F., Xu, Y.-Q., Chang, E., Shum, H.-Y.: Speech-Driven Cartoon Animation with Emotions. In: Proceedings of the ninth ACM international conference on Multimedia (2001) 17. Rao, R., Chen, T.: Audio-to-Visual Conversion for Multimedia Communication. IEEE Transactions on Industrial Electronics 45(1), 15–22 (1998)