2006 IEEE Conference on Systems, Man, and Cybernetics October 8-11, 2006, Taipei, Taiwan
Lip Assistant: Visualize Speech for Hearing Impaired People in Multimedia Services Lei Xie, Yi Wang and Zhi-Qiang Liu Abstract— This paper presents a very low bit rate speechto-video synthesizer, named lip assistant, to help hearing impaired people to better access multimedia services via lipreading. Lip assistant can automatically convert acoustic speech to lip parameters with a bit rate of 2.2kbps, and decode them to video-realistic mouth animation on the fly. We use multi-stream HMMs (MSHMMs) and the principal component analysis (PCA) to model the audio-visual speech and the visual articulations, which are learned from AV facial recordings. Speech is converted to lip parameters with natural dynamics by an expectation maximization (EM)-based audioto-lip converter. The video synthesizer generates video-realistic mouth animations from the encoded lip parameters via PCA expansion. Finally, mouth animation is superimposed on the original video as an assistant for hearing impaired viewers to make a better understanding on the audio-visual contents. Experimental results shows that lip assistant can significantly improve the speech intelligibility of both machines and humans.
I. INTRODUCTION With the rapid advances in microprocessors and networks, we can get entertained almost everywhere from various kinds of media gadgets such as PCs, PDAs and mobile phones, etc. Wide-band wired and wireless networks enable us to access a huge-capacity of audio-visual contents including news, movies, and other entertainments. For hearing impaired people, loss of audio perception ability is the major obstacle that hinders them from fully enjoying pervasive media services. It is estimated that about 26.1 million people in the United States have hard hearing problem [1]. This figure is much larger in China. With the approaching of an old-age society in China, it is expected that the portion of the population with hearing problem will increase dramatically. How to entertain such a large population with modern media technologies is a urgent topic to explore. As we know, lipreading plays a significant role in verbal communications. Besides body languages, hearing impaired people make extensive use of visual speech cues in speech perception, and expert lip-readers can understand fluent conversational speech. Even normal listeners can benefit from lipreading to improve the intelligibility of speech in noisy environments. Sumby et al. [2] suggest that seeing the talker’s face is equivalent to about 15dB increase in the signal-to-noise ratio (SNR) of the acoustic signal. Lei Xie and Zhi-Qiang Liu are with the School of Creative Media, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong SAR, China {xielei, zq.liu}@cityu.edu.hk Yi Wang is with the Department of Computer Science and Technology, Tsinghua University, Beijing, China
[email protected]
1-4244-0100-3/06/$20.00 ©2006 IEEE
4331
Motivated by the above facts, researchers have already started to help the hearing impaired people, filling in the perception gaps between the hearing impaired and the normal people. Among their efforts, speech-driven facial animation systems have been showing a great assistance in improving speech understanding. The videophone designed for deaf people is a good example of this kind of applications. However, how to entertain the hearing impaired people with current media technology is still an open and wide area to explore. This paper addresses a very low bit rate speech-to-video synthesizer, named lip assistant, to help hearing impaired people to better enjoy audio-visual contents. As we know, current audio-visual contents (e.g. TV programs and movies) are usually subtitled for a better understanding. However, in current pervasive media services, transcribing a hugecapacity of AV contents and adding subtitles are both timeconsuming and labor-intensive. Automatic transcribing can be realized by speech recognition techniques, but transcribing errors cannot be avoided, especially for unconstrained speech contents [3]. These errors can definitely mislead the viewers. Therefore, motivated by lipreading, lip assistant is designed to automatically convert the speech tracks in the AV contents to lip parameters (i.e. visual parameters) at a very low bit rate, and synthesize video-realistic lip animation from these lip parameters to provide abundant lipreading information. Mouth animation is further superimposed onto the original video as an assistant for hearing impaired viewers to make a better understanding on the audio-visual contents. The rest of the paper is organized as follows. Section II describes the main phases of lip assistant in details. Section III briefly introduces the MLLR-based speaker adaptation technique. In Section IV, objective and subjective evaluations are carried out to validate lip assistant. Finally, conclusions and future work are given in Section V. II. LIP ASSISTANT A. Framework Overview Fig. 1 shows the block diagram of lip assistant, which is composed of six parts: AV recording, AV signal processing, AV model training, lip encoder and video synthesizer (lip decoder). Prior to modelling, a phonetic-balanced, highquality audio-visual database is needed to record the lip articulation process. The AV signal processing block is in charge of extracting low-dimension, informative, timesynchronized audio and visual features from the audio-visual database. Given the joint audio-visual features, the AV model training block builds up correspondences between audio and
E 2F
(
)
?
- 7 + 0 , 23 4
(
)
V
3 7 + , - 0
5 24 3 1 .
6 0 + 7 - 8 8 23 4
9 :
9
;
I =
: 9
<
; ; W X Y S U T J
>
L M U T
:
=
K : L M N O P Q P R S T S Q U
9
:
(∆ + ∆2 )
(
)
*
+ , - .
9
I J
/ 0 1 23 23 4
! " # $ %
& $ '
) 2, - + D E 2F
Fig. 1.
5 @ 3 A B - 8 2C - 0 G
- 7 + , - 0H
Diagram of Lip Assistant
video as well as their signal distributions. The lip encoder converts a new sound track to lip parameters at a very low bit rate. Finally, the video synthesizer generates the mouth video and overlaid it to the original video.
B. AV Recording Prior to modelling the audio-visual articulation process, we recorded an audio-visual database with balanced English phonetics. A female subject read the sentences from the TIMIT corpus. The training set was composed of 452 sentences (2 SA + 450 SX). A set of 50 SI sentences was used as a testing set, and another 22 SI sentences were selected as a small held-out set to tune parameters. In the database, the speaker’s head-and-shoulder front view against a white background was recorded by a digital video camera at a relatively 30dB acoustic signal-to-noise ratio (SNR) in an AV studio. A big screen TV was served as a teleprompter. The audio-visual data are about 2500 seconds in duration. Fig. 2 shows some snapshots from the recording scene and the recorded video. We named the AV recording set the JEWEL AV database.
4332
Fig. 2. Snapshots from the AV recording scene (a), the teleprompter (b) and the recorded video (c) and (d).
C. AV Signal Processing Audio waveforms and video frames were extracted from the JEWEL database. Audio was acquired at a rate of 16KHz, and video was saved as image sequences at 25 frames per second.
Fig. 3.
Some extracted mouth images.
Fig. 4.
The speech signal was processed in frames of 25ms with a 15ms overlapping (rate=100Hz). The speech frames were pre-emphasized with an FIR filter (H(z) = 1 − az −1 , a = 0.97), and weighted by a Hamming window to avoid spectral distortions. After pre-processing, we exacted Mel Frequency Cepstral Coefficients (MFCCs) as the acoustic features. Each acoustic feature vector consisted of 12 MFCCs, the energy term, and their corresponding velocity and acceleration derivatives. The dimensionality of acoustic feature vector is 39 for each frame. Extensive research shows that although the entire facial expression can help in comprehension, but most information pertaining to visual speech does stem from the mouth region [4]. Considering this and the system complexity, our animation system focused on mouth region only. The mouth region-of-interest (ROI) was tracked by a method described in [5] and further aligned [6]. Fig. 3 shows some extracted mouth images. The principal component analysis (PCA) was used to construct the orthonormal basis for modelling the lip appearance. We selected 1500 typical mouth ROI images from the JEWEL database, and computed the eigen mouths for the red, green, blue channels respectively. A mouth image can be represented by a linear combination of these eigen mouths. The combination weights were used as visual features. Totally a set of 90 visual features (30 for each channel) were collected for each image frame. Visual features were further interpolated to 100 frames/sec to match the audio rate. D. AV Model Training 1) Multi-Stream HMMs: We used multi-stream HMMs (MSHMMs) [7] to model the audio-visual speech articulation process since they were suitable for modelling coupled multiple data streams. MSHMMs were first proposed for mutliband automatic speech recognition; and recently were successfully used in audio visual speech recognition (AVSR). In its general form, the class conditional observation likelihood of the MSHMM is the product of the observation likelihoods of its single-stream components, where stream exponents are used to capture the reliability of each modality (or confidence of each single-stream classifier). Fig. 4 shows the two-stream HMM specifically for the audio-visual domain. v a Given the bimodal observation oav t = [ot , ot ] at frame t, the state emission likelihood of the MSHMM is "K #λsct sc X Y ωsck Ns (ost ; µsck , usck ) , (1) P (oav t |c) = s∈{a,v}
k=1
4333
MSHMM
where λsct denotes the stream exponents, which are nonnegative, and a function of modality s, the HMM state c, and frame t. The state-dependence is to modal the local, temporal reliability of each stream. Ns (ost ; µsck , usck ) is the Gaussian component for state c, stream s, and mixture component k with mean µsck and covariance usck . 2) Training Strategy: Since lip assistant is to realize mouth animation driven by speaker independent continuous/conversational speech in the sound tracks of AV contents, we need a large acoustic speech database to get good distribution estimations of acoustic signal. We used the JEWEL AV database together with the TIMIT acoustic corpus to train the 47 3-state, left-to-right phonemic MSHMMs. Firstly, we used time-synchronized joint audio-visual features extracted from the JEWEL training set to train the MSHMMs. This process was to learn the correspondences between the acoustic speech and the lip articulation, while keeping the time-synchronization between the two modalities. We used the expectation maximization (EM) algorithm to estimate the model parameters. Secondly, we used the acoustic features extracted from the TIMIT training set (4620 utterances) as well as the JEWEL training set to train the audio observation distributions extensively, while keeping the visual observation probabilities and the transition probabilities intact. That is, single-stream audio-HMMs and visual-HMMs were separated from the MSHMMs by setting the new transition probability vector equal to that of the MSHMMs. After training using the large training set, audio- and visual-HMMs were re-merged together, with newly trained audio observation distributions, former visual observation distributions and transition probabilities from the MSHMMs. E. Lip Encoder The right up part of Fig. 1 shows the block diagram of the Lip Encoder, which has a hieratical structure and an audioto-lip converter. 1) Hieratical Structure: At the very beginning, an audio signal processing unit enhanced the acoustic signal to remove or alleviate any noise contamination or microphone mismatch, and then MFCC features were extracted using the same process described in Section II-C. A conventional audio-only speech recognizer was used to get word-level transcriptions of the input audio, and followed with a forced alignment via the Viterbi algorithm, resulting in sub-phonemic level (HMM state) transcriptions.
The sub-phonemic transcriptions were fed into an audioto-lip converter which produced optimal visual parameters (i.e. PCA parameters) given the input audio and the welltrained MSHMMs. After smoothing with a moving average filter to remove jitters, the visual parameters were down-sampled to 25 frames/sec and encoded as lip parameters at a rate of 2.2kbps (90×25). Since this rate is comparatively low, it is well suited for network transition and storage. 2) Audio-to-Lip Converter: We didn’t use the scheme of phoneme-to-visual-parameter which generates animations using spining or morphing between key image frames of subphonemes (HMM states), since this kind of approach ignored important speech dynamics such as prosody, resulting in unnatural performance. Moveover, from the statistical point of view, the Viterbi state sequence has an obvious defect that it only describes a small fraction of the total probability mass, and many other slightly different sequences may have very close probabilities. The Viterbi algorithm itself is also not robust to acoustic contaminations such as additive noise. If the input acoustic signal is contaminated, the Viterbi state sequence will lead to wrong mouth shapes. Therefore, we used a visual-parameter-from-audio scheme which directly resulted in visual parameters framewise under the Maximum Likelihood (ML) criterion, preserving the important speech dynamics. Given the input audio data Oa and the trained MSHMMs λ, we seek the missing ˆ v by maximizing the visual observations (i.e. parameters) O likelihood of the visual observations. According to the EM solution of ML, we maximize an auxiliary function: ˆ v = arg max Q(λ, λ; Oa , Ov , Ov0 ), O 0
(2)
Ov ∈O v
0
where Ov and Ov denote the old and new visual observation sequences in the visual observation space Ov respectively. 0 By taking derivative of Q(λ, λ; Oa , Ov , Ov ) respect to v0 ot to zero, we get [9] P P −1 qt k γt (qt , k)ωqt vk uqt vk µqt vk v ot = ˆ , (3) P P −1 qt k γt (qt , k)ωqt vk uqt vk where qt is the possible state of t, and the occupation probabilities γt (qt , k) can be computed using the forwardbackward algorithm described in the E-Step of EM. Although Eq. (3) is able to estimate the visual parameters using all possible states at each t, the computing process is time-consuming. Therefore, we used a sub-optimal approach, in which N best states with N highest likelihoods were involved in computing Eq. (3). The N-Best list was collected from the Viterbi alignment process. This was a trade-off between robustness and the computing complexity. We know that the automatic transcribing approach cannot avoid recognition errors, heavily misleading the speech understanding process. However, lip assistant provides a soft way to offer speech perception information. Although the recognizer in the lip encoder might induce word-level transcription errors, the resultant transcriptions are certainly similar with the ground truth acoustically. Subsequently after
4334
Fig. 5.
An example of media services provided by lip assistant.
the ML-based audio-to-lip conversion, the corresponding lip shapes will be similar with the true lip shapes visually, since we have already built up the correspondences between the audio and video, and the N-Best approach captures a large portion of the total probability mass which made the conversion process more robust. F. Video Synthesizer Whether transmitted or stored, the low-rate lip parameters can be decoded to mouth images using the PCA expansion process. Finally the mouth image sequence was rendered and displayed with the original AV. Fig. 5 shows a potential multimedia application provided by lip assistant, where assistant lip animations are generated on the client side for a better understanding of the media contents. At the very beginning, the client downloads the player/synthesizer as well as the eigen mouth set to the media devices, e.g., a PDA (a). After ordering (b), the AV material will be sent to the client (c). The client converts the sound track to lip parameters and re-assembles the mouth images via the synthesizer (d). Finally the mouth sequence is played on the screen with the original AV (e). III. MLLR-BASED SPEAKER ADAPTATION Since the audio distributions of MSHMMs in Section IID are trained using acoustic data collected from various speakers, reasonable lip movements can be generated through these general audio distributions. However, it is possible to build improved acoustic models by tailoring a model set to a specific speaker, leading to more accurate and natural mouth animations. Maximum likelihood linear regression (MLLR) [8] is a popular speaker adaptation technique, which uses only a small amount of training data from a new speaker to get a specific model set better fitting the characteristics of the new speaker. We performed offline, supervised speaker adaptation on the audio-HMMs using the MLLR technique. The adaptation of the audio mean vector is achieved by applying a transformation matrix Wack to the extended mean vector ξack to obtain an adapted mean vector µ ˆack : µ ˆack = Wack · ξack
(4)
where Wack is an n × (n + 1) matrix which maximizes the likelihood of the adaptation data for speech class c and
100
mixture k, and ξack is defined as (5)
where ω is the offset term for the regression, and µack (i) is the ith coefficent of µack . Estimation of the transformation matrix Wack can be found in [8] in details. IV. EVALUATIONS To validate the proposed speech-to-video synthesizer, we have carried out both objective and subjective evaluations in terms of lipreading information contained in the estimated visual speech. A. Objective Evaluations via Machine Lipreading We performed objective evaluations using audio-visual speech recognition (AVSR) experiments. This kind of machine lipreading test was used to evaluate the quality of the visual parameters (i.e. lip parameters) in terms of the improvement in speech recognition accuracy of an AVSR system over an audio-only ASR system. It provided a perceptual way to quantifying the amount of speech understanding information contained in the synthesized visual speech. We carried out speaker-dependent AVSR experiments using the JEWEL testing set, and collected the word accuracy rate (WAR) for three systems: 1) AO: audio MFCC only; 2) AV-Syn: audio MFCC + synthesized visual parameters from clean speech; 3) AV-Ori: audio MFCC + original visual parameters from clean speech. The speech babble noise was added to the 72 utterances (20 in the held-out set and 50 in the testing set) at various SNRs. We also adopted MSHMMs as the audio-visual fusion scheme, where the stream exponents were selected as a priori by maximizing the WAR of the held-out set. In the two AVSR systems, the 39 MFCC audio features (Section IIC) and the 90 visual parameters (synthesized or the ground truth) were combined to train the 47 3-state, left-to-right, phoneme MSHMMs. The AO system trained 47 3-state, leftto-right phoneme HMMs. We performed experiments under mismatched training-testing conditions. An iterative mixture splitting scheme was used to get the optimal number of Gaussian mixtures. Fig. 6 summarizes the evaluation results for 5 SNR conditions. From Fig. 6 we can clearly see that the AO system is heavily affected by additive noise. A 5dB SNR degradation (30dB to 25dB) results in a 34.3% absolute accuracy decrease. When the SNR is decreased to 10dB, the WAR falls down to 13.2%. The insertion error contributes a lot to the accuracy decrease. Not surprisingly, the two AVSR systems can significantly restrain the WAR from descending under noisy conditions. The AV-Syn system, which uses the synthesized visual parameters, achieves comparable performance compared with the AV-Ori system. The results show that the visual speech synthesized by lip assistant contains useful lipreading information which can effectively increase the accuracy of machine speech perception under noisy conditions.
4335
80 70 WAR (%)
ξack = [ω, µack (1), · · · , µack (n)],
AO AV−Syn AV−Ori
90
60 50 40 30 20 10 0
30
Fig. 6.
25
20 Audio SNR (dB)
15
10
Objective evaluation results
B. Subjective Evaluations via Human Lipreading Although hearing impaired people are the ideal subjects for measuring the perceptual quality of the synthesized visual speech, it is quite difficult to get an impartial evaluation because we have to find individuals with equal levels of lipreading proficiency [1]. Therefore, we used a substitution evaluation approach, in which a group of 8 subjects with normal hearing were recruited. Acoustic speech was degraded by babble noise at various low SNR levels. Evaluations were conducted for four different configurations: 1) AO: audio only; 2) AV-Syn: audio + synthesized lip animations from clean speech; 3) AV-Ori: audio + original lip sequences; 4) AV-Syn-OriV: audio + synthesized lip animations from clean speech + original video. In the evaluations, we used 270 sentences from the JEWEL database for speaker dependent (SD) tests, and an extra AV set for speaker independent (SI) tests. The AV set contains 270 AV snippets collected from the Internet concerning talks from 73 speakers. The SD test considered the configurations 1)–3) listed above, and the SI test considered the configurations 1), 2) and 4). The configuration 4) is intended to see the human speech perception ability when all the useful information are available when people enjoy the multimedia contents. A set of 30 separate sentences was used for each test session (a given SNR and a given configuration) to avoid subjects using prior knowledge to assist the perception of repeated sentences. No sentence repetition occurred between sessions, while a similar number of words in each session was kept. Prior to the testing, the participants were given instructions on how to response during the subjective evaluations. Fig. 7 shows a screenshot of the subjective evaluation program. The WARs for SD and SI tests averaged over the 8 subjects are summarized in Fig. 8 and Fig. 9 respectively. From the results, we can observe that severe additive noise impairs the human’s ability to transcribe audio speech. For example, when SNR is decreased to -10dB, the WAR is 72.9% for the SD test and 74.8% for the SI test. Similar to the objective evaluation results, the addition of the synthesized
Fig. 7.
A screenshot from subjective evaluation program.
100 AO AV−Syn AV−Ori
95 90
WAR (%)
85 80 75 70
Fig. 10.
A synthesized mouth sequence by the lip assistant.
65 60 55 50
0
Fig. 8.
−5 Audio SNR (dB)
−10
Subjective evaluation results for SD.
visual speech significantly improves the human speech perception ability, especially at the SNR of -10dB. With help of the synthesized mouth animation, the absolute WAR increase is 12% at -10dB SNR. Interestingly, the addition of original video can further improve the perception performance. V. CONCLUSIONS We have proposed a very low bit rate speech-to-video synthesizer, named lip assistant, to help hearing impaired people to better enjoy the multimedia services via lipreading. Different from conventional approach of adding subtitle for AV contents, specifically designed lip assistant provides a video-realistic animated lip for hearing impaired viewers 100 AO AV−Syn AV−Syn−OriV
95 90
WAR (%)
85 80 75 70 65 60 55 50
0
Fig. 9.
−5 Audio SNR (dB)
−10
Subjective evaluation results for SI.
4336
perceiving the semantics of AV contents by reading the lips, since those people extensively use visual speech in speech communications. Lip assistant is capable of automatically converting speech tracks to lip parameters with a bit rate of 2.2kbps, and decoding them to video-realistic mouth animation on the fly. Extensive experiments have shown that lip assistant can provide useful lipreading information that do help speech perception of both machines and humans. Fig. 10 shows snapshots from a synthesized mouth sequence by lip assistant. VI. ACKNOWLEDGMENTS This paper has been supported by a research grant CityU 1247/03E. R EFERENCES [1] J. J. Williams and A. K. Katsaggelos, An HMM-based speech-to-video Synthesizer, IEEE Trans. on Neural Networks, vol. 13, no. 4, 2002, pp. 900-915. [2] W. H. Sumby and I. Pollak, Visual contributions to speech intelligibility in noise, Journal of Acoustic Society of America, vol. 12, no. 2, 2002, pp. 900-915. [3] E. Cosatto, J. Ostermann, H. P. Graf and J. Schroeter, Lifelike talking faces for interactive services, Proc. of IEEE, vol. 91, no. 9, 2003, pp. 1406-1428. [4] F. Lavagetto, Converting speech into lip movements: A multimedia telephone for hard hearing people, IEEE Trans. on Rehabilitation Engineering, vol. 3, 1995, pp. 90-012. [5] L. Xie, X.-L. Cai and R.-C. Zhao, A robust hierarchical lip tracking approach for lipreading and audio visual speech recognition, in Proc. ICMLC’04, 2004. [6] C. Bregler, M. Covell and M. Slaney, Video Rewrite: driving visual speech with audio, in Proc. ACM SIGGRAPH’97, 2002. [7] S. Young, G. Evermann, D. Kershaw, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland. The HTK Book, Cambirdge University Engineering Department, 2002. [8] C. J. Leggetter and P. C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models, Computer Speech and Language, no. 9, 1995, pp. 171-185. [9] L. Xie and Z.-Q. Liu, An articulatory approach to video-realistic mouth animation in Proc. ICASSP’06, 2006.