682
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004
Speech-To-Video Synthesis Using MPEG-4 Compliant Visual Features Petar S. Aleksic, Student Member, IEEE, and Aggelos K. Katsaggelos, Fellow, IEEE
Abstract—There is a strong correlation between the building blocks of speech (phonemes) and the building blocks of visual speech (visimes). In this paper, this correlation is exploited and an approach is proposed for synthesizing the visual representation of speech from a narrow-band acoustic speech signal. The visual speech is represented in terms of the facial animation parameters (FAPs), supported by the MPEG-4 standard. The main contribution of this paper is the development of a correlation hidden Markov model (CHMM) system, which integrates independently trained acoustic HMM (AHMM) and visual HMM (VHMM) systems, in order to realize speech-to-video synthesis. The proposed CHMM system allows for different model topologies for acoustic and visual HMMs. It performs late integration and reduces the amount of required training data compared to early integration modeling techniques. Temporal accuracy experiments, comparison of the synthesized FAPs to the original FAPs, and audio-visual automatic speech recognition (AV-ASR) experiments utilizing the synthesized visual speech were performed in order to objectively measure the performance of the system. The objective experiments demonstrated that the proposed approach reduces time alignment errors by 40.5% compared to the conventional temporal scaling method, that the synthesized FAP sequences are very similar to the original FAP sequences, and that synthesized FAP sequences contain visual speechreading information that can improve AV-ASR performance. Index Terms—Audio-visual speech recognition, correlation hidden Markov models (CHMMs), facial animation parameters (FAPs), speech-to-video synthesis.
I. INTRODUCTION
A
SPEECH-TO-VIDEO synthesis system can find various applications. It can, for example, provide visual information during telephone conversation that could be beneficial to people with impaired hearing [1]. Current videophones can transmit visual information, but due to bandwidth limitations they do not encode every visual frame, which can cause a loss of lip synchronization in the received video sequence. In addition, speech signals in videophone systems are compressed which can cause loss of speech quality. It has been shown that speech recognition accuracy in noise can significantly be improved when both audio and visual information is used, when compared to the audio-only speech recognition scenario [1], [2]. Therefore, it is desirable that a listener also has visual
Manuscript received April 30, 2003; revised August 11, 2003. This work was supported in part by the Motorola Center for Communications. An earlier version of this work appeared in Proceedings of the International Conference on Image Processing, Barcelona, Spain, September 2003. The authors are with the Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208 USA (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TCSVT.2004.826760
speechreading information available in addition to the acoustic information. In this paper, we address the problem of synthesizing lip movements from the acoustic measurements of the speech signal. Hidden Markov models (HMMs) [3], [4], the statistical approach to speech recognition, are typically used to calculate the probability of generating a specific observation sequence. They can also be used to generate observations, according to the observations’ probability density functions (pdfs), which gives rise to another application of HMMs, that of speech synthesis. Several researchers have used HMMs to create speech-driven facial animation systems [5]–[13], and they are also used in this paper. In the HMM-based methods, the explicit phonetic information is available to help the analysis of coarticulation effects caused by surrounding phoneme contexts, which represents an advantage over other synthesis methods. Unlike neural networks approaches [5], it is easier to understand the physical phenomenon through the HMM structure. A brief description of the visual speech synthesis methods that appeared in the literature is provided next. Simons and Cox [6] developed an HMM-based speech-driven synthetic head. They analyzed 50 phonetically rich sentences to obtain 10 000 speech and 5000 visual vectors. They used vector quantization (VQ) to produce the speech and image codebooks of 64 and 16 codes, respectively. They created a fully connected Markov model of 16 states, each representing a particular vector quantized mouthshape. Then they determined the probability of each mouthshape producing each of the 64 VQ speech vectors. They estimated the transition probabilities and the joint occurrences of speech and mouthshape VQ symbols. After training the system, they used the Viterbi algorithm to generate the most likely visual state sequence from the input VQ speech observations. Chen and Rao [7] trained the HMMs using audio-visual observation parameters (henceforth such HMMs are referred to as AV-HMMs). They used the width and height of the outer contour of the mouth as visual features and 13 cepstrum coefficients as acoustic features. The AV-HMMs were then used to build acoustic HMMs (AHMMs). This was accomplished by generating a new set of pdfs from the trained AV-HMMs. In the training phase, the trained AHMMs and the training acoustic speech data are used to realize (using the Viterbi algorithm) the optimal state sequences. The correspondence between each AHMM state and the visual speech parameter is then calculated and stored in a look-up table. The visual parameters per AHMM state are obtained by taking the average of all visual parameters assigned to the same AHMM state. In the synthesis process, the optimal state sequences for the testing acoustic speech data were
1051-8215/04$20.00 © 2004 IEEE
ALEKSIC AND KATSAGGELOS: SPEECH-TO-VIDEO SYNTHESIS USING MPEG-4 COMPLIANT VISUAL FEATURES
generated using trained HMMs and the Viterbi algorithm. Finally, the visual speech parameters are estimated for each state of the optimal state sequence. Bregler et al. [8] created an HMM-based speech-driven facial animation system called Video Rewrite. They first trained an AHMM system using the TIMIT database. Next, they created a time-aligned phoneme-level transcription of an audio-visual database using trained AHMMs and the Viterbi algorithm. They also created a database of triphone video segments using the time-aligned phoneme-level transcription and the video track of the database. In the synthesis stage, they first aligned the acoustic speech data using trained AHMMs and the Viterbi algorithm. Next, they selected the closest triviseme segments from the video database according to certain distance metrics. Finally, they used warping techniques to smooth the video segments and synchronize them with the speech signal. The Baum–Welch HMM inversion algorithm was first proposed for robust speech recognition [9]. Choi and Hwang [9], [10] utilized this algorithm for visual speech synthesis and compared it to the Gaussian mixture-based HMM inversion algorithm proposed by Chen and Rao [7]. Their results showed that the HMM inversion algorithm performed better than the Gaussian mixture-based HMM method at estimating visual speech parameters. Nakamura [11] describes an HMM-based algorithm for speaking face synthesis based on the correlation between audio and visual speech parameters. He first describes the procedure for building an audio-visual speech recognition system. Independently trained acoustic and video HMMs are first built and then used to realize an AV-HMM system consisting of product HMMs for modeling the integration of the acoustic and visual data. The visual parameters used were the height and width of the outer lip contour and protrusion of the lip sides from an original point. The audio features used consisted of 16 mel-cepstral coefficients, their delta coefficients, and the delta log power. The expectation-maximization (EM) algorithm is used to estimate the visual speech parameters by repeatedly estimating visual speech parameters while maximizing the likelihood of the audio and visual joint probability of AV-HMMs. The Euclidean error distance between the synthesized and original visual parameters extracted from the human movements is measured to evaluate the quality of the synthesis process. Williams and Katsaggelos [12] utilized independently trained AHMMs and VHMMs and a correlation HMM (CHMM) for visual speech synthesis. Eigenlips were utilized as visual features. The subjective evaluation of the synthesized speech is described in [13]. The work described in this paper is also using the concept of the CHMMs. MPEG-4 [14], [15] is an audio-visual object-based video representation standard supporting facial animation. MPEG-4 facial animation is controlled by the facial definition parameters (FDPs) and facial animation parameters (FAPs), which describe the face shape and movement, respectively [14], [15]. A synthetic face or avatar [Fig. 1(a)] [16] can be animated with different shapes and expressions using FDPs and FAPs. The MPEG-4 standard defines 68 FAPs, divided into 10 groups. Group-8 FAPs describe the movement of the outer lips [Fig. 1(b)] and are used as visual observations in the proposed
Fig. 1.
683
(a) MPEG-4 model [16]. (b) Group-8 FAPs.
system. As a result, any MPEG-4 decoder can be used for playing the synthesized video. Transmission of all FAPs at 30 frames per second requires only around 20 kb/s (or just a few kb/s, if MPEG-4 FAP interpolation is efficiently used [17]), which is much lower than standard video transmission rates. FAPs contain important visual information that can be used in addition to audio information to improve speech understanding [18], [19]. There are many applications of facial animation using FAPs, including personal communications, messaging, and teleconferencing. Facial animation applications also include television, film, video games, and advertisements. FAPs can also be used to control an interactive virtual character in a human-like user interface environment that can be used for educational or entertainment purposes. In this paper, we make use of FAPs to perform speech-tovideo synthesis. We propose a CHMM system for mapping an acoustic signal into a video signal, which is described in terms of FAPs. In order to evaluate the quality of the results, several different objective tests were performed. The quality of the lip-synchronization was evaluated by performing a comparison with the conventional temporal scaling approach. The FAP sequences synthesized by the proposed CHMM system were compared to the original FAPs (extracted from the original video using algorithms described in [18], [19]). We also performed AV-ASR experiments using the synthesized FAPs as visual features in order to evaluate the improvement in the speech recognition performance. In the remainder of the paper, we first describe the proposed correlation HMM system and its advantages in Section II and the training procedures used in Section III. The experiments and the objective evaluation of the results are described in Section IV. Finally, conclusions are drawn and future work is proposed in Section V. II. CHMM SYSTEM The block diagram of the proposed speech-to-video synthesis system is shown in Fig. 2. The AHMM system is used to generate an acoustic state sequence that best describes the input acoustic speech signal. The reason for this step (AHMM) is to remove any speaker dependencies. The acoustic state sequence is mapped into a visual state sequence using the CHMM system. Finally, the resulting visual state sequence is used in addition to VHMMs to produce a visual observation sequence described in terms of FAPs. The CHMM system integrates an independently trained AHMM system and a VHMM system, in order to realize speech-to-video synthesis. The proposed CHMM system allows for different model topologies for acoustic and visual
684
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004
Fig. 2. Block diagram of the proposed speech-to-video synthesizer.
HMMs. It also reduces the amount of the required training data compared to early integration modeling techniques. The proposed system only impacts the receiving end of the transmission channel; therefore, it can be easily integrated into existing telecommunication networks. The goal of the CHMM system is to find the most likely visual state sequence that corresponds to the given acoustic observaand a visual tions. Let us denote an acoustic observation sequence by (1) where (2) and denotes an acoustic and a visual observation of lengths , respectively, and denotes the number of acoustic and visual observations in an observation sequence. , where , , and An HMM is denoted by represent respectively the state transition matrix, the observation probability density function (pdf), and the initial state occupancy. Let us also denote the acoustic and visual state sequences by
While down-sampling of the acoustic state sequence would cause loss of information contained in the acoustic state sequence, up-sampling of the visual state sequence would result in state duration errors due to constraints of the Markovian process [3], [12]. Therefore, we first use the aligned acoustic state sequence , resulting from temporal alignment, together with the trained AHMMs to generate an acoustic observation sequence (6) which is then filtered in order to smooth the parameters (see Fig. 2). The combination of a median filter followed by a Hanning filter was used to smooth the parameters. Afterwards, the observation sequence is down-sampled by to obtain a new seas follows: quence of length (7) In order to generate a visual state sequence from the downsampled acoustic observations, we trained a CHMM system . The CHMM system takes down-sampled acoustic observations at the input and using the Viterbi algorithm generates the corresponding visual state sequence, that is,
(3)
(8)
Given an acoustic (visual) observation sequence, the most likely acoustic (visual) state sequence can be produced using the , acoustic (visual) model and the Viterbi algorithm according to
The topology of the CHMM system is carefully chosen in order to perform the task at hand, that is, to generate a visual state sequence. CHMMs were designed to have the same number of states as VHMMs, in order to be able to estimate the optimal visual state sequence and the same transition probabilities as VHMMs in order to inherit their constraints [12]. As a result of these constraints, only the observation pdfs (means and covariances) and mixture weights need to be estimated. In order to estimate these parameters, down-sampled acoustic observations were generated for all sentences in the training part of the Bernstein database using trained AHMMs [see (4)–(7)]. Visual state sequences were also generated by force aligning the visual training data using VHMMs and the Viterbi algorithm [see (4)]. Visual state sequences were used to distribute
(4) Since the acoustic frame rate is usually higher than the , in order to estimate the visual state visual frame rate sequence from the acoustic state sequence, either the visual sequence should be up-sampled or the acoustic be down-sampled by a factor of (5)
ALEKSIC AND KATSAGGELOS: SPEECH-TO-VIDEO SYNTHESIS USING MPEG-4 COMPLIANT VISUAL FEATURES
685
Fig. 3. Training procedure for AHMMs, VHMMs, and CHMMs.
the down-sampled acoustic observations among the states of , in order to reestimate its observation pdfs. As a result a set of trained CHMMs, which generate the visual state sequence from the acoustic input signal, was obtained. The training procedure is shown in Fig. 3 and described in detail in Section III. III. TRAINING PROCEDURE In this study, we utilized the TIMIT speech database and speechreading material from the Bernstein Lipreading Corpus [20]. The AHMMs were trained using the speech data from the TIMIT database. The training data consisted of the training utterances spoken by the female speakers in dialect regions DR1–DR8. Each acoustic observation consisted of 12 mel-frequency cepstral coefficients (MFCCs), the energy term, and the corresponding velocity and acceleration coefficients. The dimensionality of the acoustic observation vector . The CMU Pronunwas, therefore, equal to 39 ciation Dictionary [21] was used to transcribe the Bernstein Lipreading Corpus. Since the phone set used for the TIMIT database differs from that of the CMU dictionary, the utterances in the TIMIT database were transcribed using the CMU dictionary. The Bernstein database is a high-quality audio-visual database and it includes a total of 954 sentences, 474 of which were uttered by a female speaker and the remaining 480 by a male speaker. For each of the sentences, the database contains a speech waveform, a word-level transcription, and a video sequence time-synchronized with the speech waveform. The
Fig. 4. Variation of the coefficient corresponding to the first eigenvector: lip shapes corresponding to (a) 2 standard deviations, (b) mean lip shape, and (c) 2 standard deviations.
+
0
Fig. 5. Variation of the coefficient corresponding to the second eigenvector: lip shapes corresponding to (a) 2 standard deviations, (b) mean lip shape, and (c) 2 standard deviations.
0
+
raw visual observations framed the head and shoulders of the speaker against a light blue background. Each utterance began and ended with a period of silence. The vocabulary size is approximately 1000 words. Audio was acquired at a rate of 16 kHz. The ten FAPs 8.1–8.10 describing the outer lip position [see Fig. 1(b)] are extracted from the visual part of the database using the algorithm described in [18], [19]. In order to decrease the dimensionality of the visual features and decorrelate them, principal component analysis (PCA) [22] was performed on the
686
Fig. 6.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004
(a) The model topology for AHMMs. (b) The “silence” model’s topology. (c) The model topology for VHMMs and CHMMs.
10-dimensional FAP vectors . After the 10 1 mean FAP vector and the 10 10 covariance matrix were obtained, the FAPs, , were projected onto the eigenspace defined by the first eigenvectors, that is, (9) is the matrix of eigenvectors, where which correspond to the largest eigenvalues, and the vector of corresponding projection weights. The first six, two, and one eigenvectors represent 99.6%, 93%, and 81% of the total statistical variance, respectively. By varying the projection weights by 2 standard deviations, we concluded that the first and second eigenvectors mostly describe the movement of the lower and upper lip, respectively, as shown in Figs. 4 and 5. In the middle column of this figure, the mean lip shape is shown. The lip shapes obtained by changing the first (second) eigenvector weight by 2 and 2 standard deviations are shown in Fig. 4(a) and (c), respectively. When choosing the dimensionality of the visual feature vector to be used for the training of the VHMMs, one should keep in mind the tradeoff between the number of HMM parameters that have to be estimated and the amount of the speechreading information contained in the visual features. Based on the statistical variance distribution, the above mentioned tradeoff, and the objective of obtaining good quality synthesized video, we decided to use six-dimensional projection weights as visual features. The visual observation vector, which is generated for each video frame (30 frames per second), consisted of only six PCA parameters and the corresponding velocity and acceleration terms. Therefore, the dimensionality of the visual observation vector utilized for . the training of the VHMMs was equal to 18
As shown in Fig. 3, the AHMMs and VHMMs were trained independently of each other. A total of 355 context-dependent AHMMs (triphones, biphones, and phones) were trained using the Baum–Welch algorithm. Each acoustic model consisted of three states in a left-to-right topology [see Fig. 6(a)]. The silence model also included transitions from state three to state one and vice versa [see Fig. 6(b)] to handle the long segments of silence that exist at the beginning and end of sentences in both TIMIT and Bernstein databases. Each visual model consisted of three states also configured in a left-to-right topology. The VHMMs, unlike AHMMs, had a skip transition from state one to state three [see Fig. 6(c)], since the video sampling frequency (30 Hz) was lower than the acoustic one (90 Hz). The VHMMs were trained using 80% of the utterances from the Bernstein database while the remaining 20% were used for testing. The VHMMs’ covariance matrices were reduced to diagonal matrices. This assumption is made based on the orthogonality of the FAP PCA vectors. Only 63 biphone and phone VHMMs, shown in Table I, were trained using the Baum–Welch algorithm, because of the limited amount of visual data. The number of context-dependent models (biphones) was determined by setting a minimum threshold (50 in our experiments) on the number of their occurrences in the training part of the database. This approach is required to assure an adequate number of visual observations for training. Iterative mixture splitting and retraining were performed for both AHMMs and VHMMs to obtain the final 11-mixture, and 3-mixture component HMMs, respectively, which were used for testing. Both acoustic and visual observations from the Bernstein database were used for the training of the CHMMs, as shown in Fig. 3. The CHMMs have the same topology as the VHMMs in
ALEKSIC AND KATSAGGELOS: SPEECH-TO-VIDEO SYNTHESIS USING MPEG-4 COMPLIANT VISUAL FEATURES
687
TABLE I VHMM
AND CHMM CONTEXT-DEPENDENT MODEL TRAINING OCCURRENCES. CONTEXT-DEPENDENT MODELS (BIPHONES) WERE CREATED IF 50 OR MORE EXEMPLARS EXISTED IN THE TRAINING SET
Fig. 7. (a) Original video frames. (b) Synthesized video frames.
order to be able to estimate the optimal visual state sequence. The number of trained CHMMs is also the same as the number of trained VHMMs, which is 63. The Viterbi decoding algorithm and trained AHMMs and VHMMs were used to perform force alignment of the Bernstein acoustic and visual training data in order to obtain acoustic and visual state sequences, respectively. The down-sampled acoustic observations were generated from the acoustic state sequences and used for the training of the CHMMs. The initial pdf means and variances of the CHMMs were set equal to the centroids of the pdf means and variances of all the AHMM states. All CHMMs and VHMMs means and variances were shared among the phone states. The aligned visual state sequences are used to distribute down-sampled acoustic observations among the CHMM states. These observations were used for the re-estimation of the CHMM mixture weights and means and variances of the pdfs. At the end of the training process, a set of trained CHMMs that can be used to generate the visual state sequence from the acoustic input signal was obtained. IV. EXPERIMENTS AND PERFORMANCE EVALUATION After the CHMMs, AHMMs, and VHMMs were trained, we performed tests using acoustic observations from the testing part of the Bernstein database. The Viterbi algorithm and the trained AHMMs were used to obtain a time-aligned acoustic state sequence, which was used to generate acoustic observations, which were afterwards down-sampled. The
Viterbi algorithm and the CHMMs were used together with the down-sampled observations to obtain the time-aligned visual state sequences. Time-aligned visual state sequences were also obtained by force-aligning the visual observations using the Viterbi algorithm and the trained VHMMs. Afterwards, synthesized visual observations were generated using aligned visual state sequences and the trained VHMMs pdf parameters. After the visual observations were smoothed, PCA expansion is performed in order to obtain 10-dimensional FAP vectors that control the movement of the outer lip of the synthetic face (see Fig. 2). The FAP vectors were generated by both the CHMM and VHMM visual state sequences. The FAP vectors were used together with the acoustic speech signal as an input to the MPEG-4 decoder [see Fig. 1(a)]. The evaluation of the performance of a speech-to-video synthesis system should be application-dependent and could also be performed with the use of subjective tests. Such tests are clearly elaborate and are beyond the scope of this paper. Some example original and corresponding synthesized frames are shown in Fig. 7. Since the original speech and video data are available, in this paper we propose and implement the following three objective evaluations based on the following: 1) Temporal accuracy: for its evaluation, we compared the time-aligned visual state sequences generated by the CHMM and VHMM systems. 2) Mean-squared error between the FAP sequences extracted from the original video data and the synthesized FAP sequences generated by the CHMM system, as well as the FAP sequences generated by the VHMM system. 3) Performance of an AV-ASR system utilizing the original speech contaminated by additive noise at various signal-to-noise ratios (SNRs) and the synthesized FAPs by utilizing the noise-free speech signal as well as the FAPs extracted from the original video and those synthesized by the VHMM and the visual state sequence obtained by aligning the video data. This evaluation provides a quantitative measure of the speechreading information provided by the synthesized FAPs or an objective evaluation of how good and useful the proposed system is for the AV-ASR application.
688
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004
Fig. 8. Block diagram of the system used for generation of VHMM synthesized FAPs.
The experiments and the resulting evaluations mentioned above are described in Sections IV-A–C. For these experiments, we will make reference to the system in Fig. 8. According to it, FAPs are extracted from the original video utilizing the techniques described in [18] and [19]. They are then input to a VHMM system which provides a temporally aligned visual state sequence, which in turn is used as input to a VHMM system for generating VHMM synthesized FAPs. A. Temporal Accuracy In order to objectively evaluate the performance of the CHMM system, we performed measurements for temporal accuracy. Temporal accuracy is quantified by comparing the time-aligned visual state sequences generated by the CHMM in Fig. 2) with reference time-aligned visual system ( state sequences generated by force-aligning the visual testing observations (generated from the original video utilizing the techniques in [18] and [19]) using the trained VHMMs and the Viterbi algorithm, as shown in Fig. 8. Phone time alignment is also obtained using the AHMMs, where the context phone durations ( in Fig. 2) were obtained by scaling the duration of the acoustic state by a factor of three to compensate for the frame rate difference. All of the above-mentioned visual state alignment statistics are shown in Table II. The first and second columns in Table II contain the phone or biphone names and number of their occurrences in the testing data. Columns 3–5 denote the mean and standard deviation of the number of frames assigned to each phoneme by the VHMM system through visual state alignment. Columns six and seven (eight and nine) denote the number of frames in error and the error percentage obtained by comparing phoneme counts calculated by the AHMMs (the CHMMs) to the phoneme counts obtained by the VHMMs. The results show that the use of the CHMM system reduced the time-alignment errors by 40.5% compared to the temporal scaling results (by using the AHMMs). Some of the phone models performed worse when the CHMMs were used. The unitary vowels /ow/, /aa/, and /er/, which are articulated by exciting a fixed local tract and are hard to perceive visually were particularly challenging for the CHMMs. The relative decrease in performance is also detected for fricative /f/ which is visually very similar to fricative /v/. The CHMM system also performed worse for liquid /r/ and plosive /b/, due to lack of the context information. Most context-dependent models showed significant improvement compared to models trained without context. B. FAP Comparison In order to objectively evaluate the quality of the FAP sequences generated by the proposed speech-to-video synthesis system, we compared the CHMM synthesized FAPs (Fig. 2)
and the VHMM synthesized FAPs (Fig. 8) with the original FAPs (Fig. 8). The reason we compare the VHMM synthesized FAPs with the original FAPs is to evaluate the accuracy of the VHMMs, since the reference visual state sequences shown in Fig. 8 were used in training the CHMMs. All FAPs are expressed in terms of facial animation parameter units (FAPUs) [14], [15], shown in Fig. 9. These units are normalized by important facial feature distances in order to give an accurate and consistent representation. The FAPs from Group 8 that describe the outer lip contours are represented with the use of two FAPUs, mouth-width separation (MW) and mouth-nose separation (MNS), shown in Fig. 9. Each of these two distances is normalized to 1024. The mean squared error (MSE) is used to compare two FAP sequences and and is defined by
(10) where
denotes the number of frames in the sequence, the number of FAPs compared, and the th FAP in the th frame of the sequence (similarly for ). Since two of the Group-8 FAPs (FAPs 8.3 and 8.4) describe horizontal movement of the outer lip points and the remaining eight vertical movement [see Fig. 1(b)], the MSE was calculated separately for these two groups of FAPs. In order to express the error in terms of FAPUs, we also calculated the percentage normalized mean error (PNME) from MSE according to (11) The PNME represents the percentage error with respect to the mouth width (for FAPs 8.3 and 8.4) and the mouth-to-nose distance (for the remaining Group-8 FAPs). The PNME is generated for the following pairs of FAP sequences: original/synthesized and original/force-aligned. The PNME is calculated for all 95 sentences from the testing part of the Bernstein database. Fig. 10 shows the PNME calculated for two FAPs that describe the horizontal outer lip movement, while Fig. 11 shows the PNME calculated for the remaining eight FAPs that describe the vertical outer lip movement. It can be observed in these figures that the PNMEs between the CHMM synthesized and original [shown in Figs. 10(b) and 11(b)] are close to the PNMEs between the VHMM synthesized and original [shown in Figs. 10(a) and 11(a)]. The average error of about 4% [for a total of two FAPs in Figs. 10(b)] and 11% [for a total of eight FAPs in Fig. 11(b)] between the CHMM synthesized and the original FAPs is considered to be quite small. The CHMM synthesized FAPs are indistinguishable from the original FAPs based on visual observations. Major contributors to the PNMEs were the temporal errors discussed earlier. PNMEs calculated for the FAPs that describe the horizontal lip movement were smaller
ALEKSIC AND KATSAGGELOS: SPEECH-TO-VIDEO SYNTHESIS USING MPEG-4 COMPLIANT VISUAL FEATURES
TABLE II VISIME DURATIONS OBTAINED FROM VHMM, AHMM, AND CHMM STATE ALIGNMENTS
689
the lip closure are larger than the changes in the lip width that occur during the articulation of speech. C. Audio-Visual Automatic Speech Recognition
than the PNMEs calculated for the FAPs that describe the vertical lip movement. This is to be expected since the changes in
Another objective metric we propose in evaluating the performance of the proposed speech-to-video system or the quality of the synthesized FAP sequences is in terms of the increase in speech recognition accuracy of an AV-ASR system over an audio-only ASR (A-ASR) system. It provides a means to quantifying the amount of speechreading information contained in the synthesized visual speech, which is of interest in most, if not all, applications. The use of visual information in addition to audio improves speech understanding, especially in noisy environments. Improving ASR performance by exploiting the visual information of the speaker’s mouth region is the main objective of AV-ASR [1], [18], [23]–[30]. Many researchers have reported results that demonstrate AV-ASR performance improvement over A-ASR systems (for a recent review, see [26]). In this study, we utilize the synthesized FAPs as visual features, in the AV-ASR system proposed in [18]. The audio and visual speech information can be combined in AV-ASR systems using several techniques. One such technique is to just concatenate audio and visual features to form larger feature vectors (early integration), and then train a single-stream HMM system on the concatenated vectors. Although using concatenated audio-visual features can improve AV-ASR performance over A-ASR, it does not allow for the modeling of the reliability of the audio and visual streams. It therefore cannot take advantage of the information that might be available about the acoustic noise in the environment, which affects audio features, or the visual noise, which affects the quality of the visual features. A second approach (used in this study) is to model audio and visual features as separate feature streams (late integration), by use of multistream HMMs. Their log-likelihoods can be combined using the weights that capture the reliability of each particular stream. Therefore, we can choose larger values for the acoustic stream weights in order to rely more on acoustic data when there is not much acoustic noise in the environment and smaller values for the acoustic weights in the presence of significant acoustic noise. The recombination of log-likelihoods can be performed at different levels of integration, such as state, phone, word, and utterance [25]. The two streams, audio and visual, are combined in the approach we developed as shown in Fig. 12. The Mel-Frequency Cepstral Coefficients (MFCC), signal energy and first and second derivatives, widely used in speech processing, were used as audio features, while the six-dimensional projection weights (CHMM synthesized FAP vectors) and their first and second derivatives were used as visual features. Since MFCCs were obtained at a rate of 90 Hz, while FAPs at a rate of 30 Hz, FAPs were interpolated in order to obtain synchronized data. 1) Multistream HMM: We used a multistream HMM, and a state log-likelihood recombination method, to perform audio-visual integration, since it allows for easy modeling of the reliaand visual bility of the audio and visual streams. The audio data streams were used to separately model audio and vi-
690
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004
Fig. 9. FAPUs [14].
nonnegative stream weights are denoted by , and they depend on the modality . We assumed that the stream weights satisfy (13)
Fig. 10. PNME calculated for the FAPs describing the horizontal movement of the outer lips for the following FAP sequences: (a) VHMM synthesized and original and (b) CHMM synthesized and original.
Fig. 11. PNME calculated for the FAPs describing the vertical movement of the outer lips for the following FAP sequences: (a) VHMM synthesized and original and (b) CHMM synthesized and original.
sual information [3]. The state emission probability of the multistream HMM is given by
(12)
where denotes context-dependent state, denotes the denotes the weight of the number of mixtures in a stream, th mixture of stream , and is a multivariate Gaussian with and diagonal covariance matrix . The mean vector
In order to perform multistream HMM training, stream weights must be chosen a priori. They were roughly optimized by minimizing the word error rate (WER) on a development data set. 2) AV-ASR Experiments: All ASR systems were developed using the HTK Toolkit Version 3.2 [31]. The experiments used the portion of the audio-visual Bernstein database with the female speaker. Context-dependent phoneme models, biphones, were used as speech units. HMMs used for state-synchronous multistate systems were left-to-right, with three states. Iterative mixture splitting was performed to obtain the final ninemixture biphone HMMs. Approximately 80% of the data was used for training, 18% for testing, and 2% as a development set for obtaining roughly optimized stream weights, word insertion penalty, and the grammar scale factor. The bi-gram language model, used for decoding, was created based on the transcriptions of the training data set, and its perplexity was approximately 40. The same training and testing procedures were used for both audio-only and audio-visual experiments. To test the algorithm over a wide range of SNRs (0, 10, 20, and 30 dB), white Gaussian noise was added to the audio signals. All results were obtained using HMMs trained in matched conditions by corrupting the training data with the same level of noise, as used for corrupting the testing data. This approach was used in order to accurately measure the influence of the visual data on ASR performance. The ASR results are summarized in Fig. 13. It can be observed that the A-ASR performance is severely affected by additive noise. In all AV-ASR experiments, the stream weights were estimated experimentally by minimizing the WER on the development data set. The AV-ASR system was trained using the original FAP sequences as visual features. The testing was performed for all three sets of FAP sequences (original, synthesized using CHMMs, and synthesized using VHMMs). The AV-ASR results obtained for different FAP sequences are also shown in Fig. 13. As can be clearly seen, all AV-ASR systems perform better than the A-ASR system, for all SNR values. The relative reduction in WER achieved by the AV-ASR system, when original FAPs were used for testing, compared to the audio-only WER, ranges from 20% for the noisy audio with an SNR of 30 dB to 21% for an SNR of 0 dB. An improvement in the ASR
ALEKSIC AND KATSAGGELOS: SPEECH-TO-VIDEO SYNTHESIS USING MPEG-4 COMPLIANT VISUAL FEATURES
Fig. 12.
691
Audio-visual system for ASR.
FAPs and the original FAPs. We also performed AV-ASR experiments using FAPs synthesized by CHMMs as visual features in order to confirm the presence of useful speechreading information in them. The ASR performance improved for a wide range of SNRs tested, therefore confirming the usefulness of the synthesized FAPs for speechreading. Clearly subjective tests can provide additional evaluation of the proposed system. A clear advantage of the use of FAPs is that any MPEG-4 decoder can be used to synthesize the visual articulators. Additional FAPs (describing inner lip and tongue movement) and different HMM topologies can be used with the proposed CHMM system. We also plan to perform experiments in order to determine the effect that noisy speech or speech uttered in nonideal environment (like Lombard speech, speech under stress, or in a car) would have on the quality of the FAPs synthesized using the CHMM system. Fig. 13.
Audio-only and audio-visual system WER versus SNR.
performance is also achieved when FAP sequences synthesized by VHMMs were used, although not as large as in the case of original FAPs. It is important to point out that improvement in the performance of the AV-ASR system was also achieved when FAP sequences synthesized using CHMMs were used for testing. The relative reduction in WER compared to the audio-only WER ranges from 12% for the noisy audio with an SNR of 30 dB to 13% for an SNR of 0 dB. These results clearly demonstrate that the speechreading information present in the synthesized FAP sequences is capable of improving speech understanding. It is important to point out that the considerable performance improvement was achieved with the use of only six-dimensional visual features vectors. Speech recognition results may possibly be further improved by better adjusting the audio and visual stream weights [30]. V. CONCLUSION We proposed a system for synthesizing visual movements from the acoustic speech signal. The framework allows for AHMMs and VHMMs to be trained independently and have different topologies. The results demonstrated considerable improvement in temporal accuracy (lip-synchronization) over the conventional temporal scaling. The objective FAP comparison results confirmed the strong similarity between the synthesized
REFERENCES [1] R. Lippman, “Speech recognition by machines and humans,” Speech Commun., vol. 22, no. 1, pp. 1–15, July 1997. [2] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Commun., vol. 16, pp. 261–291, 1995. [3] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. [4] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, Feb. 1989. [5] F. Lavagetto, “Converting speech into lip movements: A multimedia telephone for hard of hearing people,” IEEE Trans. Rehab. Eng., vol. 3, pp. 1–14, Mar. 1995. [6] A. Simons and S. Cox, “Generation of mouth shapes for a synthetic talking head,” Proc. Inst. Acoust., vol. 12, pp. 475–482, 1990. [7] T. Chen and R. R. Rao, “Audio-visual integration in multimedia communication,” Proc. IEEE, vol. 86, pp. 837–852, May 1998. [8] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visual speech with audio,” in Proc. ACM SIGGRAPH, 1997, pp. 353–360. [9] K. Choi and J.-N. Hwang, “Baum-welch hidden Markov model inversion for reliable audio-to-video conversion,” in Proc. IEEE 3rd Workshop Multimedia Signal Processing, 1999, pp. 175–180. [10] S. Moon and J.-N. Hwang, “Robust speech recognition based on joint model and feature space optimization of hidden Markov models,” IEEE Trans. Neural Networks, vol. 8, pp. 194–204, Mar. 1997. [11] S. Nakamura, “Fusion of audio-visual information for integrated speech processing,” in Proc. Third Int. Conf. Audio- and Video-Based Biometric Person Authentication, Halmstad, Sweden, 2001, pp. 127–143. [12] J. J. Williams and A. K. Katsaggelos, “An HMM-based speech-to-video synthesizer,” IEEE Trans. Neural Networks, vol. 3, pp. 900–915, July 2002. [13] J. J. Williams, A. K. Katsaggelos, and D. C. Garstecki, “Subjective analysis of an HMM-based visual speech synthesizer,” in Proc. SPIE Conf. Human Vision and Electronic Imaging, vol. 4299, San Jose, CA, Jan. 2001, pp. 544–555.
692
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004
[14] Text for ISO/IEC FDIS 14 496-2 Visual, ISO/IEC JTC1/SC29/WG11 N2502, Nov. 1998. [15] Text for ISO/IEC FDIS 14 496-1 Systems, ISO/IEC JTC1/SC29/WG11 N2502, Nov. 1998. [16] G. A. Abrantes, FACE-Facial Animation System, Version 3.3.1: Instituto Superior Tecnico, (c), 1997–98. [17] F. Lavagetto and R. Pockaj, “An efficient use of MPEG-4 FAP interpolation for facial animation at 70 bits/frame,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 1085–1097, Oct. 2001. [18] P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, “Audiovisual speech recognition using MPEG-4 compliant visual features,” EURASIP J. Appl. Signal Processing, pp. 1213–1227, 2002. , “Audio-visual continuous speech recognition using MPEG-4 [19] compliant visual feature,” in Proc. Int. Conf. Image Processing, Rochester, NY, Sept. 2002, pp. 960–963. [20] L. E. Bernstein, Lipreading Corpus V-VI: Disc 3. Washington, DC: Gallaudet University, 1991. [21] (1990) Carnegie Melon University Pronunciation Dictionary. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict [22] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York: Wiley, 2001. [23] E. Petajan, “Automatic lipreading to enhance speech recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Francisco, CA, 1985, pp. 40–47. [24] D. G. Stork and M. E. Hennecke, Eds., Speechreading by Man and Machine. New York: Springer-Verlag, 1996. [25] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, Workshop Audio-Visual Speech Recognition, Final Report. Baltimore, MD, Oct. 2000. [26] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent advances in the automatic recognition of audio-visual speech,” Proc. IEEE, vol. 91, pp. 1306–1326, Sept. 2003. [27] G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features for audio-visual LVCSR,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, 2001, pp. 165–168. [28] C. Bregler and Y. Conig, “Eigenlips’ for robust speech recognition,” in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Adelaide, Australia, 1994, pp. 669–672. [29] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, pp. 141–151, Mar. 2000. [30] H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin, “Weighting schemes for audio-visual fusion in speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, 2001, pp. 165–168. [31] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book. Cambridge, U.K.: Entropic Ltd., 2002. [32] P. S. Aleksic and A. K. Katsaggelos, “Speech-to-video synthesis using facial animation parameters,” in Proc. Int. Conf. Image Processing, Barcelona, Spain, Sept. 2003, pp. 1–4.
Petar S. Aleksic (S’01) received the B.S. degree in electrical engineering from Belgrade University, Belgrade, Yugoslavia, in 1999 and the M.S. degree in electrical engineering from Northwestern University, Evanston, IL, in 2001, where he is currently working toward the Ph.D. degree in electrical engineering. He has been a member of the Image and Video Processing Lab, Northwestern University, since 1999. His current research interests include audio-visual speech recognition, speech-to-video synthesis, audio-visual biometrics, visual feature extraction, multimedia communications, computer vision, and pattern recognition.
Aggelos K. Katsaggelos (S’80–M’85–SM’92–F’98) received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Thessaloniki, Greece, in 1979 and the M.S. and Ph.D. degrees in electrical engineering from the Georgia Institute of Technology, Atlanta, in 1981 and 1985, respectively. In 1985 he joined the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, where he is currently a Professor, holding the Ameritech Chair of Information Technology. He is also the Director of the Motorola Center for Telecommunications. During the 1986–1987 academic year, he was an Assistant Professor with the Department of Electrical Engineering and Computer Science, Polytechnic University, Brooklyn, NY. His current research interests include signal recovery and compression, and multimedia signal processing and communications. He is the editor of Digital Image Restoration (Heidelberg, Germany: Springer-Verlag, 1991), coauthor of Rate-Distortion Based Video Compression (Boston, MA: Kluwer, 1997), and coeditor of Recovery Techniques for Image and Video Compression and Transmission (Boston, MA: Kluwer, 1998). He is the coinventor of eight international patents. Dr. Katsaggelos is an Ameritech Fellow, a member of the Associate Staff, Department of Medicine, at Evanston Hospital, and a member of SPIE. He is currently a member of the Publication Board of the IEEE Signal Processing Society, the IEEE TAB Magazine Committee, the Publication Board of the IEEE Proceedings, the Scientific Board, Centre for Research and Technology Hellas-CE.R.T.H, the IEEE Technical Committees on Visual Signal Processing and Communications, and on Multimedia Signal Processing, an Editorial Board Member of Academic Press, Marcel Dekker, Signal Processing Series, Applied Signal Processing, the Computer Journal, and editor-in-chief of the IEEE Signal Processing Magazine. He has served as a member of the IEEE Signal Processing Society Board of Governors (1999–2001), an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING (1990–1992), an area editor for the journal Graphical Models and Image Processing (1992–1995), a member of the Steering Committees of the IEEE TRANSACTIONS ON IMAGE PROCESSING (1992–1997) and the IEEE TRANSACTIONS ON MEDICAL IMAGING (1990–1999), and a member of the IEEE Technical Committee on Image and Multi-Dimensional Signal Processing (1992–1998). He has served as the General Chairman of the 1994 Visual Communications and Image Processing Conference (Chicago, IL), and as technical program co-chair of the 1998 IEEE International Conference on Image Processing (Chicago, IL). He is the recipient of the IEEE Third Millennium Medal (2000), the IEEE Signal Processing Society Meritorious Service Award (2001), and an IEEE Signal Processing Society Best Paper Award (2001).