An hmm-based speech-to-video synthesizer - Semantic Scholar

1 downloads 0 Views 572KB Size Report
2010 and 2030, the post World War II baby-boom generation will enter the 65 and older age group (elderly category). As a result, it is expected that the portion of ...
900

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

An HMM-Based Speech-to-Video Synthesizer Jay J. Williams, Member, IEEE, and Aggelos K. Katsaggelos, Fellow, IEEE

Abstract—Emerging broadband communication systems promise a future of multimedia telephony. The addition of visual information, for example, during telephone conversations would be most beneficial to people with impaired hearing and the ability to speechread. For the present, it is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. In this application, the acoustic speech signal is analyzed and the corresponding articulatory movements are synthesized for speechreading. This paper describes a hidden Markov model (HMM)-based visual speech synthesizer designed to improve speech understanding. The key elements in the application of HMMs to this problem are the decomposition of the overall modeling task into key stages and the judicious determination of the observation vector’s components for each stage. The main contribution of this paper is a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. Moreover the propose model reduces the amount of training data compared to early integration modeling techniques. Results from objective experiments analysis show that the propose approach can reduce time alignment errors by 37.4% compared to conventional temporal scaling method. Futhermore, subjective results indicated that the purpose model can increase speech understanding. Index Terms—Audio–visual recognition, hidden Markov model (HMM) modeling, multimodal signal processing, visual synthesis.

I. INTRODUCTION

I

N the last 25 years, telephone communication has become an essential part of our daily lives. Today, the telephone is by far the most commonly used communication device. For the population with impaired hearing, telephone use continues to be a major problem. It is estimated that 26.1 million individuals (10% of the population) in the United States have impaired hearing [1]. Over half of the U.S. population with a hearing-impairment is age 65 years or older [2]. This statistic may not seem significant today; however, according to the U.S. Census Bureau, it is estimated that between the years 2010 and 2030, the post World War II baby-boom generation will enter the 65 and older age group (elderly category). As a result, it is expected that the portion of the population with impaired hearing will increase dramatically. In a recent study [3], 81% of the respondents with impaired hearing reported that their hearing condition had a moderate Manuscript received June 2, 2001; revised October 31, 2001. This work was supported by Motorola. The authors are with the Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60201 USA (e-mail: [email protected]; [email protected]). Publisher Item Identifier S 1045-9227(02)04413-2.

to severe effect on their use of the telephone. Current available technology such as the Telecommunication Device for the Deaf (TDD) and relay systems limit the spontaneity of interpersonal telephone communication. Speech assisted devices and communication techniques are used to compensate for some of the limitations of telephone systems. Some individuals are able to use an amplified handset or a hearing aid in conjunction with the telephone [3]. For those who have severe hearing loss, TDD and relay systems are the only alternative. However, TDD equipment is not always available in public facilities such as medical offices or in small/commercial businesses. This can limit a person’s independence and unfairly restrict access to important services. Many listeners with impaired hearing may have the ability to speak clearly if their loss occurred at a later stage in their life. It would be beneficial to these individuals to have the opportunity to speak their responses rather than type them. The addition of visual information during telephone conversations would be beneficial to individuals with impaired hearing who have the ability to speechread. Summerfield [4]defines lipreading, audio–visual speech perception, and speechreading as follows: “Lipreading is the perception of speech purely visually by observing the talker’s articulatory gestures. Audio–visual speech perception is the perception of speech by combining lipreading with audition. Speechreading embraces a larger set of activities. It is the understanding of speech by observing the talkers articulation and facial and manual gestures, and may also include audition.” The results of Summerfield’s experiments showed that sentence recognition accuracy in noise can be improved by 43% by speechreading and 31% by audio–visual speech perception when compared to the auditory only scenario. He concludes that [5]: “lipreading can compensate quite well for some of the consequences of moderate hearing losses.” It has been shown that when auditory cues are augmented by visual cues, such as mouth shapes, most listeners with impaired hearing can achieve a higher level of recognition [6]. Therefore it is desirable that these listeners have access to the visual cues for speechreading. Current videophones can transmit visual information; however, they do not provide the full motion necessary to exploit speechreading cues. Furthermore, the speech signals in videotelephone systems are compressed in comparison to normal telephone speech which might lead to a loss of speech quality. As a result of bandwidth limitations and storage constraints, videotelephone system coders do not encode every visual frame. Therefore, the received video sequence exhibits jerky motion and loss of lip synchronization [7]. The problems associated with speechreading are not limited to telephone and video communication systems. Emerging broadband communication systems promise a future of mul-

1045-9227/02$17.00 © 2002 IEEE

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

timedia telephony. The increasing demand for multimedia applications (i.e., electronic commerce, virtual reality, entertainment, etc.) will further limit persons with impaired hearing from access to these services if measures are not taken to address the needs of the hearing impaired community. In this work, the problem of synthesizing visual articulatory lip movements from acoustic measurements of the speech signal is addressed. Employing hidden Markov models (HMMs) and the Karhunen–Loève (KL) transform of the video sequence, the stochastic behavior of the speech and visual facial features are modeled. Using this stochastic model, acoustic signals can be mapped into the visual domain for increased speech understanding. To evaluate the effectiveness of the proposed approach, adults with normal hearing completed a series of audio and audio/visual sentence recognition perception tests. Results demonstrate that speech understanding can increase with the use of this speech-to-visual synthesizer. In addition, perception tests were carried out to provide insight into the perceptual boundaries related to speechreading. The proposed visual synthesizer can extend the use of the telephone network to listeners with impaired hearing. Since it only impacts the receiving end of the transmission channel, it can be easily integrated into the existing telecommunication networks. It also reduces the effect of video jitter (i.e., jerky motion) and lip synchronization problems associated with today’s videophones. Since our system is an image reconstruction and not a computer graphics animation, the reconstructed images closely match the visual quality of the original images in the Bernstein Lipreading Corpus [8]. As a result, the model eliminates the complexity associated with a computer graphics facial animation system and presents a more aesthetically pleasing video sequence. Furthermore, this model is able to retain more visual attributes in comparison to techniques which measure lip parameters (e.g., lip width, height, inner lip contour area, etc.). The novel modeling techniques presented may one day lead to a communication apparatus that will allow persons with impaired hearing to engage in fluent telephone conversations. In Section II a discussion of the relevant work and techniques used for HMM-based speech driven facial animation is presented. In Section III a novel HMM-based visual speech synthesizer is introduced. Section III-C discusses the procedures used to train the various HMM models used in this approach. In Sections IV and V a discussion of the objective and subjective tests are presented, respectively. Finally, Section VI concludes with a summary of the research results, contributions, and future research directions for HMM-based visual speech synthesis. II. BACKGROUND Technological advances in text-to-speech (TTS) synthesis, speech recognition, and computer graphics have given researchers a new approach to synthesizing talking heads. HMM is the predominant data-based statistical approach used for speech recognition. It is generally used to calculate the probability of generating an observation sequence. However,

901

since the states in an HMM model the observations’ probability density functions (pdfs), they can be used to emit observations. Unlike neural networks (NNs), it is easier to unfold the understanding of the mapping between the physical phenomenon and the HMM structure. It is also the modeling approach used in this paper. In 1990 Simons and Cox [9] developed a speech-driven synthetic head using HMM techniques. In their system they used 50 phonetically rich sentences which yielded 10 000 and 5000 speech and visual vectors, respectively. Next, they used a vector quantization (VQ) algorithm to produce two codebooks of 16 and 64 speech and image codes, respectively. The results of their quantization lead them to create a fully connected Markov model consisting of 16 states, each representing a particular vector quantized mouthshape. Then they determined the probability of each mouthshape producing each of the 64 VQ speech vectors. They estimated them directly from the image and speech data. Similarly, they used the image and speech data directly to estimate the transition probabilities and the joint occurrences of the speech and mouthshape VQ symbols. After training the system, they could generate a visual mouthshape sequence by calculating the most likely state sequence given a sequence of input VQ speech observations via the Viterbi algorithm. Bregler et al. [10] used HMM techniques to create a speech driven facial animation system called Video Rewrite. This system is designed to automatically synthesize facial movements with proper lip synchronization. In the analysis stage, they first trained a set of phoneme acoustic HMMs using speech from the TIMIT speech database. Next, they created a time-aligned phoneme-level transcription of an audio–visual training database. This task was accomplished by using the audio track of the database in conjunction with the trained acoustic HMMs and the Viterbi algorithm. Then they created a database of triphone video segments (e.g., trivisemes) using the time-aligned phoneme-level transcription and the video track of the database. In the synthesis stage, they first aligned new speech using the trained acoustic HMMs. Then they selected the closest triviseme segments from the video database according to distance metrics. Finally, warping techniques were used to smooth the video segments and synchronize them with the speech signal. Chen and Rao [11] have also used HMMs. In contrast to Bregler et al. [10], they trained their HMMs using joint audio–visual observation parameters. These HMMs were then used to realize acoustic HMMs. This was accomplished by integrating over the visual parameters to create a new set of pdfs. In the synthesis process, the acoustic HMMs are used to realize an optimal state sequence from an input acoustic speech sequence. Finally, the visual observation is estimated from each state in the optimal state sequence through integration. Since Chen and Rao used joint audio–visual HMMs for training, their system is inherently speaker dependent. Tamura et al. [12] designed a visual speech synthesizer similar to the one in [11]. Instead of using integration to produce acoustic HMMs, they chose to calculate the likelihood for HMMs using only the auditory parameter stream. In the

902

Fig. 1.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

Block diagram of an HMM-based speech-to-visual synthesizer.

analysis phase, they extracted mel-cepstral features from the acoustic speech signal and ten visual parameters around the lip contours. In the synthesis phase, they first performed syllable-based (e.g., Japanese CV syllables) recognition using the auditory speech signal. As a result, they obtained a sequence of syllables and state duration information. According to the obtained syllable sequence, they constructed a sentence HMM by concatenating syllable HMMs. Finally, visual speech parameters were generated from the sentence HMM using a ML-based parameter generation algorithm [13]. This algorithm is designed to generate visual parameters which reflect both static and dynamic features. The result is a synthetic lip sequence which is smooth and realistic. Yamamoto et al. [14] proposed an HMM-based lip movement synthesis model that incorporated lip movement coarticulation effects. They started by training acoustic phoneme HMMs from an audio–visual synchronized Japanese database. Next, they aligned the acoustic speech parameters into HMM state sequences using the Viterbi algorithm. Afterward, they classified the frames into visemes using succeeding phonemes for context. The associated lip parameters synchronized with the acoustic phone parameters were then average to form a visual database. In the synthesis phase, an acoustic speech signal is parameterized and aligned into HMM state sequences using the Viterbi algorithm. Finally, the viseme classes are determined and the associated lip parameters are extracted to form the visual output sequence. In a later paper [15] the proposed a method that extends their work using the expectation-maximization (EM) algorithm. Choi and Hwang [16] describe a Baum–Welch hidden Markov model inversion (HMMI) algorithm for synthesizing visual parameters from speech. The HMMI algorithm was first proposed by Moon and Hwang [17] for robust speech recognition. Choi and Hwang extended the algorithm for visual speech synthesis and made comparisons against the Gaussian mixture-based HMM inversion algorithm proposed by Chen and Rao [11]. The results of their algorithm showed that the HMMI scheme performed better than the Gaussian mixture-based HMM method [11] at estimating visual speech parameters.

III. VISUAL SPEECH SYNTHESIS Fig. 1 shows a block diagram of the HMM-based speech-to-visual synthesizer proposed in this paper [18]–[20]. In this approach, the acoustic speech signal is first preprocessed, then acoustic feature vectors are extracted to construct an acoustic observation sequence. Next, the acoustic HMMs (AHMMs) are used to realize an acoustic state sequence which best describes the input acoustic observation sequence. The resulting acoustic state sequence is mapped by a novel correlation model (see boxed area in Fig. 1) into a visual state sequence. Finally, the visual state sequence and the corresponding visual HMMS (VHMMs) are used to produce a visual observation sequence for speechreading. In Section III-A the various techniques for integrating audio and visual information for audio–visual speech recognition and their application to visual synthesis are discussed. In Section III-B a novel HMM-based approach for speech driven facial synthesis is discussed. A. Multimodal Signal Decomposition HMMs have been used to convert an acoustic speech signal into visual speech parameters [16], [21]–[24]. The key elements in the application of HMMs to this problem are 1) the decomposition of the overall modeling task into key stages and 2) the judicious determination of the observation vector’s components for each stage. In this paper, an acoustic observation sequence of length is denoted as (1) and the dimension of each with acoustic observation vector. Similarly, a visual observation sequence is denoted as (2) and the dimension of each with visual observation vector in the sequence. Similarly, an HMM is denoted as (3)

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

Fig. 2.

903

Integration techniques used for audio–visual speech recognition.

where , , and represent the state transition matrix, observation probability density function (pdf), and initial state occupancy, respectively. When modeling bimodal signals, it is important to take into consideration the interaction between the modalities and the constraints they may place on the modeling framework. In the case of audio–visual speech recognition, the decomposition of acoustic and visual information can be contrasted into three major approaches as illustrated in Fig. 2. In all three represents a process which is capable integration schemes, of extracting the fundamental attributes necessary to describe represents the physical phenomenon accurately. The term any discriminating process (e.g., VQ, NN, HMM, etc.) that can accurately discriminate the attributes of class . In the early integration approach, the audio and visual features are concatenated into a single observation vector (4) The use of concatenated observation features presents problems for audio–visual speech recognition and synthesis. An early integration approach assumes the auditory and visual parameters have similar temporal dynamics. Unfortunately, this assumption is far from the truth. It is well known that parameters describing visual articulatory movements have much lower bandwidth than the acoustic speech signal. Furthermore, stimulus offset asynchrony (SOA) experiments have shown that asynchrony is the norm in bimodal speech [25]. This asynchrony is the result of the retina’s and basilar membrane’s response to light and sound, respectively. The basilar membrane’s response to sound precedes the retina’s response to light. Therefore, the discriminating model is forced to not only discriminate the spatial difference between the two modalities, but also compensate for the asynchrony that also exists. Furthermore, early integration techniques require more training data than independently trained acoustic and visual models. This is secondary to increasing the acoustic observation space

from to . Experiments carried out by Stork and Hennecke [26] and Matthews et al. [27] have shown evidence that jointly trained audio–visual HMMs do not perform as well as independently trained models for robust speech recognition, supporting Massaro’s SOA experiments [25]. The late integration approach considers the case of independent auditory and visual discriminating processes allowing greater accuracy in modeling the spatial and temporal dynamics of each modality. This approach also has the additional benefit of allowing discriminating process architectures to be chosen independently of each other; however, a third discriminating process is required to integrate the two modalities. Typically, for robust audio–visual recognition, the third discriminating process analyzes the acoustic noise level and weighs the acoustic and visual discriminating processes according to the prevailing acoustic conditions. For example, when the acoustic noise level is above some threshold, more emphasis is placed on the visual discriminating process and vice versa. Similarly, if the lighting conditions in a room are inadequate for visual recognition, the integrating discriminating process would place more emphasis on the auditory discriminating process. The intermediate integration approach draws on characteristics from both early and late integration approaches. This approach is best represented by Boltzmann zippers [26] where two linear chains of neural units are connected so that they interact with each other. This architecture effectively implements integration that is intermediate between early and late integration. Currently, there is no significant evidence to draw strong conclusions about its performance compared to early and late integration approaches for audio–visual speech recognition. However, Stork and Hennecke [26] have provided preliminary results which demonstrate that intermediate integration can achieve higher recognition accuracy than both early and late integration approaches. All three approaches discussed above are capable of outperforming single modality speech recognition systems. This comes as no surprise since human speech perception is robust because human observers are able to use multiple sources of

904

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

information for speech understanding [28]. Therefore, it is reasonable to expect multimodal speech recognition systems to outperform single modality systems. B. Correlation Model In Section III-A it was discussed how late integration based audio–visual speech recognizers outperform early integration based recognizers. It can be surmised that independently trained acoustic and visual HMMs can outperform jointly trained audio–visual HMMs for the purpose of achieving speech-to-visual synthesis. The enclosed area shown in Fig. 1 represents the proposed correlating process that is capable of coupling independently trained acoustic and visual HMMs for synthesis. The ultimate goal of this process is to determine the most likely visual state sequence that best describes the acoustic . state sequence The synthesis process begins by aligning an acoustic observausing the acoustic model tion sequence and the Viterbi decoding algorithm [29] to produce the mostlikely acoustic state sequence (5) and the viSimilarly, given a visual observation sequence the Viterbi decoding algorithm sual model can be used to realize the most-likely visual state sequence (6) is typically greater Note that the acoustic frame rate than the visual frame rate , or the corresponding periods are related as (7) Therefore, in order to estimate the visual state sequence given the acoustic state sequence , either the acoustic state sequence must be down-sampled or the visual state sequence must be up-sampled by a factor of (8) should be Intuition says that the visual state sequence up-sampled to prevent the reduction of spatial information encoded in the acoustic state sequence. However, early experiments found that up-sampling the visual state sequence resulted in increased state duration errors. This is a direct result of the Markovian constraints placed on state duration. That is, the duration probability density for an HMM state is given by (9) where represents the number of consecutive observations in the state self-transition probability. It is known state , and that for most physical signals, the exponential state duration density is not appropriate. Therefore, up-sampling the visual state sequence compounds this problem and introduces duration

constraints in the Viterbi decoding algorithm. For example, when a state sequence is up-sampled, the Viterbi decoding algorithm must be augmented to enforce a minimum state duration that is a multiple of . In practice, this constraint is difficult to strictly enforce. Therefore, the constraint is relaxed by assigning penalties to paths which do not meet the minimum state duration constraint. On the other hand, down-sampling the acoustic state sequence is also undesirable because it introduces quantization errors (e.g., skipped states). Therefore, a compromise must be found. The effects of quantization can be reduced by first into an acoustic observation converting the state sequence sequence (10) Then the resultant observation sequence is filtered using cascaded median and Hanning filters to smooth the parameters (see Fig. 1). Afterwards, the observation sequence is down-sampled by retaining one out of observation vectors to realize (11) The final mapping between the subsampled acoustic observation sequence and the optimal visual state sequences is determined according to (12) The direct solution to this problem can be found by searching through all valid visual state sequences of length . However, can be prohibitively large. Therefore a the search space for novel approach is introduced using a correlation HMM, , a statistical mapping of the acoustic state space into the visual state space as a source of search constraint. The Viterbi decoding algorithm can now be used to find a suboptimal solution (13) The key elements to solving the audio–visual integration problem lie in the training procedure and the model architecture of . In order to ensure that the correlation model is able to approximate the optimal visual state sequence , the number must equal that of . Furthermore, since the of states in underlying Markov process used to switch between states in produces , must also inherit the same Markovian constraint. As a result of the above constraints, only the observation have to be estimated. Using pdfs and mixture weights of as a source constraint, the training the visual state space procedure for estimating the observation pdfs and mixture is as follows:. weights for 1) Train the acoustic and visual HMMs ( and ) independently. in conjunction with the Viterbi de2) Use the trained coding algorithm to force align the visual training data. 3) Repeat step 2); however, replace the visual model with and use the acoustic observation the acoustic model sequences associated with the visual training data.

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

Fig. 3.

905

Training flowchart for an HMM-based visual speech synthesizer.

4) Generate subsampled acoustic observation sequences ’s using the aligned acoustic state sequences obtained from step 3). 5) Create a new HMM, , using both the initial and transition probabilities distributions of the visual HMM ; that is (14) (15) 6) Use the aligned visual state sequence generated in step 2) as a constraint to distribute the subsampled acoustic observations [step 4)] among the states of . 7) Re-estimate the observation pdfs of . The results of the training procedure produces a set of correlation/integration HMMs that are capable of producing estimates of the visual articulatory movements from measurements of the acoustic input signal. Note that the HMM-based visual speech synthesizers which use early integration techniques (e.g., joint audio–visual observations) are inherently speaker dependent. However in the proposed approach, we allow the training process to be separated, therefore, allowing speaker independence. C. Training Procedure This section discusses the training procedures used for the HMMs for visual speech synthesis. Our objective is to show that the proposed correlation model is capable of restoring synchronization between the acoustic and visual observations. This work utilized speechreading material from Bernstein and Eberhardt [8]. The Bernstein Lipreading Corpus is a high quality database that can be easily speechread. This database includes a total of 954 sentences, of which, 480 were uttered by a single male speaker and the remaining 474 sentences by a single female speaker. For each of the sentences, the database

contains a speech waveform, a word-level transcription, and a video sequence time synchronized with the speech waveform. The raw visual observations framed the head and shoulders of the speaker against a light blue background. Each utterance began and ended with a period of silence. Employing HMMs and the KL transform of the video sequences, the stochastic behavior of the speech and visual facial features are modeled. Using this stochastic model, acoustic signals can be mapped into the visual domain for increased speech understanding. Fig. 3 shows a flowchart of the training procedure used for training the AHMMs, VHMMs, and correlation HMMs (CHMMs). As shown in the illustration, the AHMMs and VHMMs were trained independently of each other. The AHMMs were trained using speech utterances from the TIMIT speech corpus. The training data consisted of the training utterances spoken by the female speakers in dialect regions DR1–DR8. Each acoustic observation consisted of 12 static mel-cepstral coefficients and an energy feature computed from a 25 ms window shifted every 11 ms. The corresponding velocity (delta) and acceleration (delta–delta) features were concatenated to realize acoustic observation vectors of dimen. sion The CMU Pronouncing Dictionary was used to transcribe the Bernstein Lipreading corpus. The phone set for this dictionary differs from that of the TIMIT database. As a result of these differences, the utterances in the TIMIT database were transcribed using the CMU dictionary. A total of 433 context dependent AHMMs were trained using the Baum–Welch algorithm [29]. Each acoustic model contained three states configured in a left-to-right topology as shown in Fig. 4. The silence model also included transitions from state three to state one and vice versa (see Fig. 4). This topology allows the silence model to handle the long segments of silence which exist in both the TIMIT and Bernstein databases. The models shown in Fig. 4 also have two nonemitting states (i.e., do not emit an observation) that are used

906

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

TABLE I EXAMPLE SENTENCE MODEL CONSTRUCTION

(a)

(b) Fig. 4. Model topologies for AHMMs. The top figure represents the topology for all context dependent models while the bottom represents the silence model’s topology. Note, that states s and s are nonemitting states.

for concatenating phone level HMMs to yield sentence level Markov chains for training and testing. The training procedure began by concatenating the necessary phone models representing the utterance. An example of this process is illustrated in Table I. The phones between the brackets represent the left and right phone context, respectively. It should be noted that word boundary and silence context phones were not included in the model set. Next, the forward–backward probabilities were accumulated for the utterances represented by the training set. Finally, the model parameters were reestimated using the Baum–Welch algorithm and the accumulated probabilities. As shown in Fig. 3, the VHMMs were trained using utterances from the Bernstein Lipreading Corpus. Approximately 80% of the data was set aside for training and the remaining 20% for testing. Similar to the AHMMs’ case, the VHMMs’ covariance matrices were also reduced to diagonal matrices. This assumption is made based on the orthogonality of the eigenlip basis functions. The KL procedure was used to construct the orthonormal basis for modeling the lips of the female speaker in the Bernstein database. The ensemble used to determine the eigenlip basis consisted of 1460 images which realized an oral region of dimensions 80 45 pixels. Fig. 5 depicts the 40 most significant eigenlip images which represent 95.5% of the statistical variance. outside the ensemble can be represented by a An image linear combination of the fixed eigenlips . The coefficients, , of the image are of the form (16) (17) which represent the elements of a visual observation vector, , with representing the mean of the the transpose of the vector . images in the ensemble, and Each visual model consisted of three states configured in a left-to-right topology with a skip transition from state one to state three. This topology is illustrated in Fig. 6. The visual silence model had the same topology as the AHMM silence model. The visual observation vectors were generated at a rate Hz and consisted of 20 static eigenlips coeffiof

Fig. 5. The 40 most significant eigenlip images. The images are ordered in descending significances from left-to-right, top-to-bottom.

Fig. 6. states.

Model topology for VHMMs. Note, states s and s are nonemitting

cients and the corresponding velocity and acceleration terms ). (i.e., A total of 65 context dependent VHMMs were trained using the Baum–Welch algorithm. This is much smaller than the number of AHMMs trained, and is a result of the limited amount of visual data available. In both cases (i.e., AHMM and VHMM training), the number of context dependent models was determined by setting a minimum threshold on the number of context dependent phone occurrences represented by the training database. Thresholding was required to assure that there would be an adequate number of examples for training. The resultant models used for the VHMMs and CHMMs are listed in Table II. The CHMM shown in Fig. 3 was trained using the acoustic and visual observations from the Bernstein database. The Viterbi decoding algorithm was used in conjunction with the trained AHMMs and VHMMs to force align the Bernstein acoustic and visual training data, respectively. The resultant acoustic state sequences were then used to generate down-sampled acoustic observations for training the CHMMs. The

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

TABLE II VHMM AND CHMM CONTEXT DEPENDENT MODEL TRAINING OCCURRENCES. NOTE, CONTEXT DEPENDENT MODELS WERE CREATED IF 60 OR MORE EXEMPLARS EXISTED IN THE TRAINING SET

CHMMs’ initial observation pdf means and variances were set to the centroids of the AHMMs’ states. Note that all of the CHMMs’ pdf means and variances were shared (i.e., semicontinuous HMMs) among the states. In the CHMMs’ training process, the aligned visual state sequences were used as a constraint to assign subsampled acoustic observations to the CHMMs’ states. The observations assigned to a particular state were then used to re-estimate the mixture weights and Gaussian parameters. IV. OBJECTIVE RESULTS Evaluating the performance of the correlation/integration model objectively requires measurements for temporal and spatial accuracy. In the case of temporal accuracy, the time-aligned visual state sequence produced by the correlation/integration model and those of the VHMMs were examined. In the case of spatial accuracy, the MSE of the visual observation vectors produced by the VHMMs and CHMMs were analyzed. A. Temporal Results Temporal accuracy was measured by comparing the visual state sequences generated by the correlation/integration model with a reference set constructed by force aligning the Bernstein visual test observations using the trained VHMMs. This process began by generating acoustic state sequences. This was accomplished by aligning the Bernstein acoustic test observations using the trained AHMMs. The resultant acoustic state sequences were then used to realize down-sampled acoustic observations for input into the CHMMs. Finally, the observations were aligned using the CHMMs and the resultant visual state sequences were compared against the reference visual state sequences. Table III shows the visual phone (viseme) time alignment statistics with and without the correlation/integration model. It should be noted that errors greater-than 100% indicate that the time-alignment procedure over predicted the duration. In the examples without the correlation model, the viseme durations were obtained by scaling the duration of the acoustic state

907

sequences by a factor of to compensate for the frame rate differences. The overall results shown in Table III demonstrate that the correlation model reduced time alignment errors by 37.4% compared to the temporal scaling results. However, the table also shows that the CHMMs do not perform better than temporal scaling for all the models listed. Table IV shows the phone models which performed worse using the novel correlation/integration model. The unitary vowels /ow/, /aa/, /ae/, and /er/ shown in the table were particularly challenging for the CHMMs. Unitary vowels are articulated by exciting a fixed vocal tract with pulses of air produced by quasiperiodic vibration of the vocal cords. Because of the fixed vocal tract configuration, vowels are difficult to perceive visually. However, acoustically, they can be easily identified by their spectra. Table II shows that all four vowels have good coverage in the training set, and are also well represented in the test set. Their poor performance is the result of coarticulation and duration modeling [see (9)]. When used in context, for example, models z[aa,], hh[,ae], and w[,aa], time-alignment errors decrease dramatically. The lack of context information for these monophones (/ow/, /aa/, /ae/, and /er/) caused the CHMMs to absorb the temporal dynamics of all neighboring phones during the training process. As a result, the CHMMs observation pdfs blurred the phone boundaries causing misaligned state sequences. In addition, the exponential duration density constraint imposed by the Markov chain does not help matters. The same observations above can be made for the liquid /r/. However, its performance was not affected as severely as with the unitary vowels. The smearing of phone boundaries is not isolated to the CHMMs. Boundary misalignments also occur from use of the acoustic and visual HMMs. The fricative consonants /f/ and /v/ exhibit the same problems as the vowels discussed above. Fricatives have the longest duration among the consonants. Unlike plosives (stops), it is possible to articulate a fricative sound for any length of time. It should also be noted that /f/ and /v/ appear visually the same when articulated and unlike the other viseme studies [30]–[37], subjective tests have shown that these two consonants are among the most visually alike [30]–[37] compared to the viseme clusters. The data shown in Table II reveals that /f/ and /v/ are well represented in the training set. They are also well represented in the test set as can be seen in Table III. The results for /f/ and /v/ show a 9.52% and 45.2% relative decrease in performance, respectively. Unfortunately, given the context dependent threshold set for the training process, there are no context dependent models to compare them against. The results for all the context models used, with the exception of /iy[sh,]/, showed significant improvements compared to models trained without context. Therefore, it is expected that the time alignments would improve significantly with context dependent models. The results for the plosive /b/ shown in Table III reveal that the correlation model performed 10.7% worse than temporal scaling. The results for the fricative /sh/ and /th/, the nasal /ng/, the diphthong /aw/ and the context dependent model /iy[sh,]/

908

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

TABLE III VISEME DURATION ERRORS WITHOUT AND WITH THE CORRELATION INTEGRATION MODEL

do not have a sufficient number of observations in the test set to draw strong conclusions. Furthermore, Table II shows that all of these visemes did not occur frequently as others in the training set. B. Spatial Results Measuring the spatial accuracy for an acoustic or visual synthesizer is difficult at best using objective measures. When

a speaker articulates an utterance multiple times, the acoustic and visual signals produced are always novel. Because of this novelty, comparisons between synthesized utterances and real utterances are best handled with subjective tests. However, conducting subjective tests is time consuming and costly. Therefore, objective measures are desirable to gauge when subjective tests would prove beneficial to determining overall system performance. Shown in Fig. 7 are MSE plots detailing the

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

909

TABLE IV THE PHONES SHOWN ACHIEVE HIGHER ACCURACY WITHOUT THE USE OF THE CORRELATION/INTEGRATION MODEL

Fig. 7. Synthesized spatial errors. (Top) MSE between the original and synthesized. (Middle) Original and force aligned. (Bottom) Force aligned and synthesized. The horizontal axis represents the utterance number.

spatial errors for the following conditions: original/synthesized, original/forced, and forced/synthesized. The plots reveal the MSE errors between the original visual observation vectors and those produced by the correlation model were the worst, and the MSEs between Viterbi aligned and the correlation estimate were the best. The plots shown in Fig. 8 show the variations of the four most significant eigenlip coefficients for four sentences in the test data set. The curves labeled “Correlation Est.,” “Viterbi Est.,” and “Reference” represent the results for the correlation/integration model, Viterbi aligned VHMMs, and the actual eigenlip parameters, respectively. The plots reveal that the “Correlation Est.” and “Viterbi Est.” curves estimate the shape of the “Reference” curves well. These results suggest that the synthesized visual sequences should be adequate for speechreading purposes. However, a significant number of the plots reveal time alignment/SOA errors. For example, in Fig. 8 the onset times for the “Correlation Est.” curves lag behind the “Viterbi Est.” and “Reference” curves. However, the temporal alignments improved after 20–30 frames into the utterance. These misalignments are major contribution to the MSEs shown in Fig. 7, and the temporal errors discussed earlier. The alignment errors at the onset of the utterance are the result of temporal smearing of the acoustic segments assigned to the correlation silence model. These SOA delays suggest that there might be integration problems during subjective testing. The large spatial difference between the model curves “Correlation Est.” and “Viterbi Est.” and the “Reference” curves are expected because of the novelty of the test utterance since it is impossible to reconstruct the original visual observation from the VHMMs.

V. SUBJECTIVE TESTING It was stated earlier that the objective measures aimed to determine perceptual attributes which may give insight into human response. Objective measures work well at resolving surface problems, however, most fail at answering the finer details of an engineering design or model. Therefore, at some point subjective tests must be conducted to fully understand the constraints, weaknesses, and strengths imposed by the modeling framework. In this section, the performance of the HMM-based visual speech synthesizer using subjective measures is examined. In Section V-A, the methods used to generate the test stimuli are described, along with the subjects who participated in this study. In addition, the procedures used to facilitate the tests are presented. Then in Section V-B, the results of the subjective tests are analyzed. A. Methods 1) Subjects: Intuition suggests that individuals with impaired hearing are the ideal subjects to use for measuring the perceptual quality of a visual speech synthesizer. However, locating and screening a population of individuals with impaired hearing and the following constraints: • equal levels of speechreading proficiency, • no formal speechreading training, • matching audiograms, are difficult to achieve. Therefore, an alternate population of subjects must be chosen to control the number of parameters which may have effects on the final results.

910

Fig. 8.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

Synthesis results for sentences 380 (top left), 381 (top right), 382 (bottom left), and 383 (bottom right).

In this study, a population of normal hearing adults was recruited. This allowed the above constraints to be easily met without having to perform an extensive search to locate a population of individuals with impaired hearing. Thirty normal hearing adults (16 males and 14 females) were recruited from the Northwestern University student body and the Chicago Metropolitan area. The selected participants indicated English as their primary language. They also demonstrated normal or corrected visual acuity no less than 20/30 as measured using an optical chart. The participant’s hearing

level was verified using pure-tone threshold checks as measured using a Beltone audiometer Model 10D. 2) Stimuli: The subjective tests were designed in a manner that facilitates speechreading with emphasis on measuring the visual significance of the speech driven visual synthesizer. The test material used to conduct this study consisted of utterances taken from the Bernstein Lipreading corpus. All the auditory speech signals used were degraded using Gaussian white noise to realize speech signals with 5 dB, 10 dB, and 15 dB SNR. The synthesized visual lip sequences were realized from

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

Fig. 9.

Subjective test application screenshot.

the visual observations obtained from the CHMMs as discussed in Section III. In addition to these lip sequences, original lip sequences were also used to compare the subjective results between the two. All the visual sequences were constructed by superimposing the lip images over the lips. (see facial image in Fig. 9). 3) Procedure: The experiments were administered in a sound-treated room and required approximately 1 1/2 h of the participant’s time. Each participant was seen individually and was seated in front of a 17 computer monitor tilted to zero degrees, and each wore headphones (Fostex T20) throughout all the experimental tasks. Testing was conducted under three stimulus presentation conditions: 1) auditory only; 2) auditory–visual using the novel HMM-based visual speech synthesizer; and 3) auditory-visual using the original visual lip sequences. The participants were instructed that they would be listening and viewing utterances on a computer monitor. Prior to testing, they were given instructions on how to enter their responses in the computer system. The initiation of all the stimulus presentations was under the participant’s control via a button displayed in the video player application. Fig. 9 shows a screenshot of the application interface used to collect the subject’s response. B. Results In this section the results of the subjective tests are analyzed. In Section V-B1, the auditory results are analyzed. Then, in Section V-B2, the audio–visual results of the novel HMM-based visual synthesizer are analyzed. Finally, in Section V-B3 the audio–visual results using the original uncompressed lip sequences are analyzed. 1) Auditory: The overall word and sentence accuracy for the 5 dB SNR condition were 93.9% (3.5% standard deviation) and 77.3% (12.8% standard deviation), respectively. Note that the word accuracy was high while the sentence accuracy was

911

much lower. This result is an indication that the noise level was impairing the subject’s ability to transcribe the utterances. When the repeated utterances are removed from the test set, the overall word and sentence accuracy was 92.2% (3.68% standard deviation) and 73.3% (13.9% standard deviation), respectively. When the auditory signal was degraded to 10 dB SNR, word accuracy decreased by 27.6% while sentence accuracy decreased by 51.7%. This decline was the result of increased noise level in the auditory signal. When the repeated utterances were removed, performance degraded significantly, indicating that the subjects used prior knowledge to assist the transcription of the repeated utterances. It is also interesting to note that the number of deletions increased dramatically while the number of insertions remained constant. This can be attributed to the manner in which the subjects responded to the utterances. The majority of the subjects chose to skip utterances they could not comprehend, and respond to utterance they thought they could completely transcribe. The overall results with repeated utterances demonstrate word and sentence accuracy of 31.6% (5.90% standard deviation) and 18.5% (8.91% standard deviation), respectively. As in the previous auditory condition ( 10 dB SNR), the noise level has a significant effect on the participants’ ability to correctly transcribe the utterances. Here we can see a significant decrease in word and sentence accuracy compared to both the 5 dB and 10 dB SNR conditions. When the repeated utterances are removed from the results, it is evident that the subjects are relying on memory to transcribe the utterances. In this case word and sentence accuracy was 11.7% (7.72% standard deviation) and 1.21% (3.94% standard deviation), respectively. Compared to the 10 dB SNR condition, word and sentence accuracy decreased by 82.5% and 96.6%, respectively. The auditory results discussed clearly illustrate that noise had a significant effect on the subjects’ ability to speechread. Furthermore, the results show that the subjects were able to identify utterances in the presence of background noise when given multiple observations of the stimuli. These test results provide a reference from which to compare the audio–visual results which will be discussed. They will also help determine the visual contribution for speech intelligibility. 2) Audio With Visual Synthesis: The results for the 5 dB SNR with repeated utterances indicated word and sentence accuracy of 98.1% (1.88% standard deviation) and 87.7% (11.7% standard deviation), respectively. The word and sentence accuracy without the repeated utterances was 97.5% (2.72% standard deviation) and 85.2% (12.1% standard deviation), respectively. These results demonstrate that subjects performed better with the repeated utterance than without. Note that this result also reflects the observation in the auditory test discussed in Section V-BI. Compared to the 5 dB SNR auditory test results, one can see that the addition of visual information improved word and sentence accuracy by 6.08% and 16.1%, respectively. Note this measure of audio–visual significance is biased in favor of the auditory condition. An alternative measure is given by (18)

912

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j) Fig. 10.

(k)

Benefit distributions for audioitory-synthesize and auditory-orignal at

where and represent the auditory–visual and auditory scores expressed as probability correct, respectively. This metric has been shown to be nearly independent of SNR level and of the overall auditory recognition scores [38], [39]. Applying this measure, the mean relative AV word and sentence visual benefits

(lc)

05 dB, 010 dB, and 015 dB SNR. are and , correspondingly. Fig. 10(a) and (b) includes histogram plots depicting the distribution of the word and sentence benefit scores, respectively. The overall word and sentence accuracy for 10 dB condition with the repeated utterances was 90.57% (4.76% standard

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

deviation) and 72.89% (12.62% standard deviation), respectively. When the repeated utterances were removed, the overall word and sentences accuracy decreased to 87.50% (6.21% standard deviation) and 63.67% (14.02% standard deviation), respectively. Compared to the 5 dB SNR results, the overall word and sentence accuracy decreased by 10.26% and 25.23%, respectively. Histogram plots depicting the distribution of the word and sentence benefit scores are presented in Fig. 10(c) and (d); respectively. The mean relative AV word and sentence and , respectively. benefits are The overall word and sentence accuracy for the 15 dB SNR condition was 59.40% (9.87% standard deviation) and 35.17% (9.14% standard deviation), respectively. In the case without the repeated utterances, the overall word and sentence accuracies were 46.11% (13.01% standard deviation) and 17.56% (10.13% standard deviation), respectively. Compared to the 10 dB SNR results, the overall word and sentence accuracy decreased by 47.30% and 72.42%, respectively. This represents a significant decrease in speechreading performance; however, the visual information increased the speech intelligibility compared to the auditory only condition discussed in Section V-BI. Histogram plots depicting the distribution of the word and sentence benefit scores are shown in Fig. 10(e) and (f); respectively. In this and case, the mean relative AV scores are respectively. 3) Audio With Original Lips: Unlike the audio-synth and auditory tests, there were no repeated utterances included in these tests. The results for the 5 dB SNR condition demonstrate that overall word and sentence accuracy was 98.3% (2.21% standard deviation) and 93.9% (7.56% standard deviation), respectively. As the noise level increased to 10 dB SNR, the word and sentence accuracy decreased to 95.5% (2.96% standard deviation) and 88.7% (8.78% standard deviation), respectively. At 15 dB SNR, the word and sentence accuracy further decreased to 86.5% (6.68% standard deviation) and 65.2% (11.3% standard deviation), respectively. These results are significantly better than the results obtained from the audio-synth tests, and are an indication that the temporal misalignments and spatial errors are effecting speech intelligibility. Shown in Fig. 10(g)–(l) are the histogram plots depicting the distribution of the AV benefits scores for 5 dB, 10 dB, and 15 dB SNR levels, respectively. The mean relative AV word benefits , scores for the three SNR levels are respectively. The corresponding mean relative AV sentence , respectively. benefit scores are C. Summary Fig. 11 and Table V summarize the overall subjective speechreading results discussed in Section V-A and -B. These results indicate that the HMM-based visual speech synthesizer is able to increase speech understanding compared to auditory tests alone. However, when compared to the original lip sequences, the HMM-based visual speech synthesizer’s results at 15 dB are significantly worse. Overall the results reveal that the HMM-based synthesizer does well down to 10 dB SNR. At 15 dB SNR, the temporal and spatial errors of the HMM-based visual speech synthesizer become perceptually significant. Comments given by the sub-

Fig. 11.

913

Summary of the overall subjective speechreading results.

jects indicated that the synthesized visual sequences were not as smooth as the original lip sequences. In addition, many of the subjects stated that the speech preceded the lip movements near the beginnings of the utterances. This is a result of SOA delays and can be observed in the plots shown in Fig. 8. The results presented here compare favorably to results published by Yamamoto et al. [40] and Benoît et al. [41]. Both of these intelligibility studies used nonsense syllables. In Yamamoto’s study [40], Japanese CVCV nonsense syllables were used while in Benoît’s study [41], French VCVCV syllables were used. The results of both studies showed that articulatory lip movements synthesized from speech can improve speech intelligibility compared to the auditory condition. VI. DISCUSSION AND CONCLUSION In this paper, we proposed a framework for synthesizing visual articulatory movements given the acoustic speech signal. The proposed correlation HMM approach has the unique property of allowing the acoustic HMMs and visual HMMs to be trained independently. Furthermore, the framework allows the model topologies of the audio HMMs and visual HMMs to be chosen independently. As a result of these attributes, the selected modeling framework reduces the number of model parameters that need to estimated in contrast to early integration approaches. Results from quantitative measures showed that correlation HMMs are able to improve lip-synchronization versus conventional temporal scaling. In addition, objective analysis demonstrated that this framework can improve speech intelligibility. The performance of the HMM-based visual speech synthesizer is not perfect as indicated by the objective and subjective results. In particular the results indicated a need for a larger audio–visual databases for training context dependent models. This would improve the stimulus offset asynchrony (SOA) times of the synthesized utterances. The results also revealed that a common model topology is not necessarily the best approach for modeling visemes and context dependent viseme.

914

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

TABLE V SPEECHREADING RESULTS FOR SUBJECTIVE TESTS

The model proposed in this paper can be extended in many ways. As a result of our findings, we are investigating alternate HMM model topologies, normalization of frame rates, and facial models. Futhermore, future work will include comparative analysis, such as those methods proposed by Bregler [10] and Chen [11] using the Bernstein Lipreading Corpus [8]. ACKNOWLEDGMENT The authors would like to thank the reviewers for their suggestions and comments. REFERENCES [1] S. Kochkin, “Marketrak IV: 10-year trends in the hearing aid market-has anything changed?,” Hearing J., vol. 49, Jan. 1996. [2] D. C. Garstecki and S. F. Erler, “Hearing status and aging,” in Aging and Communication: For Clinicians by Clinicians. Austin, TX: PRO-ED, Inc., 1997, ch. 5, pp. 97–116. [3] L. J. Kepler, M. Terry, and R. H. Sweetman, “Telephone usage in the hearing-impaired population,” Ear Hearing, vol. 13, pp. 331–319, Oct. 1992. [4] Q. Summerfield, “Use of visual information in phonetic perception,” Phonetica, vol. 36, pp. 314–331, 1979. [5] , “Lipreading and audio–visual speech perception,” Phil. Trans. R. Soc. Lond. B, vol. 335, pp. 71–78, 1992. [6] K. W. Grant and L. D. Braida, “Evaluating the articulation index for auditory-visual input,” JASA, vol. 89, pp. 2952–2960, June 1991. [7] T. Chen and R. R. Rao, “Audio–visual interaction in multimedia communication,” in Proc. ICASSP, vol. 1, Apr. 1997, pp. 179–182. [8] L. Bernstein and S. Eberhardt, “Johns Hopkins Lipreading Corpus I–II,” Johns Hopkins Univ., Baltimore, MD, 1986. [9] A. Simons and S. Cox, “Generation of mouthshapes for a synthetic talking head,” in Proc. Inst. Acoust., vol. 12, 1990, pp. 475–482. [10] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visual speech with audio,” in Proc. ACM SIGGRAPH 97, 1997. [11] T. Chen and R. R. Rao, “Audio–visual integration in multimodal communication,” Proc. IEEE, vol. 86, May 1998. [12] M. Tamura, T. Masuko, T. Kobayashi, and K. Tokuda, “Visual speech synthesis based on parameter generation from HMM: Speech-driven and text-and-speech-driven approaches,” in Proc. Auditory-Visual Speech Processing 1998, Dec. 1998. [13] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation from HMM using dynamic features,” in Proc. ICASSP-95, 1995. [14] E. Yamamoto, S. Nakamura, and K. Shikano, “Lip movement synthesis from speech based on hidden Markov models,” Speech Commun., vol. 26, no. 1–2, pp. 105–115, 1998. [15] , “Speech-to-lip movement synthesis based on EM algorithm using audio–visual HMMs,” in Proc. Int. Conf. Spoken Language Processing, 1998, pp. 1275–1278. [16] K. Choi and J.-N. Hwang, “Baum-welch hidden Markov model inversion for reliable audio-to-visual conversion,” in Proc. 1999 IEEE 3rd Workshop Multimedia Signal Processing, 1999, pp. 175–180.

[17] S. Moon and J.-N. Hwang, “Robust speech recognition based on joint model and feature space optimization of hidden Markov models,” IEEE Trans. Neural Networks, vol. 8, pp. 194–204, Mar. 1997. [18] J. J. Williams, A. K. Katsaggelos, and M. A. Randolph, “A hidden Markov model based visual speech synthesizer,” in Proc. ICASSP, 2000. [19] J. J. Williams, “Speech-to-Video Conversion for Individuals With Impaired Hearing,” Ph.D. dissertation, Northwestern Univ., Evanston, IL, 2000. [20] J. J. Williams, A. K. Katsaggelos, and D. C. Garstecki, “Subjective analysis of an HMM-based visual speech synthesizer,” in Proc. HVEI, Jan. 2001. [21] M. Tomlinson, M. Russell, and N. Brooke, “Integrating audio and visual information to provide highly robust speech recognition,” in Proc. 1996 IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996. [22] R. R. Rao, T. Chen, and R. M. Mersereau, “Audio-to-visual conversion for multimedia communication,” IEEE Trans. Ind. Electron., vol. 45, pp. 15–22, Feb. 1998. [23] N. M. Brooke and S. D. Scott, “Two- and three-dimensional audio–visual speech synthesis,” in Proc. Auditory-Visual Speech, 1998. [24] M. Tamura, T. Masuko, T. Kobayashi, and K. Tokuda, “Visual speech synthesis based on parameter generation from HMM: Speech-driven and text-and-speech-driven approaches,” in Proc. Auditory-Visual Speech, 1998. [25] D. W. Massaro, Perceiving Talking Faces: From Speech Production to a Behavioral Principle. Cambridge, MA: MIT Press/Bradford Books, 1998. [26] D. G. Stork and M. E. Hennecke, “Speechreading: An overview of image processing, feature extraction, sensory integration and pattern recognition techniques,” in Proc. 2nd Int. Conf. Automatic Face Gesture Recognition, 1996. [27] I. A. Matthews, J. A. Bangham, and S. J. Cox, “Scale based features for audiovisual speech recognition,” in Proc. Inst. Elect. Eng. Colloquium Integrated Audio–Visual Processing Recognition, Synthesis, Commun., 1996. [28] M. M. Cohen and D. W. Massaro, “What can visual speech synthesis tell visual speech recognition?,” in Proc. Conf. Signals, Syst., Comput., 1994, pp. 566–571. [29] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, ser. Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1993. [30] N. Erber, “Sensory capabilities of hearing-impaired children,” in Discussion: Lipreading Skills, R. E. Stark, Ed. Baltimore, MD: Univ. Park Press, 1974, pp. 69–73. [31] C. Binnie, A. Montgomery, and P. Jackson, “Auditory and visual contributions to the perception of consonants,” J. Speech Hearing Res., vol. 17, pp. 619–630, 1974. [32] C. Binnie, P. Jackson, and A. Montgomery, “Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation,” J. Speech Hearing Disorders, vol. 41, pp. 530–539, 1976. [33] B. E. Walden, R. A. Prosek, A. A. Montgomery, C. K. Scherr, and C. Jones, “Effects of training on the visual recognition of consonants,” J. Speech Hearing Res., pp. 130–145, 1977. [34] B. Walden, S. Erdman, A. Montgomery, D. Schwartz, and R. Prosek, “Some effects of training on speech recognition by hearing-impaired adults,” J. Speech. Hearing Res., vol. 24, pp. 207–216, 1981. [35] E. Owens and B. Blazek, “Visemes observed by hearing-impaired and normal hearing adult viewers,” J. Speech Hearing Res., vol. 28, pp. 381–393, 1985.

WILLIAMS AND KATSAGGELOS: AN HMM-BASED SPEECH-TO-VIDEO SYNTHESIZER

[36] S. Lesner, S. Sandridge, and P. Kricos, “Training influences on visual consonant and sentence recognition,” Ear Hearing, vol. 8, pp. 283–287, 1987. [37] J. J. Williams, J. C. Rutledge, A. K. Katsaggelos, and D. C. Garstecki, “Frame rate and viseme analysis for multimedia applications to assist speechreading,” J. VLSI Signal Processing, vol. 20, pp. 7–23, 1998. [38] W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” J. Acoust. Soc. Amer., vol. 26, pp. 212–215, Mar. 1954. [39] K. W. Grant and P. F. Seitz, “Measures of auditory-visual integration in nonsense syllables and sentences,” J. Acoust. Soc. Amer., vol. 104, pp. 2438–2450, Oct. 1998. [40] E. Yamamoto, S. Nakamura, and K. Shikano, “Subjective evaluation for HMM-based speech-to-lip movement synthesis,” in Proc. Conf. Auditory-Visual Speech Processing, 1998. [41] C. Benoit, “Synthesis and automatic recognition of audio–visual speech,” in Proc. Inst. Elect. Eng. Colloquium Integrated Audio–Visual Processing Recognition, Synthesis, Commun., Nov. 1996.

Jay J. Williams (S’92–M’00) received the B.S. degree in electrical engineering from Howard University, Washington, DC, in 1993, and the M.S. and Ph.D. degrees both in electrical engineering from Northwestern University, Evanston, IL, in 1996 and 2000, respectively. He is now with Ingenient Technologies, Inc., Chicago, IL. His current research interests include multimedia processing, image and video signal processing, computer vision, and audio-visual interaction.

915

Aggelos K. Katsaggelos (S’80–M’85–SM’92–F’98) received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Thessaloniki, Greece, in 1979 and the M.S. and Ph.D. degrees both in electrical engineering from the Georgia Institute of Technology, Atlanta, in 1981 and 1985, respectively. In 1985, he joined the Department of Electrical Engineering and Computer Science at Northwestern University, Evanston, IL, where he is currently Professor, holding the Ameritech Chair of Information Technology. He is also the Director of the Motorola Center for Telecommunications. During the 1986–1987 academic year, he was an Assistant Professor at the Department of Electrical Engineering anad Computer Science, Polytechnic University, Brooklyn, NY. His current research interests include signal recovery and compression, and multimedia signal processing and communications. Dr. Katsaggelos is an Ameritech Fellow, a Member of the Associate Staff, Department of Medicine, at Evanston Hospital, and a Member of SPIE. He is currently a Member of the Publication Board of the IEEE Signal Processing Society, the IEEE TAB Magazine Committee, the Publication Board of the IEEE PROCEEDINGS, the Scientific Board, Centre for Research and Technology Hellas-CE.R.T.H, the IEEE Technical Committees on Visual Signal Processing and Communications, and on Multimedia Signal Processing, an Editorial Board Member of Academic Press, Marcel Dekker, Signal Processing Series, Applied Signal Processing, the Computer Journal, and Editor-in-Chief of the IEEE SIGNAL PROCESSING MAGAZINE. He has served as a Member of the IEEE Signal Processing Society Board of Governors from 1999 to 2001, an Associate Editor for the IEEE TRANSCATIONS ON SIGNAL PROCESSING from 1990 to 1992, an Area Editor for Graphical Models and Image Processing from 1992 to 1995, a Member of the Steering Committees of the IEEE TRANSACTIONS ON IMAGE PROCESSING from 1992 to 1997, and the IEEE TRANSACTIONS ON MEDICAL IMAGING from 1990 to 1999, and a Member of the IEEE Technical Committee on Image and Multidimensional Signal Processing from 1992 to 1998. He has served as the General Chairman of the 1994 Visual Communications and Image Processing Conference, Chicago, IL, and as Technical Program Co-Chair of the 1998 IEEE International Conference on Image Processing, Chicago, IL.

Suggest Documents