product hmms for audio-visual continuous speech ... - CiteSeerX

1 downloads 0 Views 93KB Size Report
We utilize. Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both Single-stream ...
PRODUCT HMMS FOR AUDIO-VISUAL CONTINUOUS SPEECH RECOGNITION USING FACIAL ANIMATION PARAMETERS

Petar S. Aleksic and Aggelos K. Katsaggelos Department of Electrical and Computer Engineering Northwestern University 2145 North Sheridan Road, Evanston, IL 60208 Email: {apetar, aggk} @ece.nwu.edu

ABSTRACT The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both Single-stream and Multi-stream Hidden Markov Models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (AASR) WERs, at various SNRs with additive white Gaussian noise. 1. INTRODUCTION Visual information can be used in addition to audio, in order to improve speech understanding especially in noisy environments [1]. Improving ASR performance, by exploiting the visual information of the speaker’s mouth region is the main objective of audio-visual automatic speech recognition (AV-ASR) [2]. Many researchers have reported results that demonstrate AV-ASR performance improvement over A-ASR systems [3-5]. Most of these systems performed tests using a small vocabulary, while recently results on a large vocabulary were shown [3, 6]. MPEG-4 is an audiovisual object-based video representation standard supporting facial animation. MPEG-4 facial animation is controlled by the Facial Definition Parameters (FDPs) and FAPs, which describe the face shape, and movement, respectively [7]. The MPEG-4 standard defines 68 FAPs, divided into 10 groups as shown in Figure 1 [7]. Transmission of all FAPs at 30 frames per second requires only around 20 kbps (or just a few kbps, if MPEG-4 FAP interpolation is efficiently used [8]), which is much lower than standard video transmission rates. FAPs

contain important visual information that can be used in addition to audio information in ASR. To the best of our knowledge no results have been previously reported on the improvement of AVASR performance when FAPs are used as visual features with a relatively large vocabulary audio-visual database of about 1000 words and with different audio-visual information integration approaches. The main objective of this paper is to report on such results. In this paper, we first describe in Sec. 2 the visual features and in Sec. 3 the audio-visual integration approaches used. Next the speech recognition experiments are described in Sec. 4. Finally, conclusions are drawn and future work is proposed in Sec. 5. 2. VISUAL FEATURES This work utilizes speechreading material from the Bernstein Lipreading Corpus [9, 14]. This high quality audio-visual database includes a total of 954 sentences, of which 474 were uttered by a single female speaker and the remaining 480 by a male speaker. For each of the sentences, the database contains a speech waveform, a word-level transcription, and a video sequence time synchronized with the speech waveform. Each utterance begins and ends with a period of silence. The vocabulary size is approximately 1,000 words. The average utterance length is approximately 4 seconds. In order to extract visual features from the database, the video (Figure 2a) was sampled at a rate of 30 frames/sec (fps) with a spatial resolution of 320 x 240 pixels, 24 bits per pixel. The luminance information was used in the algorithms and the experiments. Audio was acquired at a rate of 16 kHz. The visual feature extraction algorithm used in this work for extraction of the FAPs from the original video combines active contour and templates algorithms and is described in [10]. Through visual evaluation of the FAP extraction results we observed that the extracted parameters produced a natural movement of the MPEG-4 decoder (Figure 2b) [11] that synchronized well with the audio. In this work, only group 8 FAPs, which describe the outer lip movement, are considered (Figure 1).

(a)

(b) Figure 3. The mean lip shape (middle column), and the lip shapes obtained by the variation of the first (a), and second (b) eigenvector weights by 2 standard deviations (left and right column) SNR

Audio (16kHz)

Acoustic Feature Extraction (MFCCs)

Video (30 fps)

Visual Feature Extraction (FAPs)

90 Hz

o ta

Dimensionality 30 Hz Visual Feature 90 Hz Reduction (PCA) Interpolation otv

Single-stream Word Sequence and Multi-stream HMMs

Figure 4. Audio-visual system for ASR mind the trade-off between the number of HMM parameters that have to be estimated and the amount of the speechreading information contained in the visual features. Based on the statistical variance distribution and the above mentioned trade-off we decided to use two-dimensional projection weights (K=2) as visual features. These features were used in all AV-ASR experiments.

Figure 1. Facial animation parameters (FAPs)

3. AUDIO-VISUAL INTEGRATION a)

b)

Figure 2. a) Original video frame [9]; b) MPEG-4 model [11]

In order to decrease the dimensionality of the visual feature vector, Principal Component Analysis (PCA) was performed on the 10-dimensional FAP vectors (ft). After the 10x1 mean FAP vector f t and the 10x10 covariance matrix were obtained, the FAPs, ft, were projected onto the eigenspace defined by the first K eigenvectors, f ft = f + E ⋅ o , t t

(1)

where, E=[e1 e2…eK] is the matrix of K eigenvectors, which correspond to the K largest eigenvalues, and otf the Kx1 vector of corresponding projection weights. The first six, two and one eigenvectors represent 99.6%, 93%, and 81% of the total statistical variance, respectively. By varying the projection weights by ±2 standard deviations, we concluded that the first and second eigenvectors mostly describe the movement of the lower and upper lip, respectively (Figure 3). When choosing the dimensionality of the visual features to be used for AV-ASR one should have in

The audio and visual speech information can be combined using several techniques. One such (early integration) technique is to just concatenate audio and visual features, form bigger feature vectors, and then train a single stream HMM system on the concatenated vectors. Although using concatenated audio-visual features can improve ASR performance over A-ASR, it does not allow for the modeling of the reliability of the audio and visual streams. It therefore cannot take advantage of the information that might be available, about the acoustic noise in the environment, which affects audio features, or the visual noise, which affects the quality of the visual features. A second approach is to model audio and visual features as separate feature streams (late integration), by use of multi-stream HMMs. Their log-likelihoods can be combined using the weights that capture the reliability of each particular stream. The recombination of log-likelihoods can be performed at different levels of integration, such as state, phone, word, and utterance [3]. The two streams, audio and visual, are combined in the approach we developed as shown in Figure 4. The Mel-Frequency Cepstral Coefficients (MFCC), signal energy and first and second derivatives, widely used in speech processing, were used as audio features, while the two-dimensional projections weights and first and second derivatives were used as visual features. Since MFCCs were obtained at a rate of 90Hz, while FAPs at a rate of 30Hz,

Video Model

mixture of the stream s, and N is a multivariate Gaussian with mean vector µjsm and diagonal covariance matrix Σjsm. The nonnegative stream weights are denoted by γS, and they depend on the modality S. We assumed that stream weights satisfy,

31

32

33

21

22

23

11

12

γ

a

+ γ

v

= 1 .

(5)

In order to perform multi-stream HMM training, stream weights must be chosen a-priori. They were roughly optimized by minimizing the word error rate (WER) on a development data set.

13

Audio Model

3.3. The Product HMM

Figure 5. Topology of product HMM

FAPs were interpolated in order to obtain synchronized data. 3.1. The Single Stream HMM In this approach the audio-visual feature observation vector (ot) is formed by appending the visual observations vector (otv) to the audio observations vector (ota), that is o t = [ o ta

T

T

T

o tv ] .

(2)

The newly obtained joint features were used to train a singlestream HMM model, with state emission probabilities given by

Since video speech activity precedes the acoustic speech activity, there exists asynchrony between the audio and visual streams. The state-synchronous multi-stream HMMs described above do not allow for such asynchrony. Therefore, it is important to use an HMM topology that allows for such asynchrony. The multi-stream HMMs can be used to perform phone log-likelihood recombination by allowing for asynchrony within a phone and forcing the synchrony at the phone boundaries. The product HMMs can be used for modeling the phone-synchronous audiovisual system [3, 12]. The topology of a product HMM is shown in Figure 5. The product HMM states have audio-visual emission probabilities described in (4). We restrict the degree of asynchrony between the audio and visual streams to one state by excluding states 13 and 31 from the model shown in Figure 5. The final model therefore has seven states. The audio stream emission probabilities are tied along the same column and visual stream emission probabilities are tied along the same row.

M

b

j

(o t ) =

å

c

jm

N (ot ; µ

jm



jm

(3)

)

4. SPEECH RECOGNITION EXPERIMENTS

m =1

In (3) subscript j denotes a state of a context-dependent model, M denotes the number of mixtures, cjm denotes the weight of the m’th mixture component, and N is a multivariate Gaussian with mean vector µjm and diagonal covariance matrix Σjm. The sum of mixture weights cjm is equal to 1. 3.2. The Multi Stream HMM We also used a multi-stream HMM, and a state log-likelihood recombination method, to perform audio-visual integration, since it allows for easy modeling of the reliability of the audio and visual streams. Two data streams were used to separately model audio and visual information [3]. The state emission probability of the multi-stream HMM, bj(ot), is given by M

b j (o t ) =

∏ [å

S

c

jsm

N ( o ts ; µ

jsm



jsm

) ]γ

s

.(4)

s∈{ a , v } m = 1

In (4) subscript j denotes context-dependent state, MS denotes the number of mixtures in a stream, cjsm denotes the weight of the m’th

The baseline ASR system was developed using the HTK toolkit version 2.2. The experiments used the portion of the Bernstein database with the female speaker. Context dependent phoneme models, triphones, were used as speech units. HMMs used for single stream and state-synchronous multi-state systems were leftright, with 3 states, while product HMMs were used for multistream phone-synchronous system. Iterative mixture splitting was performed to obtain the final 9-mixture triphone HMMs. Approximately 80% of the data was used for training, 18% for testing, and 2% as a development set for obtaining roughly optimized stream weights, word insertion penalty and the grammar scale factor. The bi-gram language model, used for decoding, was created based on the transcriptions of the training data set, and its perplexity was approximately 40. The same training and testing procedures were used for both audio-only and audio-visual experiments. To test the algorithm over a wide range of SNRs (030 dB), white Gaussian noise was added to the audio signals. All results were obtained using HMMs trained in matched conditions, by corrupting the training data with the same level of noise, as used for corrupting the testing data. This approach was used in order to accurately measure the influence of visual data on ASR performance. A-ASR results are summarized in Figure 6. It can be

60

Audio-only AV early integration AV late integration (LI) AV LI - product HMMs

55

50

W E R (% )

45

40

35

considerable improvement in ASR performance, for all noise levels tested. We built a product HMM AV-ASR system and tested its performance on the Bernstein database. This approach showed speech recognition performance improvement over other audio-visual integration methods used. We determined the improvement in ASR performance that can be obtained by exploiting the visual speech information contained in FAPs that describe the outer lip movement. We also plan to perform experiments on multi-speaker very large vocabulary databases, such as [3], and utilize new approaches for audio-visual information integration, such as [13]. 6. REFERENCES

30

25

20

15 0

5

10

15 SNR (dB)

20

25

30

Figure 6. Audio-only, and audio-visual system WERs vs. SNR

observed that the ASR performance is severely affected by additive noise. 4.1. Audio-visual speech recognition experiments In all AV-ASR experiments the stream weights were estimated experimentally by minimizing the WER on the development data set. The AV-ASR results obtained for different audio-visual information integration approaches are shown in Figure 6. As can be clearly seen, all AV-ASR systems perform considerably better than the A-ASR system, for all SNR values. At the same time the multi-stream HMM systems outperform the single-stream system for all values of SNR. The phone-synchronous AV-ASR system showed performance improvement over the state-synchronous AV-ASR system, which justified the usage of product HMMs for information integration. The relative reduction of WER achieved by the phone-synchronous system, compared to the audio-only WER, ranges from 20% for the noisy audio with SNR of 30dB to 21% for SNR of 0dB. It is important to point out that the considerable performance improvement was achieved, although only two-dimensional visual features were used. Results may be further improved by better adjusting the audio and visual stream weights [6]. 5. CONCLUSIONS We evaluated several different approaches for audio-visual information integration using an AV-ASR system on a relatively large audio-visual database, over a wide range of noise levels. We used only two-dimensional visual feature vectors and obtained

[1] R. Lippman, “Speech recognition by machines and humans,” Speech Communication, vol. 22(1), pp. 1-15, July 1997. [2] D. G. Stork and M. E. Hennecke, editors, Speechreading by Man and Machine, Springer-Verlag New York Inc., 1996. [3] C. Neti et al., “Audio-visual speech recognition,” Tech. Rep., Johns Hopkins University, Baltimore, 2000. [4] C. Bregler and Y. Conig, “Eigenlips’ for robust speech recognition,” In IEEE Proc. ICASSP, pp. 669-672, Adelaide, 1994. [5] S. Dupont, J. Luettin, “Audio-visual speech modeling for continuous speech recognition, ”IEEE Trans. on Mult., vol. 2(3), pp. 141-151, 2000. [6] H. Glotin et al., ”Weighting schemes for audio-visual fusion in speech recognition,” In IEEE Proc. ICASSP, vol. 1, pp. 165 –168, 2001. [7] Text for ISO/IEC FDIS 14496-2 Visual, ISO/IEC JTC1/SC29/WG11 N2502, Nov. 1998. [8] F. Lavagetto and R. Pockaj, “An efficient use of MPEG-4 FAP interpolation for facial animation at 70 bits/frame,” IEEE Trans. on Cir. and Sys. for Video Tech., vol. 11(10), pp.1085-1097, October 2001. [9] L. E. Bernstein, Lipreading Corpus V-VI: Disc 3., Gallaudet University, Washington, D.C., 1991. [10] P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, ”Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features”, EURASIP Journal on Applied Signal Processing, pp.1213-1227, 2002. [11] G. A. Abrantes, FACE-Facial Animation System, version 3.3.1, Instituto Superior Tecnico, (c) 1997-98. [12] S. Nakamura, “Fusion of Audio-Visual Information for Integrated Speech Processing,” Audio- and Video-Based Biometric Person Authentication (AVBPA), pp. 127-143, Halmstad, Sweden, June 2001. [13] S. Nakamura, K. Kumatani, and S. Tamura, “Multi-modal Temporal Asynchronicity Modeling by Product HMMs for Robust Audio-Visual Speech Recognition,” International Conference on Multimodal Interfaces, October 14-16, 2002. [14] J. J. Williams and A. K. Katsaggelos, “An HMM-based speech-to-video synthesizer,” IEEE Trans. on Neural Networks, Special issue on Intelligent Multimedia, vol. 13, no. 4, July 2002, pp. 900-915.

Suggest Documents