Noise Robustness of Tract Variables and their ... - Semantic Scholar

1 downloads 0 Views 458KB Size Report
Louis Goldstein. 24. 1. Department ..... [1] H. Nam, L. Goldstein, E. Saltzman, and D. Byrd, “Tada: An ... [9] C. Browman and L. Goldstein, “Articulatory Gestures as.
Noise Robustness of Tract Variables and their Application to Speech Recognition Vikramjit Mitra1, Hosung Nam2, Carol Espy-Wilson1, Elliot Saltzman23, Louis Goldstein24 1

Department of Electrical and Computer Engineering, University of Maryland, USA 2 Haskins Laboratories, New Haven, USA 3 Department of Physical Therapy and Athletic Training, Boston University, USA 4 Department of Linguistics, University of Southern California, USA

[email protected], [email protected], [email protected], [email protected], [email protected] speaker variation and signal distortion. Finally, in [3, 4], it was demonstrated that articulatory features can significantly improve the performance of an ASR system in noisy environments. They also demonstrated that this effectiveness increases with a decrease in the signal-to-noise ratio (SNR). Most of the current speech inversion research has been confined to data acquired from ElectroMagnetic Mid-sagittal Articulography (EMMA) or ElectroMagnetic Articulography (EMA) [5]. A huge collection of such data is available from the MOCHA [6] and Microbeam [7] databases. These data are usually represented by the Cartesian co-ordinate displacements of transducers (pellets) placed on the articulators. Considering such articulatory data may be problematic since they are often contaminated with measurement noise and they may be inconsistent due to changes in transducer positions during the measurement procedure. Furthermore, use of pellet information can aggravate the non-uniqueness in the mapping between the acoustic waveform and its corresponding articulatory configurations. One articulatory configuration can have several different articulatory locations in a Cartesian coordinate system [8]. For these reasons, we use vocal tract variables (TVs) rather than pellet data. Instead of absolute measures, the TVs, which specify the constriction degree and location, are based on relative measures. Consequently the non-uniqueness in the articulatory data can be expected to be reduced [8]. The TVs used in this study were generated from the TAsk Dynamics Application model (TADA) [1]. Since the data were generated synthetically, they are completely free from measurement noise and inconsistencies. A major motivation for the current study is our belief that an overlapping gesture-based architecture inspired by articulatory phonology [9, 10] can overcome many of the intrinsic limitations of phone-based units in addressing coarticulation. In articulatory phonology, the articulatory constriction gesture is an invariant action unit that can be decomposed into a constellation of articulatory gestures [9, 10] where intra and inter-gestural temporal overlap accounts for acoustic variations in speech. To realize such an ASR system, estimation of the TVs may help in recovering the gestural information more accurately from the speech signal. Hence, TV estimation may be an important component in such a system and its noise robustness would be a correlate of the noise robustness of the gestural estimates and hence the ASR performance. This study does not exclusively deal with the nonuniqueness of speech inversion. It has been observed [11, 12] that a static solution for speech inversion suffers largely from the non-uniqueness issue. It has been proposed that incorporating dynamic information about the acoustic data may help to disambiguate points of instantaneous one-to-

Abstract1 This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech. Index Terms: Gestural phonology, Neural Networks, noise robust speech recognition, speech inversion, Kalman smoothing.

1.

Introduction

Speech inversion or acoustic-to-articulatory inversion of speech has been a widely researched topic in the last 35 years. Various factors have stimulated research in this area, the most prominent being the failure of current state-of-the-art phonebased automatic speech recognition (ASR) systems during spontaneous or casual speech. There are several strong arguments for considering articulatory information in ASR systems. First, it may help to model coarticulation and reduction in a more systematic way. Second according to Kirchhoff et. al. [3], articulatory information is more robust to

1

The first two authors have contributed equally to this paper and are in alphabetic order.

Copyright © 2009 ISCA

2759

6 - 10 September, Brighton UK

many mappings although this process would increase the difficulty of the non-linear mapping problem [11, 12]. In our approach, we use temporal context information to reduce the severity of non-uniqueness. We have implemented a simple 3hidden layer feedforward artificial neural network {ANN) that estimates the TVs given acoustic features with contextual information as input. The configuration of the network as well as the contextual information of the acoustic feature is based upon our previous experimental results [13]. The network is trained with clean synthetic data obtained from TADA and it is evaluated against noise-corrupted (car and subway noise) versions of the synthetic data set at various SNRs. In a parallel study, the noise-corrupted speech was enhanced by an MPO-APP based speech enhancement algorithm [14] and then the enhanced speech was used to estimate the TVs. The network is then used to generate the TVs both with and without initial speech enhancement for the car and subway noise sections of the Aurora-2 [15] dataset and the estimated TVs were used in conjunction with the baseline features to observe if the use of TVs helps in improving the accuracy of the ASR system in noisy cases.

2.

dynamic information in the input space. In this research, we have used 13 Mel-frequency Cepstral Coefficients (MFCC) as the acoustic feature, with an analysis window of 10ms and a frame advance of 5ms. The MFCCs were mean subtracted and normalized by five times the standard deviation and were scaled to be in the range of –0.95 to +0.95. Based on our observation in [13], we selected a contextual window where 8 frames are selected before and after the current frame with a frame shift of 2 (time shift of 10 msec) between the frames giving rise to a vector of size (2*8+1)*13 = 221. The TVs were similarly normalized and scaled. The feed-forward network specification is also based on [13], where the number of processing elements (i.e., the nodes) for the three layers was selected to be 150-100-150. The ANN was trained with a back-propagation algorithm with scaled conjugate gradient as the optimization rule and was trained for 5000 epochs. A tansigmoid function was used as the excitation for all of the layers. The estimation of the TVs from the synthetic data with noise was done in four different ways: (1) the noisy files were used as-is to estimate the TVs (“No Process TV”), (2) the noisy files were used as-is to estimate the TVs and the estimated TVs were smoothed after estimation by a Kalman smoother (“No Process + Kalman TV”), (3) the noisy files were first enhanced and then TV estimation was performed, (“MPOAPP TV”), and (4) the noisy files were enhanced as in (3), and the estimated TVs were processed with Kalman smoothing (“MPOAPP + Kalman TV”). Figure 1 shows the plot of the root mean square error (RMSE) and Figure 2 shows the correlation of the estimated TBCD (TBCD was selected randomly for the Figures, the other TVs showed very similar characteristics) TV for all of the above four cases against different SNR levels for subway noise.

The Data

To develop the module for estimating the TVs, we randomly selected 960 utterances from the Aurora-2 training data. The digit sequence, mean pitch and gender information from each utterance were input to TADA. TADA then generated the TVs (see Table 1), vocal tract area function and formant information for each utterance. Finally, the pitch and formant information was input to HLSyn which generated the synthetic waveforms. Out of the 960 files, 690 of them were used for training the ANN model and the remaining were used for testing. The testing files were further corrupted with subway and car noise at the same SNR levels as used in Aurora-2.

2.8 2.6

Table 1. Constriction organ, vocal tract variables & involved model articulators Constriction organ Tract Variables (TVs) Articulators Lip Aperture (LA) Lip Upper/lower Lip Protrusion (LP) lip, jaw Tongue tip constriction Tongue Tongue Tip degree (TTCD) body, tip, Tongue tip constriction jaw Tongue Body

Velum Glottis

Tongue body, jaw

Velum (VEL) Glottis (GLO)

Velum Glottis

RMSE of TBCD (in mm)

location (TTCL) Tongue body constriction degree (TBCD) Tongue body constriction location (TBCL)

2.4 2.2 2 1.8 1.6 1.4

20dB

15dB

10dB

5dB

0dB

−5dB

SNR(dB)

Figure 1. RMSE of estimated TBCD TV at different SNRs for subway noise.

In the second part of this study, the TV estimation module developed from synthetic data was used to generate the TVs for all of the naturally-produced clean utterances in the Aurora-2 training data. In addition, TVs were estimated for the test sets that were corrupted with car and subway noise.

3.

No Process No Process + Kalman MPOAPP MPOAPP + Kalman

It can be observed from Figures 1 and 2 that the RMSE of estimated TBCD increases with a decrease in SNR and, as a consequence, the correlation of the estimated tract variable with respect to the ground truth decreases. However at a given SNR, the correlation is higher (RMSE is lower) for the TV estimated after MPOAPP enhancement than that without any enhancement. Thus, enhancement improves TV estimation. Also, the Kalman smoothing is found to help in the process of the estimation as compared to not using it at all. Figure 3 shows the estimated TVs for the synthetic utterance ‘two five’, using no processing with Kalman smoothing (clean

The TV estimator and its noise robustness

Several researchers [16, 17, 18] have studied the potential of ANNs for speech inversion. Compared to other architectures, ANNs have a lower computational cost both in terms of memory and execution speed [18]. In ANN-based direct inverse models, instantaneous non-uniqueness has been dealt with by using a suitable contextual window to exploit

2760

condition) and MPOAPP enhancement smoothing (15dB subway noise). 1

Kalman

of the Aurora-2 database. The subway and the car noise subset of test set A of the Aurora-2 were used. The TVs were estimated for the training set as well; and the ASR experiment was based on training on clean and testing on multi-SNR noisy data. The recognizer trained with the Aurora-2 data-set was used for the ASR experiment. The backend uses eleven whole word HMMs, each with 16 states and each state having three Gaussian mixtures. Two pause models for ‘‘sil” and ‘‘sp”, were used, where the ‘‘sil” model has three states and each state has six mixtures. The ‘‘sp” model has a single state. The training and testing scripts provided with the Aurora-2 database were used with minor modifications accommodating for the testing only on car and subway noise as well as the change in the feature dimension because of augmentation with the TVs. Figures 6 and 7 shows the word recognition accuracy for the car and subway noise using TV estimation with and without MPOAPP enhancement, but both with Kalman smoothing post-processing. From the plots it is evident that the TVs helped to improve the recognition rates over the baseline. The improvement is found to be even better for the case where the TVs were estimated after MPOAPP enhancement of the noisy data.

No Process No Process + Kalman MPOAPP MPOAPP + Kalman

0.9 autocorrelation of TBCD

with

0.8 0.7 0.6 0.5 0.4

20dB

15dB

10dB 5dB SNR(dB)

0dB

−5dB

Frequency

Figure 2. Correlation of estimated TBCD TV at different SNRs for subway noise. 4000

0

0.2

0.3

0.4

0.5

0.6

0.7

0.5

Time Estimated @ clean Groundtruth Estimated @ 15dB subway noise after MPOAPP

0.4 GLO

correlation RMSE

0.6

2000

0.2

0.4

0 0.2

0.3

0.4

0.5

0.6

0.7

0.3

150 0.2

100

TTCL

50 60

0.3

0.4

0.5

0.6

0.7 0.1 20dB

15dB

10dB

40

0.2

0.3

0.4

0.5

0.6

0.7

0dB

−5dB

0.2

0.3

0.4

0.5

0.6

0.7

Figure 4. Average Correlation and normalized RMSE of the estimated “No Process TVs” for car noise at different SNRs (the RMSE and correlation information are measured with respect to the estimated “No Process TVs” at clean condition).

20 10 0

5dB SNR (dB)

20 0 30

TTCD

0.2

Frequency

TBCL

200

Figure 3. The spectrogram of synthetic utterance ‘two five’, along with the ground truth and estimated (at clean and 15dB subway noise) TVs for GLO, TBCL, TTCL and TTCD.

2000 0 0.4

0.5

0.6

0.7

0.8 Time

GLO TBCL

1

1.1

1.2

0.2 0 0.4 200

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

150 100

TTCL

50 0.4 60 40 20

TTCD

0 0.4 30 20 10 0 0.4

ASR experiments

0.9

clean Esimated @ 15dB car noise after MPOAPP

0.4

The TV estimation module, which was trained with clean synthetic data, was then used for natural speech. Since there is no known groundtruth for the TVs of this database, the RMSE as well as the correlation cannot be computed directly. However, we can compare the estimated TVs from different noise levels to those in clean and generate the relative RMSE and correlation. Figure 4 shows that the relative RMSE (normalized) increases and the correlation decreases as the noise level gets higher. Figure 5 shows the estimated TVs for GLO, TBCL, TTCL and TTCD for the utterance ‘two-five’ from the Aurora-2 database. The TV estimation is performed after MPOAPP enhancement followed by Kalman smoothing.

4.

4000

Figure 5. The spectrogram of natural utterance ‘two five’ from Aurora-2, along with the reconstructed TVs (at clean and 15dB car noise) for GLO, TBCL, TTCL and TTCD.

The estimated TVs were concatenated with the 39 dimensional MFCCs for performing the word recognition task

2761

contribute more positively toward improving ASR noise robustness.

100 90

Acknowledgements

80

This research was supported by NSF Grant # IIS0703859, IIS0703048 and IIS0703782.

Rec Acc %

70 60

6.

50 40 30 20 10

baseline baseline+TVs(no−process+Kalman) baseline+TVs(MPOAPP+Kalman)

0 clean

20dB

15dB

10dB SNR (dB)

5dB

0dB

−5dB

Figure 6. Recognition accuracy for car noise at different SNRs, using TV along with the baseline MFCC features. 100 90 80

Rec Acc %

70 60 50 40 30 20

baseline baseline+TVs(no−process+Kalman) baseline+TVs(MPOAPP+Kalman)

10 clean

20dB

15dB

10dB SNR (dB)

5dB

0dB

−5dB

Figure 7. Recognition accuracy for subway noise at different SNRs, using TV along with the baseline MFCC features.

5.

Conclusion and Discussion

Two main tasks were performed in this research. The first task involved evaluation of the noise robustness of the TV estimation procedure. In the second task, the estimated TVs were used in conjunction with the MFCCs to observe if they aid in improving ASR noise robustness. It was observed that a decrease in SNR resulted in an increase in the RMSE of the estimated TVs and a simultaneous decrease in correlation. A speech enhancement algorithm was used to clean the noisy speech and the cleaned speech in turn was used for the estimation of the TVs. It was observed that using such a speech enhancement technique helped to reduce the RMSE and simultaneously improve the correlation at a given SNR. This trend was fairly consistent as observed from Figures 1 and 2. The ASR experiments show that the use of the TVs helped to improve the noise robustness. The TVs obtained from the MPOAPP enhanced speech helped to obtain better ASR performance than the one without any processing. These results suggest that the TVs, if estimated properly, can help to improve the noise robustness of ASR systems. In this paper we have used a direct inverse model to estimate the TVs; moreover, the inverse model was trained with a significantly smaller amount of data than that available in the Aurora-2 training database. The results in this paper are a preliminary study showing that tract variables can contribute positively toward improving noise robustness of ASRs. Future research should address using more training data for the TV estimator. Also using articulatory data to define a groundtruth for spontaneous speech TVs would greatly help the process of spontaneous speech TV estimation and would

2762

References

[1] H. Nam, L. Goldstein, E. Saltzman, and D. Byrd, “Tada: An enhanced, portable task dynamics model in matlab”, J. Acoust. Soc. Am, Vol. 115, No.5, 2, pp. 2430, 2004. http://www.haskins.yale.edu/tada_download/index.html [2] O. Deshmukh, C. Espy-Wilson and L.H. Carney, “Speech Enhancement Using The Modified Phase Opponency Model”, J. Acoust. Soc. Am, Vol. 121, No.6, pp 3886-3898, 2007. [3] K. Kirchhoff, G. Fink, and G. Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition”, Speech Comm., vol.37, pp. 303-319, 2000. [4] K. Kirchoff, “Robust Speech Recognition Using Articulatory Information”, PhD Thesis, University of Bielefeld, 1999. [5] J. Ryalls and S. J. Behrens, “Introduction to Speech Science: From Basic Theories to Clinical Applications”, Allyn & Bacon, 2000. [6] A. A. Wrench and H. J. William, “A multichannel articulatory database and its application for automatic speech recognition”, In 5th Seminar on Speech Production: Models and Data, pp. 305–308, Bavaria, 2000. [7] J. Westbury “X-ray microbeam speech production database user’s handbook”, Univ. of Wisconsin, 1994 [8] R.S. McGowan, “Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: preliminary model tests”, Speech Comm., Vol.14, Iss.1, pp. 19-48, Elsevier Science Publishers, 1994. [9] C. Browman and L. Goldstein, “Articulatory Gestures as Phonological Units”, Phonology, 6: 201-251, 1989 [10] C. Browman and L. Goldstein, “Articulatory Phonology: An Overview”, Phonetica, 49: 155-180, 1992 [11] S. Dusan, “Statistical Estimation of Articulatory Trajectories from the Speech Signal Using Dynamical and Phonological Constraints”, PhD Thesis, Dept. of Electrical and Computer Eng., University of Waterloo, Waterloo, Canada, April 2000. [12] S. Dusan, “Methods for Integrating Phonetic and Phonological Knowledge in Speech Inversion”, In Proceedings of the International Conference on Speech, Signal and Image Processing, SSIP’01, pp. 194-200, Malta, Sept. 2001. [13] V. Mitra, H. Nam and C. Espy-Wilson, “A Step in the Realization of a Speech Recognition System based on Gestural Phonology and Landmarks”, J. Acoust. Soc. Am, Vol.125, pp.2577, 2008. [14] V. Mitra, B. Borgstrom, C. Espy-Wilson and A. Alwan, “A Noise-type and level-dependent MPO-based speech enhancement architecture with variable frame analysis for noiserobust speech recognition”, under review. [15] H.G. Hirsch and D. Pearce, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions”, In Proc. ISCA ITRW ASR2000, pp. 181-188, Paris, France, 2000. [16] G. Papcun, J. Hochberg, T.R. Thomas, F. Laroche, J. Zachs and S. Levy, “Inferring articulation and recognizing gestures from acoustics with a neural network trained on X-ray microbeam data”, J. of Acoust. Soc. Am, 92(2), pp. 688-700. [17] J. Zachs and T.R. Thomas, “A new neural network for articulatory speech recognition and its application to vowel identification”, Computer Speech and Language, 8, pp. 189209, 1994. [18] K. Richmond, “Estimating Articulatory parameters from the Acoustic Speech Signal”, PhD Thesis, University of Edinburgh, 2001.

Suggest Documents