A Comparison of Speech Recognition Performance ...

0 downloads 0 Views 315KB Size Report
talkers are enable to speak at some distance from microphone without the encumbrance of hand held or body worm equipment. The major problem in ...
Kỷ yếu Hội nghị Quốc gia lần thứ VI về Nghiên cứu cơ bản và ứng dụng Công nghệ thông tin (FAIR) - Huế, ngày 20 – 21/6/2013 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

A Comparison of Speech Recognition Performance Baseline under Reverberant Environments with CENSREC-4 1

Dinh Cuong Nguyen , Suk-Nam Choi1, Hyun Yeol Chung1 Dept. of Information and Communication of Engineering, Yeungnam University Gyungsan Gyungbuk, Korea [email protected], [email protected], [email protected]

Abstract. The Corpus and Environment for Noisy Speech RECognition 4 (CENSREC-4) evaluation framework has been distributed for evaluating distant-talking speech under various reverberation environments. The CENSREC-4 includes both real and simulated reverberant speech with convoluting impulse responses in the same environment. In our works, we compare the recognition performance under CENSREC-4 by using Front-End processing between HTK (Hidden Markov Model Toolkit) and ETSI (The European Telecommunications Standards Institute Toolkit). Experimental results showed that using Front-End of ETSI, the baseline recognition performance is improved 0.45% under clean and 3.51% for multi reverberant environments condition. Keywords: CENSREC-4, reverberant environment, speech recognition.

1 Introduction Speech recognition system has many applications in our life. Most of today's application still requires a microphone located near talker. However, almost all of these applications would benefit from distant-talking capturing, where talkers are enable to speak at some distance from microphone without the encumbrance of hand held or body worm equipment. The major problem in distant-talking speech recognition is the corruption of speech signals by both interfering sounds and the reverberation caused by the large speaker -to-micro-phone distance. The range of successful technique has been developed since beginnings of speech recognition research to combat the additive and convolution noise caused by interfering sounds, microphone mismatch. To develop a speech recognition system in a reverberant room, where speech is captured with one or more distant microphone. While traveling from the speaker's mouth to the microphones, the wavefront of the speech is repeatedly reflections, perceived as reverberation, alter the acoustic characteristics of the original speech. The speech signals captured by the microphones are fed into the recognizer, whose processing steps are usually grouped into two broad units: the front end and the back end. The goal of reverberant speech recognition is to correctly transcribe speech corrupted by reverberation regardless of the severity of the reverberation. Figure 1 illustrates a speech recognition system under reverberation environment.

Fig 1. Speech Recognition System in reverberant environments [3] For research speech recognition under reverberant environments, NII supported a corpus called CENSREC-4, which is an evaluation framework for distant-talking speech under reverberation environments. It is an effective database suitable for evaluating new methods of dereverberation and evaluating performance because the traditional

2

Lecture Notes in Computer Science: Authors’ Instructions for the …

dereverberation process and performance estimation criteria are ineffective in sufficiently improving and estimating recognition performance. CENSREC-4 is a first step in supporting the research on effective algorithms for recognition in reverberation.

2 CENSREC-4 Corpus CENSREC-4 the corpus from NII which is a framework for evaluating distant-talking speech under various reverberant environments. The data it contains are connected digit utterances. The basic data sets were used for environment to prepare the room impulse response convolved speech data. Many room impulse were measured in real environments so that these could be use to simulate speech record in various environments by convolving them with clean speech signal and room impulse response. CENSREC-4 had impulse recorded in eight kind of environments: an office, an elevator hall (the waiting area in font of an elevator), a car, a living room , a lounge, a Japanese style room (a room with tatami floor), a meeting room, and a Japanese’s style bathroom (a prefabricated bath). The impulse responses were measured for the environments using the equipment and setup listed in Table 1. Figure 2 presents the microphone setting in all environments except those in the car and the Japanese-style bathroom. Table 1 Recording equipment and conditions Microphone SONY, ECM-88B Microphone amplifier PAVEC, Thinknet MA-2016C A/D board TOKYO ELECTRON device, td-bd-8csusb2.0 Mouth simulator B&K, Type 4128 Speaker amplifier YAMAHA, P4050 Sampling frequency 48kHz (down sampled to 16 kHz before convolving) Quantization 16 bits

Fig 2 Recording setup for impulse responses in all environments except that in car and in Japanese-style bathroom The microphone was positioned near the centers of the spaces in all environments, except in the car and in the Japanese-style bathroom. For the car environment, a medium-sized sedan and positioned the mouth simulator on the drivers seat and the microphone on the sun visor for the environment inside the car. The mouth simulator and microphone were about 0.4m apart. We positioned the microphone on a coffee table in the lounge environment. For the bathroom environment, the mouth simulator was positioned over the bathtub, which was filled with cold water, and placed the microphone on a sidewall in the Japanese-style bath environment. The mouth simulator and the microphone were about 0.3m apart. The reverberant speech was simulated by convolving the impulse responses to clean speech. The vocabulary in the simulated data included in CENSREC-4 consist of eleven Japanese numbers: “ichi (1)”, “ni (2)”, “san (3)”, “yon (4) ”, “go (5)”, “roku (6)”, “nana (7)”, “hachi (8)”, “kyu (9)”, “zero (0)” and “maru (0)”.

3 The Front-End of speech recognition processing The Front-End-based approaches, which aim at dereverberating corrupted feature vectors, can be categorized into three classes according to where enhancement takes place in the chain of processing steps for feature extraction (see Figure 3). The first approach, called linear filtering, dereverberates time domain signals or STFT coefficients. The second approach, called spectrum enhancement, dereverberates corrupted power spectra while ignoring the signal phases. The third approach, called feature enhancement, attempts to remove the effect of reverberation directly from the corrupted feature vectors. Detailed description of various Front-End-based methods can be found in [5].

3

Dinh Cuong Nguyen, Hyun Yeol Chung, et al

yn [k ]

yn [k ]

2

exp( yn, k )

yn , k

ynmfcc ,k

Fig 3. Flow chart of Front-End feature extraction and three Front –End-based approaches The STFT of reverberant signal y (t ) is denoted by yn [ k ] , where n represents the index of a time frame. The log Mel-frequency filter bank feature elements and the corresponding feature vectors are denoted by yn.k and yn , respectively. Note that variable k is commonly used when referring to a frequency bin and when mentioning an element of a feature vector. The various representations of x(t ) are defined similarly. Furthermore, j is used as index for HMM states.

4 Experimental results The training and testing data were prepared in the same way as those for CENSREC-1. The testing database was divided into two sets: Test set A (in an office, in an elevator hall, in a car, and in a living room) and Test set B (in a lounge, in a Japanese-style room, in a meeting room, and in a Japanese-style bathroom). There were a total of 4,004 utterances by 104 speakers (52 females and 52 males). These utterances for Test sets A and B were divided into four groups corresponding to the reverberant conditions. Thus, each reverberant condition included 1,001 utterances. The noise in Test set A for CENSREC-1 was used for both the test set and the training set (called known noises), but that in Test set B was only used for the training set (unknown noises). Similar to this, the CENSREC-4 basic data sets also had two types of test sets: Test set A with known reverberant environments and B with unknown reverberant environments. Two sets of training data were prepared, i.e., clean training and multi condition training. There were a total of 8,440 utterances by 110 speakers (55 females and 55 males). For the multi condition training data, four kinds of reverberation (in an office, in an elevator hall, in a car, and in a living room) were convolved with clean speech. Thus, each reverberant condition included 2,110 utterances. We have the record condition of CENSREC-4 in Figure 4. In our works, we compare the recognition performance baseline of Front-End between HTK and ESTI toolkit. We specified six conditions for producing baseline scripts. The acoustic model set consisted of 18 phoneme models (/a/, /i/, /u/, /u:/, /e/, /o/, /N/, /ch/, /g/, /h/, /k/, /ky/, /m/, /n/, /r/, /s/, /y/, /z/), silence (‘sil’), and a short pause (‘sp’). Each phoneme model and ‘sil’ had five states (three emitting states), and ‘sp’ had three states (one emitting state). The output distribution of ‘sp’ was the same as that of the center state of ‘sil’. Each state of the phoneme models had 20 Gaussian mixture pdfs, and ‘sil’ or ‘sp’ had 36 Gaussian mixtures. The feature parameter of the baseline system had 39-dimensional feature vectors that consisted of 12 MFCCs, 12  MFCCs, 12   MFCCs, log power,  power, and   power. The analysis conditions were pre-emphasis (1 -0:97 z 1 ), a hamming window, a 25-ms frame length, and a 10-ms frame shift. Tables 2 and 3 (explained below) indicate the recognition results for the clean speech data and

4

Lecture Notes in Computer Science: Authors’ Instructions for the …

reverberation speech in case of Clean and multi condition training. We have the experimental results of speech recognition under multi environment condition in Table 2.

Office room

Conference room

In car

Living room

Class room

Kitchen room

Elevator

Lounge room

Fig 4. The environment recording condition Table 2. Comparison reverberant speech recognition performance in clean condition training (Acc%). Front-End HTKbaseline ETSI

Condition

Office

EVhall

In-car

LivinR

Lounge

JPstyR

Conf.R

Bath.R

Avg

Clean

99.48

99.37

99.52

99.35

99.48

99.37

99.52

99.35

99.43

Reverberation

97.48

57.86

95.62

84.42

73.96

89.54

89.78

77.97

83.33

Clean

99.54

99.52

99.7

99.44

99.54

99.52

99.7

99.44

99.55

Reverberation

98.07

64.09

98.69

86.61

81.24

92.5

94.9

82.14

87.28

Table 3. Comparison reverberant speech recognition performance in multi condition training (Acc%). Front-End HTKbaseline ETSI

Condition

Office

EVhall

In-car

LivinR

Lounge

JpstyR

Conf.R

Bath.R

Avg

Clean

96.38

96.58

96.24

95.99

96.38

96.58

99.52

96.24

96.74

Reverberation

94.41

90.60

94.99

91.55

79.98

93.44

93.56

84.23

90.35

Clean

97.39

98.10

97.02

97.56

97.39

98.10

97.02

97.56

97.52

Reverberation

97.51

91.14

97.08

95.03

88.98

92.11

95.47

90.00

93.42

5

Dinh Cuong Nguyen, Hyun Yeol Chung, et al

The experimental results show that Front-End processing of ETSI Toolkit gives recognition performance higher than that of HTK Toolkit. In clean speech condition training, the recogntion performance of Front-End ETSI increase 0.12% average compared to clean speech and 3.95% average to reverberant speech. It is also gives good recognition performance in multi environments condition training. Recognition performance of Front-End (ETSI) higher Front-End (HTK) 0.78% in clean speech condition and also 3.07% in reverberant speech condition. Acknowledgments. We wish to thank the members of the Speech Resources Consortium at the National Institute of Informatics (NII-SRC), Japan, for their generous assistance with our activities.

5 Conclusion In this paper, we present our experiments on compassion the recognition performance between Front-End (ETSI)and Front-End (HTK) processing. Experimental results showed that recognition performance of Front-End (ETSI) is better than that of HTK baseline of NII. The baseline recognition performance is improved 0.45% under clean and 3.51% for multi reverberant environments condition. In future works, to improve speech recognition performance, we will apply adaptation methods under multi reverberant environments. By using adaptation method with small utterances, we will have the recognition performance higher than the baseline.

References 1. Takahiro Fukumori, Takanobu Nishiura., Masato Nakayama et all.: CENSREC-4: an evaluation framework for distant-talking speech recognition in reverberant environments, The acoustical Society Japan, 2011. 2. Takuya Yoshioka, Arrmin Sehn, Marc Delcroix el all .: Survey on approaches to Speech Recognition in Reverberant Environment. 3. Takuya Yoshioke, Armin Sehn, Marc Delcroix.: Masking machines Understand us in Reverberant Room. 4. M. J. F. Gales and S. J. Young.: Robust continuous speech recognition using parallel model combination, IEEE Trans. Speech Audio Process., vol. 4, no. 5, pp. 352–359, 1996. 5. P. A. Naylor and N. G. Gaubitch, Eds.: Speech Dereverberation. Berlin: Springer-Verlag, 2010. 6. B. E. D. Kingsbury, N. Morgan, and S. Greenberg.: Robust speech recognition using the modulation spectrogram, Speech Commun., vol. 25, no. 1–3, pp. 117–132, 1998. 7. T. H. Falk and W.-Y. Chan.: Modulation spectral features for robust far-field speaker identification, IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 1, pp. 90–100, 2010. 8. S. Ganapathy, J. Pelecanos, and M. K. Omar.: Feature normalization for speaker verification in room reverberation, in Proc. Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4836–4839. 9. The Hidden Markov Model ToolKit, http://htk.eng.cam.ac.uk/ 10. Aurora speech recognition experimental framework, http://aurora.hsnr.de/aurora-4.html

So Sánh Mức Cơ Sở Của Hiệu Quả Nhận Dạng Tiếng Nói Trong Môi Trường Vọng Âm Với CENSREC-4 1

Dinh Cuong Nguyen , Suk-Nam Choi1, Hyun Yeol Chung1 Khoa Công Nghệ Thông Tin và Truyền Thông, Đại học Yeungnam Gyungsan Gyungbuk, Hàn Quốc [email protected], [email protected], [email protected] Tóm tắt. CENSREC-4(The Corpus and Environment for Noisy Speech RECognition 4 ) là một chuẩn dữ liệu tiếng nói được nghiên cứu và phát triển cho việc đánh giá hiệu quả nhận dạng tiếng nói trong môi trường vọng âm khác nhau, đặc biệt là trong điều kiện giữa người nói và microphone có khoảng cách (distant-talking speech). The CENSREC-4 bao gồm tiếng nói được thu âm trong môi trường thực và tiếng nói vọng âm được mô phỏng (bằng cách kết hợp một đáp ứng xung trong môi trường vọng âm và tiếng nói không nhiễu để tạo ra tiếng nói vọng âm). Trong nghiên cứu này, nhóm tác giả đã điều tra hiệu quả nhận dạng cơ sở cho CENSREC4 bằng cách sử dụng công cụ tiền xử lý của HTK(Hidden Markov Model Toolkit) và chuẩn ETSI(The European Telecommunications Standards Institute Toolkit). Kết quả thực nghiệm cho thấy rằng, việc sử dụng công cụ tiền xử lý của ETSI cho hiệu quả nhận dạng cao hơn kết quả nhà sản xuất đã công bố, hiệu quả nhận dạng cơ sở được nâng cao 0.45% trong điều kiện tiếng nói không nhiễu và 3.51% trong điều kiện tiếng nói vọng âm với môi trường nhận dạng khác nhau. Từ khóa: CENSREC-4, môi trường vọng âm, nhận dạng tiếng nói.