Statistical Sound Source Identification in a Real

Eurospeech 2001 - Scandinavia

STATISTICAL SOUND SOURCE IDENTIFICATION IN A REAL ACOUSTIC ENVIRONMENT FOR ROBUST SPEECH RECOGNITION USING A MICROPHONE ARRAY Takanobu Nishiura†‡ , Satoshi Nakamura† , and Kiyohiro Shikano‡ †

ATR Spoken Language Translation Research Laboratories 2-2-2 Hikaridai Seika-cho Soraku-gun Kyoto, 619-0288 Japan ‡ Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara, 630-0101 Japan ABSTRACT It is very important for a hands-free speech interface to capture distant talking speech with high quality. A microphone array is an ideal candidate for this purpose. However, this approach requires localizing the target talker. Conventional talker localization methods in multiple sound source environments not only have difficulty localizing the multiple sound sources accurately, but also have difficulty localizing the target talker among known multiple sound source positions. To cope with these problems, we propose a new talker localization method consisting of two algorithms. One algorithm is for multiple sound source localization based on CSP (Cross-power Spectrum Phase) analysis. The other algorithm is for sound source identification among localized multiple sound sources towards talker localization. In this paper, we particularly focus on the latter statistical sound source identification among localized multiple sound sources with statistical speech and environmental sound models based on GMMs (Gaussian Mixture Models) and a microphone array towards talker localization. 1. INTRODUCTION For teleconference systems or voice control systems, the highquality sound capture of distant talking speech is very important. However, background noise and room reverberation seriously degrade the sound capture quality in real acoustical environments. A microphone array is an ideal candidate for capturing distant talking speech. With a microphone array, the desired speech signals can be acquired selectively by steering the microphone array in the desired speech direction sensitively [1]. Accordingly, the microphone array is often used for the frontend processing of ASR (Automatic Speech Recognition) at present [2, 3]. However, to achieve the high-quality sound capture of distant talking speech, talker localization is necessary to steer the microphone array. Conventional talker localization methods in multiple sound source environments not only have difficulty localizing the multiple sound sources accurately, but also have difficulty localizing the target talker among known sound source positions. To cope with these problems, we propose a new talker localization method consisting of two algorithms. One algorithm is for multiple sound source localization. The other algorithm is for statistical sound source identification among localized multiple sound sources towards talker localization. We have already proposed a multiple sound source localization algorithm with the CSP (CrossPower Spectrum Phase) coefficient addition method [4]. In this

SPEECH NON-SPEECH

MICROPHONE ARRAY

Figure 1: Two sound signal capture with a microphone array. paper, we particularly focus on sound source identification among localized multiple sound sources with statistical speech and environmental sound models based on GMMs (Gaussian Mixture Models) and a microphone array towards talker localization. If the talker can be localized, then his/her speech can be recognized. 2. TALKER LOCALIZATION ALGORITHM As shown in Figure 1, we assume that the desired speech comes from the front direction and undesired noise comes from the right direction. In this situation, talker localization is necessary for effectively capturing and accurately recognizing distant talking speech using a microphone array. However, conventional talker localization methods in multiple sound source environments not only have difficulty localizing multiple sound sources accurately, but also have difficulty localizing the desired talker among localized sound source positions. Accordingly, we propose a new talker localization method as shown in Figure 2. First, multiple sound sources are localized with the CSP coefficient addition method after multiple sound signals are captured. Then, these localized sound signals are enhanced by steering the microphone array to them. Finally, the talker can be localized after identification between “speech” or “non-speech” using statistical speech and environmental sound models among the localized and enhanced multiple sound signals. The system recognizes the input from a sound source identified as being “speech”. We have already proposed a multiple sound source localization algorithm with the CSP coefficient addition method [4]. In this paper, we particularly focus on sound source identification us-


Signal capture with a microphone array

Speech / non-speech identification

Sound source localization

Talker localization

Microphone array steering

Automatic speech recognition

DESIRED SOUND (Speech or Non-speech) 90

WHITE NOISE

40 BEAM

Figure 2: Talker localization algorithm overview. ing statistical speech and environmental sound models and microphone array steering among localized multiple sound sources towards talker localization. If the talker can be localized, then his/her speech can be recognized. 2.1. Microphone array steering for speech enhancement Microphone array steering is necessary to capture distant signals effectively. In this paper, a delay-and-sum beamformer [1] is used to steer the microphone array to the desired sound direction. Localized sound signals are enhanced by the microphone array steering because the delay-and-sum beamformer can form directivity to the localized sound sources.

MICROPHONE ARRAY

Figure 3: Experimental environment. Table 1: Recording conditions Microphone array 14 transducers, 2.83 cm spacing 16 kHz Sampling frequency Quantization 16 bit RWCP-DB [7, 8] Impulse response Reverberation time T[60] 0.0, 0.3, and 1.3 sec. SNR -5dB, ∼, 30dB, and clean 3. EVALUATION EXPERIMENTS

2.2. Statistical sound source identification based on GMMs Multiple sound signals are captured effectively and enhanced by microphone array steering. Therefore, the talker can be localized by identifying the enhanced multiple sound signals. Until now, a speech model alone was usually used for speech / non-speech segmentation [5] or identification. However, a single speech model has problems in that it not only requires a threshold to identify between “speech” and “non-speech”, but also degrades the identification performance in noisy reverberant environments. To overcome these problems, we propose a new speech / non-speech identification algorithm that uses statistical speech and environmental sound GMMs. The multiple sound signals enhanced with the microphone array steering are identified by Equation (1). ˆ = argmax P (S(w)|λs , λn ), λ

(1)

λ

where S(w) is the enhanced signal with the microphone array steering (frequency domain), λs represents the statistical speech model, and λn represents the statistical environmental sound model. The enhanced signals are identified as “speech” or “non-speech” by estimating the maximum likelihood in Equation (1). This algorithm allows the talker to be localized among localized multiple sound sources. 2.2.1. Speech and environmental sound database Numerous sound sources are necessary to design the speech and environmental sound GMMs. Therefore, we use the ATR speech Database (ATR-DB) [6] to design the speech model and the RWCP (Real World Computing Partnership) sound scene database (RWCPDB) [7, 8] which includes various environmental sounds to design the non-speech model. RWCP-DB also includes numerous impulse responses measured in various acoustical environments. These impulse responses are used to conduct evaluation experiments in various acoustical environments.

We first evaluate the basic performance of statistical sound modeling though environmental sound recognition. Then, the sound source identification performance is evaluated in distant-talking robust speech recognition. 3.1. Preliminary experiment: environmental sound recognition Environmental sound recognition experiments are carried out with 20 samples for 92 kinds of environmental sounds and a single transducer in a clean environment. The feature vectors are MFCC, ∆MFCC, and ∆power. As a result, the environmental sound recognition performance is an average rate of 88.7% in multiple occurrence environments of the same sounds. This result confirms that the statistical modeling is very effective not only for speech recognition but also for environmental sound recognition. 3.2. Experimental conditions The sound source identification performance is evaluated with known multiple sound source positions. Figure 3 shows the experimental environment. The desired signal comes from the front direction and white Gaussian noise comes from the right direction. The distance between the sound source and the microphone array is two meters. In this situation, the statistical sound source identification performance and ASR performance are evaluated subject to variations in the SNR (Signal to Noise Ratio) and the environment. Table 1 shows data recording conditions and Table 2 shows experimental conditions for statistical sound source identification. We evaluate the statistical sound source identification performance using a single transducer and a microphone array, subject to SNR of -5dB, ∼, 30dB, and clean, and the reverberation times are T[60] = 0.0, 0.3, and 1.3 sec. We also evaluate the ASR performance with the experimental conditions for ASR which are shown in Table 2. In this paper, we evaluate the statistical sound source identification


performance with 616 sounds consisting of speech (216 words × 2 subjects (1 female and 1 male)) and environmental sounds (92 sounds × 2 sets). The ASR performance is also evaluated with speech (216 words × 2 subjects). Equation (2) shows a definition of the sound source identification rate (SIR). N

SIR =

Icor [n]

n=0

N

, Icor [n] =

1 0

= Q[n] Q[n] = Q[n], Q[n]

(2)

is the sound source identiwhere Q[n] is the correct answer, Q[n] fication result, and N is the number of all sounds. The ASR performance is also evaluated by the word recognition rate (WRR). 3.3. Experimental results Figure 4(a)(b) show experimental results using a single transducer and a microphone array that steers the directivity to the known desired sound source position. In these figures, the bar graphs represent sound source identification rates (SIR), and the line graphs represent word recognition rates (WRR). First, we focus the bar graphs in Figure 4(a)(b). In these figures, by comparing the results using the single transducer and using the microphone array steering, we can confirm that the microphone array steering results give a higher sound source identification performance than the single transducer results especially in lower SNR environments. We therefore confirm that the proposed algorithm can achieve a higher sound source identification performance by using the microphone array steering. Second, we describe the robustness against reverberation on the sound source identification. In Figure 4(b), the sound source identification performance using the microphone array steering is almost the same in each reverberant environment while the performance tends to decline slightly in the lower SNR and higher reverberant environments. With these results, we confirm that the proposed algorithm can distinguish “speech” or “non-speech” accurately in higher reverberant environments.

100

Sound identification rate [%]

100 90

90

80

80

70

70

60

60

50

50

40

40

30

30

20

20

10

10

0

-5

0

5

10

15

20

25

30

Clean

0

SNR [dB]

(a) Single transducer 100


Experimental conditions for ASR 25 msec. (Hamming window) 10 msec. MFCC, ∆MFCC, ∆power (+ CMS [9]) 216 words × 2 subjects(1 female and 1 male)

Reverberant [ 0.3 sec.] room [ 1.3 sec.]

100

90

90

80

80

70

70

60

60

50

50

40

40

30

30

20

20 10

10 0

-5

0

5

10

15

20

25

30

Clean

Speech recognition rate [%]

Table 3: Frame length Frame interval Feature vector Test data (open)

Environment Identification Recognition Anechoic [ 0.0 sec.] room

Speech recognition rate [%]

Table 2: Experimental conditions for sound source identification Frame length 32 msec. (Hamming window) Frame interval 8 msec. MFCC (16 orders, 4 mixtures), Feature vector ∆MFCC (16 orders, 4 mixtures), ∆power (1 order, 2 mixtures) Speech: 1 model Number of models Non-speech: 1 model Speech DB ATR speech DB SetA [6] Speech model training 150 words × 16 subjects (8 female and 8 male) RWCP-DB [7, 8] Non-speech DB Non-speech model training 92 sounds × 20 sets Test data (Open) Speech: 216 words × 2 subjects (1 female and 1 male) Non-speech: 92 sounds × 2 sets

0

SNR [dB]

(b) Microphone array Figure 4: Experimental results. Next, we compare the proposed method with a conventional method using only speech GMM. Statistical sound source identification was carried out with the conventional method by distinguishing “speech” or “non-speech” using a threshold. Figure 5 shows the threshold estimation. As shown in the figure, we calculate accumulated likelihood histograms with training data. Then, we estimate the threshold for the conventional method by finding the equal probability point with the accumulated likelihood histograms. Figure 6 shows results of the proposed method and the conventional method with microphone array steering in T[60] = 0.3sec. environments. The sound source identification rate (SIR) is only about 70% with the conventional method in the higher SNR environments, although the identification performance improves where the SNR is higher. However, the sound source identification rate is more than 90% with the proposed method not only in the higher SNR environments but also in the lower SNR environments. The performance of the conventional method using only speech GMM depends a lot on the threshold. However, the proposed method using speech and environmental sound GMMs can distinguish “speech” or “non-speech” accurately because its uses the difference of two GMM’s likelihoods. We also evaluate the relationship of the number of Gaussian mixtures for feature vectors MFCC and ∆MFCC and the sound source identification rate. Figure 7 shows the results. In the figure, we can confirm that the sound source identification performance is almost the same with more than four mixtures, while the performance degrades with less than two mixtures. Therefore, the proposed method may be able to distinguish “speech” or “nonspeech” even if speech and environmental sound GMMs consist of few mixtures.


100


Likelihood histogram for speech Likelihood histogram for non-speech Accumulated likelihood histogram for speech

Count

Accumulated likelihood histogram for non-speech

Threshold

Low

High

Likelihood

Figure 5: Threshold estimation by a conventional method.

90

85

Anechoic room

[ 0.0 sec.]

Reverberant room [ 0.3 sec.] Reverberant room [ 1.3 sec.]

80

1

2

4

8

12

16

24

Number of gaussian mixtures

100


95

Figure 7: Relationship of the number of Gaussian mixtures for GMMs and sound source identification rate at SNR = 30 dB.

90 80 70

recognized robustly with the microphone array in noisy reverberant environments. In future works, we will evaluate the proposed algorithm by using multiple sound source positions localized by the CSP coefficient addition method.

60 50 40 30

Speech GMM

20

Speech GMM and Non-Speech GMM

10 0

-5

0

5

10

15

20

25

30

Clean

SNR [dB]

Figure 6: Comparison of sound source identification performance levels using the proposed method and conventional method. Finally, we focus on the line graphs showing the word recognition rates (WRR) in Figure 4(a)(b). In these figures, by comparing the results using the single transducer and using the microphone array steering, we can confirm that the microphone array steering results give a higher ASR performance especially in lower SNR environments than the single transducer results. As an example, we explain the ASR performance in the SNR = 10 dB environment. In the T[60] = 0.0 sec. and SNR = 10 dB environment, WRR is 58.3% for the single transducer. However, WRR improves from 58.3% to 92.6% using the microphone array. In addition, in the T[60] = 1.3 sec. and SNR = 20 dB environment, WRR is 37.1% for the single transducer. However, WRR improves from 37.1% to 64.3% using the microphone array. This confirms that the proposed algorithm using the microphone array results in a higher ASR performance than that using the single transducer not only in anechoic environments but also in reverberant environments. According to the above evaluation experiments, we confirm that the talker can be localized accurately by sound source identification using statistical speech and environmental sound GMMs and microphone array steering among known multiple sound sources. We also confirm that the talker’s speech can be recognized robustly with the microphone array in noisy reverberant environments. 4. CONCLUSION In this paper, we propose a new talker localization algorithm and focus on sound source identification among localized multiple sound sources with statistical speech and environmental sound models based on GMMs towards talker localization. In evaluation experimental results, we confirm that the talker is localized accurately by the proposed algorithm with known sound source positions in reverberant noisy environments. In addition, the talker’s speech is

5. REFERENCES [1] J.L. Flanagan, J.D. Johnston, R. Zahn, and G.W. Elko, “Computer-steered microphone arrays for sound transduction in large rooms,” J. Acoust. Soc. Am., Vol. 78, No. 5, pp. 1508–1518, Nov. 1985. [2] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani, “Microphone array based speech recognition with different talker-array position,” Proc. ICASSP97, pp. 227–230, 1997. [3] E. Lleida, J. Fernandez, and E. Masgrau, “Robust Continuous Speech Recognition System Based on a Microphone Array,” Proc. ICASSP98, pp. 241–244, May 1998. [4] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Localization of Multiple Sound Sources Based on a CSP Analysis with a Microphone Array,” Proc. ICASSP2000, pp. 1053– 1056, Jun. 2000. [5] R. Singh, M.L. Seltzer, B. Raj, and R.M. Stern, “Speech in Noisy Environments: Robust Automatic Segmentation, Feature Extraction, and Hypothesis Combination,” Proc. ICASSP2001, May 2001. [6] K. Takeda, Y. Sagisaka, and S. Katagiri, “Acoustic-Phonetic Labels in a Japanese Speech Database,” Proc. European Conference on Speech Technology, Vol. 2, pp. 13–16, Oct. 1987. [7] S. Nakamura, K. Hiyane, F. Asano, T. Yamada, and T. Endo, “Data Collection in Real Acoustical Environments for Sound Scene Understanding and Hands-Free Speech Recognition,” Proc. Eurospeech99, pp. 2255–2258, Sep. 1999. [8] S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada,“Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition,” Proc. LREC2000, pp. 965–968, May. 2000. [9] B. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” J. Acoust. Soc. Am., Vol. 55, pp. 1304–1312, 1974.

Statistical Sound Source Identification in a Real

Statistical Sound Source Identification in a Real

Suggest Documents

Real-time sound source separation: Azimuth ...

Real-Time Multiple Sound Source Localization Using a Circular

Real-Time Sound Source Spatialization as used in Challenging Bodies

Binaural sound source localization in real and virtual ...

Real-Time Multiple Sound Source Localization and ... - LISIC - ULCO

Blind directivity estimation of a sound source in a

Sound source localisation using a single acoustic

The Source Directivity of a Dodecahedron Sound Source determined ...

SIMILARITY-BASED SOUND SOURCE LOCALIZATION WITH A

Low Cost Omnidirectional Sound Source Utilizing a

Sound-Source Recognition: A Theory and

The Source Directivity of a Dodecahedron Sound Source ...

REAL SOUND - WOW Videoke

Multiple Sound Source Location Estimation in

Seeing Sound : Real-time Sound Visualisation in Visual Feedback ...

Sound modeling: source-based approaches

A Novel Statistical Model for Predicting Sound Absorption in ... - Core

Real-Time Identification of Indoor Pollutant Source Positions Based on ...

A precedence effect resolves phantom sound source illusions in the ...

IDENTIFICATION OF SOUND SOURCES IN INTERNAL DUCTED ...

A Corticothalamic Circuit Model for Sound Identification in ... - PLOS

Statistical assessment of nonpoint source pollution in

Identification of a point source in a linear advectionâdispersion

3D Sound Techniques for Sound Source Elevation in a ... - Springer Link

Statistical Sound Source Identification in a Real