Automatic Multi-Speaker Speech Recognition System Based on Time ...

0 downloads 0 Views 431KB Size Report
System Based on Time-Frequency Blind Source. Separation under Ubiquitous Environment. Zhe Wang. School of Electrical and. Electronic Engineering.
Automatic Multi-Speaker Speech Recognition System Based on Time-Frequency Blind Source Separation under Ubiquitous Environment Zhe Wang

Haijian Zhang

Guoan Bi

School of Electrical and Electronic Engineering Nanyang Technological University Singapore, Email: [email protected]

School of Electrical and Electronic Engineering Nanyang Technological University Singapore Email: [email protected]

School of Electrical and Electronic Engineering Nanyang Technological University Singapore Email: [email protected]

Xiumei Li School of Information Science and Engineering Hangzhou Normal University China Email: [email protected]

Abstract—In this paper, an automatic speech recognition (ASR) system under ubiquitous environment is proposed, which is successfully implemented in a personalized voice command system under vehicle and living room environment. The proposed ASR system describes a novel scheme of separating speech sources from multi-speakers, detecting speech presence/absence by tracking the higher portion of speech power spectrum and adaptively suppressing noises. An automatic recognition algorithm to adapt with the multi-speaker task is designed and conducted. Evaluation tests are carried out using noise database NOISEX-92 and speech database YOHO Corpus. Experimental results show that the proposed algorithm manages to achieve very impressive improvements.

I. I NTRODUCTION In ubiquitous environment especially with multiple speakers, how to adapt the speech recognition model correctly for each speaker and then to promote the speech recognition accuracy is a challenging problem. Here we propose a spatial shorttime Fourier transform based blind source separation method used in mixed speech signal under noisy environment. The algorithm is specially designed to fulfill the separation even when the number of speakers is unknown and larger than the number of microphones. Blind source separation will bring in additive noise together with environmental noise into to each separated speech signal to be recognized. The recovery of clean speech from noisy speech is of vital importance for speech enhancement, speech recogintion and many other areas. In real life there are numerous source of noise such as environmental noise, channel distortion and speaker variability. Therefore, lots of algorithms are designed on removing the noise from speech. But many of these algorithms need an additional noise estimation scheme and are only adapted to auditory effect rather than ASR. This paper

c 978-1-4799-4315-9/14/$31.00 2014 IEEE

describes a novel self-optimized VAD algorithm together with a simple but effective noise removing scheme to the signals after separation for improving speech recognition rate. The key feature of the self-optimized VAD is that no prior estimation of clean speech variance is needed. Due to the designed ASR system, further discrimination information is utilized to make the VAD to keep more essential information for feature extraction. Besides, the threshold used for speech/nonspeech decision is generated from the noisy speech itself. It can be taken as a kind of self-optimizing process. For the multi-frequency noise removing scheme, the key feature is simplicity of the algorithm. No additional model or training is needed. It is based on one of the widely accepted algorithms, spectral subtraction (SS)[1]. Comparison is made against SS method[2], Zero-Crossing-Energy VAD method (ZCE)[3], Entropy-Based method [4], and the proposed VAD based algorithm. To realize the recognition task simultaneously, a modified recognition process based Bhattacharyya distance and universal speaker profile is proposed. For evaluation test, the NOISEX-92 and YOHO Corpus database are used, which are widely used database containing noise and speech. Experimental results show the excellent performance of the proposed ASR system. The rest of the paper is organized as follows. In Section 2, a speech separation algorithm will be provided to separate the signals from multi-speakers which can work under the situation that the number of speakers is unknown. In section 3, the self-optimized VAD algorithm is presented. The detailed derivation steps are given. Section 4 provides an adaptive soft decision algorithm using discrimination information. Section 5 presents the description of the modified MFCC based

101

x(t) = As(t) + n(t), (1)  T where A denotes the mixing matrix. x(t)= x1 (t), . . . , xM (t)  T are the received mixtures, s(t) = s1 (t), . . . , sN (t) are the speech sources, and n(t) is additive white noise vector. T is the transpose operator. The procedure of the SSTFT-based BSS algorithm based on the above signal model is presented as follows: • Calculating the STFT of the mixtures x(t) in (1), we obtain an M × 1 vector Sx (t, f ) at each time-frequency (TF) point (t, f ), we have Sx (t, f ) = ASs (t, f ) + Sn (t, f ), •





102

(2)

where S denotes the STFT operator. Next, we detect the auto-source TF points (i.e. the autoterm location of the speech sources in TF domain) by setting a small energy threshold value 0 , which is implemented by using the following criterion at each time-instant ||Sx (t, f )|| > 0 , (3) maxv ||Sx (t, v)|| where || · || denotes the norm operator, and 0 is an empirical threshold value for selection of the auto-source TF points. All the TF points which satisfy the criterion in Equation (3) are included in the set Ω. The premise of the SSTFT-based algorithm is to estimate the number of sources N as well as the mixing matrix A. We apply the method proposed in our previous work [5]. Specifically, we try to detect some dominant TF points, i.e. the points where one of the sources has the dominant energy compared to those of other sources and noise power. The mean-shift clustering method [6] is applied to classify the these dominant TF points without knowing the number of sources. The mixing matrix A is estimated by averaging the spatial vectors of all the TF points in the same cluster, and the number of sources N is estimated by counting the number of resultant clusters. Based on the detected auto-source TF point set Ω and the  we can apply the subspaceestimated mixing matrix A, based method to estimate the STFT values of each source [7]. We assume that there are at most K (K < M ) sources present at each auto-source TF point ∈ Ω. Thus, the expression in Equation (2) is simplified as below S s (t, f ), (t, f ) ∈ Ω, Sx (t, f ) ≈ A (4)

5 Am plitude

II. B LIND S OURCE S EPARATION (BSS) BASED ON SPATIAL SHORT- TIME F OURIER TRANSFORM (SSTFT) Assuming sn (t), n = 1, . . . , N , be the unknown speech sources, where N is the number of speakers. The arrangement of an M -sensor microphone array is linear. The output vector xm (t), m = 1, . . . , M are modeled as

N oisy m ixture 1

0

−5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1.2

1.4

1.6

1.8

N oisy m ixture 2 10

5 Am plitude

ASR system used in multi-speaker adaption experiment. The experimental results and the additional comparison methods are described in Section 6. Conclusions are summarized in Section 7.

0

−5

0

0.2

0.4

0.6

0.8

1 Tim e (s)

Fig. 1. Mixed Speech Signal

 denote the steering vectors of the K sources at where A each point (t, f ) ∈ Ω, and  Ss (t, f ) contains the STFT  at each auto-source TF values of these K sources. The A point can be determined by the following minimization    = arg min PSx (t, f ) , (5) A  am1 ,..., amK

 −1 A  H denotes the orthogonal  m (A HA where P = I− A m m) m m = projection matrix into noise subspace, where A amK ].  am1 , . . . ,  amK are random K columns of [ am 1 . . .   The K STFT values at the estimated mixing matrix A. each auto-source TF point can be obtained by   † Sx (t, f ), Ss (t, f ) ≈ A

(t, f ) ∈ Ω,

(6)

where † denotes the Moore-Penrose’s pseudoinversion operator. Until now the energy at each auto-source TF point in Ω is separated into K STFT values and assigned to corresponding sources. • Each source is recovered by the inverse STFT [8] using the estimated STFT values via Equation (6). Figure (1) and Figure (2) illustrate a demo of the timefrequency speech separation algorithm. In this case, there are four speakers/sources and only 2 microphones are used. III. N OISE E STIMATION Noise and speech are usually statistically independent and possess different statistical property. Noise is more symmetrically distributed, while speech is more non-stationary due to its active/non-active periods and noise is present all the time. The active/non-active transition of speech makes the energy of speech more concentrated in the speech active period. A. General Description The different behavior of noise and speech makes it possible to track speech or noise based on the minimum/maximum of speech spectrum. The high energy is more likely to be speech

2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA)

Define

Am plitude

Estim ated source 1 2 0 −2

Am plitude

Am plitude

Am plitude

0 2 0 −2 0

σS2 2 = k, then σN √

0.2

0.4

0.6

0.8 1 1.2 Estim ated source 2

1.4

1.6

1.8

k + 1exp −

|Y |2 (k + 2) − k+2 1+k Then define

  2 ) 1 . 2(σS2 + σN ln  |Y | = − k+2 1+k 2

Fig. 2. Separated Speech Signal

while the low energy part is more likely to be noise. That means detecting speech by analyzing the maximum of noisy speech is possible. The speech amplitude can be found to have a higher chance to be larger than the noise amplitude. Compared with noise, the probability distribution function of clean speech is flatter in the ’tail’ part, which means clean speech is more likely to be far from its average. The real clean speech amplitude can be much larger than the average. Even for SNR 0db, the peak portion of the speech can be proven to be more likely to be form clean speech. B. Algorithm Derivation Assume that speech is degraded by uncorrelated additive noise, two hypotheses for VAD are H0 : speech absent: Y = N H1 : speech present: Y = S + N where Y ,N ,S denote the spectral domain noisy speech, noise and clean speech, respectively. The probability density function for H0 and H1 are given by  |Y |2 exp − 2 2σN H0 : PN (Y ) = f (Y |H0 ) = (7) 2 2πσN

|Y |2 exp − 2 + σ2 ) 2(σN S (8) H1 : P (Y ) = f (Y |H1 ) = 2 2π(σN + σS2 ) 2 where σN and σS2 denote the noise and clean speech variance. PN (Y ) < , making sure PS  PN .  is By assigning P (Y ) a heuristic parameter set to be between 0.01 and 0.2. The calculation of this parameter will be further presented in Section 4.5. 

2 (σS2 + σN ) 1 1 |Y |2 exp − <  (9) 2 2 σ2 + σ2 σN 2 σN N S

(11)

(13)

Y can serve as a more direct threshold. Then the frequency bin level VAD flag can be achieved by  1 , |Y |2 > |Y |2 (14) f lag = 0 , |Y |2 ≤ |Y |2 The speech pdf is calculated, which when can be used to get |Y |2 . Then, by using Equation (13), the binary VAD flag can be achieved. The VAD algorithm mentioned previous is good at suppress the noise and can distinct the noise and voiced speech well to some extent. To imporve the auto speech recognition rate, we still need to detect the unvoiced speech. The modified algorithm uses the discrimination information to update the noise energy in all frames including the speech frames, so it could better trace changes in the noise-energy. IV. N OISE SUPPRESSION In the design of VAD algorithm, the so called soft decision algorithms appear to be superior to the binary decision ones. It is because speech signal is highly non-stationary. There is no clear boundary which marks the beginning or ending of a pronunciation. In this section, the discrimination information will be used as a soft decision threshold. A. Energy calculation on sub-band The energy calculation works on a frame by frame basis. Each frames is multiplied by a window (e.g. a cosine window) before it is transformed by Fast Fourier Transform (FFT), reducing the frequency aliasing caused by FFT. Adjacent frames are overlapped by half, so that time-domain modulation (due to processing) would be minimized. Note that the 50% overlapping means an initial delay of half of the frame size is incurred. The frame size needs to be selected carefully. Suppose the sample rate is Fs , and frame size is N = 2m (must be a power of 2 to use FFT). The time resolution is N/Fs , and frequency resolution is Fs /N . Obviously, a larger frame size allows better frequency resolution but has poor time resolution;

2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA)

103

the opposite is true for a smaller frame size. In general, for Fs = 8000Hz and Fs = 16000Hz, frame sizes of N = 256 and N = 512 are found appropriate, respectively. To divide the bands, a psychoacoustic scale based on the Bark scale can be used while divide it into a fixed number 16 (1th Bark band to 16th Bark band). So when the frame size is 256, the energy for each band is •

16i 

Si =

2 Ri,k

(15)

k=16i−15

where Ri,k is the absolute value of the k th Fourier transform coefficient of the ith band. The sub-band energy out of the whole energy is Si Pi = 16 (16) j=1 Sj The frame energy and the sub-band energy were used to calculate the discrimination information based on the sub-band energy distribution probabilities both for the current frame and the noise frame.



B. Calculation on discrimination information

T Hn = T Hn−1 (1 − λ) + |Y |2 λ

Assume the random variable Y have the chances to be value of a1 , a2 , · · · , ak . And the probability distribution of Y is related to the hypothesis H0 and H1 . Set P0 (ak ) = P (ak |H0 ),P1 (ak ) = P (ak |H1 ), the discrimination information is defined as follows, I(P1 , P0 ; Y ) =

K  k=1

P1 (ak ) P1 (ak )log P0 (ak )



(17)

P1 (ak ) = 8

j=1

Si

(19)

With the 3 equations, we can get the discrimination information d, and the updating factor λ, where the λ is defined as:  1 − αp , d < γ (20) λ= (β)5d , d ≥ γ where α,β and γ are constants. Here we take them as α = 0.8,β = 0.2 and γ = 0.2. p is a value which calculate the number of continuous frames satisfying d < γ. C. Threshold update by discrimination information • • •

104

(21)

where T Hn represents the updated noise threshold for the nth frame. |Y |2 is the probability distribution function value of current speech. λ is the noise update factor, which is calculated by Equation (20). If all the data has been dealt, the adaptive adjustment will end.

D. Modified VAD and noise suppression

The discrimination information can be calculated using subband energy distribution to measure the similarity of current frame and noise frame. Nk (18) P0 (ak ) = 8 j=1 Ni Sk

as noise frame. if current frame satisfies |Y |2 > |Y |2 and d > Tr , current frame will be judged as the start position frame, then continue comparing with the following 6 frames. If the 6 frames also satisfy |Y |2 > |Y |2 and d > Tr , the start position frame can be taken as the start position of a period of speech, otherwise current frame is still a noise frame. When previous frame is judged as speech frame, if current frame satisfies |Y |2 > |Y |2 , current frame will still be speech frame. if current frame satisfies |Y |2 ≤ |Y |2 and d ≤ Tr , current frame will be judged as the end position frame, then continue comparing with the following 6 frames. If the 6 frames also satisfy |Y |2 ≤ |Y |2 and d ≤ Tr , the end position frame can be taken as the end position of a period of speech (also the start point of a speech), otherwise current frame is still a speech frame. Tr is the edge value of the discrimination information which equals to the average discrimination value of the most recently 5 noise frame. During each step of the judgment, the noise threshold will be updated.

The first 5 frames are selected as the noise/non-speech frames. The previous frame of a period of speech signal is considered as noise frame. when previous frame is judged as noise frame, if current frame satisfies |Y |2 ≤ |Y |2 , current frame will be judged

When the speech signal Y (w) is corrupted with additive noise N (w) which is assumed to be independent of speech, in theory, the noise can be optimally removed by estimating its power and filtering the noisy signal with the following filter: H(w) =

|Y (w)| − |N (w)| Y (w)

(22)

Phase is typically ignored in the filtering process, as it is generally regarded as less important for speech than magnitude. This is in fact the well known Wiener filtering concept. The proposed VAD will detect the noise frame, and subtract noise spectra from speech signal trying to keep more information which is used during the feature extraction process of ASR and eliminating more noise which will provide wrong information during feature extraction and template matching. E. Voice Activity Detect Procedure As the speech signal is always non-stationary, making a binary decision of being voice or noise is quite risky. Therefore, we designed a module that rates the likelihood of voice being present by computing a voice activity score (VAS). This way, we can make the processing transition smoothly when the derived VAS indicates a mixture of voice and noise. The VAS for a frame is determined by two aspects. The first aspect concerns with the intelligibility of the voice, which

2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA)

is approximately quantified by counting the number of Bark bands in the speech band whose power exceeds that of the corresponding Bark band of the estimated noise. The speech band ranges from the 4th Bark band to the 14th Bark band (inclusive). The second aspect is the relative power of the current frame to that of the estimated noise power, reason being that the higher the relative power of a frame, the more likely it contains voice. The final VAS is simply the sum of the two aspects scores. The parameter  appears in Section 3.2 is set as the reciprocal of VAS and will be updated for each frame. The continuous VAS offers much more flexibility than a fixed parameter. Even when it is necessary to make a binary decision as to whether or not the frame is a noise-only frame, we can still make the processing gradually changing and converge at certain value. Then for each frame, judgment described in Section 4.3 is conducted. With the judgment result, the σN ,σS mentioned in Section 3.2 will be updated correspondingly. V. ASR S YSTEM D ESCRIPTIONS This section elaborates on the full system description from the front end feature extraction, training process which consist of sub-word unit and word template generation, and the final recognition process. After VAD and noise suppression, the processed speech signal will be evaluated in this ASR system. A. Front End Feature Extraction The feature vector used for this recognition task is 24 MFCC. The window size per frame is 20ms with 10ms overlap. All the speech data used in this experiment are 16 bit sampled at 16 kHz. B. Sub-word Unit Generation This is the first part of the training process where the user is required to record approximately two minutes of their speech. It is recommended to read phonetically rich sentences in order to obtain a more comprehensive sub-word unit. In this experiment, the user is asked to read a series of Harvard sentences. After going through front end feature extraction, the resulting MFCC are clustered (using c-means algorithm) into 64 distinct units, roughly corresponding to a collection of subword. Each of these clusters is then modeled using Gaussian Mixture Model of 4 mixtures. In this experiment, no reclustering was done during word template generation. To simplify the model, further calculation is performed to generate the 64 by 64 Bhattacharyya distance matrix. C. Word Template Generation In this step, the words that the user wants the system to recognize are registered. The user is asked to pronounce the words, and the template generation will convert those words into a sequence of sub-word unit index(obtained from the previous step) based on its maximum likelihood. To avoid over segmentation of the word, transitional heuristic is employed by

allowing the change of sub-word index only when there is a significant margin of likelihood difference with the neighboring state. This process has to be repeated for each words that the user wants to introduce to the system. D. Matching Process - the Recognition Given there are M word templates in the system, the recognition processes calculates the probability of the user input feature vector Xinput being generated by the template. The chosen word is the one which gives the highest likelihood. m∗ = argmaxpm (Xinput)

(23)

Note that the template can be viewed as a sequence of GMM, so this makes the pm (Xinpt) calculation increasingly expensive with an increasing number of word template used and very hard to observe the effect of proposed VAD algorithm[9]. We propose to convert the input feature to a sub-word unit index sequence using the Bhattacharyya distance. The Bhattacharyya between two probability distribution p1 and p2 is defined by:    p1 (x) p2 (x)dx) (24) Bhatt(p1 |p2 ) = −ln( Each sub-word unit in testing experiment is modeled using 4 mixtures GMM, so the distance between them is given by: Bhatt(p1 |p2 ) =  ⎞ ⎛   3   3     p1,mix (x) p2,mix (x)dx⎠ −ln ⎝  mix=0

(25)

mix=0

The distance is calculated for all 64 sub-word unit using Levenshtein distance method. The average run time of the recognition task by original pattern matching algorithms increases proportionally with the number of templates. But for the Bhattacharyya eidt distance method, the running time is quite stable when the number of templates increase, which is particularly suitable for a real-time recognition system. E. Stranger scenario The system descriptions above require a user to register both his voice and the words to be recognized. Each user has his or her own sub-word unit profile and word templates. Apart from that, the system also builds one guest profile and templates to be used by non-registered user. For example, in a voice trigger TV remote control use case in living room environment, each member of a household can register their voice profile. However, if a guest arrives and want to use the system without registering their voice, this guest profile is the one that will be evoked. This Guest sub-word unit profile is built by combing all registration data from the entire users. The main issue now is how to build the word templates. Using the word generation data from each user is not optimal for two reasons. Firstly, it is not generic enough as the template will be too closely tied with the way that particular user pronounce the word. Secondly, the size of the word templates will be huge because it needs to cover each word repeated by all registered user.

2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA)

105

Instead of multiple inputs, the scheme here is used to aggregate the multiple pattern of each word template. The centroid of which is then used as the guest word template. VI. A LGORITHM E VALUATION In this section we present the results and an objective evaluation of the proposed ASR system. We use global SNR to describe the noise environment. n−1 2 k=0 S (k) (26) SN R = 10log n−1 2 k=0 N (k) where S(k) is the speech signal, N (k) is the noise signal. The noise amplitude is adjusted accordingly. In this ASR experiment, the vehicle motor noise signal is from the NOISEX92[10] database, the living room noise environment is from the PASCAL ’CHiME’ Speech Separation and Recognition Challenge[12] and the speech signal is from the YOHO Corpus[11] database. A. Speech recognition Tests Speech recognition test is carried out using ASR system in Section 5. The proposed blind source separation are implemented firstly, then VAD algorithm is realized. On the whole, few speech source with more microphones will have a good separation results[5]. Here, to simplify, we only use 2 microphones to separate 4 speakers’ signal inside which one speaker’s speech is recognized under the stranger scenario. SS method, ZCE method, Entropy-Based method are chosen for comparison with the proposed VAD noise suppression method under moter noise. Experimental results on accuracy are given under different SNR (0dB, 5dB, 10dB). TABLE I Experiment results SNR SS ZCE Entropy-Based Proposed VAD

0dB 60.43 76.77 84.09 86.59

5dB 79.86 85.42 87.18 88.53

10dB 92.23 93.91 93.95 93.98

The advantage of the proposed algorithm is very obvious. Compared with Entropy-Based method which get the most accuracy among VAD algorithms, the relative improvement in 0db reaches 2.5%, while in SNR 5dB the rate achieves 1.4%. The whole ASR system can work under frame-by-frame scheme, which satisfies the real-time requirement for most embedded electronic applications. The experiment on recognition rate can also be conducted under other recognition processing methods. Same results can be got but longer processing time is needed. Besides the street noise, other kinds of noise like tank noise from NOISEX-92 are also tested, similar results are achieved.

106

VII. C ONCLUSION In this paper, a complete ASR system is proposed which can be successfully implemented in a ubiquitous environment. The key feature of the proposed ASR system is that no prior source number and estimation of clean speech variance is needed. The threshold which is used to suppress noise is generated from the speech itself, which leads to another advantage, the ability to adapt to changing environments. Moreover, the proposed source separation and noise suppression method does not need any additional training process, which makes the computational burden very small. Finally, the proposed system can be easily in ubiquitous environment. Experimental results demonstrate significant improvement in the recognition accuracy. ACKNOWLEDGMENT The authors would like to thank STMicroelectronics Asia Pacific Pte Ltd for providing speech dataset and experiment environment. R EFERENCES [1] Boll, S., “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27,pp. 113-120,1979. [2] Junqua, J.C., Mak, B., Reaves, B. “A robust algorithm forward boundary detection in the presence of noise”, IEEE Transactions on Speech and Audio Processing, 2(3),pp. 406-421,1994. [3] Beritelli, F., Casale, S., Ruggeri, G., et al. “Performances evaluation and comparision of G.729/AMR/fuzzy voice activity detectors”, IEEE Signal Processing Letters, 9 (3),pp. 85-88,2002. [4] Abdallah, I., Montresor, S. Baudry, M., “Robust speech/non-speech detection in adverse conditions using an entropy based estimator”, International Conference on Digital Signal Processing, Santorini, pp. 757-760,1997. [5] H. Zhang, G. Bi, Sirajudeen Gulam Razul, and C.-M. See, “Estimation of underdetermined mixing matrix with unknown number of overlapped sources in short-time Fourier transform domain”, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 6486-6490, 2013. [6] D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature space analysis”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 603-619, May 2002. [7] A. A¨ıssa-El-Bey, N. Linh-Trung, K. Abed-Meraim, A. Belouchrani, and Y. Grenier, “Underdetermined blind separation of nondisjoint sources in the time-frequency domain”, IEEE Trans. Signal Process., vol. 55, no. 3, pp. 897-907, Mar. 2007. [8] D. Griffin and J.S. Lim, “Signal estimation from modified short-time Fourier transform”, IEEE Transactions on Acoustics, Speech and Signal Process., vol. 32, no. 2, pp. 236-243, Apr. 1984. [9] Chang, H.Y., Lee, A.K. and Li, H.Z., “An GMM Supervector Kernel with Bhattacharyya Distance for SVM Based Speaker Recognition”, IEEE International Conference on Acoustics Speech and Signal Processing, pp. 4221-4224, 2009. [10] Varga, A., Msteeneken, H.J., “Assessment for automatic speech recognition: II.NOISE-92: a database an experiment to study the effect of additive noise on speech recognition system”, Speech Communication, 12,pp. 247351, 1993. [11] Li, Q., Reynolds, D.A. “Corpora for the evaluation of speaker recognition systems”, IEEE International Conference on Acoustics Speech and Signal Processing, vol.2, pp. 829-832, 1999. [12] “The PASCAL ’CHiME’ Speech Separation and Recognition Challenge”, http://www.dcs.shef.ac.uk/spandh/chime/challenge.html, Cited 15 Dec 2013.

2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA)