Voice Command II: A DSP Implementation of Robust ... - CiteSeerX

1 downloads 0 Views 248KB Size Report
based on energy and zero-crossing rates. When the word boundary points are detected, feature vectors between word boundary points are sent to the neural ...
Voice Command II: A DSP Implementation of Robust Speech Recognition in Real-World Noisy Environments Soo-Young Lee, Doh-Suk Kim, Ki-Hwan Ahn, Jae-Hoon Jeong, Hoon Kim, Sam-Yook Park Computation and Neural Systems Laboratory Department of Electrical Engineering Korea Advanced Institute of Science and Technology 373-1 Kusong-dong,Yusong-gu, Taejon 305-701, Korea E-mail: [email protected]

Lag-Young Kim, Jong-Seok Lee, Hee-Youn Lee Information Technology Laboratory LG Corporate Institute of Technology 16 Woomyeon-dong, Seocho-gu, Seoul 137-140, Korea E-mail: [email protected]

Abstract

The \Voice Command" system, designed for isolated word recognition tasks in real-world noisy environments, was implemented on a xed-point DSP board to operate in real-time. Simple auditory model, i.e., zero-crossings with peak amplitudes (ZCPA) model, is used for noise-robust feature extraction, and neural network classi er recognizes input patterns. The system performance is further improved by incorporating speaker adaptation and out-of-vocabulary word rejection capabilities. The radial basis function (RBF) classi er provides better rejection performance than multi-layer perceptron (MLP) classi ers.

1. Introduction Automatic speech recognition is the leading technology as a human-computer interface for real-world applications. However, there are various types of background noises in real environments which degrade the performance of speech recognition systems and prevent them from popularity in real world noisy environments. In [Lee et al., 1996] we introduced a digital neuro-chip \Voice Command" for speaker-independent small vocabulary speech recognition tasks in noisy environments, especially for stand-alone applications for consumer and car electronics which do not use high performance CPUs. The system is currently implemented in real-time on a TMS32C54X xed-point DSP. In this paper we report the improved \Voice Command II" system in which the robustness is further extended in the context of adaptation to speaker-speci c variability and rejection of outof-vocabulary (OOV) words. In general the performance of the speaker independent system, which is trained with a large amount of speech data from many speakers, is

poorer than that of the speaker dependent system when tested on the same speaker. It may be caused by the speaker-speci c speech di erences caused by anatomical di erences and speaking habits. The aim of speaker adaptation is to adapt the recognition system to a particular speaker for improved recognition performance. Also, the ability to distinguish out-of-vocabulary words and nonspeech such as door slams and coughs from in-vocabulary words and to reprompt users is becoming essential to the speech recognition systems for real-world applications. The architecture of the \Voice Command II" speech recognition system in which speaker adaptation and rejection functions are incorporated is shown in Fig. 1. The sampled speech signal is fed into the auditory feature extraction module, i.e., \Zero-Crossings with Peak Amplitudes (ZCPA)" [Kim et al., 1996] module which generates noise-robust speech features every 10 msec. In parallel with the feature extraction process, the \word boundary detection" module determines word boundaries based on energy and zero-crossing rates. When the word boundary points are detected, feature vectors between word boundary points are sent to the neural network recognizer module, which classi es the input word as one of the prede nded classes. The OOV words rejection module operates as a post-processor of the recognizer. And users can adapt the system to their own speci c characteristics to improve recognition performance.

2. Robust Feature Extraction - ZCPA Analysis A human auditory system is robust to background noise, and lots of works have been devoted to modeling functional roles of the peripheral auditory systems for ro-

Word Boundary Detection

Speech Input

ZCPA

RBF Neural Recognizer

OOV Words Rejection

Recognition Result

Speaker Adaptation

Figure 1: Architecture of the \Voice Command" speech recognition system. Cochlear Filter 1

Cochlear Filter 2

x(t)

Cochlear Filter i

Zero-Crossing Detector

Timing Information

Peak Detector Saturating Nonlinearity

Interval Histogram

Σ

ZCPA(t,f)

Intensity Information

Auditory Nerve Fibers

Cochlear Filter M

Basilar Membrane

Figure 2: Block diagram of the zero-crossings with peak amplitudes (ZCPA) model. bust front-ends of speech recognition systems in noisy environments. Among them a simple and ecient auditory model, ZCPA model, was proposed and demonstrated good performance in real-world noisy environments [Kim et al., 1996, Kim et al., 1997]. The ZCPA model consists of a bank of bandpass cochlear lters and nonlinear stages at the output of each cochlear lter as shown in Fig. 2. The cochlear lterbank represents frequency selectivity at various locations along a basilar membrane in the cochlea, and is implemented with a bank of 16 hamming band pass lters of which center frequencies are distributed between 200 to 4000 Hz according to the frequency-position relationship on the basilar membrane [Greenwood, 1990]. Auditory nerve bers tend to re in synchrony with the stimulus, and the synchronous neural ring is simulated as the upwardgoing zero-crossing event of the signal at the output

of each bandpass lter, and the inverse of time interval between adjacent neural rings is represented as a frequency histogram, where each bin is spaced by one bark [Zwicker and Terhart, 1980]. Further, each peak amplitude between successive zero-crossings is detected, and this peak amplitude is used as a nonlinear weighting factor to a frequency bin to simulate the relationship between the stimulus intensity and the degree of phase-locking of auditory nerve bers, whereas Ghitza utilized multiple level-crossing detectors in ensemble interval histogram (EIH) model [Ghitza, 1994]. The histograms across all lter channels are combined to represent the pseudo-spectrum of the auditory model. As a result, frequency information of the signal is obtained by zero-crossing intervals of subband signals, and intensity information is also incorporated by a peak detector followed by a saturating nonlinearity. However, one has to determine several parameters such as the number of levels and level values in the EIH model which are extremely critical for reliable performance, and there is no elegant method to determine these values except by trialand-error. The utilization of zero-crossings in frequency estimation makes ZCPA model free from unknown parameters associated with the level, more ecient for calculations, and more robust to noise than the EIH model. Further, we have shown that the higher level values result in higher sensitivity of the interval measurements to the additive noise [Kim et al., 1996].

3. RBF Neural Network Recognizer It is quite common to use both static and time-derivative features in speech recognition. However only the spectral representation of ZCPA is used as the feature vector since time-derivative features do not improve recognition accuracy at least for neural network recognizers [Kim et al., 1997]. The time-frequency joint representation of the speech signal normalized in time frame by trace-segmentation [Silverman and Dixon, 1980] is fed

into an RBF network to classify it as one of vocabulary words. Utterances from 40 speakers are used as training data of the RBF network, where each speaker uttered 50 vocabulary words 2 times. Other 19 speakers not involved in training data uttered 50 words 2 times, which are used as speaker-independent test data. Recognition rate obtained by RBF network is 97.2 %, and most of errors comes from very confusable one-syllable words. Factory noise and military operations room noise, contained in NOISEX-92 CD ROMS [Varga and Steeneken, 1993], and car noise were added to the test data sets for test evaluations in real situations. The reduction in recognition rates is less than 15 % even when signal-to-noise ratio (SNR) is 10 dB. RBF network has two advantages in our system for speaker adaptation (section 4) and out-of-vocabulary (OOV) rejection (section 5). In our speaker adaptation method, reference word patterns are necessary, and they are replaced with the centers of hidden neurons of the RBF network. Each hidden neuron of the RBF network is assigned to each word pattern class, and extra storage of reference patterns is not required. And the activation value of each output neuron can be used as con dence measure in the OOV rejection.

4. Speaker Adaptation Our approach to speaker adaptation is based on the nonlinear feature space mapping by which speaker-speci c spectra are transformed into the speaker-independent one. In the training stage, some of vocabulary words are uttered by a new speaker and each word is normalized in time by trace-segmentation algorithm to constitute the training data of the MLP adaptation network. The MLP is trained so that the nonlinear mapping function between the feature vector at time t, xt , and the reference feature vector, yt , is formed by minimizing the mean square error (MSE) between yt and xnew = F (xt ) for all t, where F t denotes a mapping function of the MLP. Since the center vector, ui , of the i-th hidden neuron of the RBF recognizer represents the speaker-independent mean pattern of the i-th word, the reference feature vector for the input word which belongs to the i-th class is taken from the appropriate time slot of ui . In the recognition stage, each feature vector in the time-normalized input word pattern, X = [x1 ; x2; : : : ; xT ], is transformed by the MLP adaptation network into a new feature vector to constitute a new new new word pattern, Xnew = [xnew 1 ; x2 ; : : : ; xT ], which is to be fed to the RBF recognizer. Block diagram of speaker adaptation with the MLP is shown in Fig. 3 50 vocabulary words are uttered 5 times by a new speaker, and 10 words which show low recognition rates are selected as adaptation words. Five experiments are conducted by selecting di erent training data sets, and are averaged. The above procedures are repeated for 7 new additional speakers and the results are averaged. The base system shows average recognition rates of 74.3 %

Input Pattern xt

X

Speaker-Adapted Input Pattern new

X

x new t

xT

frequency

xTnew

frequency

time

time

MLP

Figure 3: Speaker adaptation with the MLP. and 98.9 % for the 10 adapted words and for the other 40 non-adapted words, respectively, and results in 94.0 % in total. Even though recognition rate is increased by 15.4 % on average for the adapted words after adaptation, that for the other non-adapted words is decreased by 7.9 %. As a result, total recognition rate is decreased by 3.3 %. This is due to the fact that only 10 words out of 50 are used in the training of MLP which contain insucient phonetic mapping information. Since the systems with and without the MLP adaptation network show higher delity for the adapted and non-adapted words, respectively, a judge scheme which pass its verdict by taking advantages of each path is a possible solution. Although the judge network may also be trained [Kim and Lee, 1994], a much simpler algorithm is adopted here. Let us denote the activation value of the i-th output neuron of RBF recognizer obtained by bypassing and using the MLP adaptation network by vi and vi , respectively. Then the decision rule is to select maximum argument of zi , which is represented as 8 1  >> < 1 +  vi + 1 +  vi ; for i 2 I; zi = >>  : 1 +  vi + 1 +1  vi ; for i 2= I; where  > 1 is a weighting factor to the path of higher delity and I represents the set of indices of adapted words. When the judge scheme is applied with empirically determined weighting factor,  = 2:0, recognition rate for the adapted words becomes 98.2 %, which is very close to the recognition rate when adaptation scheme is not involved. Also recognition rate for the non-adapted words is increased to 92.3 %. Total recognition accuracy is increased to 97.1 %, which demonstrates that the judge scheme used in this system is very e ective. 0

0

0

5. Out-of-Vocabulary Word Rejection There are two goals in OOV word rejection. The rst goal is to increase correct recognition rate, and it can be

accomplished by minimizing incorrect rejection rate. The second one is to detect and reject OOV word, which is accomplished by maximizing OOV word rejection rate while minimizing incorrect rejection rate. Most of rejection approaches generally represent OOV word explicitly, and there have been many researches on di erent representations of OOV words [Wilpon et al., 1990, Rose, 1993]. However, we perform OOV words rejection in our system by directly investigating activation values of RBF recognizer rather than modeling OOV words explicitly for simple hardware implementations. Each output activation value of the RBF recognizer indicates how far the input word pattern is from the trained center of word pattern class, and plays the role of a con dence measure. The decision on rejection is carried on by comparing the con dence measure and threshold,  . Rejection accuracy is measured using additional 48 OOV words data from speaker-independent test speakers by varying the threshold value. Correct rejection rate of OOV words is increased up to 60.1 % while the reduction in recognition rate is 5.2 %. We also evaluated rejection performance after the system is adapted to a speci c speaker. Eight speakers uttered 48 OOV words as did in section 4 and seven kinds of garbages such as door slam, cough, slap, etc., were collected. As already described, the recognition rate is 97.1 %. When OOV words rejection rate is 64.7 %, recognition rate is attained to 94.2 %, rejection rate of garbage reaches to 94.4 % and false rejection rate { input word is rejected even though it is one of vocabulary words { is only 2.8 %.

6. Conclusions The \Voice Command II" neural speech recognition system is implemented on a DSP for robust speech recognition. Robustness of the system is enhanced in the context of speaker adaptation and rejection capabilities as well as environmental noises. In summary, several capabilities for robustness which make the system more usable are: 

Noise-robust feature extraction based on human auditory systems



RBF neural network speech recognizer



MLP-based speaker adaptation



Out-of-vocabulary words rejection

The whole system was rst designed on workstations with oating-point arithmetics. Then it was ported into xed-point DSP system during which careful considerations was made to decide the resolution at each stage of the system. Finally the reduction in recognition rate was maintained within 1 %.

Acknowledgment: This research was supported by LG Corporate Institute of Technology, Korea.

7. References [Ghitza, 1994] Ghitza, O. (1994). Auditory models and human performances in tasks related to speech coding and speech recognition. IEEE Trans. Speech and Audio Processing, 2(1, part II):115{132. [Greenwood, 1990] Greenwood, D. (1990). A cochlear frequency-position function for several species{29 years later. J. Acoust. Soc. America, 87(6):2592{2650. [Kim et al., 1996] Kim, D.-S., Jeong, J.-H., Kim, J.-W., and Lee, S.-Y. (1996). Feature extraction based on zero-crossings with peak amplitudes for robust speech recognition in noisy environments. In Proc. ICASSP, pages 61{64, Atlanta, USA. [Kim and Lee, 1994] Kim, D.-S. and Lee, S.-Y. (1994). Intelligent judge neural network for speech recognition. Neural Processing Letters, 1(1):17{20. [Kim et al., 1997] Kim, D.-S., Lee, S.-Y., Kil, R. M., and Zhu, X. (1997). Simple auditory model for robust speech recognition in real world noisy environments. Electronics Letters, 33(1):12. [Lee et al., 1996] Lee, S.-Y., Ahn, K.-H., Kim, D.-S., Cho, J.-W., Jeong, J.-H., Kim, J.-W., Kwon, S.-O., and Kil, R. M. (1996). Voice command: A digital neuro-chip for robust speech recognition in real-world noisy environments (Invited talk). In Proc. ICONIP, pages 283{287, Hong Kong. [Rose, 1993] Rose, R. C. (1993). De nition of acoustic subword units for word spotting. In Proc. Eurospeech, pages 1049{1052. [Silverman and Dixon, 1980] Silverman, H. F. and Dixon, N. R. (1980). State constrained dynamic programming (SCDP) for discrete utterance recognition. In Proc. ICASSP, pages 169{172. [Varga and Steeneken, 1993] Varga, A. and Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the e ect of additive noise on speech recognition systems. Speech Communication, 12(3):247{251. [Wilpon et al., 1990] Wilpon, J. G., Rabiner, L. R., Lee, C. H., and Goldman, E. R. (1990). Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. Acoust., Speech, Signal Processing, 38(11):1870{1878. [Zwicker and Terhart, 1980] Zwicker, E. and Terhart, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. America, 68:1523{1525.

Suggest Documents