Selective Attention for Noise Robust Speech

Selective Attention for Noise Robust Speech Recognition Ki-Young Park and Soo-Young Lee Brain Science Research Center, and also Department of Electrical Engineering Korea Advanced Institute of Science and Technology 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, Korea (South) Tel: +82-42-869-3431 / Fax: +82-42-869-8570 / E-mail: [email protected] Jong-Hye Han Division of Social Welfare Woosong Information College 226-2 Jayang-dong, Dong-gu, Taejon 300-715,Korea (South) Tel:+82-42-629-6297 / Fax:+82-42-629-6293 / E-mail: [email protected] ABSTRACT: Based on biological selective attention mechanism, a new algorithm was developed to improve recognition accuracy of isolated speeches in noisy environments. The attenuating “early filtering” model was implemented by inserting an attention layer just after the input layer. Attention gains, i.e., one-to-one synaptic weights between the original sensor input vector and the attended input vector at the attention layer, were adapted with the error backpropagation algorithm. After attention adaptation, the distance between the original sensor input and the attended vectors became a confidence measure for classification. The developed algorithm demonstrated high recognition rates for isolated Korean words in noisy environments.

INTRODUCTION Speech recognition has been investigated for a while, and demonstrated some success for clean speeches. However, poor performance in noisy environments prevents current speech recognition systems from popularity in real world. There have been 3 different approaches to overcome this important problem. One had tried to extract noise-robust speech features (Kim et al., 1997), while another had developed robust classification algorithms on input perturbations. (Jeong and Lee, 1996) The other had tried to improve signalto-noise ratios (SNRs) by separating the signal from noise before applying to speech recognition systems. (Lee, 1998) Although noise-robust feature extractions based on auditory models demonstrated some improvements in recognition rates, they suffer from enormous computational requirements. (Kim et al., 1997; Kim et al., 1999) The noise-robust classification algorithm may not work well with severe noises, and the blind signal separation algorithm requires some assumption on characteristics of the signal and noises. However, humans utilize selective attention to improve recognition accuracy. (Broadbent, 1958; Triesman, 1960; Anderson et al., 1993; Cowan, 1988) Even at very noisy cocktail party humans usually do not have much difficulties to understand speeches of their friends. Therefore, it is natural to incorporate the selective attention mechanism into speech recognition systems in noisy real world applications. Fortunately, there have been extensive studies on selective attention, and a huge amount of psychological experimental data has been accumulated. However, there also exist controversy among different theories, and only a few model is defined enough to start engineering applications in real world. (Cowan, 1997; Parasuraman, 1998; Pashler, 1998) Only recently speech recognition researchers had utilized the selective attention mechanism for the “cocktail party problem”. Fukushima incorporated selective attention and attention switching algorithms into his Neocognitron model, and demonstrated good recognition performance of superimposed binary numbers one by one. (Fukushima, 1987) However, the Neocognitron model has many unknown parameters to be determined by heuristics or from psychological experiments, and its performance is sensitive to the parameter values. Also, its computational requirements are prohibitively expensive for many real-time applications, and binary input vectors are assumed. Therefore, recognition of only small-size binary number were reported. In this paper we report a new simple selective attention algorithm based on the biological concept, and demonstrated superiority of the developed algorithm for isolated-word recognition in real-world noisy environments. We first review some of the theories on selective attention, and present our engineering version of the selective attention mechanism. The developed speech recognition system is described, and its performance on noisy Korean isolated-word recognition tasks is presented.

1

SELECTIVE ATTENTION MODELS The earliest modern theory of selective attention is known as “early selection” or “early filtering” theory. (Broadbent, 1958) Broadbent observed that, although subjects could not recall most of the unattended channel of a dichotic tape, they often could recall the most recent several seconds of that channel. Therefore, he suggested that the brain temporarily retains information about all stimuli but that the information fades, and neither is admitted to the conscious mind nor is encoded in a way that would permit later recollection, unless attention is turned quickly to a particular memory trace. As shown in Figure 1, a selective filter blocks a part of the sensory input signal before going to the limited-capacity channel for latter processing. Effectors

Senses

System for varying output until some input is secured Sensory Input

Selective Filter

Limited Capacity Channel Store of conditional probabilities of past events

Figure 1. An “early filtering” model of Broadbent Several important modifications of the Broadbent’s model had been proposed. To account for the early findings of automatic analysis of unattended input, Triesman proposed that the filter merely attenuated the input rather than totally eliminating it from further analysis. (Triesman, 1960) Broadbent also modified his model to allow stimuli selection on the basis of semantic properties and not just on the basis of physical properties. (Broadbent, 1971) This amendment was supported by later evidence, which showed the selection on the basis of physical cues is less effortful than selection on the basis of semantic cues. (Johnson and Heinz, 1978) The most extreme alternative to Broadbent's “early selection” theory is “late selection” theory. Some theorists noticed that recognition of familiar objects proceeds unselectively and without any capacity limitations, and proposed that all information is completely analyzed automatically and attention is being used only to determine which of the analyzed information will enter into the subject’s response. (Deutsch and Deutsch, 1963) However, latter experiments provide several count evidences again the “late selection” theory. For example, attentional effects in neuromagnetic fields begin about 20 msec after the auditory stimulus. (Woldorff et al., 1993) Cowan suggested that “selective filter” is actually intrinsic to the long-term memory activation process. (Cowan, 1988) The activated memory is a subset of long-term memory, and the focus of attention is a subset of activated memory. Any stimulus or any features in long-term memory could be selected by a voluntary attentional focus or spotlight. (Posner and Snyder, 1975) Cowan put a central executive module in his model to direct attention and control voluntary processing. (Cowan, 1988) It was also viewed as habituation of the attentional orienting response for a repeated pattern in a nonselected channel. (Sokolov, 1963) Sokolov suggested that a neural model of a stimulus is developed with repeating representations of the stimulus, and that incoming stimuli are compared to that neural model. As the neural model develops, the orienting response habituates. When a subsequent stimulus is found to differ from the current neural model, dishabituation of the orienting response occurs. Support for the hypothesized comparison process has included the finding that an orienting response results from a decrease or omission of stimuli that would be expected according to the ongoing stimulation, and not just an increase or onset. In summary attenuating “early filter” model is commonly agreed, but there still remains many unknown and controversial issues on how the filter is formed. We will just borrow this filter concept, and develop an algorithm to find the filter by adaptation.

2

x1 x2 x3 xN x

Wkn

a1

^x1

a2

^x2

h2

a3

^x3

h3

aN

^xN

hK

^

a

x

Vmk

h1

W

y1 y2 y3

h

yM y

V

Figure 2. Proposed neural network architecture for selective attention.

PROPOSED ATTENTION MODEL As shown in Figure 1, the “early filtering” theory proposes a selective filter between sensory input and working memory so that unwanted stimuli are cut off or attenuated before passing to the recognition parts of the brain. The proposed network architecture of the selective attention is shown in Figure 2. The dotted box is a standard one hidden-layer Perceptron (Rumelhart, 1986), which represents the 3 boxes at the right in Figure 1. Although each box may be modeled separately, we are interested in the selective attention and one MLP (multilayer Perceptron) is utilized here for simplicity. An attention layer is added in front of the input layer, which represents the selective filter in Figure 1. Each input value xk is gated to the k’th input by an attention gain ak, and the overall signal flows are defined as xˆ k = a k x k ,

hˆ j = ∑ W jk xˆ k , h j = S (hˆ j ), k

yˆ i = ∑ Vij h j ,

y i = S ( yˆ i ) .

(1)

j

Here W and V are synaptic weights, and S(.) is a hyperbolic-tangent nonlinear function. The attention gain ak is usually set to 1 at the training phase. By adapting the ak’s for a training data set, this architecture had been investigated for improving learning performance of given trained networks (Lee et al., 1991; Lee et al., 1988), speaker adaptation (Lee et al., 1997), and learning feature representation for natural language processing (Miikkulainen and Dyer, 1991). Here, the attention gain is adapted at the test phase only. Unless the selective attention process is turned on, all values of the attention gains are maintained to be one and the attention module does not play any role. Since we only consider selective attention during recall phase, no attention process is turned on during training phase and all ak’s are set to 1. When training input-output data set ((xs,ts), s=1,2,..,S) are given, one trains the neural networks to minimize output error defined as E = ∑Es = ∑ s

s

1 s t − y ( W, V, x s ) 2

2

=∑ s

1 2 ∑ (t is − y is ) . 2 i

(2)

Here, the input xs is an input vector, i.e., a feature vector at a time frame or a time-frequency joint representation vector of a speech signal for speech recognition tasks. For recognition of C words, the actual output vector y and target output vector ts have C elements each. Also, y(W,V,xs) specifies that the actual output vector is a function of synaptic weights and an input vector, and y is denotes the i’th elements of the y(W,V,xs). For each target output vector ts only one element corresponding to the specific word class is set to 1 and all the others are –1. By adjusting synaptic weights W and V to minimize the output error in Eq.(1), the network learns input-tooutput mapping functions defined by the training data set. A simple gradient-descent algorithm results in Vij [n + 1] = Vij [n] − η

∂E ∂E [n] and W jk [n + 1] = W jk [n] − η [ n] , ∂Vij ∂W jk

(3)

where η is a learning rate and [n] denotes the n’th learning adaptation epoch. The gradients in Eq.(3) may be calculated by error backpropagation (EBP) rule as

3

δ i( 2) s ≡ −

∂E ∂E = (t is − y is ) S ' ( yˆ is ), δ (j1) s ≡ − = S ' (hˆ sj )∑ Vij δ i( 2) s s ∂yˆ i i ∂hˆ sj −

∂E = ∑ h sj δ i( 2) s , ∂Vij s

−

∂E = ∑ x ks δ (j1) s . ∂W jk s

(4a)

(4b)

The selective attention process is based on error backpropagation (EBP). Let’s assume that an attention is somehow given to a specific class, i.e., one word in isolated-word recognition tasks. When the attention class is introduced, corresponding output target vector t s ≡ [t1s t 2s t Ms ] is defined. For bipolar binary output

systems t is = 1 is for the attention class and –1 for the others. Then the attention gain ak’s are adapted to minimize output error Es = with the given input x ≡ [ x1 x 2

x

N

1 2 ∑ (t is − y i ) 2 i

(5)

] T and pre-trained synaptic weights W and V. Initial values of the

attention gain ak’s are set to 1. The update rule is again based on a gradient-descent algorithm with error backpropagation rule. At the (n+1)’th iterative attention adaptation epoch, the attention gain a k is updated as a k [n + 1] = a k [n] − η a

δ k( 0) ≡ −

∂E s [n] = a k [n] + η a x k δ k(0) [n] , ∂a k

∂E s = ∑ W jk δ (j1) , ∂xˆ k j

(6a)

(6b)

where E s , δ (j1) , and W jk denote the attention output error, the j’th attribute of the back-propagated error at the first hidden-layer, and the synaptic weight between the attention-filtered input xˆ k and the j’th neuron at the first hidden-layer, respectively. Also η a denotes an attention adaptation rate. It is actually an extension of the EBP rule to adapt input vectors, and Eqs.(6) are also applicable to more general feed-forward neural networks. In consideration with the attenuating filter theory the attention gains are limited between 0 and 1. The resulting xˆ k = x k a k is regarded as the expected input for the specific attended class, or Sokolov’s “neural model”. Due to the nonlinear transformation between the input and the output vectors, the adaptation process usually converges to a local minimum in close proximity to the input vector. It is interesting to see that the local minimum is actually a desired characteristic in this case.

ISOLATED-WORD RECOGNITION SYSTEM The proposed model of selective attention can be used in speech recognition system to enhance the recognition performance in noisy environments. Currently HMM (Hidden Markov Model) and neural networks are the most popular classification algorithms for speech recognition tasks. (Lippmann, 1989) For isolated-word recognition tasks of clean speeches both algorithms result in good performance. In our experiments MLP (multilayer Perceptron) and RBF (Radial Basis Function) neural network models work a little better than discrete or continuous HMM. For continuous speech recognition tasks standard MLP or RBF model requires pre-segmentation, which is still very problematic. More sophisticated neural network models have been proposed with much more calculation requirements. (Lippmann, 1989; Waibel et al., 1989) On the other hand HMM classifiers are easy to incorporate continuous speeches, and demonstrate better performance. However, the proposed attention adaptation process is based on a gradient-descent algorithm with the error backpropagation rule, which is not easy to be incorporated into the HMM algorithm. Therefore, only MLP classifiers are applied here to isolated speech recognition tasks. A multi-layer Perceptron estimates a posteriori probability for trained patterns as a pattern classifier, and performs well for most input patterns without noises. (Richard and Lippmann, 1991) When input patterns are degraded due to noises and perturbations, the networks produce unreliable outputs. Although several learning 4

algorithms had been reported to improve robustness on input perturbations (Jeong and Lee, 1995), feedforward networks have basic limitations. Therefore, one needs more elaborate decision measures, which may be derived from feedback signals. The proposed selective attention algorithm incorporates an attention error Es as this feedback signal. As shown in Figure 3, the proposed algorithm is defined as followings: Step 1: Step 2:

Step 3:

For a given input vector x, calculate the output vector y of the neural classifier. For top Nc candidates, repeat: 1. Set all ak = 1. 2. Adapt ak by Eqs. (6) until termination conditions are satisfied. 3. Calculate a confidence measure M for this candidate. Select a candidate with maximum confidence M as the classification result

Figure 3. Proposed isolated-word recognition system with selective attention In Figure 3 the attention adaptation process for only one candidate is shown in detail inside the dotted box. The 2 rectangles represent the attention layer and MLP in Figure 2, respectively. It also shows error backpropagation path and attention adaptation in dotted lines. A triangle denotes a distance between the original input x and the attended input xˆ , which is regarded as the main component of the confidence measure M. All the other solid boxes have the same process. The adaptation at the Step 2 terminates if the output error becomes smaller than a preset value or stays at a local minimum. The top candidates are selected as the word classes with maximum output values for the original input vector x. The confidence measure M may be defined as a function of 3 parameters such as • A0: output activation before attention adaptation, • DA: input distance before and after attention, • EA: output error before and after attention. Here, A0≡ya is the output of the attention class a without the selective attention (SA) for the original input x. The newly introduced parameter DA and EA are defined as DA ≡

1 2N

N

2 ∑ ( x k − xˆ k ) =

k =1

1 2N

N

2 2 ∑ x k (1 − a k ) ,

(7a)

k =1

5

EA ≡

1 2M

M

2 ∑ [t i − y i (xˆ )] .

(7b)

i =1

With successful attention adaptation, DA becomes smaller. However, if the input vector is much different from training vectors of the attented class, the attention adaptation converges to a local minimum with a significant output error EA . In this case, EA becomes an important measure. The confidence M is now defined as M ≡

AO + 1 E αA D Aβ

,

(8)

where α and β are relative weighting factors between the three parameters. In the case of `α and β being equal to zero, the maximum M criteria is the same as that of the standard MLP without selective attention. With a large α factor the decision at Step 3 mainly relies on the selective attention adaptation.

EXPERIMENTAL RESULTS Experiments are conducted with 630 words, i.e., 30 Korean words spoken by 7 men three times. Each word is segmented and 16 ZCPA (Zero Crossing with Peak Amplitude) features are calculated at each time frame. (Kim et al., 1997; Kim et al., 1999) Each frame corresponds to 20 msec time interval, which is shifted by 10 msec for the next frame. The ZCPA features are motivated by mammalian auditory periphery, and show excellent performance for speech recognition in noisy environment. For the isolated-word recognition, the number of time frames is normalized to 64 by a simple trace algorithm, and 16 x 64 features are extracted for each word. A two-layer MLP with 1024 input neurons, 100 hidden neurons, and 30 output neurons is trained with the 630 training vectors. Test input vectors are obtained by adding random noises or military operating room noises to the training vectors with 4 different SNRs (Signal-to-Noise Ratios). There are some parameters to be set manually such as • Number of candidate classes, Nc • Ranges of attention gain values, ak’s • Relative weighting factors, α and β

Figure 4. Best possible misclassification rates versus the number of candidate classes The more candidate classes are considered, the more accurate decision one can make. However the computational amount also proportionally increases, and one should make a compromise. Figure 4 shows false recognition rates as functions of the top candidate number Nc for white Gaussian noises. In this figure the false 6

recognition rates are obtained by checking the probabilities of the correct word classes outside of the candidate classes. These are the best achievable values in the system by any searching method. As shown in Figure 4, no enhancement of performance may be achieved with the excess use of candidates. Therefore Nc=5 is used throughout the following experiments. Training of the attention layer is not different from that of ordinary MLP except that it has local connections. Therefore it suffers from the same generalization problems like overfitting and underfitting. When it has too little flexibility, the attention layer would not be trained for most input vectors. On the other hand, if the network has too much flexibility, the attention adaptation has a tendency of overfitting and performance on the test vectors becomes worse. In our experiments 1024 attention gains are adapted, which is too large. Therefore it is desirable to restrict the degree of freedom at the attention layer. In experiments best performance was obtained with the attention gain range between 0.5 and 1.2.

Figure 5. Recognition rates versus relative weighting factors α and β. Figures 5 and 6 show recognition rates achieved by the developed method described in Figure 3. In Figure 5 the recognition rates with 15 dB SNR white Gaussain noises are shown as a function of relative weighting factors α and β in Eq.(8). As weighting factors for the attention output error, EA, increases, there is no enhancement but rather decline of performance. Actually the attention layer has 1024 adaptive elements, which is enough to reduce the output attention error for all the experiments. Therefore, it does not provide reliable information for the decision. However, as shown in Figure 5, the input attention distance DA is a good parameter for the correct classification. Although the exact value of optimum α depends upon noise levels, α=5 gives reasonably good results with SNRs between 10 and 20 dB. Also the performance is not sensitive to the α values. Figure 6 summarizes the achieved false recognition rates with white Gaussian noises and military operating room noises from the NOISEX data. Due to the speech-like noises, the latter data are especially interesting in real-world applications. Both results with and without the selective attention adaptation are compared. Here, ‘SA’ and ‘MLP’ denote results with and without the selective attention process, respectively. In these figures the proposed speech recognition system with selective attention greatly reduces false recognition rates in noisy environments. About 30% reduction of false recognition rates is achieved for both random and speech-like noises. With the noise-robust feature extractions of the ZCPA auditory model, the false recognition rates are already very small compared to those of popular LPC (Linear Prediction Coding) and MFCC (Mel-Frequency Cepstrum Coefficients). Therefore, the 30% reduction is significant improvement. It is also interesting to note that the improvements are quite similar for both random and speech-like noises. In general speech-like noises are critical to recognition performance, and also not easy to reduce. However, the proposed selective attention algorithm handles this difficult problem well.

7

50 False Recognition Rate (%)

False Recognition Rate (%)

50

40

30

20

10

0

40

30

20

10

0 20dB

15dB 10dB Signal-to Noise Ratio

5dB

20dB

(a)

15dB 10dB Signal-to Noise Ratio

5dB

(b)

Figure 6. False recognition rates with several SNR levels. The SA and MLP denote results with and without the selective attention adaptation. (a) White Gaussian noise; (b) Military operating room noise.

CONCLUSION Selective attention is a brain information processing to improve accuracy in pattern recognition tasks, especially in noisy environments. In this paper, a simple selective attention algorithm is reported. The proposed algorithm is based on Broadbent's early selection model, and constructed by augmenting attention layer in the front of an MLP classifiers. The training of attention gains could be done with an extension of the standard EBP learning algorithm. The developed algorithm was tested by isolated-word speech recognition tasks, and demonstrated much better recognition rates in noisy environments. Although only 30 words are tested, the algorithm is based on local search near the test input vector and expected to work for larger number of words. Acknowledgement: This research was supported as a Brain Science & Engineering Research Program by the Korean Ministry of Science and Technology (MOST). At the initial stage of the researches, S.Y. Lee was also supported as an Overseas Research Fund for Excellent Researchers by the MOST.

REFERENCES Anderson, C., Olshausen, B., and Essen, D. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information, The Journal of Neuroscience, 13(11):4700-4719. Broadbent, D.E. (1958). Perception and Communication. Pergamon Press. Broadbent, D.E. (1971). Decision and Stress, Pergamin Press. Cherry, E.C. (1953). Some experiments on the recognition of speech, with one and with two ears. Journal of Acoustic Society of America, 25:975-979. Cowan, N. (1988). Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information processing system. Psychological Bulletin, 104:163-191. Cowan, N. (1997). Attention and Memory: An Integrated Framework, Oxford Univ. Press. Desimone, R, and Duncan, J. (1995). Neural mechanisms of selective visual attention, Annual Reviews on Neuroscience, 18:193-222. Duetsch, J., and Deutsch, D. (1963). Attention: some theoretical consideration, Physical Review, 70:80-90.

8

Duncan, J. (1984). Selective attention and the organization of visual information. Journal of Experimental Psychology: General, 113:501-517. Fukushima, K. (1987). Neural network model for selective attention in visual pattern recognition and associative recall, Applied Optics, 26:4985-4992. Jeong D.G., and Lee, S.Y. (1996). Merging backpropagation and Hebbian learning rules for robust classification, Neural Networks, 9:1213-1222. Johnson, W.A., and Heinz, S.P. (1978). Flexibility and capacity demands of attention. Journal of Experimental Psychology: General, 107”420-435. Kim, D.-S., Lee, S.-Y., Kil, R.-M., and Zhu, X. (1997). Simple auditory model for robust speech recognition in real world noisy environments. Electronics Letters, 33:12. Kim, D.-S., Lee, S.-Y., and Kil, R.-M. (1999). Auditory processing of speech signals for robust speech recognition in real-world noisy environments, IEEE Trans. Speech and Audio Processing, 7:55-69. Lee, T.W. (1998). Independent Component Analysis, Kluwer Academic Press. Lee, H.J., Lee, S.Y. Lee, Shin, S.Y., and Koh, B.Y. (1991). TAG: A neural network model for large-scale optical implementation, Neural Computation, 3:135-143. Lee, S.Y., Jang, J.S., Shin, S.Y., and Shim, C.S. (1988). Optical Implementation of Associtive Memory with Controlled Bit Significance, Applied Optics, 27:1921-1923. Lee, S.Y., Kim, D.S., Ahn, K.H., Jeong, J.H., Kim, H., Park, S.Y., Kim, L.Y., Lee, J.S., and Lee, H.Y. (1997). Voice Command II: a DSP implementation of robust speech recognition in real-world noisy environments, International Conference on Neural Information Processing, pp. 1051-1054, Nov. 24-27, 1997, Dunedin, New Zealand. Lippmann, R.P. (1989). Review of neural networks for speech recognition, Neural Computation, 1:1-38. Miikkulainen, R., and Dyer, M.G. (1991). Natural language processing with modular PDP networks and distributed lexicon, Cognitive Science, 15:343-399. Norman, D.A. (1968). Toward a theory of memory and attention. Psychological Review, 75:522-536. Parasuraman, R. (ed.) (1998). The Attentive Brain, MIT Press. Pashler, H.E. (1998). The Psychology of Attention, MIT Press. Posner, M.I., and Snyder, C.R.R. (1975). Attention and cognitive control. In R.L. Solso (Ed.), Information Processing and Cognition. Erlbaum. Richard, M., and Lippmann, R. (1991). Neural network classifier estimate Bayesian a posteriori probabilities, Neural Computation, 3:461-483. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). Learning internal representation by error propagation, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, MIT Press. Sokolov, E.N. (1963). Perception and the conditional reflex. Pergamon Press. Triesman, A. (1960). Contextual cues in selective listening, Quarterly Journal of Experimental Psychology, 12:242-248. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1989). Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, Signal Processing, 37:328-339. Woodorff, M.G., Gallen, C.C., Hampson, S.A., Hillyard, S.A., Panatev, C., Sobel, D., and Bloom, F.E. (1993). Modulation of early sensory processing in human auditory cortex. Proceedings of the National Academy of Science, 90:8722-8726.

9