MODEL OPTIMIZATION FOR NOISE DISCRIMINATION IN ... - CiteSeerX

6 downloads 0 Views 124KB Size Report
train, aircraft, moped, and truck). They have claimed that with the use of LPC-cepstral features and five-state HMMs, the classifier achieved better results than ...
MODEL OPTIMIZATION FOR NOISE DISCRIMINATION IN HOME ENVIRONMENT Agnieszka Betkowska, Koichi Shinoda, and Sadaoki Furui Department of Computer Science, Tokyo Institute of Technology ABSTRACT In this paper we present two methods for improving the performance of automatic noise recognition systems. The first method minimizes the number of hidden Markov model parameters according to the training data available, by using the minimum description length criterion. The second method combines scores from more than one different recognition systems by applying the Mixture of Experts architecture. Experimental results showed the effectiveness of these methods. 1. INTRODUCTION In automatic noise recognition (ANR), noise sources included in acoustic signals are detected and classified into predefined categories. In our daily life, we encounter different types of environmental noises such as home noise, office noise, and traffic noise. Although they cause performance degradation in speech recognition, they often bring useful information for understanding surrounding environments. The applications of ANR are various: robust speech recognition, acoustic database queries, hearing aids, and development of personal robots. Promising techniques include statistical pattern recognition techniques such as hidden Markov models (HMMs) and neural networks [1]. Gaunard et. al. [2] have built an HMM-based classifier for five noises (car, train, aircraft, moped, and truck). They have claimed that with the use of LPC-cepstral features and five-state HMMs, the classifier achieved better results than human listeners. In this paper we present two different ANR methods. One is a method for optimizing the complexity of noise model, and the other is a method for combining recognition results from more than one ANR systems with the use of Mixtures of Expert (MoE) architecture. It is well known that the recognition performance depends on the appropriate choice of the model structure [3], in which the amount of training data available and the characteristics of the input signals should be taken into consideration. We investigate a method for adjusting the complexity of HMMs according to the training data available, where the number of mixtures in each state of HMM is reduced by This work is sponsored by NEC Corporation

using minimum description length (MDL) criterion. In addition, adaptation for each home environment is carried out to enhance the ANR performance. We also investigate a noise discrimination method based on the combination of the recognition results from more than one ANR systems with different topologies. First, the topology of each system is optimized to score in favor of a specific noise class. Then, the outcomes of these systems are given to a neural network responsible for separating verbal sounds from non-verbal sounds. Next, more detailed classification is carried out within each of these two categories. We applied the proposed two methods to ANR in home environment. They were evaluated on the database recorded by a personal robots which were used in twelve families. This paper is organized as follows. In Section 2, the proposed two methods are explained. In Section 3, the results of their evaluation are shown. Section 4 concludes the paper. 2. PROPOSED METHOD 2.1. Optimization of the complexity of noise models 2.1.1. Reduction of Gaussian components in HMM We use the Gaussian reduction method [4] for optimizing the complexity of noise models. A well-trained, large-size HMM with fixed number of Gaussians over all states is prepared as the initial model. Then, the Gaussian mixtures of each state are clustered to form a tree structure by using the k-means algorithm. The Kullback-Leibler divergence is used as a measure of distance between two Gaussians. A subset of Gaussian components are chosen from the tree structure by using the MDL criterion, which has been proven to be effective for selecting one model that best fits the given data from various probabilistic models. The total number of Gaussians is additionally controlled by a penalty coefficient for large-size models (for details, see [4]). 2.1.2. Home Adaptation It is generally observed that the noise characteristics are different from home to home. Therefore, a home-dependent

model which is adapted to noises in a specific home, is expected to yield better performance than a home-independent model which represents common characteristics of home noises. For the adaptation, we use the method proposed in [5]. In the adaptation procedure, the mean of each Gaussian component in the home-independent model is mapped to the mean of the home-dependent model according to the following equation:

     where  is a shift parameter for a particular mean  from an independent model,  is the adapted mean of the dependent model,  is the number of states, and  is the number of Gaussian components in each state. For each mean   , a shift  must be estimated. However, if the total number of Gaussian components is large, there might be not enough adaptation data to estimate those shifts correctly. Therefore, in order to control the number of free parameters, the tree structure for shifts is introduced. In this tree, each leaf node  represents a Gaussian component and each non-leaf node represents a subset of Gaussian components [5] . A shift  is assigned for each leaf node  and a tied-shift  is assigned for each non-leaf node . Using this tree structure we can control the number of free parameters according to the amount of data available. If we do not have enough data, the tied-shift  of the node in the upper part of the tree is applied to all Gaussian components in the corresponding subset !  as follows:

  "    

for

$# ! 

where !  is the set of the descendant nodes of node . As the amount of data increases, the tied-shift  from the lower part of the tree is used for adaptation. For controlling this process, we use a threshold which defines the minimum amount of data needed to estimate  .

Fig. 1. (a) An architecture of Mixtures Of Experts (MoE) (b) An expert network for noise  in an MoE In order to combine scores of  systems, several expert networks are created, instead of one global network. A gating network that decides which of the expert network should be used is also built [6] (Fig. 1(a)). A neural network for each expert consists of an input layer with  neurons, one hidden layer with 4 neurons, and an output layer with single neuron (Fig. 1(b)). The likelihood 5  for each class  is calculated as follows:

6 ( 7 (& '-) "8 &

9

7  *   &('-) )  021;: 3+  < = 3+  4 > ?  3+  < 7 6 (7 &('-) )  5  &,'-) 8 & 7 021A@

where 8 &(') is a sigmoid function. The weights   and 7 @ : are trained in such a way that 5  is close to one if the correct  noise class is and zero otherwise. 3. EXPERIMENTAL RESULTS

2.2. Mixtures of Experts and neural network In the second method, we employ more than one ANR systems to improve the ANR performance. Each of the ANR system gives its own score for an input signal. The Mixtures of Expert (MoE) technique using neural networks are applied to combine those scores together and make a final decision about the category of the signal. Let  be the number of recognition systems and  be the number of noise classes. The a posteriori probability (score) %  &(') that the noise sample ' belongs to class  in system is expressed as follows:

*+ +&,'-)  .

%  &('-)

/021

% /  &,'-)

  

3+  

In this section, we describe the evaluation for the two ANR methods. 3.1. Database For the evaluation, we used the database recorded by a personal robot PaPeRo developed by NEC[7], which was used in the house of twelve families. The whole database contains 74640 sounds each of which was detected by the speech detection algorithm equipped in PaPeRo. The samples recorded by PaPeRo were labelled manually. These sounds were classified into three kinds: speech without noise, noisy speech, and noise without speech. In this study, we used noises without speech. While there were various combinations of different noises in the database, for simplification we used

Fig. 2. MDL clustering and adaptation. data samples that contain only one kind of noise. In our experiments, we use five categories: TV (label T), human distant speech (label A), kitchen sounds (label D), footsteps (label F), and sudden noise (label X).

Fig. 3. MDL clustering for different initial models. No clustering for baseline model was performed. models showed an improvement of 2-4% over the baseline model, though their sizes were the same as the baseline model. This result shows that the models created by using the proposed method perform better than the HMMs constructed in conventional way.

3.2. First experiment - adjusting complexity of HMMs 3.2.1. Experimental conditions In this experiment, the data from six families were used. We divided these data for testing and training in the following manner: we chose data from one family as a test set and the remaining data from the other families as training set. We repeated this procedure six times and averaged the results. For adaptation we divided the previous test set into two parts: 1/3 was used for adaptation and 2/3 for evaluation. We trained a 3 states left-to-right HMM for each noise. We prepared six initial (baseline) models with different number of Gaussians in each state (32, 64, 128, 256, 512, and 1024). 3.2.2. Results The MDL clustering and adaptation results are shown in Fig. 2. The penalty coefficients in the MDL model selection was chosen to have the best accuracy when used with adaptation. As we can see, the performance of the baseline model and that of the clustered model were similar. However after applying adaptation process, the clustered model gave better accuracy than the adapted baseline model. It can be said that the MDL clustering is especially effective when adaptation process is also utilized. We also performed another experiment. The large size models (with different initial size) were reduced by the MDL clustering to the size of HMM with 32 mixtures per state. The comparison of performance among those models and the baseline HMM was shown in Fig. 3. The clustered

3.3. Second Experiment - Mixtures of Experts 3.3.1. Experimental conditions In this experiment we used data from twelve families. We divided the data into three different sets: training set, development set, and test set. They were divided in the following way: data from eight families were used for training the MoE systems, and the remaining data from the other four families were distributed between development and test set. The topology of each ANR system was obtained by optimizing it for the noises in a particular noise class. Hence, each ANR system performs in favor of one kind of noise. The search for the optimal topology was conducted from among the following four types of HMMs: Type A: Ergodic HMM Type B: Left to right HMM Type C: Left to right HMM where a skip of one state is allowed Type D: Left to right HMM where a skip of more than one states is allowed For each HMM type, the number of states was varied from 1 to 11 states. The number of mixtures per state was chosen from 1,2,4,8,16,32, and 64. For each noise class, the topology of the model was chosen by using the MDL criterion. The resulting topologies for five ANR systems are given by Table 1.

Table 1. Topologies of noise HMMs. Noise Topology TV Type C, 11 states, 2 mix Human distant speech Type C, 4 states, 16 mix Sudden noise Type D, 3 states, 8 mix Footstep Type A, 2 states, 2 mix Kitchen sounds Type A, 2 states, 16 mix

separating sudden noise, kitchen noise and remaining sounds had an accuracy of 81.0%. When the test set was used, however, MoE A performed better than MoE B. For MoE B, the classification error between verbal and non-verbal sounds was 9.1%. The segregation of verbal sounds was performed with an accuracy of 65.4%. Sudden noise, kitchen sounds and footsteps were separated with an accuracy of 77.4%. All of these rates were worse than those for the development set. This might be because of the mismatch between the development set and the test set. The evaluation with more complicate noises should be carried out to confirm the effectiveness of the proposed MoE B method. 4. CONCLUSION

Fig. 4. Results for development set and test set for the two kinds of Mixtures of Experts (MoE) The recognition of the development data and the test data was performed by all the five ANR systems. The best results among those five systems were taken as a baseline. We built two kinds of MoE systems: MoE A: One network for separating all noises. MoE B: Two-step discrimination network. First a gating network separates verbal sounds (TV and human distant speech) from non-verbal sounds (kitchen sound, sudden noise and footstep). If a verbal sound is detected by the gating network, this sound goes to an expert network responsible for separating TV from human distant speech. Similarly, if a nonverbal sound is detected, this sound goes to an expert network responsible for separating kitchen sound, sudden noise, and footstep. 3.3.2. Results The recognition results are shown in Fig. 4. In the evaluation, when development set was used, the recognition accuracy of one-step segregation between noises (MoE A) was improved by 2.4%. The accuracy further increased when the separation for non-verbal and verbal sound was performed first (MoE B). The segregation for these two groups was conducted with an error rate of 3.2%. The expert network responsible for separating TV from human distant speech had an accuracy of 78.6% and the network responsible for

In this paper we presented two different ANR methods: adjustment of HMM’s complexity and noise discrimination based on the combination of results from more than one recognition systems by using MoE. The former method was successful for reducing the complexity of the model and increasing the accuracy of the discrimination system by 4.5%. In the latter method, the neural network responsible for separating noises improved the performance by 2.0%. While the two-step segregation (MoE B) did not show improvement in this study, where the number of noise classes was small, it might be effective when there exists a large variety of noises. Researches in this direction seem promising. 5. REFERENCES [1] J. Sillanp¨aa¨ , A. Klapuri, J. Sepp¨anen,and T. Virtanen, “Recognition Of Acoustic Noise Mixtures By combined Bottom-Up And Top-Down Processing,” in Proc. EUSIPCO2000, pp. I 335-338, 2000. [2] P. Gaunard, G. C. Mubikangiey, C. Couvreur, and V. Fontaine, “Automatic Classification Of Environmental Noise Events by Hidden Markov Models,” in Proc. ICASSP, vol.6, pp. 3609-3612, 1998. [3] L. Ma, D. Smith, and B. Milner, “Environmental Noise Classification for Context-Aware Applications,” In Proc. EuroSpeech-2003, pp. 2237-2240, 2003. [4] K. Shinoda and K. Iso, “Efficient reduction of gaussian components using MDL criterion for HMM-based speech recognition,” in Proc. ICASSP2002, pp. I 869-872, 2002. [5] K. Shinoda and T. Watanabe, “Speaker Adaptation with autonomous control using tree structure,” in Proc. EuroSpeech95, pp. 1143-1146, 1995. [6] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” in Neural Comput.,vol 3, pp. 79-87, 1991. [7] T. Iwasawa, S. Ohnaka, and Y. Fujita, “A Speech Recognition Interface for Robots using Notification of III-Suited Conditions,” in Proceedings of the 16th Meeting of Special Interest Group on AI Challenges, pp. 33-38, 2002.

Suggest Documents