Towards an optimal feature set for robustness

1 downloads 0 Views 62KB Size Report
typical to daily life seems to be better accepted by humans than video camera monitoring. ..... PLP analyzer. The RASTA method replaces the conventional.
Fourth International Multi-Conference on Systems, Signals & Devices March 19-22, 2007 – Hammamet, Tunisia

Volume III : Conference on Communication & Signal Processing

Towards an optimal feature set for robustness improvement of sounds classification in a HMM-based classifier adapted to real world background noise Asma Rabaoui1 , Zied Lachiri2 and Noureddine Ellouze1 1 Unit´e de recherche Signal, Image et Reconnaissance des formes, ENIT, BP 37, Campus Universitaire, 1002 le Belv´ed`ere, Tunis Tunisia. e-mails: [email protected], [email protected]

2 D´epartement Physique et Instrumentation, INSAT, BP 676, Centre Urbain Cedex, 1080, Tunis Tunisia.

e-mail: [email protected]

Abstract— This paper deals with classification of auditory scenes into predefined classes. The auditory scenes comprised nine classes of everyday inside environments. Generally, the main goal of an automatic sounds recognition system is to analyze in real time the sound environment of a habitat and to detect the abnormal sounds that could indicate a distress situation. In a surveillance application, the most encountered problem is the background noise that can be captured with the sounds to be identified. Thus, we propose to use a MultiStyle training system based on HMMs: one recognizer is trained on a database including different levels of background noises and is used as a universal recognizer for every environment. In order to enhance the system robustness, we will explore different adaptation algorithms that reduce the environmental variability. In that system, the efficiency of different acoustic features was also studied. Experimental evaluation shows that a rather good recognition rate can be reached, even under important noise degradation conditions when the classifier is fed by the convenient set of features. Keywords— HMM classifier, Multi-style training, Adaptation, Features selection.

I. I NTRODUCTION Environmental sound retrieval has a wide range of applications. The use of a sound recognition system can offer concrete potentialities for surveillance and security applications [5]. Furthermore, these functionalities can also be used in portable tele-assistive devices, to inform disabled and elderly persons affected in their hearing capabilities about relevant environment sounds (warning signals, etc.). That system which is able to classify a number of different sounds found in environments typical to daily life seems to be better accepted by humans than video camera monitoring. This work forms part of a larger investigation into the integration of sound surveillance in a monitoring application. Previous work [10], [6], [3] has focused on recognizing single sound events. The developed Automatic Sounds Recognition (ASR) systems were very sensitive to variations between training and testing conditions, whether these variations were related to changes in acoustic environment or incorrect modelling assumptions [11]. Hence, to successfully develop ASR applications, it is crucial to take into account such discrepancies. This can be achieved using different kinds of

ISBN 978-9973-959-06-5 / SSD © 2007

techniques [12] aiming essentially at finding robust and invariant signal features [7], improving the modeling techniques, modifying recognition parameters or features using adaptation or compensation techniques [9], and using robust decision strategies [7]. Discrimination between different classes of environmental sounds is the goal of our work. The first author’s contribution to this research field is represented by a thorough investigation of the applicability of state-of-the-art audio features in the domain of environmental sound recognition. In fact, traditional features developed for speech recognition and features applied in audio segmentation and music retrieval are considered. Additionally, a set of novel features obtained by combining the basic parameters is introduced. The quality of the investigated features is evaluated by a HMM-based classifier. This paper treats first the applicability of a range of audio features in the domain of environmental sound classification and then it focuses on environment adaptation of the acoustic Hidden Markov Models (HMMs) [14] by applying a particular training mode called the Multi-Style training. The objective of acoustic model adaptation techniques is to derive a new set of acoustic models from the reference models given some adaptation data reflecting test acoustic conditions. In that system, the quality of the features is examined with the HMMbased classifier and a set of novel audio features is compared to established features. The remainder of this paper is organized as follows. In section 2, an overview of the adapted HMM-based classifier is given. Section 3 discusses the applicability of the selected audio features. Experimental set-up and results are presented in section 4. Section 5 concludes the paper with a summary and discussion. II. A N OVERVIEW OF THE ADAPTED HMM- BASED CLASSIFIER

A. The HMM-based classifier To begin, we will summarize the main steps in designing an automatic sounds recognition system. In the standard pattern recognition approach, the classification of a signal is usually

performed in two steps. First, a pre-processor employs signal processing techniques to generate a set of features characterizing the signal to be classified. These features form a features vector. A decision rule is then utilized by the classifier to assign the pattern to a particular class. Our system uses a Hidden Markov Model (HMM) framework for classifying a range of different sounds. Its originality resides in the HMMs training mode which consists in using both clean and noisy sets. In fact, two training modes can be defined: either training on clean data only or training on clean and noisy (multicondition) data [1]. The advantage of training on clean data only is the modelling of sounds without distortion by any type of noise. Such models should be suited best to represent all available sound information. The highest performance can be obtained with this type of training in case of testing on clean data only. But these models contain no information about possible distortions. One possible solution is building a library of recognizers for various environmental conditions. The recognizer ”closest” to the operating environmental conditions is then picked out of the library. Though training a recognizer for every noisy environment is conceivable, this approach remains timeconsuming. Hence, the advantage of multi-condition training is that distorted sounds signals are taken as training data which extend our method to practical applications. In this paper, we propose a multi-style training approach: the training database includes different levels of environmental noises added to the original signals and the recognizer can be successfully tested in every noisy environment. Moreover, in order to enhance the robustness of our system, the proposed solution uses environmental adaptation techniques in the multicondition training system. B. Adaptation techniques The adaptation algorithms closely examined are Maximum Likelihood Linear Regression (MLLR) [11], [13], Maximum A Posteriori (MAP) and the MAP/MLLR algorithm that combines MAP and MLLR [1]. MLLR was originally developed for speaker adaptation [13] but can equally be applied to situations of environmental mismatch. In MLLR adaptation, an initial set of environment independent models are adapted to the new environment by transforming the mean and the variance parameters of the models with a set of linear transforms. The transformations are trained so as to maximise the likelihood of the adaptation data with the transformed model set. Originally transformations were estimated only for the mean parameters but recently the approach has been extended so that the Gaussian variances can also be updated [11]. In our work, due to computational reasons, MLLR is only implemented for diagonal covariance, single stream and continuous density HMMs. Model adaptation can also be accomplished using a maximum a posteriori (MAP) approach. This adaptation process is sometimes referred to as Bayesian adaptation. MAP adaptation involves the use of prior knowledge about the model parameter

distribution. Hence, if we know what the parameters of the model are likely to be (before observing any adaptation data) using the prior knowledge, we might well be able to make good use of the limited adaptation data, to obtain a decent MAP estimate. This type of prior is often termed an informative prior [1]. MAP adaptation is specifically defined at the component level and it requires more adaptation data to be effective when compared to MLLR. When larger amounts of adaptation training data become available, MAP begins to perform better than MLLR, due to this detailed update of each component (rather than the pooled Gaussian transformation approach of MLLR). In fact the two adaptation processes can be combined to improve performance still further, by using the MLLR transformed means and/or variances as the priors for MAP adaptation. In this case components that have low occupation likelihood in the adaptation data, (and hence would not change much using MAP alone) have been adapted using a regression class transform in MLLR [1]. III. A N OPTIMAL FEATURES COMBINATION The features extraction is the most important part of a recognizer. If the features are ideally good, the type of classification architecture won’t have much importance [2]. On the apposite, if the features can’t discriminate between the concerned classes, no classifier will be efficient. Ideally good features should present the following properties: • • •

they have to emphasize the differences between classes. they have to be robust to noise disturbance, preserving the class separability as far as possible. a high correlation between features should be avoided as much as possible.

The experiments were first conducted with all features separately. Starting with the basic signal processing transforms and ending with the cepstral coefficients, this section will list only the optimal features that lead to the best results achieved using the above mentioned adapted classifier. It is noteworthy that several features designed for speech processing perform fairly well in the domain of environmental sounds. Several fundamental acoustic features were investigated for classification of auditory scenes. In addition, the variance and delta features of the basic features were also studied. We provide here a very short description of each feature. The features are grouped into three categories according to their processing domain [6]. •



Time-domain features: – Zero-crossing rate (ZCR) is defined as the number of zero voltage crossings within a frame. – Short-time average energy is the energy of a frame. Frequency-domain features: – Spectral centroid represents the balancing point of the spectral power distribution. It is calculated as the sum of the frequencies weighted by the amplitudes,





divided by the sum of the amplitudes, which is the first moment of spectrum with respect to frequency. – Spectral roll-off point measures the frequency below which a certain amount of the power spectrum resides. It is calculated by summing up the power spectrum samples until the desired percentage (threshold) of the total energy is reached. The threshold in our experiments was 0.93. Linear prediction and cepstral features: These features are used for estimating the rough shape of the spectrum of a signal. – Linear prediction coefficients (LPC) were extracted using the autocorrelation method. – Cepstral coefficients were derived from the LPC. – Mel-frequency cepstral coefficients (MFCC) were extracted applying the discrete cosine transform to the log-energy outputs of mel-scaling filter-bank. Wavelet features: – Discrete Wavelet Coefficients (DWC) extracted by applying a wavelet decomposition (Dyadic) to the signal. – Mel Frequency Discrete Wavelet Coefficients (MFDWC) which derives from the MFCC coding and consists in applying the DWT (Discrete Wavelet Transform) on the log of the filter bank energies instead of applying the DCT [1]. IV. E XPERIMENTAL SET- UP AND EVALUATIONS

A. Database description The major part of the sound samples used in the recognition experiments are impulsive and are taken from different sound libraries available on the market [15]. Considering several sound libraries is necessary for building a representative, large, and sufficiently diversified database. Some particular classes of sounds have been built or completed with hand-recorded signals. All signals in the database have a 16 bits resolution and are sampled at 44100 Hz. In this way, all possible audio spectrum components can be exploited for recognition purposes. This point is very important for impulsive sounds, whose frequency bandwidth can be rather extended, because of sharp temporal attacks (guns, explosions). Furthermore, some considered sounds show an important energy content in the highest frequencies, as glass breaks for example. The selected impulsive sounds belong to 10 classes (human screams, explosions, glass breaks, phone rings, door slams, dog barks, low bangs, gunshots, children voices and machines). As we can see, all categories are typical of surveillance, outdoors or domestic applications. The number of considered samples for each sound category is taking randomly. Furthermore, some non-impulsive classes of sounds (machines, children voices) are also integrated in the experimentation. There utilities come on the occasion of robustness evaluation.

B. The environment independent (EI) system The baseline recognizer is trained using the previously described database. One real world background noise is added to each scene at Signal to Noise Ratios (SNRs) ranged from -10dB to 30dB. The obtained environmental independent system is called the baseline recognizer and is trained on more than 5500 different scenes. In real applications, the EI system will be able to classify various sounds recorded in different environmental conditions. To make preliminary experiments in order to test that system, it is necessary to choose a particular environment (for example a real world background noise at SNR=5 dB) and to adapt the EI models by adaptation data which are contaminated by the same noise at the same level. Later we will demonstrate that the obtained adapted system performs the baseline recognizer. C. The adapted system parameters Our objective is to adapt the current well trained environment independent models to the characteristics of a particular environment using a small amount of adaptation data. Thus, we trained acoustic models on artificially perturbed sounds material. Experimental evaluation on environmental adaptation using MAP, MLLR and MAP/MLLR techniques illustrates a recognition improvement over the baseline system results (Table 2). We assume that for MLLR we update the mean and the variance model parameters. The three algorithms are applied on the database previously described for supervised adaptation experiments using various amounts of adaptation data. The smallest set contains 10 scenes and the largest one contains 156 scenes. D. Results with individual features The objective of our investigations is to evaluate the quality of features. The retrieval ability of the selected features is evaluated by supervised classification using a HMM-based classifier. In fact, classification is well performed if features extraction provides low variance inside classes and high variance between classes. Features are computed for entire sample files. The analysis window length for all features was 25 ms and the used windowing function was hamming. The overlap between successive frames was 50% of the frame length. Based on preliminary experiments, we noticed that adding the first derivative and the acceleration parameters had no effects on the performance. Evaluations on the adapted system with individual features are compared to the results done by the baseline system (non-adapted). Table 1 represents the confusion matrix obtained when we apply the first 12 mel-frequency cepstral coefficients (MFCCs) to the HMM classifier using 3 hidden states and 3 Gaussians components (G) in each state with 5 iterations of the Baum-Welch algorithm [14], [8]. Others tests are done using features traditionally applied in speech recognition which are LPC, PLP, and PLP-RASTA. Besides, features based on wavelet transforms (DWC and

Table 1. Confusion Matrix : 12 cepstral coefficients (MFCC) , HMMs with 5 states, G=3

human screams gunshots glass breaks explosions door slams dog barks phone rings children voices machines

human screams 80 0 0 0 0 0 0 0 0

gunshots 0 86.9 0 9.5 2.8 0 0 0 0

glass explosions door dog breaks slams barks 0 0 0 0 0 8.3 5 0 86.7 0 13.3 0 0 90.5 0 0 4.7 0 92.4 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 Total Recognition Rate = 92.9%

phone rings 0 0 0 0 0 0 100 0 0

children voices 20 0 0 0 0 0 0 100 0

machines 0 0 0 0 0 0 0 0 100

Table 2. Effects of adaptation (156 adaptation scenes) using various individual features.

Features P LP LP C M F CC DW C M F DW C P LP RAST A

Non Adapted System Clean Data Noisy Data SNR=40dB SNR=5dB 91.38 77.08 91.69 76.77 92.9 77.15 89.46 70.23 91.54 76 89.23 78.46

MFDWC) are also included in the experiments. All the results are illustrates in table 2. As we can see, all the adaptation methods lead to an important improvement of the recognition accuracy when compared with the result done by the baseline system (without adaptation). By using 156 adaptation scenes, the error rate of the adapted recognizer improves over the baseline system by more than 30% for rather all the considered algorithms. The use of wavelet coefficients is motivated by their ability to capture important time and frequency features. Therefore, we used wavelets as a coding technique. RASTA-PLP is an improvement of the traditional PLP method and consists in a special filtering of the different frequency channels of a PLP analyzer. The RASTA method replaces the conventional critical-band short-term spectrum in PLP and introduces a less sensitive spectral estimation. RASTA-PLP makes PLP more robust to linear spectral distortions. E. Effects of features combination In this section results of feature combinations are discussed. Table 3 shows the results for the best combinations of features. Empirically we are searching for an optimal solution by starting from a well performing feature based on unitary signal processing transforms such as DFT and DWT and then by adding other features that showed to be independent in many previous works [6], [2]. Features that do not improve retrieval quality are removed from the combination.

MAP SNR=5dB 77.08 80.23 84.69 77.08 80.23 84.69

Adapted System MLLR MAP/MLLR SNR=5dB SNR=5dB 83.15 88.23 83.15 88.54 89.02 90.15 77.08 78.69 89 90.15 84.69 84.69

In [6] by applying data analysis, the author showed that adding temporal features can improve the classification performances. Thus, we added ZCR which is one-dimensional feature. It represents the fundamental frequency of a signal. In the case of environmental sounds the fundamental frequency of sounds may be similar for different classes; this is why ZCR is not applicable to classification as a single feature. Due to the low-dimension of the tested temporal features (ZCR and the average energy) and the frequency features (RF and Centroid), they fail to represent data information. Nevertheless, these features may improve retrieval quality in combination with the basis features. In general, a combination of spectral and time-based features is promising because it explains different aspects of the signal. While spectral features characterize frequency characteristics, time-based features incorporate temporal information and loudness. Experiments show that some features are not able to discriminate the various classes successfully. Coefficients of the Discrete Wavelet Transform (DWT) (using a Daubechies motherwavelet and 50 coefficients) are not discriminative features for our database. DWT contains only information of low frequency bands. High frequencies, that are necessary to characterize certain environmental sounds, are neglected. This explains the poor quality of retrieval, obtained by the transform coefficients and justifies why we added the first 12 MFCC coefficients, temporal-based and frequency-based features .

Table 3. Effects of features combinations.

Features M F CC + Energy + RF + centroid + ZCR LP C + Energy + log energy + RF + centroid + ZCR M F DW C + Energy + RF + centroid + ZCR DW C + M F CC + Energy + log energy + RF + centroid + ZCR P LP RAST A + Energy + RF + centroid + ZCR

V. C ONCLUSION In this paper, we have addressed the problem of automatic recognition of environmental scenes and described a classification system which is able to classify auditory scenes under important noise degradation conditions using a feature vector consisting of multiple features. There are many interesting directions in which to continue the research. others classifiers can be studied and more research is needed on studying how to select the best features combinations for each type of classifier. R EFERENCES [1] A. Rabaoui, Z. Lachiri and N. Ellouze. Hidden Markov Model Environment Adaptation for Noisy Sounds in a Supervised Recognition System. International Symposium on Communication, Control and Signal Processing (ISCCSP), Marrakech, 13-15 , March 2006. [2] D. Mitrovic. Discrimination and Retrieval of Environmental sounds. PhD thesis, Vienna University of Technology, December 2005. [3] K. El-Maleh. Frame level noise classification in mobile environments. PhD thesis, McGill University, Montreal, Canada, January 2004. [4] D. Istrate. D´etection et reconnaissance des sons pour la surveillance m´edicale. PhD thesis, INPG, France, December 2003. [5] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi and T. Sorsa. Computational Auditory Scene Recognition. International Conference on Acoustic, Speech and Signal Processing, Orlando, Florida, May 2002. [6] V. Peltonen. Computational auditory scene recognition. M.Sc. thesis, Tampere University of Technology, Finland, 2001. [7] C. H. Lee. Adaptive classification and decision strategies for robust speech recognition. Workshop on Robust Methods Speech Recognition Adverse Conditions, Tempere, Finland, May 1999. [8] J. A. Bilmes. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Markov Models. Technical Report, International Computer Science Institute Berkeley,April 1998. [9] C. H. Lee. On stochastic feature and model compensation approaches to robust speech recognition. Speech Communication, vol. 25, pp. 29-47, 1998. [10] C. Couvreur. Environmental Sound Recognition: A Statistical Approach. PhD. thesis, Faculte Polytechnique de Mons, Belgium, June 1997. [11] M. J. F. Gales and P. C. Woodland. Variance compensation within the MLLR framework. Technical Report CUED, Cambridge University, 1996. [12] Y. Gong. Speech recognition in noisy environments: A survey. Speech Communication, vol. 16, no. 3, pp. 261-291, Apr. 1995. [13] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Computer Speech and Language, 9:171-186, 1995. [14] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, vol. 77, pp. 257-286, Feb. 1989. [15] Real World Computing Paternship. CD-Sound scene database in real acoustical environments. In http://tosa.mri.co.jp/sounddb/indexe.html

Number of features 16 17 16 67 16

Recognition Rate (%) 93.73 92.9 91.48 93.8 93.2

Suggest Documents