GA-BASED FEATURE SELECTION FOR SYNCHRONOUS AND ASYNCHRONOUS APPLAUSE DETECTION Ján Olajec1, Cumhur Erkut2, Roman Jarina3
[email protected],
[email protected],
[email protected] ABSTRACT Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. For instance, asynchronous applause can mark the breaks between the songs in a live performance, whereas synchronous applause can indicate both the end of the performance and the appreciation of the audience. In this paper, we introduce a framework for automatic feature subspace selection from a common feature vector. The selected features build a new representation which is better suited to a given learning task and recognition. In order to solve this problem, we propose a method based on the Genetic Algorithm (GA) to improve the representativeness and robustness of the features in a generic audio recognition task. Combining only those coefficients selected by the GA increases the classification accuracy from F = 0,7961 to F = 0.9412 (an improvement about 18.2%). 1. INTRODUCTION Content-based audio/video analysis aims at obtaining a structured organization of the original content and understanding its embedded semantics like humans do. An automatic content analysis of digital audio streams can be beneficial to many multimedia applications, such as context-aware computing [5], [1], and audio/video content parsing and summarization [15], [20], [21]. The fundamental step in audio content analysis and understanding is the segmentation of the input audio stream into different audio elements such as speech, music, specific sounds (e.g. applause, laugh, scream, explosions, etc.) and various audio effects. Particularly, the specific sounds and effects may be very helpful to understand high-level semantic of the content [13]. For instance, a bursting applause can indicate the enthusiasm 1
of the audience in a live broadcast, whereas a synchronous applause for an encore can point out the appreciation of a performance. The detection of such key sounds is one of the challenges in intelligent management of multimedia information and content understanding. For example, in [13], [10], the elements such as applause, cheer, ball-hit, and whistling are used to detect the highlights in sports videos. In film indexing [13] [14], humor, excitement, or violence scenes are categorized by detecting the specific audio sounds like laughter, gun-shot, and explosion. Various techniques are applied to detect generic audio elements. Some of them follow the conventional approaches used in automatic speech recognition (MFCC with HMM) [12], while others use a wide range of temporal and spectral features classified by the GMM [1], [20], [21], k-nearest neighbor [1], neural networks, and support vector machines (SVMs) [14], [20]. In pattern classification problems it is often desirable to find a relevant subspace in order to reduce the dimensions of the data before clustering. For the speech/nonspeech discrimination task, frequently-used techniques include the distortion discriminant analysis (DDA) [3], [11] and classification with an SVM [18], [16]. In music genre classification tasks [22], [17], forward and backward feature selection [2], principal component analysis (PCA), linear discriminant analysis (LDA) [17], and independent component analysis (ICA) [8] are also used to obtain smaller feature sets.
Audio s(n)
Feature vectors
AR front-end
Classifier
Classes
Figure 1. Block digram of an AR system
J. Olajec is with the Department of Telecommunications, University of Zilina, Slovak Republic, (phone +421-41-513-2229; fax: +421-41-513-1520; e-mail:
[email protected]), and currently is an Erasmus-visitor at the Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland. 2 C. Erkut is with the Laboratory of Acoustics and Audio Signal Processing , Helsinki University of Technology, Finland. 3 R. Jarina is with the Department of Telecommunications, University of Zilina, Slovak Republic.
1.1 Automatic Recognition Current automatic recognition (AR) systems are divided into two primary groups: systems for speech recognition and systems for recognition of specific sounds. This article focuses on an AR system for specific sounds, namely synchronous and asynchronous applause. Such systems are simpler to design than the speech recognition systems with grammatical support. A basic AR system is typically composed of an AR front-end (preprocessing and feature extraction) and a classifier, as shown in Figure 1. The classifier our case is the GMM or HMM, which maps the feature patterns to desired sound classes. In our experiments, we follow the standard approach commonly used in speaker recognition [21]. The signal is parameterized by the Mel Frequency Cepstral Coefficients (MFCCs) that are classified by a Gaussian Mixture Model (GMM). We investigate the discriminative properties of various MFCC sets. It is shown that if only some of the coefficients are selected, the performance of the detector increase. 1.2 Introduction to the Genetic Algorithm In this paper, we are exploring selection of audio features by Genetic Algorithm (GA) with an application to clapping sound (or applause) detection. The basic strategy for feature subset selection is to incorporate methods such as the accelerated search or Monte Carlo methods (e.g. simulated annealing and genetic algorithms). The genetic algorithm [19] is a technique based on the evolution concept. The GA uses three fundamental operations: mutation, selection, and crossover. It uses population principle, that means, it scans several points of the feature space at the some step. The search is realized in pseudo-random form, basically on a fitness function [19]. The GA is highly adaptable, offers innovative solutions, and has been applied successfully to several classification tasks (e.g., [6], [12], [4]). It has proven that the GA is capable of globally exploring a
AR front-end
Feature selection
2. AUDIO FEATURE SELECTION BY THE GENETIC ALGORITHM Let us denote the feature vector as c ∈ R n , where R n is an n-dimensional space (in our case n = 63). The GA selects only a subset of coefficients from this ndimensional vector. The selected coefficients form a new vector c'∈ R m , where m < n . There are 2 n − 1
candidates to form the vector c' (it is more than 9.2234×1018 if 63-D vector is searched). Obviously, it is not feasible to evaluate all combinations to find optimum subset for sound pattern representation. The GA is a good option to sort this problem out. The 63-dimensional MFCC feature vectors { c } are represented as binary individuals s =
Classes
Classifier
Figure 2. Block digram of an AR system with feature selection.
{ g1 , g 2 ,... g K } T ,
where K = 63 in our case. The components (genes) gi obtain values 0 or 1. If gi = 1, then the i-th cepstral coefficient will be selected as a candidate for pattern representation, otherwise it is rejected. In the beginning, a population S of 50 K-dimensional individuals
S = { s1 , s 2 , s 3 ,... s 50 } is randomly created. A suitability
v of individuals is evaluated by the fitness function V(s) as
vi = V { s i } , where i is the index of an individual. The
type of the fitness function V is crucial to find an optimal solution to the selection problem; in our experiments we use the measure-F (F) as a fitness function
V= F=
2⋅ P⋅ R , P+ R
(1)
where the Precision (P) and Recall (R) are defined as
P= R=
GA Audio s(n)
solution space and is suitable for the task of selecting small number of robust features out of the large feature set.
N correctly detected N total number detected N correctly detected N total number corect
,
.
(2)
(3)
The Precision (2) and Recall (3) are widely-used measures for detector performance evaluation. All parameters P, R, F are within the range
0, 1 , where
“1” is the maximum for each case. We have chosen this fitness function based on our previous experience with clapping sounds, see [7] for other alternatives and the
3. EXPERIMENTS We carried out three types of experiments. The first experiment is focused on the detection of synthetic synchronous and asynchronous applause. The synthetic applause is generated by using the ClaPD [9], where the number of people, the synchronization strength, and room characteristics can be parametrically controlled. In the second experiment we have used recordings of synchronous and asynchronous applause from the TVprograms and concerts. In the last case, we have mixed the databases from the first and second experiments. In the experiments, we have tested different combinations of MFCC {c}, delta MFCC {Δ} and delta-delta MFCC {Δ2} feature vectors on classification performances. Accordingly, we find the combination of coefficients from feature vectors {c, Δ, Δ2}, which have the best discrimination properties for synchronous and asynchronous applause. Asynchronous T [m:s] No.clips 6:57 2:05 25 2:05 25
3.1. Synthetic applause detection In this experiment we have used the ClaPD for database generation. The database consists from tree parts, (1) the training set, (2) the test set, and (3) the validation set (Table 1). The synchronous and asynchronous applause is generated for 5, 10, 15, 20, 30, 40, 50, 60 clappers by controlling the reverberation parameter and the synchronization strength. The audio signal is sampled at 22kHz. A sliding 25 ms window with 10 ms shift is used for signal analysis. For each window, 512 FFT coefficients are warped into 20 mel-scale bands. Finally the signal is represented by 20 MFCCs + zeros coefficient (d.c. component), with their first and second time-derivatives. Thus, each frame is represented by 63 features. First, we conducted experiments with seven combinations of the feature vectors {c, Δ, Δ2}, namely c, Δ, Δ2, c+Δ, c+Δ2, Δ+Δ2, and c+Δ+Δ2 , and classified with a GMM. The results for all combinations were identical (P=0.83, R=0.8, F=0.82). The performance of this experiment is indicated as the legend “GMM” in Fig. 3. With combination of features {c+Δ+Δ2} and {c} we have classified the synchronous and asynchronous synthetic applause by HMMs of different number of states: 2, 4, 8, 16, 32, 64, and 128. The best classification with HMM was achieved by an eight-state model (see Figure 3), corresponding to P=0.81 R=1 and F=0.89. 1
Synchronous T [m:s] No.clips 12:21 2:05 25 2:05 25
G M M 4 HM M 8 HM M 16 HM M 32 HM M G A G M M
0 .9 0 .8 0 .7 0 .6 0 .5
Table1. Structure of database. Synt. Applause Train set Test set Valid.set
For classification of feature vectors combinations, we have used the statistical classification methods Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM).
R e c a ll
full details. As the next step, the genetic operators mutation and crossover (recombination) are applied. The genetic operator is randomly chosen at the ratio of 0.2 : 0.8 (mutations : crossovers). One individual of the actual population is chosen for mutation. A parent is selected quasi-randomly with respect to suitability of the individual. Individuals with greater suitability vi have greater chance to participate in reproduction compared to the individuals with lower suitability. The number of the mutations of an individual is randomly determined (up to 5). Also the gene, which is being changed (0 to 1 or 1 to 0), is randomly determined. A new individual (so called “child”) is formed by such procedure. In the case of crossover, two parents are chosen according to the fitness rule. The number of the crossovers is again randomly determined (up to 5). The child, which originates from the parents, may also be mutated with a probability of 0,05. By this procedure, 4 new individuals (children) are created. Their suitability is evaluated by the fitness function V, and they are added to the original set of individuals. The new population is created by the removal of 5 individuals with the lowest suitability. Thus, the size of population remains N = 50.
0 .6
0 .7 0 .8 P r e c is io n
0 .9
1
Fig. 3. Performance of various sets of classifiers (GMM and HMM), feature vectors (combinations of {c}, { Δ} and {Δ2} ) and feature vector determined by GA.
No. of experiments Vector dimension No. of mixture Type of covariant matrix No. of individuals in Generation No. of new individuals Crossover probability No. of point of crossover No. of point of mutation Probability mutation after crossover No. of Generations No. of evaluation Dimension of new vectors
Exp. 1 63 1 diagonal 50 4 0,8 1-5 1-5
Exp. 3 63 1 diagonal 50 4 0,8 1-5 1-5
0,05
0,05
600 2446 21 - 33
330 1366 24 - 38
From these experiments it is not clear which feature vector (c, Δ, or Δ2) more markedly contributes to the discrimination properties. Hence, we applied the GA on the feature vector {c+Δ+Δ2} and used a GMM classifier. The resulting recognition rates for synchronous and asynchronous synthetic applause was P=0.96, R=1, F=0.98 on validation set (in Fig. 3 these results are indicated by the legend “GA GMM”). The parameters of the GA are shown in Table 2. 3.2. Real applause detection In the second experiment, the detection synchronous and asynchronous applause is realized on samples of the recorded applause from the TV-programs and music concerts. Audio processing and classification of signals is the same as in experiment 1, and the structure of the database is given in Table 3. In this experiment we have used GMM classifiers for different combinations of feature vectors, and HMM classifiers for vectors {c} and {c+Δ+Δ2}. The results for detection with GMM are shown in Figure 4. The classification with HMMs of states 4, 8, 16, 32, 64 and 128 had the same results as the classification with GMM for feature vectors {c+Δ+Δ2}. Classification with the HMM of two states had the worse results. Misclassification of asynchronous applause was due to a close-miked clapper; because of this dominant contribution, the system detects the asynchronous applause as synchronous. Conversely, the synchronous Table 3. Structure of database for experiment 2. Synt. Applause Train set Test set Valid.set
Asynchronous T [m:s] No.clips 2:51 2:08 25 2:06 25
Synchronous T [m:s] No.clips 1:29 1:40 25 1:40 25
applause is classified as asynchronous when a big group of people clapping in sync for a short period of time. The reason why we are not using the GA in the experiment is that recognition results were already sufficient: only one clip each from synchronous and asynchronous applause classes was misclassified. 1
0 .9 R e c a ll
Table 2. Parameters of the GA.
{c},d a{c+Δ}, {c+Δ2}, ta 1 2 {c+Δ+Δ } {Δ}d a t a 2
0 .8
{Δd2}a t a 3
0 .7
d a ta 4
{Δ+Δ2} 0 .6 0 .6
0 .7
0 .8 P re c is io n
0 .9
1
Fig. 4. Precision and Recall of sync-async detection for different combination of vectors in recorded applause detection. Classification is performed with two GMMs. 3.3. Hybrid applause detection In the third experiment we use a combination database consisting of the recorded and synthetic synchronous and asynchronous applause (Table 4). As in the previous experiments, we perform a classification with GMM for different combinations of feature vectors and classification with HMM for vectors {c} and {c+Δ+Δ2}. The results for the GMM and HMM are illustrated in Figs. 5 and 6, respectively. In the experiments with combinations of feature vectors {c, Δ, Δ2} and classification with GMM, the MFC coefficients {c} play an important role. As illustrated in Figure 5, the vectors {c}, {c+Δ}, {c+Δ2}, and {Δ} give the best results. The number of states of an HMM plays also an important role. The increase of the number of states improves the performance of the system; concretely the precision is better but recall is on the same level (see Figure 6). The best results of recognition were attained by using the genetic algorithm to select only some coefficients from the feature vector {c+Δ+Δ2}, as shown in Figure 6 with the legend “GA GMM”. With this selection both precision and recall are the best: P = 0.9231, R = 0.96, F = 0.9412.
Table 4. Structure of database for experiment 3 Synt. Applause Train set Test set Valid.set
Asynchronous T [m:s] No.clips 9:49 13:50 50 4:11 50
Synchronous T [m:s] No.clips 13:50 3:45 50 1:40 50
0 .9 0 .8 5
R e c a ll
0 .8
{c} {c+Δ}, {c+Δ2} {c+Δ+Δ2} {Δ} {Δ2} {Δ+Δ2}
0 .7 5 0 .7 0 .6 5 0 .6 0 .6
0 .6 5
0 .7
0 .7 5 0 .8 P re c is io n
0 .8 5
0 .9
Fig. 5. Precision and Recall of sync-async detection for different combination of vectors in the mixed database. Classification is performed with two GMMs. 1
0 .9 2 HM M 4 HM M 8 HM M 16, 32 HM M 64 HM M 128 HM M 256 HM M G A G M M
R e c a ll
0 .8
0 .7
0 .6
0 .5 0 .6
0 .7
0 .8 P re c is io n
0 .9
1
Fig. 6. Results of classification of sound width different number of state of HMM; and classification GA reduce feature vector with GMM. 4. SUMMARY AND CONCLUSION A GA-based feature selection for synchro/asynchro applause classification has been described and tested
with two statistical classification methods. We have investigated the discrimination properties of MFCC feature vectors and their first and second time derivatives. We have applied the GA method for feature selection. We conclude that the performance increases when only selected features are provided to the classifier. Using GA method, 24 the most discriminative features were selected automatically from the 63 dimensional feature vector consists of MFCCs, delta MFCCs and delta-delta MFCCs (Fig. 7.). The MFC coefficients seem to have a major role for the classification of recorded applause than the delta and the delta-delta coefficients. However, the selection of some coefficients from the first and the second derivatives of the MFCC, and their addition to the MFCC can markedly contribute to the separation of the classes. Contrary to the LDA or the PCA methods, we can consider the GA-based feature extraction as a nonlinear method. In the past, the GA-based methods were considered computationally too demanding, but the current premises such as the parallel and distributed computing make it a feasible choice for the tasks similar to those reported in this paper.
No.
{c}
{Δ}
{Δ2}
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 0 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 1 1
0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 1 0 0 1
1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0
Sum (selected ceoff.) = 24 Fig. 7. Selected features by GA (‘1’) from 3x21 dimensional MFCC feature set.
5. ACKNOWLEDGMENT The research leading to this work was partially supported by the Scientific Grant Agency of the Ministry of Education of the Slovak Republic under the contract No. 1/4066/07. 6. REFERENCES [1] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A.P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, J. Huopaniemi, “Audio-based context recognition,” Audio, Speech and Language Processing, IEEE Transactions, Vol. 14, No. 1. (2006), pp. 321-329. [2] A. Webb, “Statistical Pattern Recognition,” Wiley, 2002. [3] C. J. C. Burges, J. C. Platt, and S. Jana, “Distortion discriminant analysis for audio fingerprinting,” IEEE Trans. Speech Audio Process., vol. 11, no. 3, pp. 165–174, May 2003. [4] C.-T. Lin, H.-W. Nein, J.-Y. Hwu “GA based noisy speech recognition using two-dimensional cepstrum,” IEEE Trans. On Speech and Audio Processing, Vol. 8, No. 6 Nov 2000, 664675 [5] D. Ellis, K. Lee, “Minimal-impact audio-based personal archives,” In Proc. of ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 2004, 39-47. [6] H. Guo, L. B. Jack, A. K. Nandi, “Feature generation using genetic programming with application to fault classification,” IEEE Trans. Syst., Man, Cybern. B, vol. 35, no. 1, 89–99, Feb. 2005. [7] J. Olajec, R. Jarina, M. Kuba,”GA-Based feature extraction for clapping sound detection,” IEEE NEUREL 2006, September 2006, Belgrade, Serbia and Montenegro, pp. 21-25. [8] L. De Lathauwer, B. De Moor, and J. Vandewalle, “Blind source separation by higherorder singular value decomposition,” in Signal Processing VII: Theories and Applications, Proc. EUSIPCO-94, Edinburgh, UK, 1994, pp. 175–178. [9] L. Peltola, C. Erkut, P. R. Cook, V. Välimäki, “Synthesis of Hand Clapping Sounds,” IEEE Trans. Audio, Speech and Language Processing, Vol. 15, No. 3, March 2007 [10] M. Xu, N. Maddage, C.-S. Xu, M. Kankanhalli, Q. Tian, “Creating audio keywords for event detection in soccer video,” In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 2, 281-284.
[11] N. Mesgarani, M. Slaney, S. A. Shamma, Discrimination of Speech from Nonspeech Based on Multiscale SpectroTemporal Modulations, IEEE Trans. Audio, Speech and Language processing, Vol.14, No.3, May 2006 [12] P. Day, A.K. Naudi “Robust Text-Independent Speaker Verification Using Genetic Programming,” IEEE Trans. On Audio, Speech and Language Processing, Accepted for publication. [13] R. Cai, L. Lu, A. Hanjalic, H.-J. Zhang, Cai, L.-H. “A flexible framework for key audio effects detection and auditory context inference,” IEEE Trans. Audio, Speech and Language Processing, May, 2006. vol.14, 1026-1039. [14] S. Moncrieff, C. Dorai, S. Venkatesh, “Detecting indexical signs in film audio for scene interpretation,” In Proc. of the 2nd IEEE International Conference on Multimedia and Expo, 2001, 989-992. [15] S. Tsekeridou and I. Pitas, “Content-based video parsing and indexing based on audio-visual interaction,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, Apr. 2001, 522–535. [16] T. Joachims, “Making Large-Scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning,” B. Scholkopf, C. Burges, and A. Smola, Eds. Cambridge, MA: MIT Press, 1999. [17] T. Li, G.Tzanetakis, “Factors in Automatic Musical Genre Classification of Audio Signals,” IEEE Workshop on Applications of Signal Proc. to Audio and Acoustics, 2003. [18] V. N. Vapnik, “The Nature of Statistical Learning Theory,” New York: Springer, 1995. [19] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, “Genetic Programming: An Introduction,” New York: Morgan Kaufmann, 1998. [20] Y. Li, C. Dorai, “Instructional Video Content Analysis Using Audio Information,” IEEE Transactions on Audio, Speech and Language processing, Accepted for publication. [21] Y. Li, S. Narayanan, C-C. Jay Kuo, “Content-Based Movie Analysis and Indexing Based on AudioVisual Cues,” IEEE Transaction on Circuits and systems for video technology, Vol.4, No.8, August 2004, 1073-1085. [22] Y. Yaslan Z. Cataltepe, “Audio Music Genre Classification Using Different Classifiers and Feature Selection Methods,” Proceedings of the 18th International Conference on Pattern Recognition,Vol.2, 2006, pp. 573 - 576.