GA-Based Feature Extraction for Clapping Sound Detection - CiteSeerX

2 downloads 0 Views 1MB Size Report
Sep 27, 2006 - music, specific sounds (e.g. applause, laugh, scream, explosions, etc.) and various audio effects. Particularly, the specific sounds and effects ...
8th Seminar on Neural Network Applications in Electrical Engineering, NEUREL-2006 Faculty of Electrical Engineering, University of Belgrade, Serbia, September 25-27, 2006 http://neurel.etf.bg.ac.yu, http://www.ewh.ieee.org/reg/8/conferences.html

20 06

GA-Based Feature Extraction for Clapping Sound Detection Ján Olajec, Roman Jarina, Michal Kuba Abstract — Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. In this paper, we introduce a framework for automatic feature subspace selection from a common feature vector. The selected features build a new representation which is better suitable for a given learning task and recognition. In order to solve this problem, we propose the GA-based (Genetic Algorithm) method to improve the representativeness and robustness of the features generic audio recognition task. Keywords — Audio classification, audio contend analysis, feature extraction, genetic algorithm and programming.

C

I. INTRODUCTION

ONTENT-BASED audio/video analysis aims at obtaining a structured organization of the original content and understanding its embedded semantics like humans do. An automatic content analysis of digital audio streams can be beneficial to many multimedia applications, such as context-aware computing [4], [5], and audio/video content parsing, and summarization [1], [2], [3]. The fundamental step in audio content analysis and understanding is the segmentation of the input audio stream into different audio elements such as speech, music, specific sounds (e.g. applause, laugh, scream, explosions, etc.) and various audio effects. Particularly, the specific sounds and effects may be very helpful to understand high level semantic of the content [7]. Hence, detection of these key specific sounds is one of the challenges in the area of intelligent management of multimedia information and content understanding. For example, in [7], [6], the elements such as applause, cheer, ball-hit, and whistling, are used to detect the highlights in sports videos. In film indexing [7] [8], humour, excitement, or violence scenes are categorized by detecting the specific audio sound like laughter, gun-shot, and explosion.

J. Olajec is with the Department of Telecommunications, University of Zilina, Slovak Republic, (phone +421-41-513-2229; fax: +421-41513-1520; e-mail: [email protected]). R. Jarina is with the Department of Telecommunications, University of Zilina, Slovak Republic, (e-mail: [email protected]). M. Kuba is with the Department of Telecommunications, University of Zilina, Slovak Republic, (e-mail: [email protected]).

1-4244-0433-9/06/$20.00 ©2006 IEEE

Various techniques are applied to detect generic audio elements. Some of them follow conventional approaches used in ASR (MFCC with HMM) [7] others use wide range of temporal and spectral features classified by GMM [5], [2], [3], k-nearest neighbour [5], neural networks, support vector machine [8], [2]. In this paper, we are exploring selection of audio features by Genetic Algorithm (GA) with an application to clapping sound (or applause) detection. In the experiment, we follow standard approach, commonly used in speaker recognition. Signal is parameterised by Mel Frequency Cepstral Coefficients (MFCCs) that are classified by Gaussian Mixture Model (GMM). We investigate discriminative properties of various MFCC sets. It is shown that if only some of the coefficients are selected, the performance of the detector may increase. Genetic algorithm [10] is technique based on evolution concept. GA uses three fundamental operations: mutation, selection, and crossover. It uses population principle that means, it scans several points of space together. Searching is realized by pseudorandom form, basically on “fitness” function [10]. GA has been shown to be highly adaptable and offers innovative solutions. GAs have been applied successfully to several classification tasks (e.g. [9], [12], [13]). It has proven they are capable of globally exploring a solution space and are suitable for the task of selecting small number of robust features out of the large feature set. II. AUDIO FEATURE SELECTION BY GENETIC ALGORITHM is Lets denote feature vector c ∈ R , R n-dimensional space (in our case n = 24). GA selects only subset of coefficients from this n-dimensional vector. Thus the selected coefficients form the new vector n

n

c'∈ R m , where m < n . There are 2 n − 1 (it is more than 16 millions if 24-D MFCC vector is searched) candidates to form the vector c' . Obviously, it is not feasible to evaluate all combinations to find optimum subset for sound pattern representation. GA is a good option to sort this problem out. The procedure of the GA, applied in the experiment, is as follows (Figure 1). 24-dimensional MFCC feature vectors { c } are represented as binary individuals

s = {g1 , g 2 ,... g K } , K = 24. The components (genes) T

gi obtain values 0 or 1. If gi = 1, then i-th cepstral

2 coefficient will be selected as a candidate for pattern representation, otherwise it is rejected. At the beginning, a population S of 50 K-dimensional individuals S = {s1 , s 2 , s 3 ,... s 50 } is randomly created. A suitability

V1 = (µ A − µ O ) +

1

2

σ

2 A

+

1

σ O2

.

(3)

v of individuals is evaluated by fitness function V(s) as:

vi = V {s i },

(1)

where i is the index of an individual. Then the genetic operators – mutation and crossover (recombination) are applied. The genetic operator is randomly chosen at the ratio of 0,3 mutations : 0,7 crossovers. One individual of actual population is choice for mutation. A parent is selected quasi-randomly with respect to suitability of the individual. Individuals with greater suitability vi have greater chance to participate in reproduction as individuals with lower suitability. Probability Pi of choice of individuals with index i into reproductive process is directly proportional to vi that means

Pi =

vi

,

N

∑v j =1

(2)

j

where N is the number of individuals in population (N=50). Number of mutations of an individual is randomly determined (up to 3). Also the gene, which is being changed (0 to 1 or 1 to 0), is randomly determined. A new individual (so called “child”) is formed by such procedure. In the case of crossover, two parents are chosen according to the fitness rule. Number of crossovers is again randomly determined (up to 3). The child, which originates from the parents, may by also mutated with probability P = 0,05. By this procedure, 5 new individuals (children) are created. Their suitability is evaluated by the fitness function V, and they are added to the original set of individuals. New population is created by removal of 5 individuals with lowest suitability. Thus the size of population remains N = 50. Use of an appropriate type of the fitness function V is very important to find an optimal solution to the selection problem. We proposed four fitness functions. Three of them are estimated from histogram of log-likelihoods for the two following classes. The first class represents the specific sound - clapping (or applause). The second class represents all other sounds. For the class of clapping sounds, GMM model was built. The histograms of likelihood values results from testing this GMM on both classes of sounds. Distributions of logarithmic likelihoods are defined by mean values µA or µO respectively and standard deviations σA or σO respectively. Index A indicate the class of clapping sounds and O indicates the class of all other sounds. The first fitness function V1 maximizes difference of mean values µ, and minimizes difference of standard deviations σ between the classes.

Fig. 1. Flow diagram of genetic algorithm.

The second fitness function originates from Kulback-Leibler informational measure in the distribution density estimating problem [11].

V = ∑ PA (ll )log ll

PA (ll ) , PO (ll )

(4)

where for approximately Gaussian distributions, we can write

1  µ − µ O V =  A 2  σ O 

2 2   σ A  σ  − 1 + ln O .  +  σA   σ O 

(5)

To avoid CPU buffer under-flowing during calculation, we rewrote the function V2 (Eq. 4 and 5) to the following form

3

V2 =

{

( log µ O − log µ A ) ) 1 − logσ O } 10 2⋅{log µ A + log (1−10 + 2 + 10 2⋅{log σ A −logσ O } − 1 + . + (log σ A − logσ O ) ⋅ ln 10

}

Sound” by averaging relevant log-likelihoods (Figure 2b). After adapting thresholds between classes it reaches the maximal F measure 0,6972. (6)

The third fitness function V3 is based on criterion used in linear discriminant analysis.

V3

( µ A − µ O )2 . = σ A2 + σ O2

(7)

Measure F, calculated from precision P and recall R, is applied as the 4th fitness function V4

V4 = F =

2⋅ P⋅ R , P+R

Components of new feature vectors are selected from standard cepstral vector consisted of 24 MFCCs by applying Genetic Algorithm. All 4 fitness function defined by Eq. (3),(6),(7),(8), are used as assessment criterion of GA. Figure 3 shows graph of maximal and average fitness among all 50 individuals in population for each generation of genetic algorithm (with fitness function V4). In addition, distributions off the log-likelihoods when GA is applied are in Appendix.

Applause

(8)

Precision and Recall are standard and widely used measures for detector performance evaluation.

Other sounds

III. EXPERIMENT The database consists of more than 6.5 hours of sounds, which are divided into two classes: 1) clapping; 2) other sounds. The database was created from selections of different TV programs. The audio streams are manually segmented. Each segment (clip) belongs to one of these two classes. Duration of clips is from 3 to 10 sec. The recordings are sampled at 22050Hz with 16bit/sample resolution. Duration of generic sounds and clapping sounds is set according to ratio 1:0,12. This ratio approximately corresponds with average appearance of clapping sounds in TV programs. Details about the database are summarised in Table 1.

Applause Other sounds

Fig. 2. Distribution of log-likelihoods of the baseline model a) at frame level; b) at clip level.

TABLE 1: STRUCTURE OF DATABASE

Class Clapping (Applause) Other sounds

Database type Train set Test set Test set

No. of clips 366 346 2788

Duration [h:min:sec] 0:24:29 0:23:41 5:52:35

For signal analysis, 23ms window with 50% overlap is used. From each frame, 24 MFCCs is computed. The train set of the clapping sound class was used for creation of the 3-component GMM that was use for classification. Parameters of the classificator are optimized by Expectation Maximization algorithm. Each frame of the test sound is evaluated by logarithmic likelihood log P(c | Θ ) . The GMM classificator, which uses all 24 MFCCs, was used as baseline system. A distribution of the baseline log-likelihoods for both classes is shown in Figure 2. Each clip is classified into class “Applause” or “Other

Fig. 3. Graph fitness function V4.

The results of the experiment are summarised in Table 2. When the function V1 was applied in GA, the highest fitness was reached when all 24 MFCCs were used. That means all coefficients were selected by using this criterion. When the functions V2, V3 or V4 were used,

4 respectively, maximal fitness was reached when only subset of features was selected. These three approaches gave different subset of features, but in all cases the performance of the detector was greater than by using all 24 features. The cepstral coefficient c1 was selected in all cases, the coefficients c2 ÷ c4 in two cases, and the last coefficient c24 in three cases, as well. It is interesting that if only the first and last coefficients were used, the highest precision and recall were reached.

APPENDIX

Applause Other sounds

TABLE 2: PERFORMANCE OF THE DETECTOR FOR DIFERENT SUBSET OF FEATURES. Fitness Func.

Selected MFCC

Max (V)

F

P

R

V1

All (c1 ÷ c24)

25,677

0,697

0,578

0,879

V2

c 1 , ÷ c5

1,198

0,737

0,619

0,910

V3

c1 , ÷ c4 , c7, c16, c24

0,942

0,785

0,692

0,908

V4

c1, c24

0,833

0,833

0,754

0,931

Applause Other sounds

Fig. A1. Histogram of logarithmic likelihood when GA with fitness function V2 is applied, at a) frame level; b) clip level.

IV. CONCLUSION These controversial results of the experiment cannot give clear answer, which coefficients are more or less discriminative in the case of detecting clapping sound among other sounds. But it is valuable to notice the facts that: a) the performance increases when only selected features are led to the classifier; b) clearly, the highest and lowest MFCCs are more discriminative that the coefficient from the central part of the MFCC feature vector. The MFC coefficients with small index carry information about the rough contour of the spectra, while the coefficient with the highest index may carry more information about fine structure of the spectra and about long-term correlation in temporal domain. This long-term temporal correlation may correspond with the rythmicity that is presented in clapping sounds. Thus the selection of features seems to be sound specific The future work on generic sound detection will be oriented toward application of GA on wide range of various spectro-temporal audio features to get more clear answer to the feature selection problem.

Applause Other sounds

Applause Other sounds

Fig. A2. Histogram of logarithmic likelihood when GA with fitness function V3 is applied, at a) frame level; b) clip level.

5

Applause Other sounds

Applause Other sounds

Fig. A3. Histogram of logarithmic likelihood when GA with fitness function V4 is applied, at a) frame level; b) clip level. ACKNOWLEDGMENT This work was supported by Research and Development Assistance Agency under the contract No. APVT-20-044102.

REFERENCES [1] [2] [3]

[4] [5]

[6]

[7]

[8]

[9]

S. Tsekeridou and I. Pitas, “Content-based video parsing and indexing based on audio-visual interaction,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, Apr. 2001, 522–535. Ying Li, Chitra Dorai, “Instructional Video Content Analysis Using Audio Information,” IEEE Transactions on Audio, Speech and Language processing, Accepted for publication. Ying Li, Shrikanth Narayanan, C.-C. Jay Kuo, “Fellow ContentBased Movie Analysis and Indexing Based on AudioVisual Cues,” IEEE Transaction on Circuits and systems for video technology, Vol.4, No.8, August 2004, 1073-1085. Ellis, D., and Lee, K. “Minimal-impact audio-based personal archives,” In Proc. of ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 2004, 39-47. Peltonen, V., Tuomi, J., Klapuri, A. P., Huopaniemi, J., and Sorsa, T. “Computational auditory scene recognition,” In Proc. of the 27th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, vol. 2, 1941-1944. Xu, M., Maddage, N., Xu, C.-S., Kankanhalli, M., and Tian, Q. “Creating audio keywords for event detection in soccer video,” In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 2, 281-284. Cai, R., Lu, L., Hanjalic, A., Zhang, H.-J., and Cai, L.-H. “A flexible framework for key audio effects detection and auditory context inference,” IEEE Trans. Speech Audio, Speech and Language Processing, May, 2006. vol.14, 1026-1039. Moncrieff, S., Dorai, C., and Venkatesh, S. “Detecting indexical signs in film audio for scene interpretation,” In Proc. of the 2nd IEEE International Conference on Multimedia and Expo, 2001, 989-992. H. Guo, L. B. Jack, and A. K. Nandi, “Feature generation using genetic programming with application to fault classification,” IEEE Trans. Syst., Man, Cybern. B, vol. 35, no. 1, 89–99, Feb. 2005.

[10] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, “Genetic Programming: An Introduction,” New York: Morgan Kaufmann, 1998. [11] V. M. Kondakov, P. N. Sapozhnikov; “Kulback-Leibler informational measure in the distribution density estimating problem,” Publisher: Springer New York, ISSN: 1072-3374, Volume 56, Number 3, September 1991, 2415 – 2418. [12] P. Day, A.K. Naudi “Robust Text-Independent Speaker Verification Using Genetic Programming,” IEEE Trans. On Audio, Speech and Languages processing, Accepted for publication. [13] Chin-Teng Lin, Hsi-Wen Nein, Jiing-Yuan Hwu “GA based noisy speech recognition using two-dimensional cepstrum,”, IEEE Trans. On Speech and Audio Processing, Vol. 8, No. 6 Nov 2000, 664-675

Suggest Documents