Video salient event classification using audio features Silvia Corchs, Gianluigi Ciocca, Massimiliano Fiori, and Francesca Gasparini Department of Informatics, Systems and Communication, University of Milano-Bicocca, Viale Sarca 336, 20126 Milano, Italy ABSTRACT The aim of this work is to detect the events in video sequences that are salient with respect to the audio signal. In particular, we focus on the audio analysis of a video, with the goal of finding which are the significant features to detect audio-salient events. In our work we have extracted the audio tracks from videos of different sport events. For each video, we have manually labeled the salient audio-events using the binary markings. On each frame, features in both time and frequency domains have been considered. These features have been used to train different classifiers: Classification and Regression Trees, Support Vector Machine, and k-Nearest Neighbor. The classification performances are reported in terms of confusion matrices. Keywords: audio event, audio saliency, CART, SVM, kNN
1. INTRODUCTION Since the amount of multimedia data is constantly increasing, mining and analyzing the content of such data is a challenging task. The integration of computational human attention models within the multimedia data mining techniques can certainly improve their performances. The visual attention mechanism has been already investigated and incorporated within the data mining methods. Different saliency-based models, such as the well-known by Itti et al.1 have been considered in the case of static images and also for video (spatio-temporal saliency models).2–4 However, when observing a video sequence, humans are not only driven by visually salient stimuli but also by auditory salient ones. There exist in the literature several works that model auditory-saliency,5 but the problem of the integration of audio-visual saliency to generate a multimodal saliency map is not yet solved. Within this context, the aim of this work is to detect the events in video sequences that are salient with respect to the audio signal. While many computational models are available for the case of visual saliency maps, fewer models exist for acoustic attention. Itti et al.1 proposed the concept of saliency map to understand and represent bottom-up visual attention. A set of low level features (intensity, color and orientation) are extracted in parallel from the image in multiscale to produce topographic feature maps. These maps are then combined into a single saliency map which indicates the perceptual influence of each part of the image. The saliency map is scanned to find the locations that attract attention, and it was verified by virtue of eye movement that the model could replicate several properties of human attention. Analogous to visual saliency maps, a saliency map for audition was proposed by Kayser et al.5 The structure of the saliency map is similar to the visual saliency map by Itti et al.1 Feature maps in this model are represented by means of a spectrogram, a visual representation of how the frequencies in a sound change over time. This auditory saliency map extracts individual features, such as spectral or temporal modulation, in parallel way. Following Itti et al.1 and Kayser et al.,5 Kalinli and Narayanan6 proposed an auditory saliency map based on a model of bottom-up stimuli driven auditory attention that was successfully applied to speech processing. Schauerte et al.7 introduced a different definition of auditory attention based on Bayesian surprise. Their proposall relies on a probabilistic model of the signals frequency distribution to calculate the surprise, which in principle measures how unexpected an observed signal is given the preceding observations. Recently, Tsuchida and Cottrell8 proposed an auditory salience using natural statistics model. The model is an extension of the visual saliency model by Zhang et al.9 Send correspondence to: Francesca Gasparini E-mail:
[email protected]
Over the last years the tasks of identifying auditory scenes, detection and classification of individual sound events within a scene, have seen a particular rise in research. Recently, Giannoulis et al.10 presented a public evaluation challenge for the classification of acoustic scenes and the detection of acoustic events. The authors addressed acoustic events like cough, doorslam, drawer, keyboard, keys, door knock, laughter, mouse click, pageturn, pendrop, phone, printer, speech, switch. Acoustic scene detection and recognition has also been the focus of the international collaborative effort called CLEAR (CLassification of Events, Activities and Relationships).11 In particular, the case of environments like classrooms or meeting rooms was analyzed where speech usually is the most informative acoustic event, but other kinds of sounds may also carry useful information (applause or laughter inside a speech, a strong yawn in the middle of a lecture, a chair moving or door noise, etc.). Hunter et al.12 analyzed the acoustic signals recorded during a tennis championship. The authors used the video to mark-up notable sound events including various types of tennis strokes, echos, bounces of the ball, audience applause, footsteps, speech and other vocalizations. In order to classify the salient sounds occurring during the tennis matches, they have created spectrographic templates and have followed a standard procedure used in speech processing. Focusing on the sport category, we differentiate from the above cited articles since we consider only a binary classification task: salient or no-salient acoustic events. That is, in this paper we are interested in classifying temporal intervals (called frames) of videos of several different sports as salient or no-salient, analyzing only the audio tracks. To this end, we have first manually labeled 10 audio streams corresponding to videos of seven different sports. Using this information as ground truth we have then applied different machine learning methods to classify the data as salient or no-salient. The main contributions of this work are as follows: • Labeling salient audio-events. In our work we have extracted the audio files from videos of 7 different sport events: Cricket, Formula 1, Football, Poker, Basket, Kendo and Tennis. The videos used in our work were chosen from the database collected by The DIEM Project13 freely available online. For each video, we have manually labeled the salient audio-events. By salient acoustic event within a sport context we mean all the sounds related with important actions, applause of the audience, shouting of the supporters, speakers shouted comments, etc. For example, within a Tennis video we have labeled as salient events the ball impact on the racket, audience applause, and the exited speaker comments of significant actions. • Building a classifier for detecting salient and no-salient acoustic events. On each frame, features in both time and frequency domains have been considered14, 15 and used to train different classifiers. The classifier output is a binary indicator function of 0s (no event) and 1s (audio event), and the classification performance is reported in terms of confusion matrices. To solve the classification problem, we have considered three different kind of classifiers: Classification and Regression Trees (CART),16 Support Vector Machine (SVM)17 and k-Nearest Neighbors algorithm (kNN). The paper is organized as follows: the audio features and the dataset are presented in sections 2 and 3 respectively. In section 4 CART, SVM and kNN methods are briefly described. In section 5 the experiments performed are listed and the corresponding results are shown in terms of confusion matrices for each of the three different classifiers. Finally, the conclusions are drawn in section 6.
2. AUDIO FEATURES Each audio file has been segmented into overlapping frames weighted by a Hamming window. On each frame, features in both time and frequency domains have been evaluated. Using x(i) to denote the i − th amplitude sample of the audio signal frame considered (of length N), we have evaluated the following features in the time domain:15 • Volume (V L): detects the variation of signal amplitude:
v uN −1 X 1u V (n) = t x(i)2 N i=0
(1)
• Low Short Time Energy Ratio (LSTER): is an effective feature in discriminating between speech and music signals. It is defined starting from the Short Time Energy (STE) for a given frame ω of length K:
ST Eω =
K−1 X
x(i)2
(2)
i=0
Being I the total number of frames (ωi , i = 0, I − 1), and µST E the average STE, LSTER is then LST ER =
I−1 1 X [sign(0.5µST E − ST Eωi ) + 1] 2I i=0
(3)
In order to obtain the features in the frequency domain, we compute the DF T (Discrete Fourier Transform) of the audio signal. Using X to denote the DF T of signal in time domain and M to denote the index of sample having highest frequency, we compute the following features: • Signal Energy (E): is defined as the area under the squared signal:
E=
M X
|X(n)|2
(4)
n=0
• Sub-Band Energy (EB): is the signal energy in a given sub-band: EB =
MB X
|X(n)|2
(5)
n=0
where MB is the highest frequency index in sub-band B • Frequency Centroid (F C): represents the average point of the spectral power distribution: PM
FC = Pn=0 M
n|X(n)|2
n=0
|X(n)|2
(6)
• Frequency Bandwidth (F B): is defined as the size of frequency interval assigned to a signal: v u PM u (n − F C)2 |X(n)|2 F B = t n=0PM 2 n=0 |X(n)|
(7)
• Spectral Flux (SF ): is a measure of how quickly the power spectrum of a signal is changing, calculated by comparing the power spectrum for one frame against the power spectrum from the previous one:
SF =
K−1 1 X f (X, n)2 K n=0
(8)
where: f (X, n) = log(|X(n)| + ) − log(|X(n − 1)| + ) is a proper positive value, and K is the number of DFT samples,
(9)
3. DATABASE The database of videos used in our work were chosen from the database collected by The DIEM Project13 freely available online. Within this experiment the eye movements of 250 participants watching 85 different videos were collected. The videos belong to the following categories: advertisements, film clips, real-world scenes, social scenes, film trailers, video game trailers, music videos, documentaries, sports highlights, and news clips. In our work we have extracted 10 audio files from videos of 7 different sport events, as follows: • 1 audio track of Cricket, 1.26 minute long. • 1 audio track of Formula 1, 1.15 minute long. • 1 audio track of Poker, 1.56 minute long. • 1 audio track of Basket, 3.12 minute long. • 1 audio track of Kendo, 1.32 minute long. • 1 audio track of Football, 1.23 minute long. • 4 audio tracks of Tennis, 3.14, 1.32, 0.59, and 1.28 minute long respectively. We refer to these tracks as heterogeneous data in the sense that they represent different sport categories and noisy data as different audio contents are present (e.g. typical sport sounds, applause, music, speech, . . . ). For each video, we have manually labeled the salient audio-events each 0.2 seconds of the audio track, using the binary markings 1 for salient, 0 otherwise, for a total of 1442 salient labellings and 3930 non-salient ones (5372 in total).
4. CLASSIFICATION To solve the classification problem, we have considered three different kind of classifiers: CART,16 SVM17 and kNN.18 In the CART methodology the classifiers are produced by recursively partitioning the feature space, each split being formed by conditions related to the features values. In tree terminology, subsets are called nodes: the feature space is the root node, terminal subsets are terminal nodes, and so on. Once a tree has been built, a class is assigned to each of the terminal nodes, and when a new case is processed by the tree, its predicted class is the class associated with the terminal node into which the case finally moves on the basis of its features values. The construction process is based on training sets of cases of known class. Tree classifiers provide fairly comprehensible predictors in situations where there are many variables which interact in complicated, nonlinear ways. Moreover, they imply no distributional assumptions for the features. The detailed description of the CART methodology can be found in.16 In our experiments we have used the MATLAB R2013a implementation. The SVM methodology comes from the application of statistical learning theory to separating hyperplanes for binary classification problems.17 The central idea of SVM is to adjust a discriminating function so that it makes optimal use of the separability information of boundary cases. Given a set of cases which belong to one of two classes, training a linear SVM consists in searching for the hyperplane that leaves the largest number of cases of the same class on the same side, while maximizing the distance of both classes from the hyperplane. When the training set is not linearly separable, the optimal separating hyperplane is found by solving a constrained quadratic optimization problem. In this work we have used one of the most common SVM solver, the LIBSVM.19 We have also considered a kNN classifier to compare the results with a simpler classifier. A kNN is non parametric classifier where a test instance is classified by a majority vote of its neighbors, with the test instance being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then it is simply assigned to the class of its nearest neighbor.
5. EXPERIMENTS Each audio file has been segmented into overlapping frames. These overlapping frames have the length of 0.2 seconds and they are weighted by a Hamming window with an overlap factor of 50%. On each frame, the features of Section 2 are computed. Since in the original labeled tracks the number of salient and no-salient frames is strongly unbalanced, in our classification task we have taken into account a subset of 1426 salient frames and 1862 no-salient ones. We have evaluated each classifier listed in Section 4 under three different experimental setups: a. Heterogeneous and noisy data by using all the 10 tracks; b. Heterogeneous and clean data by using a subset of 5 tracks selecting by removing those containing mostly music background: 3 tracks of Tennis, 1 of Cricket and 1 of Poker; c. Homogeneous and clean data, by using 3 tennis tracks. Each experiment has been performed using two training strategies: 1. Cross-validation: we have performed n rounds (where n is the number of the considered tracks), partitioning the n tracks into n different couples of complementary subsets of n − 1 and 1 track respectively, to avoid data snooping. For each round, the training phase was performed on the subset with n − 1 tracks and the test on the remaining one. The performance of the classification is evaluated considering the validation results over the n rounds. 2. Half of all the n tracks are used as training data and the remaining halves are used as test sets. In case of CART classifiers, as in general the classification trees are too large and thus tend to over-fit the data, we have pruned them. The classification performances are reported in terms of confusion matrices. Therefore the total experiments performed for each classifier are 6, hereafter labeled as: a1 (i.e. 10 tracks, cross validation), a2 (10 tracks, half tracks for training half, for test), b1,b2, and c1, c2. In Table 1 the results for CART classifications are shown in terms of confusion matrices and accuracy. Table 1. Confusion matrices for CART classification.
ten tracks (a1) class predicted real 0 1 0 1236 626 1 573 853 accuracy= 64%
five tracks (b1) class predicted real 0 1 0 833 200 1 316 505 accuracy= 73%
tennis tracks (c1) class predicted real 0 1 0 306 121 1 30 255 accuracy= 79%
ten tracks (a2) class predicted real 0 1 0 661 298 1 256 421 accuracy= 67%
five tracks (b2) class predicted real 0 1 0 377 157 1 88 301 accuracy= 74%
tennis tracks (c2) class predicted real 0 1 0 212 39 1 6 97 accuracy= 88%
Comparing the three experimental setups (a,b,c) whatever the training strategy applied (1 or 2), the best classification performance is obtained for the homogeneous and clean tracks, as expected. With respect to the training strategies, approach 2 produces slightly better results than approach 1 on setups a and b. This is mainly
due to the heterogeneous nature of the data. In case of setup c where the data is homogeneous the performance difference is more evident. To better understand the role of the audio features considered in our work, a selection of CART trees are reported in Figures 1-3, corresponding to experiments a1, b1 and c1 respectively. SF ≤ 0.0010
> 0.0010 FC
VL ≤ 0.1159
≤ 0.0152
> 0.1159 1
0
> 0.0152 FB
VL ≤ 0.0503
1
0
1
> 0.0274
≤ 0.0274
> 0.0503
VL
≤ 0.0978
> 0.0978 1
EB ≤ 4322
> 4322 1
0
Figure 1. CART classifier obtained for experiment a1.
SF ≤ 0.0010
> 0.0010 FC
VL ≤ 0.1159 0
> 0.0152
≤ 0.0152
> 0.1159
FB
0
1
> 0.0274
≤ 0.0274
VL
1 ≤ 0.0978
> 0.0978 1
0
Figure 2. CART classifier obtained for experiment b1.
VL ≤ 0.0104
> 0.0104 SF
0 ≤ 0.00028
> 0.00028 FC
VL ≤ 0.1154 0
≤ 0.0152
> 0.1154 1
≤ 0.0050
0
FB
SF
FB ≤ 0.0093
> 0.0152
> 0.0050
≤ 0.0276
0
1
> 0.0093 1
> 0.0276 VL ≤ 0.1172 0
> 0.1172 1
Figure 3. CART classifier obtained for experiment c1.
Observe that the two trees of experiments a1 and b1 are very similar and the first feature used in the splitting
is the Spectral Flux (SF). On the other hand, the tree obtained considering only the audio tracks of tennis (c1) chooses as first splitting feature the Volume (VL). When training an SVM, a very important step is the choice of the kernel function and the setting of the corresponding parameters. We have followed the proposal of Hsu et al.19 and after a scaling of the data, the Radial Basis Function (RBF) kernel was chosen. The penalty term and the parameter of the RBF used in the training and testing phase are found using a coarse to fine greedy search algorithm in a cross validation procedure. In Table 2 the results for SVM classifications are shown in terms of confusion matrices and accuracy. Table 2. Confusion matrices for SVM classification.
ten tracks (a1) class predicted real 0 1 0 1143 719 1 581 845 accuracy=61%
five tracks (b1) class predicted real 0 1 0 807 226 1 294 527 accuracy=72%
tennis tracks (c1) class predicted real 0 1 0 345 82 1 63 222 accuracy=81%
ten tracks (a2) class predicted real 0 1 0 725 234 1 296 381 accuracy=68%
five tracks (b2) class predicted real 0 1 0 361 173 1 79 310 accuracy=73%
tennis tracks (c2) class predicted real 0 1 0 213 38 1 34 69 accuracy=81%
We have also evaluated the performance of a kNN classifier with k=8. In Table 3 the results for kNN classifications are shown in terms of confusion matrices and accuracy. Table 3. Confusion matrices for kNN classification.
ten tracks (a1) class predicted real 0 1 0 1236 626 1 663 763 accuracy=61%
five tracks (b1) class predicted real 0 1 0 805 228 1 360 461 accuracy=69%
tennis tracks (c1) class predicted real 0 1 0 365 62 1 94 191 accuracy=76%
ten tracks (a1) class predicted real 0 1 0 748 211 1 322 355 accuracy=68%
five tracks (b1) class predicted real 0 1 0 390 144 1 119 270 accuracy=72%
tennis tracks (c1) class predicted real 0 1 0 198 53 1 23 80 accuracy=78%
6. CONCLUSIONS In this work we have focused on classifying audio frames as salient or no-salient events for the case of different sports contexts. In total ten audio tracks have been analyzed. These tracks can be considered heterogeneous in the sense that they represent different sport categories and noisy as different audio contents are present. We have considered three different experimental setups: heterogeneous and noisy data; heterogeneous and clean data; and homogeneous and clean data. Three classifiers (CART, SVM and kNN) have been used for our classification
task and features on both time and frequency domain build the feature space. Similar classification results have been found for all the three classifiers when heterogeneous and noisy data are considered. The accuracy found is around 67%. This accuracy is improved if homogeneous and clean data are taken into account. The results of the research here presented could be integrated with spatio-temporal saliency maps to design a multimodal saliency strategy to detect video salient events, taking into account both visual and audio information.
REFERENCES [1] Itti, L., Koch, C., and Niebur, E., “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Machine Intell. 20, 12541259 (1998). [2] Corchs, S., Ciocca, G., and Schettini, R., “Video summarization using a neurodynamical model of visual attention,” in [Multimedia Signal Processing, 2004 IEEE 6th Workshop on], 71–74, IEEE (2004). [3] Boccignone, G., Marcelli, A., Napoletano, P., Di Fiore, G., Iacovoni, G., and Morsa, S., “Bayesian integration of face and low-level cues for foveated video coding,” IEEE Transactions on Circuits and Systems for Video Technology 18(12), 1727–1740 (2008). [4] Yubing, T., Cheikh, F. A., Guraya, F. F. E., Konik, H., and Tr´emeau, A., “A spatiotemporal saliency model for video surveillance,” Cognitive Computation 3(1), 241–263 (2011). [5] Kayser, C., Petkov, C. I., Lippert, M., and Logothetis, N. K., “Mechanisms for allocating auditory attention: an auditory saliency map,” Current Biology 15(21), 1943–1947 (2005). [6] Kalinli, O. and Narayanan, S. S., “A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech.,” in [INTERSPEECH ], 1941–1944 (2007). [7] Schauerte, B. and Stiefelhagen, R., ““wow!” bayesian surprise for salient acoustic event detection,” in [Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)], (May 26 - 31 2013). [8] Tsuchida, T. and Cottrell, G. W., “Auditory saliency using natural statistics,” in [CogSci 2012 Proceedings], 1048–1053 (2012). [9] Zhang, L., Tong, M. H., Marks, T. K., Shan, H., and Cottrell, G. W., “Sun: A bayesian framework for saliency using natural statistics,” Journal of Vision 8(7) (2008). [10] Giannoulis, D., Benetos, E., Stowell, D., Rossignol, M., Lagrange, M., and Plumbley, M., “Detection and classification of acoustic scenes and events,” an IEEE AASP Challenge (2013). [11] Stiefelhagen, R., Bernardin, K., Travisrose, R., Michel, M., and Garofolo, J., “The clear 2007 evaluation,” (2007). [12] Hunter, G. J., Zienowicz, K., and Shihab, A. I., “The use of mel cepstral coefficients and markov models for the automatic identification, classification and sequence modelling of salient sound events occurring during tennis matches,” Journal of the Acoustical Society of America 123(5), 3431–3431 (2008). [13] “http://thediemproject.wordpress.com/,” (2013). [14] Artese, M., Bianco, S., Gagliardi, I., and Gasparini, F., “Audio stream classification for multimedia database search,” in [IS&T/SPIE Electronic Imaging ], 86670G–86670G, International Society for Optics and Photonics (2013). [15] Lu, L., Zhang, H.-J., and Jiang, H., “Content analysis for audio classification and segmentation,” Speech and Audio Processing, IEEE Transactions on 10(7), 504–516 (2002). [16] Breiman, L., Friedman, J., and Olshen, R. Stone, C., [Classification and Regression Trees], Wadsworth and Brooks/Cole (1984). [17] Vapnik, V., [The Nature of Statistical Learning Theory ], Springer Verlag (1995). [18] Altman, N., “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician 46(3), 175–185 (1992). [19] Hsu, C Chang, C. and Lin, C., “A practical guide to support vector classification,” Bioinformatics 1, 1–16 (2010).