We show how this signature can be helpful for search and navigation into a video file. For the classification of audio signal onto Speech/Music/Silence/Noise we ...
Video Scene Description: An Audio Based Approach Hadi Harb, Liming Chen DEPT. MATHÉMATIQUES - INFORMATIQUE, ECOLE CENTRALE DE LYON. 36, avenue Guy de Collongue B.P. 163, 69131 ECULLY, France, Europe {Hadi.Harb, Liming.Chen} @ec-lyon.fr
La croissance des données audio-visuelles numériques nécessite de trouver des méthodes de description de ces données, afin de faciliter la recherche et la navigation dans les documents. Bien que la plupart des chercheurs se concentrent sur le flux visuel d’une vidéo pour aboutir à la description des scènes, nous introduisons dans cet article une nouvelle approche de description des scènes dans une vidéo en utilisant exclusivement le flux sonore. Nous proposons de donner à chaque scène une signature basée sur une classification du son en musique/parole/silence/bruit. Pour la classification, nous proposons une nouvelle technique basée sur quatre paramètres extraits du spectre du signal, et un réseau de neurones pour la classification. Les résultats d’expérimentation sur 71 min du film « Gladiateur » sont 90% de précision de description, et 88% de précision de classification. RÉSUMÉ.
ABSTRACT. The increase in the amount of digital audio-visual data requires the finding of automatic methods for content-based description of this material that facilitate the search for data and the navigation in multimedia files .While most researchers are focusing on the visual stream only for scene description, we propose using the audio stream only for automatic video scene description. In this paper we introduce a novel method for describing the content of a video scene based on a Speech/Music/Silence/Noise signature for each scene. We show how this signature can be helpful for search and navigation into a video file. For the classification of audio signal onto Speech/Music/Silence/Noise we propose a novel technique based on four special features extracted from the spectrum, and a neural network as a classifier.90% of description accuracy and 88% of classification accuracy are achieved on experiments on 71 minutes from movie “Gladiator”. MOTS-CLÉS : Réseau de Neurones, indexation des vidéos, classification du son, description des scènes, signature des vidéos KEYWORDS:
Neural Networks, Video Indexing, Audio Classification, scene description, video analysis, fingerprinting
1. Introduction The number of digital audio-visual documents is increasing everyday as a result of the fusion of information technology with audio-visual technologies. Users can now access several hours of digital videos locally or over networks such as LANs, WANs, and Internet. This increase in the number of digital audio-visual documents needs tools that help in searching for a document, and navigating into an audiovisual file. A lot of work has been done on video analysis in the goal of video indexing, such as shot detection (M. Ardebilian, 2001), and scene segmentation (M. Walid et al., 2001), (John M. Gauch, et al. 1999). Most researchers have concentrated on the visual stream (images) in their work, and they gave a little or no attention to the audio stream. In fact, the audio stream in a video is very important to understand its semantics. For example, when we listen to a movie we can understand the story of this movie without having any visual information about it. Even when we do not understand the speech spoken in the film, when the user doesn’t understand the movie’s language, we still can understand the type of each scene (action, romantic, discussions…). In this paper we address the problem of scene analysis and description in a video based on speech, music, silence and noise classification. We propose to give to each video segment a signature, or a fingerprint. This signature is based on the percentage of each audio class in the segment. %M (% Music), %S (% Speech), %N (% Noise). The %M%S%N signature is useful for searching and navigating in audio-visual documents: 1- Search: by excluding documents that do not have the same signature. For example, when searching for news or meetings videos we can exclude documents that are not rich in speech. 2- Navigation: when navigating into a file, we can access directly speech, or music or noise segments without the need of a linear search. For example when searching for a combat scene in a movie, we can access directly the noise scenes. 3- Description: using the S, M N signature information about the movie, or scene genre can be obtained, by comparing N, M with S. clearly a movie containing 50% of N seems to be an action movie, and a movie containing 90% of S seems to be a dialog one. Audio analysis has been addressed in different contexts, such as music-speech discrimination (Saunders J, 1996), (E. Scheirer et al. 1997), (H. Harb et al. 2001), as a pre-processing phase to exclude music before applying an Automatic Speech Recognizer (ASR). In general audio data analysis (M. S. Spina, 2000) when the ASR is adapted to acoustic conditions, based on the classification results. And in audio segmentation (H. Harb et al. 2001b), (H. Sundaram et al. 2000), (Y. Wang, et al. 2000), to be combined with the visual based scene segmentation.
Video Scene Description: and Audio Based Approach
3
In this paper the goal of the audio analysis is video scene description, supposing that scene boundaries are already given. The description is based on a Music, speech, noise, silence classification. For the classification we propose a set of features that aims to describe the dynamics of an audio signal, these features are extracted from the spectrum. And for the classification, we propose the use of a Neural Network (a Multi Layer Perceptron with one hidden layer), and we include the context- the future and the past- in the classification procedure. 2. Audio classification The audio classification consists of extracting a set of high level information or semantic features from the low signal level. Audio classification ranges from speech recognition or classifying speech signals as phonemes or words, to speaker recognition, or classifying the speech signal by their speakers, and general audio classification such as speech-music discrimination. The classification of audio signal needs choosing a set of features extractable from the signal, and a classifier to classify these features. Figure 1 and figure 2 show two audio segments in the time domain, one for a speech signal and the other is for a music one; we can see clearly the difficulty in the classification of these signals onto music or speech if based on this representation of the audio signal. This problem can be solved by extracting some features that can discriminate between the audio classes, as we shall discuss in section 2.1.
Figure 1 a speech signal in the amplitude vs time representation
Figure 2 a music signal in the amplitude vs time representation Approaches used in (Saunders J, 1996), (E. Scheirer et al. 1997) are based on short time features such as MFCC, short time energy, and some long time features such as 4Hz power modulation, and a Gaussian Mixture Model for classification. In fact
these methods were experimented on radio programs such as meetings, or news. These methods need big training databases. Authors in general didn’t test on very big databases, and they used the same type of data for training and testing. For example in (G. Williams et al. 1999) the author used ¾ of the data for training and ¼ for testing. But for a real-world video indexing application video data is very diverse, ranging from movies, to commercials, to news. This variety in video’s types implies a great variability into each audio class, and as a consequence the amount of data needed for training must be increased according to these techniques. To tackle the problem of diversity and variability of videos, classification methods must generalize from little amount of learning data. In this paper we propose the use of an MLP (Multi Layer Perceptron), with a set of features extracted on a relatively long term basis. Features we use in our system are based on the spectrum, motivated by the human perception (B. Moore, 1995), and are extracted every 2s. 2.1 Features The features we used are 4 features extracted from the spectrum of the audio signal. These features are Silence Crossing Ratio (SCR), Frequency Tracking (FT), Distance Between Audio Images (DBAI), and Frequency Centroid Variation (FCV). SCR and FT are described in details in (H. Harb et al. 2001). The idea behind the SCR is to measure the number of times we have silence in a window of 2 seconds using the energy envelope Figure 4. And the aim of FT is to measure the harmonicity of the signal, also in a window of 2s Figure3. and figure5. Thus the SCR and FT are extracted every 2s.
Figure 3 Spectrogram(bottom) and frequency tracking (top) for 40s of Music
Distance Between Audio Images (DBAI)
Figure 5 Spectrogram(bottom) and frequency tracking (top) for 40s of speech
Video Scene Description: and Audio Based Approach
5
Another feature to capture changes in the acoustic signal we proposed is based on distances between images extracted from the spectrum. In fact, we decompose the spectrum on consecutive images. The coordinates of a pixel are his frequency element in the FFT vector, and the time at which the FFT vector is extracted. The intensity of the pixel is the value of the frequency spectrum. So every FFT vector constitutes a column of pixels. We group every 401 columns in each image, making images of 40x40 pixels. Now we apply a distance on two consecutive images to capture changes in the spectrum. The distance we use is a special purpose distance. We decompose first each image into 8x8 blocks. Then we calculate a special distance between corresponding blocks (it can be interpreted as Hausdorff distance). The distance between two corresponding blocks is the sum of min(distances between one pixel of the first block and all pixels of the corresponding block). Finally the distance between two images is the sum of block distances. We extract this feature for every 2s.
Figure 4 . the power envelope of 40s of music signal (top) end 40s for speech signal(bottom).
Frequency Centroid Variation (FCV)
The Frequency centroid is the center of gravity of each spectral vector. It can be calculated by the following formula:
G=
∑X f
f
•f
∑f
Where X f is the coefficient corresponding of the frequency “f”
in the vector X, and “f” is the frequency index. 1
40 columns correspond of 400ms, each FFT Vector is extracted for every 10ms,
The FCV can be used to capture the dynamics of the spectrum of the audio signal. In fact, this feature is the mean of derivatives of Frequency Centroid vs time in a 2s window.
Figure 5 FCV for Music Figure 6 FCV for Noise
Figure 8 FCV histogram for 400s of noise
Figure 9 SCR histogram for 400s of speech
Figure 10 DBI histogram for 400s of music
Figure 7 FCV for Speech
Video Scene Description: and Audio Based Approach
7
3. The Classifier As a classifier we used a neural network, (a Multi Layer Perceptron with one hidden layer). First the video stream is passed to a Scene Segmenter (SS) (M. Walid et al., 2001), the SS gives the start and the end time of each “scene”. From each 2s in the scene the features SCR, FT, DBAI, and FCV are extracted and the 2s block is classified as Music, Speech, Noise or Silence. For each scene the number of Music, Silence, Speech, and Noise blocks is calculated and the percentage of each class is obtained. The description of an audio scene is to give for each scene %Sp, %Si, %M, and %N (%speech, %silence, %music, and %noise). And the class of an audio scene is the dominant class in the scene.
Figure 11 the audio classification procedure
3.1. The Feature Vector The classification is based on 2 seconds blocks(or frames). For every 2s block features are extracted (SCR, FT, DBAI, and FCV). The feature vector of each block is obtained by using the features of the 2 blocks from the past and the 2 blocks from the future, and features of the current block in the middle of the feature vector. This means that the feature vector covers a window containing 4s from the past, 4s from the future and 2s from the present, to classify the present 2s block. The output layer of the Neural Network has 4 neurons, 1 neuron per class. And the number of neurons in the input layer is 4*(Passed_Blocks+1+Future_Blocks). For the hidden layer we used 80 neurons. We didn’t see any effect by increasing the number of hidden neurons.
Figure 12 the feature vector, based on past, present and future features.
An important remark we have noticed in our experiments is the relation between the number of context blocks (past and future blocks) and the convergence time2 of the Neural Network. With a neural network that hardly converges with no context blocks, the convergence time decreases dramatically by increasing the number of context blocks, and the convergence time increases again when the number of 120 100 80 Convergence time in s
60 40 20 0 0
1
2
3
4
context blocks exceeds 2. Figure 13 MLP’s convergence time vs number of context blocks. the relation between the number of the context blocks and the convergence time reflects the correlation between passed, present and future moments in real world audio signals.
2 Convergence time is the time needed for the network to have a general error