sound stream comes from unstructured audiovisual sources, speech detection is needed before the application of an Automatic Speech Recognizer. (ASR).
ROBUST SPEECH MUSIC DISCRIMINATION USING SPECTRUM’S FIRST ORDER STATISTICS AND NEURAL NETWORKS Hadi Harb, Liming Chen Ecole Centrale de Lyon Dépt. Mathématiques Informatiques, 36 avenue Guy de Collongue 69134 Ecully, France {Hadi.Harb, Liming.Chen} ABSTRACT Most of speech/music discrimination techniques proposed in the literature need a great amount of training data in order to provide acceptable results. Besides, they are usually context-dependent. In this paper, we propose a novel technique for speech/music discrimination which relies on first order sound spectrum’s statistics as feature vector and a neural network for classification. Experiments driven on 20000 seconds of various audio data show that the proposed technique has a great ability of generalization since a classification accuracy of 96% has been achieved only after a training phase on 80 seconds audio data. Furthermore, the proposed technique is context-independent as it can be applied to various audio sources. 1. INTRODUCTION Speech/Music discrimination is an important task in multimedia indexing. It is usually the basic step for further processing on audio data. For instance, when sound stream comes from unstructured audiovisual sources, speech detection is needed before the application of an Automatic Speech Recognizer (ASR). In order to be able to face the huge amount of online and/or offline audio streams, semantic audio classifiers for multimedia indexing purposes, in particular speech/music ones, need to be fast, reliable, adaptable. Due to the variability of speech and music signals, a speech/music classifier must be able to generalize from a little amount of learning data. Furthermore, the definition of music and speech can differ from one application to another, for instance speech with background music can be considered as speech in one application and as music in another one. Therefore an audio classifier must be able to adapt to new conditions implying that the training process must be fast and simple. The major drawback of existing speech/music classification techniques is the need of a large amount of training data in order to achieve a reasonable performance[3][4][5][6][7][8]. There also exists some other systems dedicated to broadcast news sound segmentation [1][2]. However, they cannot be effectively applied to applications of different audio conditions where a speech/music classification is needed. In this paper, we introduce a new technique for speech/music classification achieving good
performance for several applications in different audio conditions and having a good generalization skill from a little amount of training data, thus easily adaptable to new conditions. 2. SPECTRUM’S STATISTICS 2.1. Speech Music Perception For the audio classification problems it is needed to choose the duration of the frame or time window which will be used for feature extraction. 10 ms is generally used by the researchers to extract spectrallike features. Hence 10ms is the standard duration used for the decision of the classifier. However, since the humans are, till now, the best audio classifiers, one can rely on the modest knowledge about the human perception when designing a speech/music classifier. Two conclusions can be obtained from the knowledge about the human’s speech/music classification. 1Humans need durations larger than 10ms (about 200ms) to easily achieve the classification. 2- If several 10ms speech segments are concatenated, the perceived class is not always speech. Several 10ms segments of speech ordered in a special way in time can give an impression of a non-speech sound. Therefore, the relation between neighbouring shortterm spectral features can be shown to be critical for a human speech/music discrimination task. One can argue that using relatively large windows (200ms) for speech/music classification can be advantageous. 2.2. Audio Signal Modeling Gaussian Mixture Models (GMM) were used for the classification of speech and music. GMM tries to model the distribution of a set of features, for instance spectral/cepstral vectors. The relation between neighbouring feature vectors is not taken into account when modelling using a GMM. The reported results in [10] and[7] show that cepstral features are important for the classification but are not sufficient. One solution to this drawback of GMM can be by the use of Hidden Markov Models (HMM). HMM have the ability of modelling the relation in time between spectral/cepstral vectors in addition to the classical GMM capability. However, the performance of HMM is related to the size of the
training data due to the great variability of short term spectral features and hence a good estimation of transition probabilities is related to the size of the training data which is more than several hours of audio for each class. To include the time information when modelling the sound spectrum and to minimize the variability of features within each class one can choose to model every set of neighbouring spectral vectors in a long term time window (T) using one model. We propose the use of the first order statistics of spectral vectors in relatively large time windows (T) (T>250ms). The statistics are the mean and the variance of each frequency bin. Hence, each time window (T) is modelled by a mean and a variance vector. This modelling scheme can be seen as modelling spectral vectors in T windows by a one Gaussian model. Therefore, a speech or a music segment is modelled by a mixture of N Gaussians. However, we do not use the classic expectation maximisation algorithm or the Gaussian probability density function to estimate mixture parameters and to calculate likelihoods. Instead, we use a Neural Network to estimate the probability of each Mean/Variance model to be included in one class or another (speech and music). The topology of the Neural Network trained on a set of Mean/Variance models can be seen as an analogy of the parameters of a GMM. The Neural Network will be trained to classify the T windows using their mean/variance models. However, In contrast to conventional GMM methods where short term features (spectral/cepstral vectors) are the basic frames for training and testing, in the proposed approach each “T” window will be the basic frame in the training and the recognition process. Figure 1 is shown to illustrate the behaviour of speech and music samples in the proposed feature space based on the modelling scheme presented previously. Each point in the plot (+ speech, x music) corresponds to 1s of audio where one mean vector and one variance vector of FFT spectrum are calculated. The abscissa of each point is the magnitude of its corresponding variance vector, and the ordinate is the magnitude of the corresponding mean vector. One can notice that the decision boundary between these two classes is quite simple in this simplified feature space demonstrating that the proposed modelling scheme can be effective.
Figure 1. a plot of 1000 s of speech (+) and 1000 s of music (x)
3. ARCHITECTURE OVERVIEW OF THE SPEECH/MUSIC CLASSIFIER Based on the modeling scheme presented in the previous section we propose a speech/music classifier containing three main steps: Spectral Feature extraction, Normalization/Statistics computing, and Neural Network based classification. 4. SPECTRAL FEATURE EXTRACTION During this step, spectral components of the audio signal are extracted using the Fast Fourier Transform (FFT) with a Hamming window of 30ms width and a 20ms overlap. The spectrum is further filtered conforming to the Mel Scale to obtain a vector of 20 Spectral coefficients every 10ms: the Mel Frequency Spectral Coefficients (MFSC). MFSC are the basic features used in our system. However, as mentioned in section 2, the use of these features directly is not sufficiently effective. Therefore we need a further step to perform normalization and statistics on these basic spectrum features, providing feature vectors for classification. 5. NORMALIZATION/STATISTICS The fact of using a Neural Network as a classifier and using the sigmoid function as an activation function necessitates some kind of normalization of the feature vector. Generally optimal values in the feature vectors are in the [0-1] range. The Neural Network risks saturation if feature vectors contain values higher than 1. The saturation means that synaptic weights change very slowly when training the neural network, implying a very long training time [9]. The normalization is also needed for a classification system to be robust to loudness and channel changes, i.e. from a CD quality to telephone channel quality. Two normalization schemes were investigated: (1) a channel-based (ch) normalization, and (2) whole (wh) spectrum normalization. In (1) each FFT coefficient is normalized by the local maximum within the same frequency channel in a time interval of 4seconds. While in (2) each FFT coefficient is normalized by the local maximum over all frequency channels. The statistics part of Normalization/Statistics module partitions the audio signal into non-overlapped windows with duration of “T” seconds. Experimental results show that the accuracy is not significantly affected by the choice of “T” if it is between 0.2 and 4 s. In each “T” window the mean and the variance across the time component of MFSC vectors are calculated. The concatenation of a subset of these statistics constitutes the feature vector of the “T” window. Accordingly, the classification is based on frames of duration “T”. 6.
NEURAL NETWORK BASED CLASSIFICATION As mentioned above, from each time window “T” one model expressed by one mean vector and one
Spectral features Statistics/normalization µ 1 ν 1 µ 2 ν 2 µ 3 ν 3 , µ k ν k
The Mean, variance model
Figure 2 the architecture of the classifier 7. EXPERIMENTS When a robust and general speech/music classifier is needed, the choice of the testing database is of great importance. To be general, the testing database must contain audio signals from a great variety of sources such as TV programs, Radio programs, movies, songs, live recordings, and so on. In addition, the choice of training data in regard to the test data is very important. Namely the ratio train/test data is an indicator of the generalization’s capability of a classifier. The database we have collected for the evaluation of our technique comes from three main datasets. Set 1 contains 1000 seconds of speech and 1000 seconds of music extracted from a French movie. This set was used as a test bed for different normalization strategies, and for the analysis of the effect of the amount of training data on the classification accuracy. Set 2 contains 7176 seconds of speech and 7212 seconds of music collected from TV programs, radio programs, movies, songs, and telephone conversations. This set was used to analyze the effectiveness of the proposed system as a general context-independent speech/music classifier. Set 3 is a collection of recordings from LCI, a French TV news channel, from France Info, a French news radio channel, and from several online music channels. This set was used to analyze the classification accuracy when the context is known. Namely, speech from the two defined channels is the speech class and songs from defined music channels define the music class. This set contains 10000 seconds of speech and 10000 seconds of music.
Table 1 shows the composition of the evaluative database. Table 1 the composition of the evaluative database Time (s) 2000 14388 20000 36388
Set1 Set2 Set3 Total
Source movie TV, radio, telephone TV, radio
7.1. Normalization One experiment was carried out on Set 1 to analyze the effectiveness of the normalization schemes. The system was trained on 40 seconds of speech extracted from a news program, and 40 seconds of music extracted from two songs. The two normalization schemes presented in section 5 were investigated. However, no accuracy results are provided if FFT coefficients are not normalized since one normalization scheme is essential in our system. The results shown in Table 2 demonstrate a little advantage of a channel-based normalization. Table 2. Classification accuracy for the two normalization techniques. Training data 80 80
Wh Ch
Test data 2000 2000
Classification accuracy % 92.12 93.17
7.2. Training data effect In this experiment the effect of the amount of training data was studied. Set 1 was used for extracting training data as for testing. And the system with channel-based normalization and MFSC features was experimented. As one can expect, the classification error rate on the test data (training data is excluded in the evaluation) can be decreased by increasing the amount of training data. The plot of the error rate in function of the amount of training data is shown in Figure 3. classification error vs training data 16 14 12
error rate %
variance vector is obtained. The combination of the mean and the variance values constitutes the input of a neural network which is used as a classifier, Figure 2. However, the use of a Neural Network as a classifier is suitable for our problem, though we have implemented a k-NN classifier for comparison purposes. Once trained, a Neural Network is very fast for classification, responding to the real time constraint in our audio stream indexing. Also, the compact representation of Neural Networks facilitates potential hardware implementation of the classifier. The Neural Network we have used is a Multi Layer Perceptron (MLP) with the error back-propagation training algorithm and the sigmoid function as an activation one.
10 8 6 4 2 0 40
training data in seconds
Figure 3. A plot of the evolution of the error rate when increasing the amount of training data 7.3. Context-independent classification The system with channel-based normalization and MFSC features was evaluated as a general contextindependent speech/music classifier. That is, the system was trained on the data used in the previous
experiments and the test data used is Set2. The results shown in Table 3 demonstrate that the proposed approach is effective as a general speech/music classifier. Moreover, the classification results on the same dataset of a k-NN classifier are 85.30% demonstrating the effectiveness of the Neural Network in this classification problem. Table 3. Context-independent classification accuracy. Speech Music Total
Training 40 40 80
Test 7176 7212 14388
Accuracy % 96.30 89.00 92.65
7.4. Context-dependent classification One can expect an improvement of the classification accuracy if the audio sources are limited and known. In the majority of content-based multimedia indexing applications the problem is context-dependent. Also, a context-dependent experiment is needed to make a fair comparison to existing speech/music classification systems. In this experiment we trained the system on 40 seconds of speech and 40 seconds of music extracted from the known TV/Radio channels to analyze. Thus, the system will be faced to test data from the same sources of the training data but at different time intervals (the recording was made on a 3 weeks interval). As expected, the classification accuracy was considerably increased from 93 % to 96 %, Table 4. These results are comparable to the reported results in the literature though the training data is 80s only while the test data is 20000 demonstrating that the proposed feature space based on the spectrum’s statistics is suitable for this classification problem. Table 4. Context-dependent classification accuracy.
Speech Music Total
Training data (s) 40 40 80
Test data (s) 10000 10000 20000
Classification accuracy % 96.06 95.75 95.90
8. CONCLUSION Many techniques have been proposed in the literature for speech/music classification. In order to achieve an acceptable performance, most of them require a large amount of training data, making them very difficult for retraining and adaptation on new conditions. Other techniques are rather context oriented since tested only on specific application conditions, such as speech/music classification in radio programs or in the context of broadcast news transcription. In this paper, we introduced a novel modeling scheme for the audio signal based on first order spectrum’s statistics and neural networks. Based on this
modeling scheme a new technique for speech/music classification was presented. Experimental results on a test database containing speech and music from the majority of existing sources, show the effectiveness of the presented technique both for context dependent and context independent speech/music classification problems. Moreover the proposed technique requires a very little amount of training data; for the experiments we only used 80 seconds training data. This advantage is extremely important in contentbased multimedia indexing since retraining our system on new material is very simple and can frequently occur. Experiments show that 96% of classification accuracy was achieved for context-dependent problems as compared to 93% for contextindependent ones. 9. REFERENCES [1]. J.L. Gauvain, L. Lamel, G. Adda, "Partitioning and Transcription of Broadcast News Data," Proc. ICSLP'98, 5, pp. 1335-1338, Dec. 1998. [2]. T Hain, S E Johnson, A Tuerk, P C Woodland & S J Young “Segment Generation and Clustering in the HTK Broadcast News Transcription System” Proc. 1998 DARPA Broadcast News Transcription and Understanding Workshop, pp. 133-137, 1998 [3]. E. Scheirer, M. Slaney, “Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator”, Proc. of ICASSP97, Munich, Germany, April, 1997 [4]. Gethin Williams, Daniel Ellis, “Speech/music discrimination based on posterior probability features”, Proc. Eurospeech99, 1999 [5]. Lie Lu, Hao Jiang and Hong-Jiang Zhang, “A Robust Audio Classification and Segmentation method”, Proc. of ACM Multimedia Conference, 2001. [6]. K. El-Maleh, M. Klein, G. Petrucci, P. Kabal, McGill , “Speech/music discrimination for multimedia applications” , Proc. ICASSP00 pp 2445-9, 2000 [7]. Seck M., Magrin-Chagnolleau I., Bimbot, F., “Experiments on speech tracking in audio documents using Gaussian mixture modeling”, Proc. ICASSP01, pp 601-604, vol. 1, 2001 [8]. Micheal J. Carey, Eluned S. Parris, Harvey Lloyd-Thomas, “A comparison of features for speech, music discrimination”, Proc. Of ICASSP99, pp 149-152, 1999 [9]. Simon Haykin, "Neural Networks A Comprehensive Foundation", Macmillan College Publishing Company,1994. [10]. Tzanetakis G., Cook P. “Musical genre classification of audio signals” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, July 2002