Automatic Classification of Speech and Music Using Neural Networks M. Kashif Saeed Khan
Wasfi G. Al-Khatib
Muhammad Moinuddin
Dept. of Information and Computer Science King Fahd Univ. of Petroleum and Minerals Dhahran, Saudi Arabia 31261
[email protected]
Dept. of Information and Computer Science King Fahd Univ. of Petroleum and Minerals Dhahran, Saudi Arabia 31261
[email protected]
Dept. of Electrical Engineering King Fahd Univ. of Petroleum and Minerals Dhahran, Saudi Arabia 31261
[email protected]
as tapes and digital audio stored on the Web. Speech/Music classification is an important task in multimedia indexing. After the classification process, one can give meaningful descriptions such as speech, music, silence, etc to different segments of the audio data. Such indexing supports querying audio data based on the segment type, which is one form of content-based retrieval in multimedia databases. In addition, speech/music classification constitutes a major first step in converting audio data to other forms of data that are suitable for more conventional contentbased retrieval systems, such as textual databases.
ABSTRACT The importance of automatic discrimination between speech signals and music signals has evolved as a research topic over recent years. The need to classify audio into categories such as speech or music is an important aspect of many multimedia document retrieval systems. Several approaches have been previously used to discriminate between speech and music data. In this paper, we propose the use of the mean and variance of the discrete wavelet transform in addition to other features that have been used previously for audio classification. We have used Multi-Layer Perceptron (MLP) Neural Networks as a classifier. Our initial tests have shown encouraging results that indicate the viability of our approach.
In this paper, we present a neural network-based framework for automatic classification of audio data. The paper is organized as follows. First, a brief literature review is carried out in Section 2. Next, a description of the features examined in our system is presented in Section 3. After that, we discuss our proposed classification framework in Section 4. This is followed by reporting our experimentation and results in Section 5. Finally, we present our conclusion and future work.
Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – indexing methods. H.5.5 [Information Interfaces and Presentation]: Sound and Music Computing – signal analysis, synthesis and processing.
2. BACKGROUND Many researchers have addressed problems related to speech/music classification, speech/music features, feature extraction, and classification frameworks. Scheirer et al [15] and Saad el al [13] have both examined the following five features intended to measure conceptually distinct properties of speech and music signals:
General Terms Management and Design.
Keywords Music speech classification, content-based indexing, audio signal processing, audio features, neural networks.
1. 2. 3. 4. 5.
1. INTRODUCTION With the fast growth of multimedia repositories in general, and audio data in specific, the development of technologies for spoken document indexing and retrieval is in full expansion. Audio data sources range from broadcast radio and television, to the humongous volumes of recorded material in different forms, such
Percentage of Low Energy Frames, Roll Off Point of the Spectrum, Spectral Flux, Zero-Crossing Rate, Spectral Centroid.
Scheirer et al have additionally used the following eight features: 1. 2. 3. 4. 5. 6. 7. 8.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MMDB’04, November 13, 2004, Washington, DC, USA. Copyright 2004 ACM 1-58113-975-6/04/0011…$5.00.
94
Four Hz modulation energy The Variance of The Roll Off Point of the Spectrum Variance of The Spectral Centroid Variance of The Spectral Flux Variance of The Zero-Crossing Rate The Cepstral Residual Variance of The Cepstral Residual Pulse metric
4. 5.
Five of those are “variance” features, consisting of the variance in a one-second window of an underlying measure which is calculated on a single frame. Scheirer et al have used log transformations on all thirteen features. As a classification framework they have investigated four different classifiers: • • • •
To index efficiently the soundtrack of multimedia documents, it is necessary to extract elementary and homogeneous acoustic segments. Pinquier et al [10, 11, 12] have explored such prior partitioning which consists of detecting audio signals as speech, music, speech-music, and other by using GMM. For speech detection, cepstral coefficients, entropy modulation and 4 Hz modulation energy were used and for music detection, spectral coefficients, number of segments and segment durations were used. They have reported 90.1% of accuracy.
Multi-dimensional Gaussian maximum a posteriori (MAP) estimator, Gaussian Mixture Model (GMM) classification, Spatial partitioning scheme based on k-d trees and k Nearest-neighbor classifier
They have reported that the MAP Gaussian classifier does a much better job in rejecting music from the speech class than vice-versa, and among the four classifiers, k nearest-neighbor classifier gave good results, around 92.2% accuracy.
Harb and Chen [4] used first order sound spectrum’s statistics as feature vectors, extracted by using Fast Fourier Transform (FFT) and then used a Neural Network (NN) to estimate the probability of each mean/variance model. The NN used is a Multi Layer Perceptron (MLP) with the error back propagation training algorithm and the sigmoid function as an activation one. They have achieved 96% classification accuracy for context-dependent problems and 93% for context-independent ones. In [5], Harb and Chen have investigated audio for indexing purposes and proposed an algorithm that needs no training phase, as in Gaussian Mixture Models based algorithms. It classifies audio signals into 4 classes, speech, music, silence, and other. Different features were used for different classes like for Silence: Energy level and ZCR, and for Speech/Music: Silence Crossing Rate (SCR) and Frequency Tracking (FT). Classification is achieved by thresholding these features. They have reported 90% classification accuracy.
Saad et al have proposed an algorithm in which speech/music classification is performed by using average percentage deviation which is calculated by percentage deviation of each feature relative to the maximum deviation of that feature and if it is less than a particular threshold value it is labeled as speech otherwise music. They have reported 94.25% accuracy. Saunders [14] has described a technique for discriminating speech from music on broadcast FM radio based on Zero-Crossing Rate (ZCR) of the time domain waveform. His technique emphasized on detecting the characteristics of speech such as: 1. 2. 3.
Limited Bandwidth Alternate Voiced and Unvoiced Sections Energy Variations Between High and Low Levels
The possibility to discriminate between speech and music signals by using features based on low frequency modulation has been investigated by Karnebäck [7]. Three different low frequency modulation parameters, 4 Hz amplitude and standard deviation, 4 Hz normalized amplitude, and 2-4 Hz normalized amplitude have been extracted and tested by using GMMs. Classification accuracy of 93.6% have been reported. Wang et al [16] present a simple and effective approach in which the proposed modified low energy ratio is first extracted as the only feature and then the system applies the Bayes MAP (Maximum A-posteriori Probability) classifier to decide the audio class. Around 97% of classification accuracy has been achieved. El-Maleh et al [3] has focused on frame level narrow band speech/music discrimination by using four feature sets for experimentation:
It is indirectly using the amplitude, pitch and periodicity estimate of the waveform to carry out the detection process by using a multivariate Gaussian classifier. He has reported an average accuracy of 95%. Carey et al [1] presents a comparison of several of the different features, some of them already used in [14, 15], and tested the same data by using Gaussian Mixture Models (GMM) as a classifier and Expectation Maximization (EM) algorithm for training. The following features were used for classification: 1. 2. 3. 4. 5. 6. 7. 8.
Cepstral Coefficients Delta Cepstral Coefficients Amplitude Delta Amplitude Pitch Delta Pitch Zero-Crossing Rate Delta Zero-Crossing Rate
1. 2. 3. 4.
Separate experiments were carried out in combination of a feature and its derivative. The best performance resulted from using the cepstra and delta cepstra which gave an equal error rate (EER) of 1.2%. Parris et al [9] has used Cepstral coefficients, amplitude, and pitch features along with GMM and has reported an equal error rate of 0.7%. Chou and Gu [2] has proposed an approach for robust singing signal detection applied to applications of audio indexing in multimedia databases. Following set of features were used along with GMM for classification: 1. 2. 3.
MFCC Log energy
Line Spectral Frequencies (LSF) Differential LSF (DLSF: are the successive differences of LSF) LSF with Higher Order Crossings (HOC: it is the zero crossing count (ZCC) of the filtered input signal) LSF with Linear prediction zero crossing ratio (LP-ZCR: It is the ratio of the ZCC of the input and the ZCC of the output of the LP analysis filter).
He used two different classification algorithms: a quadratic Gaussian classifier and a k-nearest neighbor classifier. The knearest neighbor classifier gave the best results of 80.85%. Panagiotakis and Tziritas [8] have developed a system which first segments audio signals and then classifies them into one of the three main categories: speech, music, and silence. They have proposed an algorithm for classification based on RMS and ZeroCrossings. They have reported around 95% of classification accuracy.
4 Hz modulation energy Harmonic coefficients 4 Hz harmonic coefficients
95
3. AUDIO FEATURES USED IN CLASSIFICATION What makes the problem of audio recognition in the form of speech/music discrimination a hard task, is that finding features or mathematical models that describe all the variability of classes efficiently is not evident. In this paper we have proposed some new features and have used them along with features that have already been used by other researchers to carry out the classification process.
3.1 Previously Used Features Percentage of “Low-Energy” Frames: This value measures the proportion of frames with RMS power less than 50% of the mean RMS power within a one-second window. The energy distribution for speech is more left-skewed than that of music, as there are more quiet frames. So, this measure will be higher for speech than for music.
Figure 2. Variance of discrete wavelet transform Difference of Maximum and Minimum Zero Crossings: Zero Crossings is the number of time-domain zero-voltage crossings within a speech frame. In essence, the zerocrossing indicates the dominant frequency during the time period of the frame. This is a correlate of the spectral centroid. Rather than using zero crossings directly, we have used the difference of maximum and minimum zero crossings as a feature vector. It is evident from Figure 3 that it gives discriminating patterns for different classes of audio signal.
RMS of a Low-Pass Response: We are taking a Root Mean Square value of a Low-Pass response i.e. after passing the signal from a low pass filter. The RMS value of a low-pass response for speech is higher than the RMS value of a lowpass response for music. Spectral Flux: This feature measures frame-to-frame spectral difference. Thus, it characterizes the changes in the shape of the spectrum. Speech goes through more drastic frame-toframe changes than music. Speech alternates periods of transition (consonant - vowel boundaries) and periods of relative stasis (vowels), whereas music typically has a more constant rate of change. As a result the spectral flux value is higher for speech, particularly unvoiced speech, than it is for music.
3.2 Newly-Proposed Features Mean and Variance of the Discrete Wavelet Transform (DWT): DWT gives the frequency estimates of a signal at a particular time. The mean and variance of DWT give good discriminating feature vectors. Among different types of DWT, we have used Haar wavelets. Figures 1 and 2 show the discriminating patterns of speech, music, and both speechmusic for the mean and variance of DWT.
Figure 3. Difference of maximum and minimum zero crossings Linear Predictor Coefficients (LPC): LPC is one of the most powerful speech analysis techniques. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation. Figure 4 shows the viability of using LPC as discriminating features.
4. CLASSIFICATION FRAMEWORK After extraction of features, a classifier needs to be employed in order to discriminate between the various classes of audio signals. The classification process needs to be fast, reliable, and adaptable. Due to the variability of speech and music signals, a speech/music classifier must be able to generalize from a small amount of learning data and must be capable of adapting to new conditions implying that the training process must be fast and simple. In this work, multilayer feed-forward neural networks are used to carry out the classification process, as we explain in the following section.
Figure 1. Mean of discrete wavelet transform
96
where (d j (n) − y (j L ) (n)) φ ′j (v (j L ) (n)) for neuron j in hidden layer L δ (jl ) (n) = (l ) ( l +1) (l +1) φ ′j (v j (n)) ∑ k δ k (n) wkj (n) for neuron j in hidden layer l
5. EXPERIMENTATION AND RESULTS When a robust and general speech/music classifier is needed, the choice of a test database is of great importance. In addition, the choice of training data as opposed to the test data is another important issue. In particular, the ratio of training data to test data is an indicator of the generalization’s capability of a classifier. Figure 4. Linear predictive coefficients.
The database we have collected for the evaluation of our technique comes from three different documentaries. This database has the advantage to present long periods of speech, music and mixed zones containing speech and music. We have a single male speaker in the data containing speech. The wave file format has been used for audio data. A sample rate of 44.1 KHz, 16 bit mono has been chosen. Non-overlapped frames of data consisting of 150 samples, each of 3 seconds at 20 millisecond intervals, were formed for feature vector extraction. Two sets of experiments have been carried out: In the first set, data has been divided into 60% training and 40% testing. The second one has been divided into 70% training and 30% testing. We have considered only whole audio samples for both training and testing. In other words, the frames from a single sample were never split into partially training and partially testing data. This is important since there is a good deal of frame-to-frame correlation in the sample values, and so splitting up audio samples would give an incorrect estimate of the classifier performance for truly novel data. Seven features were used for experimentation. They were percentage of low energy frames, RMS of a low-pass signal, spectral flux, mean and variance of DWT, difference of max and min zero-crossings, and linear predictive coefficients (LPC).
4.1 Multilayer Feed-Forward Neural Networks Neural networks are typically organized in layers. Layers are made up of a number of interconnected ‘nodes’ which contain an ‘activation function’. Patterns are presented to the network via the ‘input layer’, which communicates to one or more ‘hidden layers’ where the actual processing is done via a system of weighted ‘connections’ [6]. The hidden layers then link to an ‘output layer’ where the answer is output, as shown in Figure 5.
Table 1 Percentage accuracy of classification
No. of Neurons Training Set
Figure 5. A feed-forward neural network structure.
4.2 Learning Algorithm The back-propagation algorithm (BP) is the most widely used learning procedure for supervised neural nets. Prior to training, some small random numbers are usually used to initialize each weight on each connection. BP requires preexisting training patterns, and involves a forward-propagation step followed by a backward-propagation step. The forward-propagation step begins by sending the input signals through the nodes of each layer. A nonlinear activation function is applied at each node output. This process repeats until the signals reach the output layer and an output vector is calculated. The backward propagation step calculates the error vector by comparing the calculated and target outputs. New sets of weights are iteratively updated until an overall minimum error is reached, based on the following update equation:
Old weights
Local gradient
5
60%
83.3%
10
60%
90%
5
70%
95%
10
70%
96.6%
Table 1 summarizes the percentage classification accuracy for the aforementioned feature sets for different parameters. Figure 6 shows the sum of squared error. Figure 7 shows the accuracy of classification for all four experiments. Details of the misclassifications are shown in Figures 8 and 9. One can see that as the neural network gets trained, the error in the classification reduces. Several observations can be made from Figure 7 and Table 1. It is evident that when the training set was around 60% and the number of neurons in the hidden layer was 5 the accuracy was around 83.3%. The misclassified 50% of speechmusic data was recognized as 10% music and 40% speech, as shown in Figure 9, as speech is usually dominant in
w(jil ) (n + 1) = w(jil ) (n) + µ δ (jl ) (n) yi(l −1) (n) + α ∆w(jil ) (n − 1) 1424 3 123 123 14243 New weights
Accuracy
Old change in weights
97
documentaries. When we increased the number of neurons to 10, the accuracy of classification increased to 90%. The 30% misclassification in speech-music data was recognized as speech (Figure 9). Similarly when we increased the size of the training data set up to 70%, the accuracy with 5 neurons was around 95%. The 13.3% of speech-music data was recognized as speech (Figure 9). Misclassifying the speech-music data as speech may be attributed to the dominance of speech in those segments. When the number of neurons was increased to 10 with 70% of the training data set, 96.6% accuracy was achieved. The 6.6% of speech data was recognized as speech-music (Figure 9). It is worth noting that music classification achieved a 100% accuracy in all four experiments.
Figure 8. Percentage misclassification of speech.
Figure 6. Sum of squared errors.
Figure 9. Percentage misclassification of speech-music.
6. CONCLUSION AND FUTURE WORK Many techniques have been proposed in the literature for speech/music classification. In order to achieve an acceptable performance, most of them require a large amount of training data, rendering them difficult for retraining and adaptation on new conditions. Other techniques are rather context oriented, as they have been tested on specific application conditions, such as speech/music classification in radio programs or in the context of broadcast news transcription. In this paper, we introduced a simple but robust classification scheme for audio signal based on four originally proposed features and neural networks. Experimental results show the effectiveness of the presented technique for which 96.6% overall accuracy of classification was achieved, with music being classified correctly 100%.
Figure 7. Classification accuracy in percentage recognition.
Our future work includes the study of the contribution of each feature considered in this work in the classification performance. In other words, one needs to look into how much does each one of those features contribute positively to the classification process,
98
and whether there are any dependencies and/or redundancies in the use of this set of features.
[8] Panagiotakis, C. and Tziritas, G., A Speech/Music Discriminator Based On RMS And Zero-Crossings. IEEE Transactions on Multimedia, 2004.
7. ACKNOWLEDGMENTS
[9] Parris, E. S., Carey, M. J. and Lloyd-Thomas, H., Feature Fusion For Music Detection. In Proceedings of the European Conference on Speech Communication and Technology, 1999.
The authors are thankful to the anonymous referees for their valuable comments. The authors also acknowledge the support of King Fahd University of Petroleum and Minerals in the development of this work.
[10] Pinquier, J., Rouas, J. -L. and André-Obrecht, R., A Fusion Study in Speech/Music Classification. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 03), Vol. 2, 2003.
8. REFERENCES [1] Carey, M. J., Parris, E. S. and Lloyd-Thomas, H., A Comparison of Features for Speech, Music Discrimination. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 99), Vol. 1, 1999.
[11] Pinquier, J., Rouas, J.-L. and André-Obrecht, R., Robust Speech / Music Classification in Audio Documents. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 02), Vol. 3, 2002.
[2] Chou, W. and Gu, L., Robust Singing Detection In Speech/Music Discriminator Design. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 01), Vol. 2, 2001.
[12] Pinquier, J., Sénac, C. and André-Obrecht, R., Speech and Music Classification in Audio Documents. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 02), Vol. 4, 2002.
[3] El-Maleh, K., Klein, M., Petrucci, G. and Kabal, P., Speech/Music Discrimination For Multimedia Applications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 00), Vol. 6, 2000.
[13] Saad, E. M., El-Adawy, M. I., Abu-El-Wafa, M. E. and Wahba, A. A., A Multifeature Speech/Music Discrimination System. In Proceedings of the Canadian Conference on Electrical and Computer Engineering (CCECE 02), Vol. 2, 2002.
[4] Harb, H. and Chen, L., Robust Speech Music Discrimination Using Spectrum's First Order Statistics And Neural Networks. In Proceedings of the Seventh International Symposium on Signal Processing and Its Applications, Vol. 2, 2003.
[14] Saunders, J., Real-Time Discrimination of Broadcast Speech/Music. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 96), Vol. 2, 1996.
[5] Harb, H., Chen, L. and Auloge, J. Y., Speech/Music/Silence and Gender Detection Algorithm. In Proceedings of the 7th International Conference on Distributed Multimedia Systems (DMS 01), 2001.
[15] Scheirer, E. and Slaney, M., Construction and Evaluation of A Robust Multifeatures Speech/Music Discriminator. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 97), Vol. 2, 1997.
[6] Haykin, S., Neural Networks: A Comprehensive Foundation. Prentice Hall, 1999.
[16] Wang, W. Q., Gao, W. and Ying, D. W., A Fast and Robust Speech/Music Discrimination Approach. In Proceedings of the International Conference on Information, Communications and Signal Processing, 2003.
[7] Karnebäck, S., Discrimination between speech and music based on a low frequency modulation feature. In Proceedings of the European Conference on Speech Communication and Technology, 2001.
99