Enhanced Class-Dependent Classification of Audio ... - IEEE Xplore

4 downloads 6954 Views 312KB Size Report
database creation and information retrieval, health condition monitoring, audio scene analysis, etc. While relevant features have been well studied and identified.
2009 World Congress on Computer Science and Information Engineering

Enhanced Class-dependent Classification of Audio Signals Zhou Nina, Ser Wee, Yu Zhuliang, Yu Jufeng, Chen Huawei Center for Signal Processing, Nanyang Technological University BLK S1, 50 Nanyang Avenue, Singapore 639798

Abstract has not emerged, most reported audio signal classification methods use the features developed for speech recognition [3]. For example, in the non-speech audio signal classification methods reported in [3], three commonly used feature extraction techniques including mel-frequency cepstral coefficients (MFCC), continuous wavelet transform (CWT) and short-time Fourier transform (STFT) are applied. The simulation results in [3] indicated that the combination of MFCC and DTW has the best recognition performance. In the classification of cough signal reported in [4], some unique spectral coefficients are used as the input features to the probabilistic neural network (PNN) based classifier. For the classification of speech and non-speech audio signals, Kimber et al. [5] used cepstral coefficients as the feature inputs to the Hidden Markov Models (HMMs) [6] to classify speech, silence, laughter and other audio signals. Abu-EI-Quran et al. [7] proposed to first distinguish speech and acoustic sounds in terms of a pitch ratio parameter, and then further classify non-speech audio segments using a time delay neural network with MFCC and ΔMFCC features. In addition, variations of these features over time and the relationship among these features were also exploited to distinguish speech, music and silence. Existing research pays little attention to the specific characteristics of each class of audio signals. Considering the fact that different features may have different distinguishing abilities in classifying different sound signals, this paper proposes to utilize the idea of class-dependent feature selection for the classification of speech and other audio signals. Features that are less relevant are abandoned to improve the classification performance. In our daily activity monitoring project, speech, coughing and cup-plate touching sounds are the target signals for classification. Instead of utilizing the pith ratio value to specifically determine speech from other non-speech signals [7], we use the commonly used feature extraction techniques, e.g., MFCC, STFT and DWT [8], in the first part of our proposed method. We then propose a new feature ranking measure modified from [16] and use it to rank

The process of Audio Signal Classification (ASC) involves the extraction of features from sound and the use of these features to identify the class it belongs to. There are many possible applications for ASC including for example speech recognition, audio database creation and information retrieval, health condition monitoring, audio scene analysis, etc. While relevant features have been well studied and identified for speech signals, they are relatively less studied for other types of audio signals. Considering the fact that different classes of audio signals have their own unique characteristics, the idea of class-dependent feature selection and classification is examined in this paper. In particular, the paper uses a class-dependent method based on a proposed scatter-matrix based class separability ranking measure to select a highly relevant feature subset for each type of the audio signals. An effective training model is also incorporated into the proposed method .The support vector machine with radial basis function kernel is then used as the classifier. Experiments have been conducted on speech and two other types of audio sounds, i.e., coughing and the sound generated when a cup touches a plate or vice versa. Compared to some recently published methods, the proposed classdependent ASC method requires fewer features and is able to achieve the same or better classification accuracy.

1. Introduction In the area of audio information recognition (or audio signal classification [1], [2]), the classification on speech and music are well studied. The recognition process generally involves preprocessing the audio signal, extracting relevant features, and classifying the sound into one of a set of classes [1], [2]. Similar to other classification approaches, feature extraction is critically important for audio signal classification. The non-speech audio signal classification is not as well studied as the classification for speech and music. Since a common method of audio signal classification

978-0-7695-3507-4/08 $25.00 © 2008 IEEE DOI 10.1109/CSIE.2009.664

100

and select a small number of class-dependent features [9] for each type of the target audio signals. In addition, a novel training model is also incorporated in the proposed method to further enhance the classification accuracy.

Welch Power Spectral Density Estimate for Cough -40 -60

Power/frequency (dB/Hz)

-80

2. Problem formulation

0.5

1

1.5

2

2.5

3

3.5

4

3.5

4

3.5

4

Welch Power Spectral Density Estimate for Cup

-50 -100 0

0.5

1

1.5

2

2.5

3

Welch Power Spectral Density Estimate for speech 0

2.1. Preprocessing Preprocessing

0

0

-50

Feature selection

-100 0

Classification

0.5

1

1.5 2 2.5 Frequency (kHz)

3

Figure 2. Comparison of Welch power spectral density estimate among coughing, cup-plate and speech. (a) Coughing. (b) cup-plate. (c) speech. In this paper, three commonly used feature extraction techniques, i.e., MFCCs, STFT and DWT, are applied on coughing, the cup-plate touching sound and speech. The details of these feature extraction techniques can be found in [3], [8]. Different audio signals have their own characteristics. For example, in Figure 2, we can see that coughing generally has most of frequency at around 400HZ, whereas the cup-plate touching sound generally has most of frequency at around 3k HZ. Therefore, it is possible to select a class-dependent feature subset [6] for each type of audio signals. In the process of feature selection, we modify the class separability measure based on scatter matrix from [16] to obtain a new ranking measure. According to [16],

Figure 1. Block diagram of audio signal classification Audio signal classification consists of three basic steps, i.e., data preprocessing, feature extraction and classification [8] (see Figure 1). According to [8], [14], data preprocessing for audio signal classification originally involves sampling and loading audio signals into computers. In this paper, sampling and loading of audio signals will not be mentioned. More details will be given to the preprocessing tasks of sound segmentation [14] and extension. In this paper, speech data is downloaded from the Harvard-Haskins Database of Regularly-Timed Speech database [11]. Coughing and the cup-plate touching sound are sampled by us. The sampling frequency rate is set as 8k Hz and the sampling time is 5 seconds. Each sample we obtain will include several periods of target sounds and also some background sounds which approximate silence. Therefore, sound segmentation is necessary. For speech, due to the characteristics of voiced speech and unvoiced speech, short-time energy and zero crossing rates are combined to realize the segmentation of speech. For coughing and the cup-plate touching sound, we use the energy method to realize the segmentation. The energy is calculated by Eq. (1): (1) E (t ) = α E (t − 1) + (1 − α ) x (t ) 2

SW −1SB can be utilized to measure data’s separability. Here, SW is the within class scatter the trace of matrix SW =

C



c =1

and

SB

nc

Pc ∑ ( x cj − m c ) T ( x cj − m c )

(2)

j =1

is

the

between-class

scatter

matrix

C

S B = ∑ Pc ( mc − m )T ( mc − m )

Here E ( t ) denotes the energy of the signal x (t ) at time t . α is a regulating parameter, which is within the range of [0,1). In this paper, α is set as 0.99. If we use T e to denote the energy threshold and m ax( E ( t ))

(3)

c =1

Here

C

refers to the number of classes and Pc refers

c -th class. nc refers to the number of samples in the c -th class and xcj refers to

to the probability of the

denote the maximum value of the energy E ( t ) , each single target sound whose energy is above Te ∗ max( E (t ) is extracted.

the

j -th sample in the c -th class. mc denotes the

mean vector of the c -th class and m denotes the mean vector of all the training samples. If given two different feature subsets, the better feature subset will have larger trace value. Among all possible feature subset, the most important feature subset will be the one with the

2.2. Feature extraction and feature selection

101

function (RBF) kernel in our experiment to test all different feature vectors. For each type of audio signal, we construct a binary classification model. For the example of speech, the binary model is trained with two classes, one class being speech and the other class consisting of coughing and the cup-plate touching sound. All samples with the same feature mask as speech are input into the model. Likewise, two models for coughing and the cup-plate touching sound are trained too. Given one test sample, it will be input into three models, respectively. Each model produces an output [9]. Here, each output is a value within the range [0, 1], which in fact denotes the probability that the sample with the current feature subset belongs to the current class. According to [9], the maximal output determines the class the sound belongs to.

−1

biggest trace value of SW S B . Here, we propose to −1

evaluate each feature by the trace of SW S B . Each time, we remove one feature and then calculate the trace value of the feature subset consisting of all −1

features except the removed one, denoted as SW

SB(Ri ) .

Sequentially, each feature is removed once. If an important feature is removed, the trace value of

SW−1SB(Ri ) will become small. Therefore, the smaller the value of trace

(SW−1SB(Ri ) ) , the more important the

removed feature. In this way, all features are ranked for each class. Since the proposed class separability measure is based on scatter matrix, which is different from the distance-based CSM [9], we call it as the scatter matrix-based CSM. The next job is to select a desirable feature subset for each class of sound signals. The process of selecting a specific feature set for each type of sound signals includes three steps. Firstly, we change a multi-class problem into several two-class problems. It can be stated as follows: given the classification problem with C classes

3. Experimental setup and results The implementation software of the experiment is MATLAB R2006b. The MFCC algorithm is downloaded from Malcolm Slaney's toolbox [3] and the STFT algorithm is downloaded from Timothy D. Dorney's toolbox [3]. The DWT algorithm is available in the MATLAB toolbox. After sound segmentation, we obtain 292 samples, among which there are 68 coughing samples, 76 cupplate touching samples and 148 speech samples. For each kind of samples, we will randomly choose 2/3 for training (i.e., 195 training samples in total) and the left 1/3 for testing (i.e., 97 in total). Each simulation is repeated 10 times. For each sound signal, we first separate signals into frames and then extract 20 MFCC coefficients in each frame. The mean value of the 20 MFCCs is calculated as the feature inputs to the SVM [7], [15]. We utilize the proposed scatter matrixbased class separability measure to rank features and select some top ranked features for the classification. We also conduct the distance-based CSM [9] to make a comparison with our method. In addition, we also apply the proposed method for extracted STFT and DWT features [3],[8], respectively. At last, we conduct a simulation under the noise condition (see Figure 3), where white Gaussian noises with different SNR are added. In Table 1, we list top ten features from the proposed method and also the distance-based CSM. From the results in Table 1, we can see that top ranked MFCC features for three types of audio signals are very different. For coughing, important features are those close to low frequency band. This can be due to the characteristics of coughing (see Figure 2), where the first plot shows the power frequency of coughing

{ y1 , y2 … , yC −1 , yC } , after adopting the

strategy of “one-against-all” [9], we obtain

C

two-

y1 is one of two classes, then all other classes { y2 , y3 ,… yC } class problems. For the first problem, if class

are seen as the other class. Sequentially, for the last problem, class yC is one class and { y1 , y2 ,… yC −1} is seen as the other class. In this way, C problems are formed. Secondly, for each two-class problem, features are ranked by the proposed scatter matrixbased class separability ranking measure. The current feature ranking list is based on classifying current two classes. Therefore, C feature importance ranking lists are obtained for C classes. Finally, the number of features for each class is determined according to a predefined threshold or the accuracy [9]. Here, we determine feature subsets again according to the measure of trace (SW

−1

SB ) . The desirable feature subset

will be the one with the maximal trace value. For the convenience of representation, the selected feature subset will be denoted by a feature mask [9]. A feature mask is to use 1 to represent the presence of the features and 0 to represent the absence of features.

2.3. Classification with class-dependent feature subsets Considering many advantages of the SVM [9] and its wide application in audio signal classification [10], [12], we will use the SVM classifier with radial basis

102

has an obvious maximal value at around 0.4k HZ (i.e., comparatively low frequency band). For the cup-plate sound, important features are those close to higher frequency band, compared with coughing. Similarly, this can be explained from the power spectrum estimate in Figure 2, which shows the cup-plate sound has the maximal power frequency at around 3.2k HZ. For speech, the power spectrum estimate shows speech has a wide frequency band. Among those top ranked features for speech, some belong to high frequency band, and others belong to low frequency band. Beside the proposed method, ranked features by the distancebased CSM [9] also reflect those characteristics. When we use the trace of

some coefficients are irrelevant to the sound classification. In calculating STFT features, we first separate each signal into short time frames. Each frame length is set as 256 and the step distance is set as 80. For each frame, we select the first 129 Fourier transform coefficients as features. Using 129 features produces the mean classification accuracy of 98.56% (Table 3). When using top ranked 10 features and 15 features for classification, respectively, they both produce the mean classification accuracy 98.35%. While top ranked 50 features improve the mean classification accuracy by 0.20% (Table 3). This improvement also shows that not all Fourier transform coefficients are useful for the sound classification. Some may degrade the classification performance.

(SW−1SB(Ri ) ) to determine

the feature subsets from the ranking list, we obtain that the top ranked 10 features has nearly the same trace value as using all 20 features. In Table 2, we present the classification results with different number of features. The top ranked 5 features produce the mean mean classification accuracy 96.91%, which is a little higher than 96.50% (obtained by using the CSM [9]). This difference is aroused by different feature ranking lists (see Table 1) obtained from the proposed class separability measure based on scatter matrix and the CSM based on distance [9]. When using top 8 features, the mean classification accuracy is increased by 1%, for both methods. The top 10 features produced the same classification accuracy as using all features, which shows that using the trace of

Table 3. Classification accuracies of three types of audio signals with the proposed class-dependent feature selection method.

Feats

select feature subsets is feasible for our problem. Table 1. Feature ranking of three types of audio signals using the proposed classdependent feature selection method and the CSM method in [9].

Propose d CSM [12][15]

5 96.91 (1.03) 96.50 (0.92)

8 97.94 (1.63) 97.53 (1.17)

10 98.97 (1.03) 98.97 (1.03)

Feat. num

mean acc. (std) %

Feat. num

mean acc. (std) %

DWT

3044

84.54 (2.75)

100

94.23(1.47)

STFT

129

98.56 (0.92)

50

98.76(1.69)

Class

Table 2. Comparative classification accuracies for different number of features. (.) denotes standard deviation.

Accuracies (%) for different feats. num

Proposed method

Since data mentioned in [7] [15] now are not public available, we can not make a comparison between our method and their methods. The above several simulations are conducted in nearly ideal conditions, i.e., no noise added. Here, we add some white Gaussian noise with different SNR into our audio signals to simulate the non-ideal conditions. In Figure 3, white Gaussian noises with -

(SW−1SB(Ri ) ) to

Method

All feats.

Top 10 features Proposed

20 98.97 (1.03) 98.97 (1.03)

CSM [9]

cough

6 8 4 3 7 2 15 17 11 20

2 3 4 7 6 9 10 8 12 20

cupplate

12 5 18 17 11 19 10 8 20 16

6 12 15 20 17 19 18 16 1

speech

17 1 5 3 19 8 18 6 7 4

1 17 7 19 5 8 14 20 10

5dB, 0dB, 5dB and 10dB SNR are added, respectively. We compare the classification accuracy among no feature selection, the proposed feature selection and the distance-based CSM [9], for MFCC features. We can see that the added noise has a little bad effect to the classification accuracy. With the proposed feature selection, the classification accuracy can be improved.

For DWT features, we conduct 3-level decomposition and obtain 3044 coefficients as feature inputs. All features produce the mean classification accuracy 84.54%, whereas top ranked 100 features improve the mean classification accuracy by nearly 10%, i.e., 94.23% (see Table 3). This may indicate that

103

So the distance-based CSM method [9]. By comparing the proposed method and the distance-based CSM [9], the proposed method works better for some SNR, and is comparable to the distance-based CSM for some SNR.

Thesis, School of Information Technology, University, Gold Coast Campus, 2004.

4. Conclusions

[5] D. Kimber, and L. Wilcox, “Acoustic Segmentation for Audio Browersers”, Proceedings of Interface Conf., Australia, 1996.

Griffith

[4] Samantha J. Barry, Adrie D. Dane, H. Alyn and A. D. Walmsley, “The Automatic Recognition and Counting of Cough”, Cough, vol. 2, no. 8, 2006.

This paper proposes to utilize the class-dependent feature selection method for audio signal classification. We proposed a new class separability ranking measure based on scatter matrix, which is different from the distance-based CSM [9]. The proposed method first involves the extraction of MFCC, DWT and STFT features for speech, coughing and the cup-plate sound. It then ranks the features for each class. Experimental results showed the proposed class-dependent feature selection is able to effectively exploit the characteristics of each type of audio signals and to select very different corresponding feature subsets. Compared to some existing feature selection methods and methods without feature selection, the proposed method is able to achieve the same or even better classification accuracy.

[6] L. Rabiner and B.-H. Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1999. [7] A. R. Abu-EI-Quran, R. A. Goubran and A. D. C. Chan, “Security Monitoring Using Microphone Arrays and Audio Classification”, IEEE trans. Instrumental and measurements, vol. 55, no. 4, pp. 1025-1032, Aug. 2006. [8] M. Cowling and R. Sitte, “Comparison of Techniques for Environmental Sound Recognition”, Pattern Recognition Letters, vol. 24, no.15, pp. 2895-2907, 2003. [9] L. P. Wang, N. N. Zhou and F. Chu. “A general wrapper approach to selection of class-dependent features”, IEEE Transactions on Neural Networks, vol. 19, no. 2, pp. 12671278, 2008.

0.98

[10] L. T. Chen, M. J. Wang, C. -J. Wang and H.-M. Tai, “Audio Signal Classification Using Support Vector Machines”, Springer-Verlag Berlin Heidelberg, pp. 188-193, 2006.

0.97 0.96

Mean accuracy (%)

0.95 No Feat.selection Proposed CSM [12]

0.94 0.93

[11] The Harvard-Haskins Database of Regularly-Timed Speech database, http://vesicle.nsi.edu/users/patel/download.html.

0.92 0.91 0.9 0.89 0.88 -5

0

5

[12] G. D. Guo and S. Z. Li, “Content-based Audio Classification and Retrieval by Support Vector Machines”, IEEE trans. on Neural Networks, vol. 14, no.1, pp. 209-215, 2003.

10

SNR (dB)

Figure 3. Comparative classification accuracy for MFCC features, in the presence of white Gaussian noise.

[13] X. D. Zhang, ”Modern Signal Processing”, Springer, 2002

5. References

[14] E. Dorken, S. Nawb, and S. Milios, “Knowledge-based signal processing applications, Knowledge signal based processing”, Prentice Hall Publications, 1992.

[1] Klapuri, “Audio Signal Classification”, Technical Report, ISMIR Graduate School, 2004.

[15] C.-C. Lin, S. -H. Chen, T. -K. Truong and Y. Chang, “Audio Classification and Categorization Based on Wavelets and Support Vector Machine”, IEEE trans. Speech and Audio Processing, vol. 13, no. 5, pp. 644-651, Sep. 2005.

[2] D. Gerhard, “Audio Signal Classification: History and Current Techniques”, Technical report, department of Computer Science, University of Regina, Regina, Canada, Nov. 2007. [3] M. Cowling, “Non-speech Environmental Sound Classification Systems for Autonomous surveillance”, PhD

[16] P. A. Devijver and J. Kittler, “Pattern recognition: A statistical approach”, Prentice Hall, New York, 1982

104

Suggest Documents