Audio Classification based on Maximum Entropy ... - Semantic Scholar

1 downloads 0 Views 157KB Size Report
Then audio snippets are classified using some learning algorithms. Originally, researchers selected a few audio features, such as zero- crossing rate, short time ...


➡ AUDIO CLASSIFICATION BASED ON MAXIMUM ENTROPY MODEL Zhe Feng, Yaqian Zhou, Lide Wu, Zongge Li Department of Computer Science and Engineering, Fudan University, Shanghai, P. R. China ABSTRACT Audio Classification has been investigated for several years. It is one of the key components in audio and video applications. In prior work, the accuracy under complicated condition is not satisfactory enough and the results highly depend on the dataset. In this paper, we present a novel audio classification method based on Maximum Entropy Model. By applying this method on some widely used features, different feature combinations are considered during model training and a better performance can be achieved. When evaluated it in TREC 2002 Video Track’s speech/music feature extraction task, this method works well for both speech and music among participated systems.

1. INTRODUCTION Audio classification, especially the speech/music discrimination, has been becoming a focus in the research of audio processing and pattern recognition in the passed decade. It can be used in a lot of areas and applications, such as audio segmentation, audio indexing and retrieval, automatic speech recognition and etc. In the research of video processing technology, especially in content-based video retrieval, a lot of researches show that semantic information can be extracted more easily with the help of speech and audio. Audio classification can be both the preprocessing of video retrieval and also an important component for video feature extraction, such as MPEG-7 Descriptor extraction. Generally speaking, audio classification is a pattern classification problem. Firstly, the needed features are extracted from the audio signals. Then audio snippets are classified using some learning algorithms. Originally, researchers selected a few audio features, such as zerocrossing rate, short time energy and pitch, and classified the audios by simple thresholds[1][2][3]. Then more complicated models are applied in this field. Scheirer and Slaney compared many frequently used features and apply GMM, BP-ANN, and k-NN in classification [4]. These methods work well under simple conditions. For example, if the window size was 2.4 seconds and there is no mixing between speech and music, the reported accuracy, which

0-7803-7965-9/03/$17.00 ©2003 IEEE

I - 745

was gotten just from using zero-crossing rate and short time energy, is 98% [4]. But as long as the condition becomes more complicated, the performance of most methods will turn worse dramatically. Recently some work make the discrimination more robust. Lu et al introduced band periodicity and a two-stepped discrimination scheme to discriminate speech, music, silence and environment sound in one-second window [5]. Srinivasan et al tried to classify the audio that consists of mixed classes [6]. Again, the accuracy under complicated condition is not satisfactory enough and the results highly depend on dataset. In this paper, a novel audio classification methods based on maximum entropy model are presented. Considering the principles of maximum entropy model, the method is more flexible for different datasets and can compare different feature combinations. The results, which are given in section 5 show that this method is efficient and robust in complicated conditions. Compared with other audio classification system, this method got a more satisfied result in speech/music feature extraction task of TREC 2002 Video Track. The rest of this paper is organized as follows. Maximum entropy model is introduced in section 2. In section 3, features used in our method are listed. The system details are described in section 4. We present experimental results and evaluations in section 5, and finally present the conclusions in section 6. 2. MAXIMUM ENTROPY MODEL 2.1. Basic Idea The basic idea of maximum entropy model is to model anything that is known and to assume nothing that is unknown. That is to say, we should find a distribution, which satisfies all the known facts to the classifier and assumes nothing to those unknown things. Suppose X is the input feature vectors to the classifier and y is the output class label. Then the distribution p(y|X) can be estimated using the idea above. In maximum entropy model, this rule is implemented by maximizing the entropy H ( p ) = − ∑ p ( y X )log p ( y X ) X ,y

and remaining the consistence with the known facts.

ICME 2003



➡ The general method to represent the known facts is to define ME feature like this: 1 , if ( X , y ) exist in training data , i=1, …, N fi (X , y) =  else 0 ,

where N is the size of training samples. Here we can see ME feature represents the relationship between input feature vector X and output audio class label y. In order to constraint the distribution to satisfy the known facts, a constrained distribution set can be defined as: ℘ = { p( y | X ) : E p{ fi } = E ~p { fi },1 ≤ i ≤ n} where

E p { fi } = ∑ fi ( X , y ) p( X ) p( y | X )

3.1. Frame Feature In our system, the frame features include: zero-crossing rate, short time energy, auto correlation coefficients, DFT coefficients, brightness[4], pitch, band spectrum energy and bandwidth. The length of each frame is 25ms. Since these features are widely used low level features of audio signals, we shall skip their detailed definition.

X,y

3.2. Window Feature

X ,y

Window features are extracted from the frame features, which have been mentioned in section 3.1, of each frame in a one second window. In the following formulas, DFT(n,k) is the k-th DFT coefficients of frame n. z Mean and variance of zero-crossing rate. z High Zero Crossing Rate Ratio (HZCRR) It is the ratio of the number of frames whose zerocrossing rate is above 1.5 fold of average zero-crossing rate in a window [5]. z Mean and variance of short time energy z Low Short Time Energy Ratio (LSTER) It is the ratio of the number of frames whose short time energy is less than 0.5 times of average short time energy in a window [5]. z Noise Frame Ratio (NFR) It is defined as the ratio of noise frames in a given audio snippet. A frame is considered as a noise frame if the maximum of auto correlation coefficients is lower than a pre-set threshold [5]. z Mean and variance of brightness z Spectral flux According to [4][5], it is defined as the average variation value of spectrum between the adjacent two frames in a window.

E ~p { f i } = ∑ f i ( X , y ) ~ p( X ) ~ p( y | X ) ~ p ( X ) and ~ p ( y X ) is the observed distribution in training

data. Thus, the problem is to find a distribution with the maximum conditional entropy in constrained distribution set℘:

  p ∗ ( y X ) = arg max − ∑ ( p ( y X ) p ( X )) log ( p ( y X ) p ( X )) p ( y X )∈℘  X , y  It can be proved that solution of this equation is [7]: 1 p*(y | X ) = exp(∑ λ i f i ( X , y )) Z(X ) i Z ( X ) = ∑ exp(∑ λ i f i ( X , y )) y

features. The audio type classification is applied on a 1second window. So the window features are extracted based on the frame features. Window features are final features used to make audio classification.

i

where Z ( X ) is a normalized parameter and λi is the weight for each ME feature. λi can be estimated from training data. 2.2. Build Maximum Entropy Model In our method, the audio is segmented into snippets with certain duration. Input feature vector X is extracted from each audio snippet and y is the output audio class label of it. According to the basic idea of maximum entropy model, the features are discrete. Therefore, those features with continuous value should be discretized at first. It is obvious that different features in input feature vector are not completely independent and also have different importance on audio classification. Therefore, ME feature space are constructed by considering different feature combinations.

SF =

Spectral Roll-off Point (SRP) As Scheirer et al described in [4], it is the 95th percentile of the power spectral distribution. z

 N −1 i   ∑∑ DFT (n, k )  1 n=0 k =0 > 0.95 SRP = a rgmin N −1 K −1 (K − 1) i = 0,L, K −1  DFT (n, k )  ∑∑  n =0 k =0

Mean and variance of pitch Pitch ratio It is the ratio of the number of frames whose pitch is greater than 0 in a window. z Mean and variance of band spectrum energy z z

3. FEATURE EXTRACTION There are a lot of features are considered and evaluated in prior work. In our method, the features are extracted for each frame at first. We call them frame

N −1 K −1 1 [log( DFT (n, k ) ) − log( DFT (n − 1, k ) )] 2 ∑∑ (N − 1)(K − 1) n =1 k =1

I - 746



➡ Mean and variance of bandwidth Among these window features, HZCRR, LSTER, NFR, pitch ratio and SRP have discrete value. The other window features are discretized using Uniform Quantization.

z

mixtures and 16 mixtures. Best performance is achieved when mixture number is 8. Table.1 Comparison with NN, k-NN and GMM Model Length(s) NN 11-NN 8-mix GMM ME

4. DETAILS OF SYSTEM 4.1. Training Data The 3-hour training data contains 15 videos selected from TREC 2001 Video Track Dataset. These videos contain several audio types: pure speech, pure music, background noise, speech under music, speech under noise, speech under music with noise, music with noise and silence. They are labeled manually and a maximum entropy model is trained for each audio class except for silence.

4.3. Video Feature Extraction In TREC 2002 Video Track, Feature Extraction Task is added. Speech and music are two features need to be extracted for each shot in Feature Test Video Set. After applying our audio classification method, two formulas are used to calculate the ranking score of speech and music feature for each shot [8]. # of windows whose type is speech # of windows in a shot # of windows whose type is music = # of windows in a shot

Ranking Scorespeech =

Ranking Scoremusic

5. EXPERIMENTS AND EVALUATIONS In Experiment 1, we compared the accuracy on speech, music and background noise between NN, k-NN, GMM and maximum entropy model in our own dataset. 13 videos are used for evaluation. 3 videos come from TREC Video 2002 Data Set, 9 videos come from TREC Video 2001 Dataset and other one video is Shanghai TV News. The total duration of these videos are 2.46 hours. For kNN model, we tried k=3, 5, 7, 11, 25 and got the best performance when k=11. For GMM model, we tried 8

Music

6969 0.906 0.922 0.929 0.946

1012 0.344 0.342 0.493 0.532

Background Noise 245 0.371 0.246 0.259 0.415

Total 8878 0.809 0.821 0.843 0.869

In the test data, pure speech is only a small part of all speech segments. In Experiment 2, we compared the accuracy on pure speech and speech mixing with other audio types. Table.2 shows the results of this experiment. Table.2 Accuracy of complicated speech type

4.2. Classification Procedure The whole classification procedure consists of four steps: Step 1. Detect silence based on energy. Step 2. Extract and discretize window features for each 1second window except silence. Step 3. Applying maximum entropy model of each audio class, and find the class with the maximum probability. Step 4. Smooth the results with some rules.

Speech

Length(s) NN 11-NN 8-mix GMM ME

Pure Speech

Speech + Music

Speech + Noise

1429 0.946 0.965 0.980 0.987

3437 0.890 0.894 0.892 0.926

1331 0.924 0.959 0.972 0.962

Speech + Noise + Music 977 0.883 0.910 0.926 0.938

Experiment 1 and 2 show that maximum entropy model can get a better performance than k-NN and GMM when using same features. In this paper, we use some widely-used audio features for classification. Although different feature combinations have been considered during training, feature selection is still an open problem. We believe more efficient features and feature selection methods can increase the accuracy of our method. In order to compare this method with other relevant work, we evaluate our method using TREC 2002 Feature Extraction Test Data Set (5.07 hours), pooling answers and evaluation standards in Experiment 3. The results are described in Table.3 and Table.4. In Table.3, PSWF means the Precision at Shots with features; A-Pre means Non-interpolated Average Precision. These are the two parameters used in TREC Evaluation [9]. The results listed in the table include all the submitted systems. In our official submissions, we used NN and GMM method. After applying maximum entropy model to the Feature Test dataset, we used trec_eval program to evaluate the results. Table.4 is the detailed evaluation of maximum entropy method using same parameters with TREC. From the results, we can see that maximum entropy method is more reasonable than NN and GMM in Feature Extraction Task. Here we want to notify that this method works well for both speech and music.

I - 747



➠ 7. ACKNOWLEDGEMENT

Table.3 Video Feature Extraction Evaluation Speech

Music

PSWF

A-Pre

PSWF

A-Pre

System 1

0.6773

0.6420

0.5119

0.2571

System 2

0.6686

0.6304

0.1720

0.0568

System 3

0.6686

0.6486

System 4

0.7214

0.7208

System 5

0.7142

0.7103

System 6

0.7164

0.7127

System 7

0.7019

System 8

The research described in this paper has been supported by Fujitsu Research and Development Center Co.,LTD and NSF of China (69935010). 8. REFERENCES

0.2301

0.2221

0.6809

0.5864

0.4378

0.5731

0.5702

0.5807

0.5115

System 9

0.6758

0.6448

0.7183

0.6370

Fudan_NN

0.6744

0.6748

0.5625

0.5211

Fudan_GMM

0.6881

0.6632

0.5633

0.5638

Fudan_ME

0.7098

0.6925

0.7093

0.6816

[1] J. Saunders, “Real-Time Discrimination of Broadcast Speech/Music”, Proc. of ICASSP’96, Vol.II, pp.993-996, Atlanta, 1996. [2] N. Patel, I. Sethi, “Audio Characterization for Video Indexing”, Proc. of SPIE in Storage and Retrieval for Still Image and Video Databases, Vol.2670, pp. 373-384, San Jose, 1996. [3] T. Zhang, C.-C. J. Kuo, “Content-Based Classification and Retrieval of Audio”, Proc. of SPIE on Advance Signal Processing, Algorithms, Architectures and Implementations VIII, Vol.3461, pp.432-443, San Diego, 1998.

Table.4 Detail Evaluation of ME Model Interpolated Recall Precision

[4] E. Scheirer, M. Slaney, “Construction and Evaluation of a Robust Multifeature Music/Speech Discriminator”, Proc. of ICASSP’97, vol. II, pp 1331-1334. 1997

Precision at n Shots

Recall

Speech

Music

n

Speech

Music

0.0

1.0000

1.0000

5

1.0000

1.0000

0.1

0.9839

1.0000

10

1.0000

1.0000

0.2

0.9839

1.0000

15

1.0000

0.9333

0.3

0.9839

0.9925

20

1.0000

0.9500

0.4

0.9839

0.9705

30

1.0000

0.9333

0.5

0.9839

0.9329

100

1.0000

0.9600

0.6

0.9835

0.8940

200

1.0000

0.9700

0.7

0.9810

0.8751

500

0.9700

0.9780

0.8

0.0000

0.0000

1000

0.8660

0.9810

0.9

0.0000

0.0000

1.0

0.0000

0.0000

[5] Lie Lu, Hao Jiang and Hongjiang Zhang, “A Robust Audio Classification and Segmentation Method”, Proc. of the 9th ACM International Multimedia Conference and Exhibition, pp. 103-211, 2001 [6] S. Srinivasan, D. Petkovic and D. Ponceleon, “Towards Robust Features for Classifying Audio in the Cue Video System”, Proc. of the 7th ACM international conference on Multimedia, pp.393 – 400, 1999. [7] S.D. Pietra, V.D.Pietra and J.Lafferty, “Inducing Features of Random Fields”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 4, pp. 380-393, 1997 [8] L. D. Wu, X. J. Huang and etc. “FDU at TREC 2002: Filtering, Q&A, Web and Video Tasks”, TREC 2002 Notebooks

6. CONCLUSION In this paper, we have presented a novel audio type classification method based on maximum entropy model. Experimental evaluation shows that maximum entropy model gives better accuracy than k-NN and GMM model using same features. Combining with some ranking rules, it is fitful to justify whether a shot in a video contains speech or music. Although different feature combinations have been considered during training, feature selection is still an open problem. In the future, we plan to make further research on it to improve the accuracy.

[9] A. Smeaton and P. Over, “The TREC-2002 Video Track Report”, TREC 2002 Notebooks.

I - 748