Percussion Bell, bongo, chime, conga, cymbal, drum, gong, maraca, tambourine, triangle, timbales, tomtom, tympani. String. Guitar, violin. Woodwind Oboe ...
Feature Selection for Automatic Classification of Musical Instrument Sounds Mingchun Liu and Chunru Wan School of Electrical and Electronic Engineering Nanyang Technological University Singapore 639798 Tel: +65 790 6298
E-mail: {P147508078 | Ecrwan}@ntu.edu.sg
Experiments of classifying the musical instruments into the right families have been conducted using nearest neighbor (NN) classifier, modified k-nearest neighbor (k-NN) classifier, as well as Gaussian mixture model (GMM), based on the selected best features.
ABSTRACT In this paper, we carry out a study on classification of musical instruments using a small set of features selected from a broad range of extracted ones by sequential forward feature selection method. Firstly, we extract 58 features for each record in the music database of 351 sound files. Then, the sequential forward selection method is adopted to choose the best feature set to achieve high classification accuracy. Three different classification techniques have been tested out and an accuracy of up to 93% can be achieved by using 19 features.
KEYWORDS:
Sequential forward feature classification, musical instrument, feature extraction.
THE DATABASE The musical instruments are commonly sorted into five families according to their vibration nature, which are string, brass, percussion, woodwind, and keyboard. Currently, there are 351 files in the musical instrument database. A brief description of the sound files is given in Table 1.
selection,
Table 1. The musical instrument collection
INTRODUCTION The collection of musical instrument sounds is an obligatory part of comprehensive music digital libraries. Automatic musical instrument classification can be very helpful for indexing the database as well as for annotation and transcription. In [2], four instruments, guitar, piano, marimba, and accordion, could be identified using an artificial neural network or nearest neighbor classifier. The results of this preliminary work achieved were encouraging although only temporal features were utilized. In [3], polyphonic music was separated into each monophonic one using comb filters and musical instruments were estimated by frequency analysis. More recently, a system for musical instrument recognition was presented that used a wide set of features to model the temporal and spectral characteristics of sounds [1].
Classes Brass
Instruments Fanfare, French horn, trombone, trumpet, tuba
Keyboard
Piano
Percussion String
Bell, bongo, chime, conga, cymbal, drum, gong, maraca, tambourine, triangle, timbales, tomtom, tympani Guitar, violin
Woodwind
Oboe, saxophone
FEATURE EXTRACTION The lengths of the sound files range from 0.1 second to around 10 seconds. Every audio file is divided into frames of 256 samples, with 50% overlap at the two adjacent frames. Each frame is hamming-windowed and 58 features are extracted for each frame. Means and standard deviations of the frame-based features are computed as the final features for each audio file. The 58 features from three categories are showed in Table 2. The features 1–8 are temporal features, 9–32 are spectral features, and 33–58 are coefficient features.
Due to a large number of audio features available, how to choose or combine them to achieve higher classification accuracy is studied in this paper. Simply choosing all the features available often doesn’t yield the best performance, because some features give poor separability among different classes and some are highly correlated. These bad features have a negative effect when added into the feature vector. Therefore, a sequential forward selection method is adopted to select the so-called best feature set.
FEATURE SELECTION The extracted features are normalized by their means and standard deviations. Then, a sequential forward selection (SFS) method is used to select the best feature subset. Firstly, the best single feature is selected based on classification accuracy it can provide.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’01, June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006…$5.00.
247
Table 3. Classification performance
Table 2. Feature description
3
Descriptions Mean and standard deviation of volume root mean square Volume dynamic ratio
4
Silence ratio
1-2
5-6
9-10 11-12
Mean and standard deviation of bandwidth
13-20 21-22 23-24 25-28 29-32 33-58
NN
k-NN
GMM
Time
0.73(7)
0.73(6)
0.71(6)
Frequency
0.80(8)
0.82(10)
0.80(11)
Coefficient
0.86(17)
0.85(13)
0.80(17)
0.90(14)
0.86(9)
0.84(14)
0.89(29)
0.91(15)
0.85(17)
0.85(12)
0.87(7)
0.81(8)
0.91(13)
0.93(19)
0.87(22)
Time and Coefficient Frequency and coefficient Time and frequency Time, frequency, and coefficient
Mean and standard deviation of frame energy Mean and standard deviation of zero crossing ratio Mean and standard deviation of centroid
7-8
Features
Means and standard deviations of four subband energy ratios Mean and standard deviation of pitch
1
Mean and standard deviation of salience of pitch Means and standard deviations of first two formant frequencies Means and standard deviations of first two formant amplitudes Means and standard deviations of first 13 Mel-frequency cepstral coefficients
classification accuracy
Feature number
0.9 0.8 0.7
NN k−NN GMM
0.6
Next, a new feature, in combination with the already selected features, is added in from the rest of features to minimize the classification error rate. This process proceeds until all the features are selected. The SFS method can quickly provide a suboptimized set of features in comparison with the exhaustive searching approach which is not practical due to exorbitant computation time involved in the concerned applications.
Figure 1. The Classification accuracy versus feature dimension for testing patterns
EXPERIMENTS
CONCLUSION
The database is split into two equal parts: one for training, and the other for testing. Three classifiers, NN, modified k-NN, GMM, are used to classify the musical instruments. In the modified kNN, we firstly find the k (k=3 in this paper) nearest neighbors from each class instead of whole training set, their means are calculated and sorted, then assign the testing feature vector with the class corresponding to the smallest mean. The classification accuracy versus feature dimension of the three classifiers using combination of temporal, spectral, and coefficient features, is showed in Figure 1. From the figure, we can see that the performance increases rapidly with the increase of features at the beginning. It remains more or less constant and even decreases after a particular number. Other experiments using single temporal, spectral, coefficient feature or any combination of them have the similar phenomenon. The classification accuracies of the best feature set and the corresponding feature numbers are listed in the Table 3. The best feature sets for different classifiers are different. Among all the experiments, the modified k-NN classifier using 19 features, in which 6 are temporal, 8 are spectral, and 5 are coefficients, achieves the highest accuracy of 93%.
REFERENCES
0.5
0
20 40 feature dimension
60
In this paper, we use a sequential forward feature selection scheme to pick up the best feature set in single or any combination of temporal, spectral, and coefficient space for classifying musical instruments into five families. Simple classifier using small set of features can achieve a satisfactory result. Since the number of features is reduced, less computation time is required for classifying the music instrument sounds. This will be beneficial to real-time applications such as sound retrieval from large databases.
[1] Eronen, A.; Klapuri, A. Musical instrument recognition using cepstral coefficients and temporal features. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 63-71, 2000.
[2] Kaminsky, I.; Materka, A. Automatic source identification of monophonic musical instrument sounds. IEEE International Conference on Neural Networks, Vol. 1, pp. 189-194, 1995.
[3] Miiva, T.; Tadokoro, Y. Musical pitch estimation and discrimination of musical instruments using comb filters for transcription. 42nd Midwest Symposium on Circuits and Systems, Vol. 1, pp. 105-108, 1999.
248