XII INTERNATIONAL CONFERENCE - SYSTEM MODELLING and CONTROL SMC'2007 OCTOBER 17-19, 2007, Zakopane, Poland [Warning: Draw object ignored]
Application of A Fast Orthogonal Neural Network in Content-based Music Genre Classication
Bartªomiej Stasiak and Mykhaylo Yatsymirskyy
Institute of Computer Science, Technical University of Lodz ul. Wolczanska 215, 93-005 Lodz, Poland
[email protected],
[email protected]
Abstract The paper presents a novel approach to automatic music genre recognition based on a fast orthogonal neural network. The details of the construction of a hybrid neural network performing both the feature extraction and classication tasks are presented. The proposed solution is successfully tested on the basis of a dataset containing 500 music examples from ve dierent categories.
Introduction The problem of automatic audio material classication has been studied for decades and speech recognition is one of the rst and most prominent examples. More general approach involves determining the type of the analyzed audio data on the basis of a few predened broad categories such as speech, silence, laughter, music, animal, mechanical or other environmental sounds [1]. For this purpose several multi-stage systems have been proposed [2]. Determining the genre of music is the next step of a detailed audio analysis which, on account of a rapidly growing on-line accessibility of huge multimedia collections, is gaining much attention recently [1, 3]. A standard general approach in most classication problems is to reduce the dimensionality of the input space by means of dening a relatively small number of information-rich features. This is the most crucial stage of a classication system as the choice of the features usually inuences its overall performance much more than the actual classication method [3]. Too compact feature set may miss important discriminative properties of the analyzed data, thus leading to poor classication result for the dataset used to train the classier. On the other hand, too many features enabling very good representation of the training data may, at the same time, dramatically increase misclassication of the testing set. This phenomenon, known as the curse of dimensionality , results from the rapidly growing sparseness of the groups of data in the highly-dimensional feature space [4]. High generalization error values may also result from applying a too complex classier for a relatively simple problem. In such a situation the classier memorizes the details of the training samples rather than learns a proper general representation of each class. In the case of neural networks with hidden layers it usually happens when the amount of hidden neurons is too high. One of the possible solutions to this problem is to apply a pruning technique reducing some neural connections which usually leads to lowering the generalization error [4].
The most important issue in the eld of feature extraction is, however, of not quantitative but qualitative nature. In most cases human expert knowledge and experience is necessary for dening a really descriptive feature set which would be compact, easily computable and comprehensive. For audio classication tasks feature extraction methods presented and successfully applied so far include frequency domain techniques based on short time Fourier transform (STFT), such as spectral centroid, spectral rollo, spectral ux and Mel-frequency cepstral coecients (MFCC). Narrowing the audio category enables the application of other more specic features such as e.g. rhythmic and pitch content descriptors for music analysis [1, 3]. The most convenient conceivable audio data classier system would work directly on raw sound samples data without the need of extracting any features. This may seem unrealistic, mainly due to the aforementioned input space dimensionality problem, where the raw sound data would play a role of an extreme long feature vector. It should be noted, however, that a typical classication system works just this way, performing a mapping from the original input space to the output space of class labels with feature space being an intermediate representation. In adaptable systems usually only the second part of the mapping, from feature space to class labels, is being adapted, while the rst one, from input to feature space, is xed according to some preset, more or less sophisticated and domain dependent feature extraction algorithms. In this paper we propose a new approach to fully automatic and adaptable feature extraction and music genre classication based on a fast orthogonal neural network (FONN) [5]. We show that the reduced neural architecture of FONN enables to substantially lower the generalization error in comparison to a standard multilayer perceptron (MLP) and to reach a high level of classication comparable to classical feature-based solutions. We show a method of building a system with a hybrid neural network performing both feature extraction and classication tasks.
The architecture of the proposed and the reference systems The proposed system consists of two main blocks a feature extractor and a classier. The rst block is implemented as a fast orthogonal network of size N , where N is the number of the processed sound samples. The most important characteristics of this type of neural network are [5]: 1) A sparse neural connection scheme based on a fast algorithm of an orthogonal transform; 2) The computational complexity of N ·log(N ) where N is the size of the input space as compared to N 2 for a typical, densely interconnected neural layer; 3) Applicability of ecient gradient backpropagation algorithms for the adaptation. Fast orthogonal neural networks, rst proposed in [5], consist of basic operation orthogonal neurons (BOONs) with two outputs orthogonal to each other. The BOONs are connected in a special way which guarantees that the network is able to learn a fast orthogonal transform such as fast Fourier transform (FFT) or fast cosine transform (FCT). In this work we applied a FONN based on a twostage algorithm of fast cosine transform, type II with tangent multipliers (mFCT2). The BOONs used on both stages are presented in Fig. 1 and the gradient and error values for the second stage are computed during the teaching process with the following formulas:
(1) (2) where a and b denote the inputs of the BOON, t is the weight and e 1 , e 2 correspond to error values propagated back from the next layer to the current one (n in superscript) and from the current layer to the previous one (n 1 in superscript). More details concerning the application of the formulas (1, 2) may be found in [5]. Fig. 1. The types of basic operation orthogonal neurons used in the proposed system
Each neuron in the second stage contains therefore one adaptable weight t determining the value of its both outputs; the rst stage is excluded from adaptation. The architecture of the FONN constituting the rst block of the proposed classication system is presented in Fig. 2, part a). This is only an example for the eight-dimensional input space which may be easily extended for N = 16, 32, ... . Note, that the sequence of elements of the input vectors should be decimated. Both stages contain three layers and the number of adaptable BOONs in the second stage equals 12, which is obviously far less than would be needed by a fully connected dense layer with 8 inputs and outputs. The set of all realizable linear transforms includes the cosine transform, type II given as [5, 6]:
(3) where n , k = 0, 1, ..., N 1. The second block., i.e. the classier shown in Fig. 2, part b), does not actually constitute a separate module. Since the FONN-based feature extractor is a multi-layer feed-forward neural network trained to minimize the mean square error of its outputs, it is then possible to connect it to the additional output layers of dierent architecture performing the data classication task. Both parts of the network are then jointly trained with error values passed back from the classier to the feature extractor. Fig. 2. The architecture of the proposed network for N = 8 and K = 2. The division between the feature extractor a) and the classier b) is marked with a dotted line
The rst layer of the classier contains N simple bias nodes with two inputs, each connected to a feature extractor output and to a constant bias value +1, respectively. The output values of the bias nodes are then transformed with a non-linear, unipolar sigmoid function. The next layer consists of K biased neurons with sigmoidal activation function, each connected to all outputs of the preceding layer. This is the last layer of the classier and each of its outputs corresponds to one of the K target classes.
Both feature extractor and classier constitute jointly a neural network similar to a non-linear multilayer perceptron with the dierence lying in the sparseness of the hidden layer, implemented as a FONN block in our approach. This similarity may be further observed in comparison to the MLP diagram shown in Fig. 3. Fig. 3. The architecture of the reference network (MLP) with eight hidden nodes
The neural network presented in Fig. 3 has also been implemented and applied as the point of reference for the proposed FONN-based solution. It should be noted that in a classical multilayer perceptron the input vectors are passed to the dense layer directly. We decided to apply the initial sparse layers as the preliminary block due to two reasons. Firstly, they are not subjected to adaptation so they may be simply seen as a xed data preprocessing block identical in both compared neural networks. In this way the adaptable parts of both systems work in the same conditions. Secondly, the simulation tests clearly showed that, in the case of direct audio data analysis, it substantially enhances the classication results on the training dataset. The most important dierence between both systems, i. e. the FONN-based and MLP-based classiers is the number of weights N w subjected to adaptation, given as:
(4) in the case of the FONN and: (5)
in the case of the MLP , where H is the number of hidden nodes. For example, assuming N = 256 and K = 5, the number of weights of a MLP-based network depending on the number of hidden neurons H is presented in Table 1. The number of weights in the case of a FONN-based network is xed and equals 2821 for the same values of N and K , which is less than is needed by the MLP with as little as 11 hidden nodes. Table 1. The number of weights of a multilayer perceptron for dierent numbers of hidden neurons
H 2
Weights in Weights in hidden output layer layer 514 15
Total weights 529
4
1028
25
1053
8
2056
45
2101
12
3084
65
3149
16
4112
85
4197
24
6168
125
6293
32
8224
165
8389
48
12336
245
12581
64
16448
325
16773
100
25700
505
26205
The goal of the supervised learning process was to minimize the mean square error of the outputs of the network, where each target vector contained all zeros except of one position, set to 1, dening the proper class. Conjugate gradient algorithm with directional minimization based on third order polynomial approximation [4] was used for training both the proposed and the reference classier systems.
Audio dataset and classication procedure The audio material used for all the tests is composed of 500 short (60s) examples of music equally divided into ve classes, i. e. orchestral , piano , vocal , chamber (string quartet) and jazz where each excerpt comes from a dierent piece or part. From this number 400 examples (80 examples per class) constitute a training set and the remaining 100 (20 per class) is used for testing. All the examples, downloaded from a site oering free excerpts of classical music on-line [7], are mono recordings with the sampling frequency 22050 kHz and 16 bits per sample. Every training excerpt is represented by 12 stretches, each containing 256 samples in time domain (approximately 12ms) and each testing example is represented by 99 stretches of the same length. This makes the training input set of size 4800x256 and the testing input set of the size 9900x256. All the stretches are cut out from random positions of the audio les excluding 5 second margins from the beginning and from the end. Two additional stretches are cut from every training example to form a validation set of the size 800x256, which is used to visualize the generalization error during learning. All the stretches are normalized and the constant component is removed. All the stretches from a single class of the training set were treated equally during the training process, irrespective of whether they came from the same or from dierent excerpts. The classication of the test set involved analysis of all the 99 stretches representing a single example. For each example the mean vector of all the 99 network output vectors was computed and the position of the element with the highest value determined the assigned class label. Two groups of classication tests were performed in order to compare the abilities of the FONNbased classier with a traditional multilayer perceptron (MLP). The second group, investigating the perceptron, involved several tests for dierent number of hidden neurons.
Experimental results and discussion The basic problem encountered during all the simulation experiments is inherently connected to the properties of gradient optimization methods, namely to the risk of getting stuck in a local minimum. Therefore, from the total number of 20 tests performed for every classier and every amount of MLP hidden weights only the best 30% of the results, in the testing dataset classication sense, were taken into account. In most of the remaining tests the learning error stopped decreasing quite early indicating a local minimum and the classication outcome was close to random classication (20%). Table 2 contains the mean values of those selected results and the best case obtained for each classier is presented in Table 3. Table 2. The results of the classication tests (mean values)
1.417
Recognition rate (train) 48.31%
Recognition rate (test) 32.33%
275.67
2.009
58.11%
32.17%
0.319
300.17
3.089
60.93%
27.50%
0.258
405.67
4.075
72.84%
27.67%
0.275
1589.17
2.138
76.22%
76.17%
Classier
Error (train)
Epochs
Epoch length [s]
MLP (H = 24)
0.364
209.17
0.341
MLP (H = 32) MLP (H = 48) MLP (H = 64) FONN
Table 3. The results of the classication tests (the best case)
1.433
Recognition rate (train) 49%
Recognition rate (test) 37%
339
2.110
60%
36%
80
3.709
50%
32%
0.238
499
4.022
78%
34%
0.277
1879
2.507
76%
78%
Classier
Error (train)
Epochs
Epoch length [s]
MLP (H = 24)
0.364
199
0.337 0.362
MLP (H = 64) FONN
MLP (H = 32) MLP (H = 48)
The basic observation resulting from the performed tests was that the FONN-based classier was able to learn the training data with an accuracy exceeding 75%. According to our expectations, the training results of the multilayer perceptron depended greatly on the number of the hidden neurons applied. It appeared impossible to reach even a 50% level of recognition with H < 24. The results of the recognition of the training set comparable to those achieved by the FONN required at least 64 hidden units. However, in this case the validation error tended to increase signicantly after reaching a minimum as the learning process continued. This was inevitably reected in the poor recognition of the testing set falling down to an almost random level of 27.67%. Surprisingly, usually the result of the testing set classication was not better when the training was stopped at the minimum of validation error. On the contrary, reaching the lowest possible error on the training dataset most often ensured the best classication of the test dataset, despite the relatively high validation error. This results from the fact that more polarized network output values, i. e. closer to 0 or 1, more often occurring near the end of the learning process, may yield substantially higher mean square error with the identical or even enhanced classication result. Due to this reason we decided to train the network up to the saturation point for all MLP and FONN-based classiers. The most important observation was that the eect of growing validation error did not take place in the case of the FONN classier (Fig. 4). In most cases the validation error decreased initially and stayed almost constant for the rest of the learning process. Again, although only the learning error continued to decrease and usually reached much lower nal values than the validation error, the classication accuracy of the testing set continued to improve and reached the level comparable to the results obtained for the training set (over 75%). Fig. 4. The typical error curves on the training set (left) and on the validation set (right) for the MLP with 64 hidden neurons (top) and the FONN (bottom)
This result is comparable to the work of Tzanetakis and Cook [1] who reached 88% classication on a dataset containing only four classes (choir, orchestra, piano, string quartet) by means of applying a rich and complex feature set based on both time and frequency domain. It is worth noting that the dataset used for our research contained additionally quite heterogeneous jazz class containing also vocal or piano fragments. The confusion matrix for the best trained FONN classier is presented in Table 4. Each row of the confusion matrix corresponds to the actual class, each column shows the classication result and the diagonal contains the number of the correctly recognized exampled. It may be seen that the results roughly reect the diversity of sound generators present in each class. Indeed, the most homogeneous piano excerpts are recognized perfectly, while the jazz and orchestral examples, containing a variety of instrumental sounds from percussion to stringed and wind instruments impose some classication problems. Table 4. Confusion matrix for the FONN-based classier chamber jazz orchestral piano vocal chamber
15
3
1
1
0
jazz
3
13
2
2
0
orchestral
1
2
12
4
1
piano
0
0
0
20
0
vocal
1
0
0
1
18
It should be stressed that the proposed classication system performs a blind analysis assuming no a priori knowledge of the analyzed data properties. Moreover, no long-term audio characteristics are taken into account, as the randomly cut out input vectors are relatively short. The main disadvantages of the proposed solution are the long training time and the risk of hitting a local minimum. It is worth noting, however, that in most cases it is quite easy to detect the high probability of nding a local minimum at the very beginning of the learning process and restart it. As for the long computation time, it may be counterbalanced in many possible applications by a very short time of forward propagation in the phase of testing the previously trained network, resulting from the lack of any feature computation procedures. The values obtained during the tests performed on a computer with Intel Celeron M, 1.40 GHz processor were usually close to 19ms for a whole music example which gives 0.19ms for a single stretch.
Conclusion Fast orthogonal neural network was applied as a part of a neural classication system enabling to perform successful fully automatic feature extraction and classication of an audio database containing 500 examples representing ve dierent music genres. The proposed system was compared to a standard multilayer perceptron showing its denite superiority in terms of the generalization error values. The obtained classication results on the testing dataset were on the same level as in the case of the training data, which indicates that the presented architecture of the FONN-based classier is probably close to optimal for the given audio classication task and there seem to be no need to perform any additional neural connection pruning. This may be a consequence of the possibility of
applying the presented neural connections scheme to computing cosine transform, closely related to Fourier transform being one of the most powerful audio analysis tools [6] used in signal processing. The classication capabilities of the presented neural network, enabling to reach the total classication result exceeding 75%, are comparable to the results obtained by other authors with the aid of systems based on manually dened feature sets. Due to the lack of the necessity of feature computation the presented system, while requiring a relatively much time for the training process, oers a quick operation in the testing phase. Future works will concentrate on applications of fast orthogonal neural networks for classication tasks involving other types of multimedia content, such as still images. Investigating the possibility of biometrical identication system construction based on voice samples and face images is also planned as a part of the future research. References [1] G. Tzanetakis, P. Cook, Musical Genre Classication of Audio Signals, IEEE Trans. Speech Audio Process, Vol. 10, No. 5, pp. 293-302, 2002 [2] L. Lu, H. J. Zhang, Content Analysis For Audio Classication and Segmentation, IEEE Trans. Speech Audio Process, Vol. 10, No. 7, pp. 504516, 2002 [3] Li, T., Ogihara, M., Li, Q.: A comparative study on content-based music genre classication. In Proc. of the 26th Annual International ACM Conference on Research and Development in Information Retrieval SIGIR'03, ACM Press, 2003. pp. 282-289 [4] S. Osowski, Neural networks for information processing. (in Polish) OWPW, Warsaw (2000) [5] B. Stasiak, M. Yatsymirskyy, Fast Orthogonal Neural Networks. Lecture Notes in Articial Intelligence No. 4029, ISBN 3-540-35748-3, Springer Verlag 2006, pp. 142-149 [6] A. Czy»ewski, Digital sound. (in Polish) EXIT Academic Publishing House, Warsaw (2001) [7] http://www.classical.com/