Document not found! Please try again

A hierarchical modular classifier for musical instruments

2 downloads 0 Views 337KB Size Report
A hierarchical modular classifier is proposed for the classification of musical ... based on the recognition of particular sound sources (e.g. human voice), for sound .... 20 frames. This is not so restrictive, because typically a classical musical. 511 ...
2004 IEEE Workshop on Machine Learning for Signal Processing

A HIERARCHICAL MODULAR CLASSIFIER FOR MUSICAL INSTRUMENTS A.M. Fanelli, L. Caponetti, G. Castellano, C.A. Buscicchio URiversitA degli Studi di Ban, Dipartimento di Infomatica Via E. Orabona 4, 70126, Bari, Italy e-mail: {fanelli, caponetti, castellano, buscicchio) @di.uniba.it

ABSTRACT A hierarchical modular classifier is proposed for the classification of musical instruments out of the sound they produce. In order to increase the classification performance each module of our classifier operates on a distinct subset of input feature. Each module is a fuzzy inference system with the capdbility of learning fuzzy d e s from data: in the training it encodes the knowledge learned from data in the form of fuzzy if-then rules. Before the classification process, sound data are pre-processed and a compact time-frequency representation is extracted. Preliminary experimental results compare favorably with human performance on the same classification task and demonstrate the utility of both the modular architecture and the hierarchical classification process. 1. INTRODUCTION

Recognizing sound sources in a complex environment is arguably the primary function of the human auditory system. Recognition is possible, in part, because acoustic features of sounds often betray physical properties of their sources. While humans can become skilled at identifying the types of sound sources, no artificial system, at present, has been built that can demonstrate the same competence. As a consequence, much attention of the researchers is devoted to create artificial systems that can learn to recognize sound sources in complex auditory environments [l]. There are many applications in which automatic sound source identification would be useful. For example, it would be useful to build intzlligent systems that can annotate U21 or transcribe music [31[41, to build safety systems based on the recognition of particular sound sources (e.g. human voice), for sound data compression and finally to make researches / to analize about the human processes of sound source recognition. Current research in this area is still mainly based on the Helmholtz's study [ 5 ] conceming the definition of musical "tinibre" and the relative perceptual importance of various acoustic features of musical instrument sound. The goal of this work is to develop an artificial system that automatically classifies and recognizes musical instruments from the sounds they produce. In particular, in this paper we extend the classification of families of orchestral musical instruments, introduced in [61, to the classification of each instrument belonging to a instrument family. At this aim we design a classifier which identify the sound of musical instruments using a hierarchical classification

0-7803-8608-6/04/$20.00 02004 IEEE

509

process. According to this approach, a 'many-class' problem is decomposed into a hierarchy of 'few-class' sub-problems, grouped at different level of abstraction. The hierarchical approach recalls the taxonomic mode used by human beings to perform classification. Indeed, when we recognize an object in the environment, fust we recognize it at a high level of abstraction, and then we proceed to more concrete levels of recognition. As recognition proceeds down the branches of the taxonomy, each decision is relatively simple, requiring discrimination among a small number of classes [41. To address the problem of high input dimensionality, each classifier in the hierarchy (both at higher levels and at lower levels), has a modular structure, i.e. it is an ensemble of neuro-fuzzy networks. Each neuro-fuzzy network is trained to classify patterns using only a separate region of the input space. After learning, each network encodes in form of fuzzy rules the mapping between the sub-region of the input space and the classes. Then, the classification responses of neurofuzzy networks in the same ensemble are properly combined to provide the response of the modular classifier. The use of such a modular approach, also justified on neurobiological grounds [7], would permit the formation of high-order computational units that can perform complex tasks such as that of musical instrument identification. Moreover, one key advantage is the reduction of the computational complexity of the leaming process, which is globally more affordable with respect to training a single large network to solve the task as a whole. In addition, the integration of a fuzzy reasoning scheme and a neural network helps to develop explicit rather than implicit classification schemes and to quantify vagueness can exist both in musical sounds themselves and in mles governing the classification mechanism. According to the proposed approach, both the classification problem (using a hierarchical classification process) and the input dimensionality problem (using a modular structure) are divided into sub-problems. The paper is organized as follows. Section I1 describes the preprocessing of musical data and feature extraction. Section 111 illustrates the architecture of the proposed hierarchical modular neural-fuzzy classifier. Section IV presents the experimental results followed by conclusions in Section V.

2. DATA PRE-PROCESSLNG AND FEATURES EXTRACTION We consider the sound waveform x(t) obtained from the sound of a musical instrument sampled in a time T. Typically, in a sound waveform four regions can he identified according to its energy (as shown in Figure 1): A. affuck energy increases from zero to its maximum value B. decay: energy decreases to a stable value C. susruin: energy remains stable D. release: energy decreases to zero Since the sound waveform is not very "regular" in time, it is important to take into account spectral features and the time in which they occur in the signal. Hence, both temporal and spectral features should he extracted from the sampled signal. In Figure 1, a sampled sound waveform is shown.

510

OM

0 03

0 02

0 01

0

-001

-0 02

2

0.5

25

xlo'

Figure 1. Waveform and energy regions of a trumpet sound To perform time-frequency analysis we use the short-tirw Fourier fransfumi (STlT) which is equivalent to a filterbank where the filter channels are linearly spaced and all channels have the same bandwidth S'WT is computed by dividing the original signal x ( f ) into S segments (calledframes), and then by applying the Fourier transform of each segment s = 1,..., S. The result X,(f) is a spectrograin, which provides the frequency spectrum of the signal in every frame The spectrogram provides a very large amount of frequency-time information. So, for each frame, an extraction of a smaller number of spectral features is performed, by integrating X , ( f ) over some particular frequency bands. These hands are selected on the basis of their biological relevance in the simulating the frequency response of the human cochlea. We divide the frequency range U = [IOO, 160001 Hz into B bands (see Figure 2) using the Equii&nt Rectrrrlgular Bandwidth (ERB) scale [S][91[10]. Then, spectral information is reduced by over the integrating the spectral magnitude of the s-th segment, i.e.

IX,a,

frequencies within each ERB hand. Hence, for each frame s = I,..,, S and for each ERE band b = I, ..., B we compute the band loudness:

p X , C f ) I df

fa+A&

L: =

(1)

h-A&

where fa and Ab are the center and the width of the h-th band, respectively. Lb, can be regarded as the intensity of lhe sound signal within a window of frequencies having the width of a ERE band. Moreover, we consider only 1 sec of the sampled signal x(t) (i.e. since sampling frequency is 32 KHz, we get the first 32000 sound samples) and compute the STFT with a window's width of 1/20 sec (i.e. 1600 samples). An overlap of 6.25 msec (i.e. 200 samples) between windows is also taken. As a result, we obtain S = 20 frames. This is not so restrictive, because typically a classical musical

511

instrument sound reaches its “steady state” energy in less than 1 sec. The result is a cochleasram (Figure 4) which represents, in the time domain, the loudness of the sound signal withim all the 24 critical bands. Hence, the total amount of resulting features is 24 x 20 = 480.

Figure 2. Equivalent Rectangular Bandwidth critical hands.

4

-2 0 25

0

0

Figure 3 . Cochleagram of the signal in Figure 2, where s is the frame number, h is the band number, 1 is the band loudness. Each square region of the surface represents a feature.

512

3. MUSICAL INSTRUMENTS CLASSIFIER Through the pre-processing phase described in the previous section, we formulated the problem of sound recognition as B classification problem of 480 inputs and 12 classes. According to the proposed hierarchical approach, fust the 12-class classification task is decomposed into a hierarchy of few-class subproblems, grouped into two levels. At the higher level, classes are grouped into three "meta-classes" that correspond to instrument families (strings, woodwinds, brass), hence a three-class classification problem is solved. At the lower level, three few-class sub-problems are solved, aimed at recognizing different instruments among those belonging to the same family. On the overall, the whole classification system is composed of four classifiers arranged in a two-level hierarchy: High-level the classifier FAM which classifies musical sounds into three nieta-classes that correspond to instrument families (strings, woodwinds, brass) Low-level: o the classifier STR which classifies musical sounds belonging to meta-class 'strings' into two classes: viola, violin; o the classifier WOD which classifies musical sounds belonging to meta-class 'woodwinds' into five classes: bassoon. oboe, clarinet, flute, piccolo; o the classifier BRS which classifies musical sounds belonging to meta-class 'brass' into five classes: tuba, horn, trumpet, flugelshom. muted trumpet. To cope with the high dimensionality of the input space, a modular structure is considered for each classifier in the hierarchy. Each classifier (FAM, STR, WOD, BRS) is made of an ensemble of neuro-fuzzy networks. Each network module classifies a sound using only a sub-region of the input space. Ten separate regions of the input space are considered, derived as follow. The cochleagram is fust split into two parts: a "low" part comprising the first 12 bands (from 1 to 12). and a "high" pW comprising the last 12 hands (from 13 to 24). Then, both the low and the high part is further divided into 5 regions, each comprising 4 frames as depicted in Figure 4. As a result, ten sub-region of the input space (cochleagram) are derived, each sub-region made of 48 features, as shown in Figure 4. Summarizing, each classifier is an ensemble of 10 neuro-fuzzy networks, where each network learns to classify a sound by using only a region of the input space. To combine the responses of the ten networks, two integrating units are used. One unit combines the outputs of the five sub-networks processing part "low" of the feature space. the other unit combines the sub-networks processing pan "high". The outputs of the two integrating unit are obtained as follows:

513

Ea"' E

High

where i i r l i s the j-th output of the sub-network processing the r-th sub-region, J

and air' are the associated weights. Such weights are chosen so as to give more impoflance to the output of modules processing the fist frames. Indeed, it has been demonstrated experimentally that the attack has a great influence in the recognition of a sound source (if we discard the onset of a sound, the recognition becomes difficult even for humans). In fact, many sound timbre characteristics are stronger in the early vibrations than in the last ones (i.e. more in the sound onset than in his "steady state"). As a consequence, the weights have been chosen according to the following decreasing function:

a(')= C , r = 1,...,5

(4) r where c is a constant value, here set to 3.0. However, such weights can also be modified as pan of the leaming by implementing the integrating unit as a gating network [I 11. The final output of each modular classifier is obtained as a simple average of the partial outputs produced by the two parts of sub-networks.

Figure 4. Architecture of the modular neuro-furry network. The cochleagram is split into 10 regions that are inputs for sub-networks(here represented by circles). Their responses are averaged to give the final response.

514

4. EXPERLMENTAL RESULTS To perform the task of recognizing musical instruments, we used a dataset of 500 sound samples of 12 orchestral instruments played on their entire pitch ranges, belonging to three different families: suings (viola, violin), woodwinds (bassoon, oboe, clarinet, flute, piccolo), brass (tuba, hom, trumpet, flugelshom, muted trumpet). The sound dataset was supplied by K. D. Martin of the MIT Media Laboratory Machine Listening Group, Cambridge MA, USA. To perform simulations, we considered 20 different splits (called trials) of the whole data set into a training set (70% of the 500 samples) and test set (remaining 30%) the whole data set. Each training set had 350 samples. The hierarchical modular classifier was designed as follows: e The FAM modular classifier was composed of 10 neuro-fuzzy networks having 48 inputs and 3 outputs. Each neuro-fuzzy network was trained for 1000 epochs providing a classification rate on the 20 training sets ranging from 90% to 98%. The ARC modular classifier was composed of 10 neuro-fuzzy networks having 48 inputs and 2 outputs. Each neuro-fuzzy network was trained for 600 epochs providing a classification rate on the 20 training sets ranging from 98% to 100%. The WOD modular classifier is composed of 10 nenro-fuzzy networks having 48 inputs and 2 outputs. Each neuro-fuzzy network was trained for 600 epochs providing a classification rate on the 20 training sets ranging from 97% to 99%. The BRS modular classifier is composed of I O neuro-fuzzy networks having 48 inputs and 2 outputs. Each neuro-fuzzy network was trained for 600 epochs providing a classification rate on the 20 training sets ranging from 93% to 99%. For each modular classifier, all sub-networks were initialized by applying the FCM alxorithm with 15 clusters, so they have all the same structure (15 rule nodes). The learning algorithm we used is described in 161. To appreciate the ability of classification of the hierarchical process, we made a comparison between the hierarchical modular classifier described above and a modular classifier which directly classifies the musical instrument, without introducing meta-classes, i.e. with no family classification. Such non-hierarchical classifier, labeled DIR, is composed of 10 neuro-fuzzy networks having 48 inputs and 12 outputs. Each neuro-fuzzy network was trained for 4000 epochs providing a classification rate on the 20 training sets ranging from 72% to 90%. The results of such comparison are shown in Figure 5 . It is evident the improved classification ability of the hierarchical classifier versus the non-hierarchical one.

515

-3

0 Hierachical

I

Figure 5 . Classification rate of the classifiers, averaged on 20 trials with a breakdown of the results in terms of musical instruments.

5. CONCLUSIONS A classifier with a hierarchical modular architecture for high dimensional classification tasks has been proposed. Such classifier is composed by neuro-fuzzy networks which can find its optimal structure and parameters automatically. Preliminary experimental results showed that the proposed hierarchical classifier is able classify instruments into the correct family with a success rate that compares favorably with human performance. Moreover the hierarchical classification process gives better performances than non-hierarchical one. Further work is in progress to improve the classification results. For example, we are studying the effect of using different stmcture sizes for the sub-networks and different types of integration among neuro-fuzzy modules. Moreover, we are trying different pie-processing techniques.

REFERENCES (11 G. Peeters, S. McAdams, P. Hemera, "Instrument sound description in the context of MPEG-7". in Proc. Inlemarional Computer Music Conference, pp, 166-169, Berlin, Germany,Aug-Sept 2000. (21 E. Wold, T. Blum, D. Keislar and J. Wheaton, "Content-based classification, search, and retrieval of audio", IEEE Multimedia (Fall), pp. 21-36, 1996. [3] B.L. Vercoe, W.G. Gardner and E.D. Scheirer, "Structured audio: the creation, transmission and rendering of parametric sound representations", in Proc. IEEE, vol. 85, no. 5, pp. 922-940. 141 K.D. Martin, "Toward automatic sound source recognition: identifying musical instruments" in Proc. NATO Compurarional Henring Advanced Study Insliture, II Ciocco, Italy, 1998.

516

151 H. Helmholtz, 1954, "On the sensations of tone as a physiological basis for the theoq of music", A.]. Allis,Trans., Dover, 1954.

161 A.M. Fanelli, G. Castellano, CA. Buscicchio, "A modular neuro-fuzzy network for musicd instrument classification'; in Lecture N o f a in Computer Science: Mulfiple Classifier Sysfeir, pp. 372-382. Springer Verlag, Berlin, 2000. 171 J.C. Houk, '"Learningin modular networks," in Proc. of Seveafh Yale Workshop on Adaptive and Laming Systems, pp. 80-84, New Haven, CT. Yale University, 1992. 181 B.C.J. Moore and B.R. Glasberg, "A Revision ofzwicker's Loudness Model", in ACTA Acousrica, 82335-345, 1996. 191 M. Slaney, "An efficient implementation of the Patterson-Holdsworth auditory filterbank", Apple CompuferTechnical reporl#35, 1993. I101 E. Zwicker and H. Fastl, "Psychoacoustics, Facts and Models", Berlin: Springer Verlag, 1990. [ I l l R.A. Jacobs, M.I. Jordan, S.J. Nowlan and G.E. Hinton, "Adaptive mixtures of local experts", in Neural Conipurafion,vol. 3, no. 1, 1991,pp. 79-87.

517

Suggest Documents