A Matlab Toolbox for Music Information Retrieval - CiteSeerX

A Matlab Toolbox for Music Information Retrieval Olivier Lartillot1 , Petri Toiviainen1 , and Tuomas Eerola1 University of Jyv¨ askyl¨ a, PL 35(M), FI-40014, Finland [email protected]

Abstract. We present MIRToolbox, an integrated set of functions written in Matlab, dedicated to the extraction from audio files of musical features related, among others, to timbre, tonality, rhythm or form. The objective is to offer a state of the art of computational approaches in the area of Music Information Retrieval (MIR). The design is based on a modular framework: the different algorithms are decomposed into stages, formalized using a minimal set of elementary mechanisms, and integrating different variants proposed by alternative approaches – including new strategies we have developed –, that users can select and parametrize. These functions can adapt to a large area of objects as input. This paper offers an overview of the set of features that can be extracted with MIRToolbox, illustrated with the description of three particular musical features. The toolbox also includes functions for statistical analysis, segmentation and clustering. One of our main motivations for the development of the toolbox is to facilitate investigation of the relation between musical features and music-induced emotion. Preliminary results show that the variance in emotion ratings can be explained by a small set of acoustic features.

1 Motivation and approach MIRToolbox is a Matlab toolbox dedicated to the extraction of musicallyrelated features in audio recordings. It has been designed in particular with the objective of enabling the computation of a large range of features from databases of audio files, that can be applied to statistical analyses. We chose to base the design of the toolbox on Matlab computing environment, as it offers good visualisation capabilities and gives access to a large variety of other toolboxes. In particular, the MIRToolbox makes use of functions available in public-domain toolboxes such as the Auditory Toolbox (Slaney, 1998), NetLab (Nabney, 2002), or SOMtoolbox (Vesanto, 1999). Other toolboxes, such as the Statistics toolbox or the Neural Network toolbox from MathWorks, can be directly used for further analyses of the features ex-

2

Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola

tracted by MIRToolbox without having to export the data from one software to another. It appeared that such computational framework, because of its general objectives, could be useful to the research community in Music Information Retrieval (MIR), but also for teaching. For that reason, a particular attention has been paid concerning the ease of use of the toolbox. The functions are called using a simple and adaptive syntax. More expert user can specify a large range of options and parameters. The different musical features extracted from the audio files are highly interdependent: in particular, as can be seen in figure 1, some features are based on same initial computations. In order to improve the computational efficiency, it is important to avoid redundant computations of these common components. Each of these intermediary components, and the final musical features, are therefore considered as building blocks that can been freely articulated one with each other. Besides, in keeping with the objective of optimal ease of use of the toolbox, each building block has been conceived in a way that it can adapt to the type of input data. Hence, in figure 1, the computation of the MFCCs can be based on the waveform of the initial audio signal, or on the intermediary representations such as spectrum, or mel-scale spectrum. Similarly, the autocorrelation method will behave differently with audio signal or envelope, and can adapt to frame decompositions.This decomposition of the all set of feature extraction algorithms into a common set of building blocks has the advantage of offering a synthetic overview.

2 Feature extraction Figure 1 shows an overview of the main features considered in the toolbox. All the different processes start from the audio signal (on the left) and form a chain of operations developed horizontally rightwise. The vertical disposition of the processes indicates an increasing order of complexity of the operations, from simplest computation (top) to more detailed auditory modelling (bottom). Each musical feature is related to the different broad musical dimensions traditionally defined in music theory. In bold are highlighted features related to pitch, to tonality (chromagram, key strength and key Self-Organising Map, or SOM) and to dynamics (Root Mean Square, or RMS, energy). In bold italics are indicated features related to rhythm: namely tempo, pulse clarity and fluctuation. In simple italics are highlighted a large set of features that can be associated to timbre. Among them, all the operators in grey italics can be in fact applied to many others different representations: for instance, statistical moments such as centroid, kurtosis, etc., can be applied to either spectra, envelopes, but also to any histogram based on any given feature. One of the simplest features, zero-crossing rate, is based on a simple description of the audio waveform itself: it counts the number of sign changes of the waveform. Signal energy is computed using root mean square, or RMS

A Matlab Toolbox for Music Information Retrieval

3

Zero-crossing rate RMS energy Envelope Attack/Sustain/Release

Audio signal waveform

Spectrum

Centroid, Kurtosis, Spread, Skewness Flatness, Roll-off, Entropy, Irregularity Brightness, Roughness Spectral flux Mel-scale spectrum Chromagram Cepstrum Autocorrelation

Filterbank

Envelope Spectral flux

MFCC Fluctuation Key strength Key SOM Pitch

Autocorrelation Spectrum

Tempo Pulse clarity

Fig. 1. Overview of the musical features that can be extracted with MIRToolbox.

(Tzanetakis and Cook, 1999). The envelope of the audio signal offers timbral characteristics of isolated sonic event. Many features are based on the spectral representation, where the energy is decomposed along the frequency range: • Basic statistics of the spectrum gives some timbral characteristics (such as spectral centroid, roll-off (Tzanetakis and Cook, 1999), brightness, flatness, etc.). • An estimation of roughness, or sensory dissonance, can be assessed by adding the beating provoked by each couple of energy peaks in the spectrum (Terhardt, 1974). • A particular descriptor used in timbre analysis is based on Mel-Frequency Cepstral Coefficients (MFCC) (cf. example 2.1). • Tonality can also be estimated (cf. example 2.2). The estimation of pitch is usually based on spectrum, autocorrelation, or cepstrum, or a mixture of these strategies (Peeters, 2006). A distinct approach consists of designing a complete chain of processes based on the modelling of auditory perception of sound and music (Slaney, 1998). This can be used in particular for the computation of rhythmic pulsation (cf. example 2.3). 2.1 Example: Timbre analysis One common way of describing timbre is based on MFCCs (Rabiner and Juang, 1993; Slaney, 1998). Figure 2 shows the diagram of operations. First, the audio sequence is described in the spectral domain, using a FFT. The spectrum is converted from the frequency domain to the Mel-scale domain:

4


Mel-scale spectrum

DCT

40

6000

10 8

20

4000

6 4

10

2000

Coefficient rank

Frequencies

30

8000

MFCCs

12

Mel-bands

10000

0

nds

Conversion

M el -ba

Freq u

Spectrum

enc

ies

the frequencies are rearranged into 40 frequency bands called Mel-bands. The envelope of the Mel-scale spectrum is described through a Discrete Cosine Transform. The values obtained through this transform are the MFCCs. Usually only a restricted number of them (for instance the 13 first ones) are selected.

2

0

5

10

15

20

25

30

Time (in s.)

0

5

10

15

20

25

30

Time (in s.)

0

5

10

15

20

25

30

Time (in s.)

Fig. 2. Successive steps for the computation of MFCCs, illustrated with the analysis of an audio excerpt decomposed into frames.

The computation can be carried in a window sliding through the audio signal, resulting in a series of MFCC vectors, one for each successive frame, that can be represented column-wise in a matrix. Figure 2 shows an example of such matrix. The MFCCs do not convey very intuitive meaning per se, but are generally applied to distance computation between frames, and therefore to segmentation tasks. 2.2 Example: Tonality analysis The spectrum is converted from the frequency domain to the pitch domain by applying a log-frequency transformation. The distribution of the energy along the pitches is called the chromagram. The chromagram is then wrapped, by fusing the pitches belonging to same pitch classes. The wrapped chromagram shows therefore a distribution of the energy with respect to the twelve possible pitch classes. Krumhansl and Schmuckler (Krumhansl, 1990) proposed a method for estimating the tonality of a musical piece (or an extract thereof) by computing the cross-correlation of its pitch class distribution with the distribution associated to each possible tonality. These distribution have been established though listening experiments (Krumhansl and Kessler, 1982). The most prevalent tonality is considered to be the tonality candidate with highest correlation (or key strength). This method was originally designed for the analysis of symbolic representations of music but has been extended to audio analysis through an adaptation of the pitch class distribution to the chromagram representation (Gomez, 2006). Figure 3 displays the successive steps of this approach.

Wrapped chromagram

Crosscorrelation

C minor C# minor

Chromas

Key strength

Key candidates C major C# major D major

Frequencies

5

s lass e

Wrap

Pitc hc

mas

Chromagram

Chr o

mas

Conv

Chr o

Freq u

Spectrum

enc ies


Pitch classes

... ...

C C# D D# E F F# G G# A A# B

Key candidates

Fig. 3. Successive steps for the calculation of chromagram and estimation of key strengths, illustrated with the analysis of an audio excerpt, this time not decomposed into frames.

A richer representation of the tonality estimation can be drawn with the help of a self-organizing map (SOM), trained by the 24 tonal profiles (Toiviainen and Krumhansl, 2003). The configurations of the 24 classes after the training on the SOM corresponds to studies in music theory. The estimation of the tonality of the musical piece under study is carried by projecting its wrapped chromagram onto the SOM.

Fig. 4. Activity pattern of a self-organizing map representing the tonal configuration of the first two seconds of Mozart Sonata in A major, K 331. High activity is represented by bright nuances.

2.3 Example: Rhythm analysis One common way of estimating the rhythmic pulsation, described in figure 5, is based on auditory modelling (Tzanetakis and Cook, 1999). The audio signal is first decomposed into auditory channels using a bank of filters. The envelope of each channel is extracted. As pulsation is generally related to increase of energy only, the envelopes are differentiated, half-wave rectified, before being finally summed together again. This gives a precise description of the variation of energy produced by each note event from the different auditory channels. After this onset detection, the periodicity is estimated through autocorrelation. However, if the tempo variates throughout the piece, an autocorrelation of the whole sequence will not show clear periodicities. In such cases it

6


is better to compute the autocorrelation on a moving window. This yields a periodogram that highlights the different periodicities, as shown in figure 5. In order to focus on the periodicities that are more perceptible, the periodogram is filtered using a resonance curve (Toiviainen and Snyder, 2003), after which the best tempos are estimated through peak picking, and the results are converted into beat per minutes. Due to the difficulty of choosing among the possible multiples of the tempo, several candidates (three for instance) may be selected for each frame, and a histogram of all the candidates for all the frames, called periodicity histogram, can be drawn.

Filter bank

Envelopes extraction

Diff

HWR

+

Onsets

Auto correlation

Periodo gram

Filter

Peaks

Tempo

Delays

Audio waveform

Channels

Resonance curve

Time

Time

Fig. 5. Successive steps for the estimation of tempo illustrated with the analysis of an audio excerpt. In the periodogram, high autocorrelation values are represented by bright nuances.

3 Data analysis The toolbox includes diverse tools for data analysis, such as a peak extractor, or functions that compute histograms or various statistical moments. One particularity of these functions is that they can automatically adapt to the various types of input data. More elaborate tools have also been implemented that can carry out higher-level analyses and transformations. In particular, audio files can be automatically segmented into a series of homogeneous sections, through the estimation of temporal discontinuities in timbral features (Foote and Cooper, 2003). The resulting segments can then be clustered into classes, suggesting a formal analysis of the musical piece. Supervised classification of musical samples can also be performed, using techniques such as K-Nearest Neighbours or Gaussian Mixture Model. One possible application is the classification of audio recordings into musical genres.

4 Application to the study of music and emotion The toolbox is conceived in the context of a project investigating the interrelationships between perceived emotions and acoustic features. In a first study,


7

musical features of musical materials collected form a large number of recent empirical studies in music and emotions (15 so far) are systematically reanalysed. For emotion rating based on interval scales – using emotion dimensions such as emotional valence (liking, preference) and activity –, the mapping applies linear models, where ridge regression can handle the highly collinear variables. If on the contrary the emotion ratings contain categorical data (happy, sad, angry, scary or tender), supervised classification and linear methods such as discriminant analysis or logistic regression are used. The early results suggest that a substantial part (50-70%) of the variance in the emotion ratings can be explained by a small set of acoustic features, although the exact set of features is dependent on the musical genre. The existence of several different data sets representing different genres and data types is a challenging task for the selection of the appropriate statistical measures. A second study focuses on musical timbre. Listeners’ rating of 110 short instrument sounds (of constant pitch height, loudness and duration, but varying timbre) on four bipolar scales (valence, energy arousal, tension arousal, and preference) were correlated with the acoustic features. We found that positively valenced sounds tend to be dark in sound colour (see figure 6). Energy arousal, on the other hand, is more related to high spectral flux and high brightness (R2 = .62). The tension arousal ratings (R2 = .70) are a mixture of high brightness, and roughness and inharmonicity. These observations extend the earlier findings relating to timbre and emotions (Scherer and Oshinsky, 1977; Juslin, 1997).

Fig. 6. Correlation between listener rating of valence and acoustic brightness of short instrument sounds (R2 = .733).

8


Following our first Matlab toolbox, called MIDItoolbox (Eerola and Toiviainen, 2004), dedicated to the analysis of symbolic representations of music, the MIRtoolbox will be available for free, from September 2007, at the following address: http://www.cc.jyu.fi/~lartillo/mirtoolbox This work has been supported by the European Commission (NEST project “Tuning the Brain for Music”, code 028570).

References EEROLA, T. and TOIVIAINEN, P. (2004): MIR in Matlab: The Midi Toolbox. Proceedings of 5th International Conference on Music Information Retrieval, 22–27, Barcelona. FOOTE, J. and COOPER, M. (2003): Media Segmentation using Self-Similarity Decomposition. In Proceedings of SPIE Storage and Retrieval for Multimedia Databases, 5021, 167-75. GOMEZ, E. (2006): Tonal Description of Polyphonic Audio for Music Content Processing. INFORMS Journal on Computing, 18-3, 294–304. JUSLIN, P. N. (1997): Emotional Communication in Music Performance: A Functionalist Perspective and Some Data. Music Perception, 14, 383–418. KRUMHANSL, C. (1990): Cognitive Foundations of Musical Pitch. Oxford University Press, New York. KRUMHANSL, C. and KESSLER, E. J. (1982): Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review, 89, 334–368. NABNEY, I. (2002): NETLAB: Algorithms for pattern recognition. Springer Advances In Pattern Recognition Series, Springer-Verlag, New-York. PEETERS, G. (2006): Music Pitch Representation by Periodicity Measures Based on Combined Temporal and Spectral Representations. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. RABINER, L. and JUANG, B. H. (1993): Fundamentals of Speech Recognition. Prentice-Hall. SCHERER, K. R. and OSHINSKY J. S. (1977): Cue Utilization in Emotion Attribution from Auditory Stimuli. Motivation and Emotion, 1-4, 331–346. SLANEY, M. (1998): Auditory Toolbox Version 2. Technical Report 1998-010, Interval Research Corporation. TERHARDT, E. (1974). On the Perception of Periodic Sound Fluctuations (Roughness). Acustica 30(4): 201–213. TOIVIAINEN, P. and KRUMHANSL, C. (2003): Measuring and Modeling RealTime Responses to Music: The Dynamics of Tonality Induction, Perception, 32-6, 741–766. TOIVIAINEN, P. and SNYDER J. S. (2003): Tapping to Bach: Resonance-based modeling of pulse. Music Perception, 21(1), 43–80. TZANETAKIS, G and COOK, P. (1999): Multifeature Audio Segmentation for Browsing and Annotation. In: Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New-York. VESANTO, J. (1999): Self-Organizing Map in Matlab: the SOM Toolbox. Proceedings of the Matlab DSP Conference 1999. Espoo, Finland,35–40.