Quantitative characterisation of audio data by

Eur. Phys. J. Special Topics 222, 473–485 (2013) © EDP Sciences, Springer-Verlag 2013 DOI: 10.1140/epjst/e2013-01853-8

THE EUROPEAN PHYSICAL JOURNAL SPECIAL TOPICS

Regular Article

Quantitative characterisation of audio data by ordinal symbolic dynamics T. Aschenbrenner1,a , R. Monetti1 , J.M. Amig´ o2 , and W. Bunk1 1 2

Max-Planck-Institut f¨ ur extraterrestrische Physik, Giessenbachstr. 1, 85748 Garching, Germany Centro de Investigaci´ on Operativa, Universidad Miguel Hernandez, Avda. de la Universidad s/n, 03202 Elche, Spain Received 27 March 2013 / Received in final form 25 April 2013 Published online 25 June 2013 Abstract. Ordinal symbolic dynamics has developed into a valuable method to describe complex systems. Recently, using the concept of transcripts, the coupling behaviour of systems was assessed, combining the properties of the symmetric group with information theoretic ideas. In this contribution, methods from the field of ordinal symbolic dynamics are applied to the characterisation of audio data. Coupling complexity between frequency bands of solo violin music, as a fingerprint of the instrument, is used for classification purposes within a support vector machine scheme. Our results suggest that coupling complexity is able to capture essential characteristics, sufficient to distinguish among different violins.

1 Introduction In 2002, Bandt and Pompe [1] published their ideas of ordinal symbolic dynamics and its application to quantify the complexity of time series. They applied ordinal patterns to speech signals and demonstrated the identification of voiced segments, an important issue in e.g. speech recognition in audio signals, by their lower complexity. In this context measures based on ordinal symbolic dynamics showed their robustness against background noise and tolerance for the change of the sampling frequency. Meanwhile, considerable effort has been put into the further development of measures based on ordinal symbolic representations. The introduction of transcripts [2] opened a new frame to study the relation between two or more time series. In Amigo et al. [3], the concept of coupling complexity was formulated by combining group theoretic considerations and information measures. The common notion of complexity, where monotonic and random time series are associated with low complexity, is consistently extended to coupling complexity: Two (sub)systems have low coupling complexity if their signals are either identical, random independent or monotonic. In this contribution, we use the definition of transcripts and coupling complexity to differentiate between the acoustical properties of violins built by different luthiers. In this respect, violins are considered as complex systems, composed of subsystems, a

e-mail: [email protected]

474

The European Physical Journal Special Topics

which collude in a specific and unique form in frequency space. The hypothesis is that coupling complexity between different frequency bands is a signature of the respective violin. Our data consist of two sets of solo violin recordings: Each set is composed by a number of recordings of the same musical piece, performed by the same musician on different violins. The data are considered in different created frequency bands and the coupling complexity between these channels is calculated. A multi-class classification approach using a support vector machine (SVM) is applied. A high classification rate would indicate that the coupling complexity is suitable to capture differences in the timbre of violins.

2 Violin as complex system The violin belongs to the family of bowed string instruments. In standard technique, the player uses a bow with a ribbon of hairs to excite the strings. This is done by exploiting the “stick-slip-effect” of the string [5, 6], where the string sticks to the moving bow til the reset force exceeds the static friction and the string is suddenly released. Then, the string is captured again by the driven bow and the cycle starts anew. As a result, oscillations of the strings are generated which are (in part) determined by the geometric and elastic properties of the string. Besides the lowest frequency, i.e. the fundamental frequency, also higher harmonics – integer multiples of the fundamental – are elicited. Due to the relative small diameters of the strings, the direct energy transfer string to air, which determines the sound volume, is quite low. Therefore, the body of the violin is needed to enhance the string-air coupling. The design of the violin is optimised to create a pleasant, long ranged, bright sound over the whole pitch frequency range of the violin (196 Hz open G string, up to ≈ 4000 Hz). The refined combination of components acting like a filter (bridge [7], bass bar), of constructional elements transporting vibrations to the back plate (sound post), together with a sophisticated thickness distribution of top and back plate serve this purpose. As a result, numerous vibrational eigenmodes associated with certain eigenfrequencies define the timbre of a violin. For example, a certain Stradivarius [8] shows 48 eigenmodes starting from 120 Hz up to 2340 Hz. Violins often exhibit quite a number of higher-order eigenmodes with narrowly spaced eigenfrequencies; higher pitches excite these modes to highly irregular amplitudes. The most important amplifying mechanism for lower frequencies (around ≈ 280 Hz [7,8]) is due to resonance of the air in the body of the violin. In the following, we interpret a violin as a complex system, which consists of several non-linear coupled sub-systems acting in different frequency bands. The interdependencies between these bands, which can be traced back to the coupling characteristics of the instrument components, is the subject of our analysis.

3 Data We consider two sets of recordings, where the same violinist plays the very same piece for solo violin with different violins. For a compilation of the data, see Table 1. In the first recording, Ruggiero Ricci plays the cadenza of the violinist Hubert Léonard for the second movement of Beethovens Violin Concerto, op. 61, with 18 contemporary violins [12]. Each performance lasts about 36 s. The second recording is the Sarabande from the second Partita of J.S. Bach performed by Saschko Gawriloff on six classic and one modern violin [13]. Each piece takes about 3 min. In the CD booklet, it is explicitly stated that the recording parameter were as standardised as

Symbolic Dynamics and Permutation Complexity

475

Table 1. Compilation of the data. Two sets of solo violin recordings are used. In both sets a violinist plays one piece under the same setting with different violins. The first set compares the sound of 18 contemporary violins [12] while the second one presents six classic and one modern violin [13]. Ruggiero Ricci L.v. Beethoven Violin Concerto D major, op. 61 Cadenza 2nd movement (Hubert Léonard) Luthier Duration [s]

Saschko Gawriloff J.S. Bach Partita II D Minor Sarabande, BWV 1004 Luthier

Duration [s]

G. Aif S. Zygmuntowicz P. Greiner P. Pistoni W. Muller J. Dilworth P. Robin A. Giordano D. Baqué R. Hargrave Ch. G¨ otting L. Bellini F. Chaudière D. Fantin J. Curtin R. Regazzi G.C. Guicciardi R. Scrollavezza

N. Amati (1640) G.B. Guadagnini (1771) A. Guaneri (1671) P. Guaneri A. Horvath (1992) A. Stradivari (1683) J.-B. Vuillaume (1870)

159 156 153 161 156 165 158

34 34 37 33 38 36 38 35 35 36 41 35 37 37 33 37 36 37 set 1

set 2

possible: The same recording conditions (recording studio, technical equipment, position of the microphones) were used. Each violin is tuned to the identical pitch. Gawriloff plays with even the same bow (made by Fran¸cois Tourte). Both sets are treated as multi-class classification problems with Nc = 18 (first set) and Nc = 7 (second set) classes. Both recording sets are intended to present each violin at its best. Due to this fact and the instrument’s different response to the player’s action, the chosen tempo and articulation differ slightly. This implies that e.g. the temporal occurrence of musical patterns or motifs does not exactly match between different performances.

4 Coupling complexity Bandt and Pompe [1] introduced ordinal symbolic dynamics, a novel approach to the analysis of complex systems. Ordinal symbols are defined by the rank ordering of L > 2 data points. Each symbol may be interpreted as a permutation with ordinal pattern π = π0 , π1 , ..., πL−1 , where π0 , π1 , ..., πL−1 is a permutation of 0, 1, ..., L − 1, fulfilling xt+π0 ≤ xt+π1 ≤ ... ≤ xt+πL−1 . In this way a time series with n points xt n−1 t=0 in R is represented by (n − L) symbols1 π. For a length L, there exist (L!) different permutations or symbols. 1 In general symbols will be generated using delay vectors, that is a “spacing” T between the data points considered for rank ordering. For subsequent data points T = 1. A delay

476


The robustness of symbols and, hence, of derived measures of ordinal symbolic dynamics against (small amounts of) noise and “slow” (compared to the time horizon Th of a symbol, see footnote 1) artifacts is attributed to the locality of the process of symbol generation. In particular, any positive linear scaling of the data does not affect the resulting symbolic representation at all. A description of the dynamics of coupled complex systems in the framework of ordinal patterns was suggested in [2] by introducing the concept of transcripts. In general, transcripts give the transition from one ordinal pattern to another one in terms of permutations. Thus, transcripts may be employed for a quantitative assessment of the relation between symbolic representations of time series. They exploit the fact that permutations of length L build a non-commutative group, the symmetric group SL [3], where the group operation is the composition of permutations π σ = π0 , π1 , ..., πL−1 σ0 , σ1 , ..., σL−1 = σπ0 , σπ1 , ..., σπL−1 .

(1)

The neutral element ID of the group is the identity permutation. The inverse element of this group is defined by the order O operation as −1 σ −1 = σ0 , σ1 , ..., σL−1 = O (σ0 , σ1 , ..., σL−1 ) = σO0 , σO1 , ..., σOL−1 , (2) with

ID = σ σ −1 = σ −1 σ

(3)

Given two symbols σ and π, a transcript τ is defined as

This leads to

τ σ = π.

(4)

τ (π, σ) = π σ −1 .

(5)

For every transcript τ of length L, there are (L!) different pairs (π, σ) which satisfy Eq. (4). For a detailed depiction of the properties of symbols and transcripts see [2] and [3]. With symbols, the time series is partitioned in a natural way into a finite set, which facilitates the application of information theoretic techniques. Assume, we have two n−1 time series xt n−1 t=0 and yt t=0 with symbolic representations π and σ, respectively. The associated marginal probability distributions are PLπ , PLσ and PLJ denotes the joint probability distribution PLJ = PL (π, σ). In addition, transcripts τ relates x with y via Eq. (4), with probabilities PLτ . Bandt and Pompe used the Shannon entropy of symbols, called permutation entropy [1], as a measure of the complexity of the system. This permutation entropy or marginal entropy is PLπ log PLπ , (6) H(PLπ ) = π∈SL

and the joint entropy is given by H(PLJ ) = HL (π, σ) = PL (π, σ) log PL (π, σ) = PLJ log PLJ . π,σ∈SL

(7)

π,σ∈SL

The transcript entropy may be calculated as H(PLτ ) = PLτ log PLτ

(8)

τ ∈SL

T > 1 reduces the number of symbols to (n − (L − 1) ∗ T ) for n data points. The time horizon Th of a symbol with length L is Th = (L − 1) ∗ T .


with

PLτ =

PL (π, σ).

477

(9)

π,σ∈SL ∧τ σ=π

In Eq. (9) for the calculation of PLτ the sum is restricted to all pairs of symbols (π, σ) which are connected by the same transcript τ . In [3], some possibilities to describe the complexity of the relation between pairs of time series quantitatively were studied. Two different coupling complexity indices were introduced. One of them, therein called C1 , will be used in the following: (10) C(L) = min {H(PLπ ), H(PLσ )} − H(PLJ ) − H(PLτ ) . Several properties of the coupling complexity C(L) are summarised here: The coupling complexity is non-negative, C(L) ≥ 0. It is symmetric with respect to π and σ. It is designed in such a way that C(L) = 0 for two identical signals (I), for two independent and uniformly distributed random time series (II) and, if one of the time series is monotonic in time (III). In the first case (I), H(PLπ ) = H(PLσ ) = H(PLJ ) and the transcript entropy H(PLτ ) = 0 since τ ≡ ID. In (II), H(PLπ ) = H(PLσ ) = H(PLτ ) the marginal and the transcript entropies are equal, while the joint entropy H(PLJ ) takes twice the value of the other entropies resulting in a vanishing coupling complexity. As a monotonic series (III) is represented by one symbol, the minimum of marginal entropies is zero, transcript entropy and joint entropy sum to zero. Coupling complexity is bounded by C(L) ≤ log2 L!. If H(PLJ ) − H(PLτ ) = 0, 2 this upper bound can be narrowed down to C(L) ≤ log2 L! − L! (proof in [4]). The maximum value of coupling complexity occurs for a bijective relationship between the transcripts τ and the associated pairs (π, σ). Coupling Complexity decreases when this condition is not fulfilled. For further details see [4]. 4.1 Frequency partitioning In the simplest way, the frequency range of interest [νL , νU ] is equi-partitioned into bins of equal width. This assures that each frequency enters the classification method with the same weight. The lower frequency νl (i) and the upper frequency νu (i) of the ith bin is given by ν U − νL ∗i nbin ν U − νL νu (i) = νL + ∗ (i + 1), with i = 0, ..., nbin − 2. nbin νl (i) = νL +

(11)

There are two well established techniques for the partitioning of the frequency space, which are motivated by human perception and are used in the following. It is known that the human ear processes auditory input in distinct frequency bands. Within each band, the ear is not able to distinguish any two frequencies. The perception of pitch is not linear with frequency across the whole range accessible to humans. To account for this, the erb and mel-scale, which are closely related to each other, were introduced. The relation to the frequencies ν in Hz is given by [9] d∗ν · (12) νerb/mel = a ∗ logb c + e The coefficients for the erb- and mel-scale are a = 21.4, b = 10, c = 1, d = 4.37, e = 1000 and a = 1000, b = 2, c = 1, d = 1, e = 1000, respectively.

478


80

80

erb

mel

equi-partitioned

60

binning

binning

60

40

20

40

20

0

0 0

1000

2000 Frequency [Hz]

3000

4000

(a)

0

5.0•103

1.0•104 Frequency [Hz]

1.5•104

2.0•104

(b)

Fig. 1. Comparison of different frequency partitioning methods. A simple equi-width partitioning is displayed along with psycho-acoustically motivated erb and mel-scaled versions. In all three cases the whole frequency space is partitioned in nbin = 90 bins. In Fig. 1(a) the first nbin − 1 are used for ν = [0, 4000] Hz and the last bin is reserved for the remaining frequencies ν = [4000, 22050] Hz. Figure 1(b) displays the partitioning, where fundamental frequencies and its 2n -harmonics are pooled in the same bin. A shaded rectangle labels the frequency range up to 4000 Hz. Table 2. Example of the filtering condition of the highest bin for 2n -harmonics partitioning with a lower frequency threshold of ≈ 5 Hz and nbin = 5. n 0 1 2 3 4 5 6 7 8 9 10 11

lower frequency [Hz] 19845.00 9922.50 4961.25 2480.63 1240.31 620.16 310.08 155.04 77.52 38.76 19.38 9.69

upper frequency [Hz] 22050.00 11025.00 5512.50 2756.25 1378.13 689.06 344.53 172.27 86.13 43.07 21.53 10.77

After transforming to these scales the partitioning is carried out according to Eq. (11). The mapping from the linear scale to erb- and mel-scale is shown in Fig. 1(a). An alternative partitioning method, hereafter called 2n -harmonics, divides the frequency range from [ 12 νnyq , νnyq ] according to Eq. (11), with νnyq the Nyquist frequency according to the sampling rate of the digital recording. The resulting minimum and maximum frequencies of each bin and its 2−n -multiples (n ≥ 1) define the filtering conditions (see Fig. 1(b)). The maximum value for n is determined by a lower frequency threshold. An example is presented in Table 2. For a lower frequency threshold of ≈5 Hz and nbin = 5 the highest bin is given by the filtering conditions listed in Table 2. For 2n -harmonics, in contrast to the former described partition methods, there is no linear correlation between bin number and frequency. Every band contains frequencies from the whole spectral range.


479

(a) Original data sequence 2000 Amplitude [a.u.]

1000 0 -1000 -2000 -3000 -4000 0.00

0.10

0.20

0.30

Time [s]

0.40

0.50

(b) Effects of the partitioning methods 0 1

3

10-2

100

10-2

1000

2000 3000 Frequencies [Hz]

4000

0

000

Amplitude [a.u.]

111 222 333 444

500 0 -500 0.00

0.10

0.20 0.30 Time [s]

0.40

1000


222 333 444

0.10

0.20 0.30 Time [s]

0.40

0.96

10-2

0.50

0.19

400

0.00

400

1200 2000 2800 Frequencies [Hz]

3600

Coupling Complexity C [bit]

1200

0.53

797

0.35

344

0.18

91

0.00

91


3065

2

10-2

111 222 333 444

0.10

0

1000

0.20 0.30 Time [s]

0.40

0.50


4000

000

300 200 100 0 -100 -200 -300 2000 1000 0 -1000 -2000 400 200 0 -200 -400 400 200 0 -200 -400 1000 500 0 -500 -1000

111 222 333 444

0.00

0.10

0.20 0.30 Time [s]

0.40

0.88

0.50 0.80

3256

1610

1

100

4000

000

0.00

4 0.70

Frequencies [Hz]


0.38


0.70

Frequencies [Hz]

2000

1000

0.88

0.77

0

10-4

40 20 0 -20 -40 50 0 -50 -100 1000 500 0 -500 -1000 1000 500 0 -500 -1000 1000 0 -1000

3065

0.57

01234 0 1 2 3 4

4

100

0

111

0.00

3600 2800

3

102

4000

000

20 10 0 -10 -20 40 20 0 -20 -40 1000 500 0 -500 -1000 1000 500 0 -500 -1000 1000 0 -1000

0.50

2

10-4

Amplitude [a.u.]

0

1

102

10-4

100 50 0 -50 1000 500 0 -500 -1000 1000 500 0 -500 -1000 1000 500 0 -500 -1000

Frequencies [Hz]

0

4

Power [a.u.]

100

10-4

Amplitude [a.u.]

2

102 Power [a.u.]

Power [a.u.]

102

2003

0.53

1147

0.35

562

0.18

162

0.00

162


3256

0.64


4

Power [a.u.]

3

Amplitude [a.u.]

2


1

Frequency Bands

0

3

0.48

2

0.32

1

0.16

0.00

0 0

1 2 3 Frequency Bands

4

Fig. 2. Example for coupling complexity C(L) between the frequency band filtered signals (L = 4, delay T = 100). Figure 2(a): Original data, a 0.5 s segment from the Cadenza of Beethovens concerto with a violin from William Muller (recording set 1). Figure 2(b): Four different frequency partitioning methods are compared. The first row shows power spectra with different frequency binning indicated by shaded regions. The second row presents the resulting filtered signals for every bin. Note that the filtered data have exactly the length of the original data. In the lowest row, the matrices of coupling complexity between the filtered signals are displayed. Columns of Fig. 2(b) correspond to equi, erb-, mel- and 2n -harmonic partitioning, respectively.

To our best knowledge, this is the first study of the coupling complexity between frequency bands. As an illustration, we show (Fig. 2) results derived from the violin data set 1 with R. Ricci (see Table 1): It is a 0.5 s-sequence from the Cadenza of Beethovens concerto with a violin from William Muller. The power spectrum is dominated by a fundamental frequency ≈900 Hz and its harmonics at 1800, 2700 and 3600 Hz. The frequency range up to the highest pitch frequency, which can be achieved with the violin, is partitioned according to the methods described above (equi, erb, mel and 2n -harmonics) in nbin = 5 bins. The data are filtered according to these frequency bands and coupling complexity is assessed. Since coupling complexity is symmetric and vanishes for {σi } ≡ {πi }, the coupling of the frequency bands may

480


in this case be represented as a symmetrical 5 × 5 matrix with vanishing diagonal elements. Whether complexity values are a signature of the violins and which frequency partition approach is best suited will be checked by the classification procedure.

5 Classification The method of support vector machines is employed for classification. This technique stems from the field of machine learning. Na different feature vectors fqi ∈ RNf with q = 1, .., Nf (number of features Nf ) and i = 1, ..., Na are generated with a known class membership Cli out of Nc different classes. SVMs determine the very hyperplane, which separates feature vectors of different classes in higher dimensional space by maximizing the separation margins between these classes. Only the points close to the hyperplane are decisive for the location of the hyperplane: these are called support vectors. The classification performance can be enhanced by considering non-linear transformations of the feature vectors using special kernel functions. Applying kernel functions, as for example radial basis functions (RBF) KRBF (fi , fj ) = exp(−γ ||fi − fj ||2 ) with γ > 0, the data are transformed to a new space, where the data are linearly separable even when the original data were not. In this study the library LIBSVM was used. For details, see [10, 11]. For classification purposes, the data has to be divided into a training and a testing set. In the training set, the class label of each feature vector is “known” to the algorithm and the reliability of classification rate is optimised by using the technique of cross validation. During n-fold cross validation, the training set is further divided in n parts, and the class membership of each feature vector of one part is “predicted” by a model based on the vectors of the (n − 1) remaining parts. In case of RBF, there are two SVM-parameters, which are determined using n-fold cross validation: γ, and a value of C > 0 associated with the penalty parameter of the error term [11]. For the assessment of classification, a contingency table is used. Its elements are , c lm where l denotes the correct class and m indicates the assigned class. Na = lm clm is the total number of classified feature vectors. For a class k the following quantities are defined: True Positives False Negatives False Positives True Negatives

T Pk F Nk F Pk T Nk

= = = =

c kk j=k cjk j=k ckj Na − T Pk − F Nk − F Pk .

(13)

The best parameter combination in the training set is determined by the highest value of correctness T Pk ρ= k , 0 ≤ ρ ≤ 1, (14) Na where ρ = 1 means perfect classification. Based on these parameters a model of the data is generated, which could then be applied to the feature vectors of the testing set. Besides the resulting correctness ρ, the following measures are used as classification metrics: sensitivity

SEk

=

specificity

SPk

=

positive predictive value P P Vk = negative predictive value N P Vk =

T Pk T Pk +F Nk T Nk T Nk +F Pk T Pk T Pk +F Pk T Nk F Nk +T Nk ·

(15)


481

From Eqs. (14) and (15), we see that correctness ρ is the weighted mean class specific sensitivity. A simple feature reduction method is applied to select the features which are most decisive concerning the classes. Features q are reordered according to their mutual information Iq with classes Cl, Iq = M (fq , Cl) = H(fq ) + H(Cl) − H J (fq , Cl) with q = 1, .., Nf

(16)

and those with the highest information content are considered for classification. H(fq ) denotes the Shannon entropy of feature fq , H(Cl) is Shannon entropy of classes and the joint entropy between feature and classes is abbreviated with H J (fq , Cl). This ansatz only considers the importance of single features, disregarding the possibly relevance of special combinations of features for the classification task. Feature reduction is relevant to compare classification performance of two measures with different number of features. Later on, we will apply this technique to compare the classification results of power spectra with those of coupling complexity.

6 Results Coupling complexity will be calculated in a sliding window technique based on just one channel (left channel) of the stereo recording. The size of the non-overlapping windows is chosen to guarantee statistical validity of the symbol representation. Taking a length L = 4, there exist (L!)2 = 576 symbol pairs. To be on the safe side, a window size w = 22050 data points is selected, corresponding to 0.5 s of data (sampling rate r = 1/44100 s−1 ). The feature vectors are defined by the coupling complexity between filtered signals. For classification with SVM, the data of every recording are randomly but uniquely divided in training and testing sets of about the same size without considering the musical structure of the piece: The separation does not consider melodic themes or rhythmic aspects. It may happen that a musical motif is represented in both sets or just enters one of the two sets. The training set is used to check the correctness ρ for the different parameters, that is delay time and partitioning method. Using a five-fold cross validation provides a better estimate for the classification results in the testing set. For every time window, a feature vector is generated and stored together with its class membership, i.e. the luthier of the violin. In a short test, varying the number of bins nbin , the cross validation results of the linear kernel Klin (fi , fj ) = fiT fj are compared with those of the RBF-kernel KRBF . As expected, the values for the correctness with the RBF-kernel are consistently superior. These results support the choice of the RBF-kernel and a value of nbin = 90. The number of features nf is defined by the number of bins through nf = (nbin − 1) ∗ nbin /2. For the chosen nbin = 90, the number of features is nf = 4050. With the best set of parameters in the training set, a SVM-model is generated and applied to the testing set. This approach is used with both recording sets. For comparison, the same classification scheme is tested, with the same training set and window selection, using just the power spectra. To benefit from the performance properties of fast fourier transform, the data of each window are zero-padded to 215 points, yielding nf = 16385 spectral entries. In Table 3 the resulting correctness rates are listed for the two recording sets. Coupling complexity is calculated using ordinal patterns of length L = 4 and delay times T = 100 and T = 1000. These correspond to time horizons of Th = 0.0068 s, Th = 0.068 s, related to frequencies νh = 147.0 Hz and νh = 14.7 Hz, respectively. In general, it turns out that classification rates are significantly enhanced with the

482


Table 3. Classification results for the training set. The correctness ρ of the five-fold cross validation for both recording sets (see Table 1) are listed (L = 4, window size w = 22050, nbin = 90) for delay T = 100 and T = 1000. Highest values ρ for each set are highlighted. For comparison, the classification just based on raw power spectra is presented at the bottom. partitioning methods

Correctness

equi erb-scale mel-scale 2n -harmonic power spectrum

set 1 T = 100 T = 1000 0.818 0.590 0.761 0.707 0.810 0.670 0.638 0.623 0.342

0.9

Recording Set 2

0.8

Recording Set 1

set 2 T = 100 T = 1000 0.893 0.807 0.866 0.735 0.880 0.733 0.850 0.854 0.547

0.7

0.6

0.5 1000

2000 # of features

3000

4000

Fig. 3. Application of the information based feature reduction approach (see Eq. (16)) displays the course of classification rates of coupling complexity.

shorter delay T = 100. An exception to this is the 2n -harmonic frequency partition, where the classification rates for the two tested delay times are similar. This decreased dependence on T might be due to the mixing of the frequencies across the whole frequency range for every bin of the 2n -harmonic partitioning. Classification based on power spectra yields relatively low correctness values, indicating the suitability of coupling complexity to differentiate between the instruments. Reducing the number of features, i.e. the number of spectral elements, according to Eq. (16) from its initial value nf = 16385 to nf = 4050, lowers the classification rate based on power spectra further to ρ = 0.500 (recording set 2). In both recording sets, best classification in the training set is achieved with equi-partitioning of the frequencies and delay time T = 100. Application of the feature reduction approach to these cases in the training sets reveals that the classification results rely on a relatively low number of features. Even a reduction by a factor of two down to nf ≈ 2000 yields still considerable correctness rates (see Fig. 3). For both recording sets, equi-partitioning, delay time T = 100, and the corresponding SVM-parameters (γ, C) are used for the testing set. For recording set 1 (ρ = 0.841, see Table 4) as well as recording set 2 (ρ = 0.900, see Table 5), values of correctness in the testing set are slightly higher than those of the cross validation procedure in the training sets. In the recording set 1 with Nc = 18 different violins, sensitivity ranges from 0.622 to 0.947 with a standard deviation of 0.085. All specificity values are beyond 0.979. The recording set 2 with NC = 7 yields an even better classification with sensitivity ranging from 0.826 to 0.970 and a standard deviation of 0.052.


483

Table 4. Classification results for the testing set of recording 1 (see Table 1). The best parameter combination of training set is used (see Table 3). Ruggiero Ricci, L.v. Beethoven: Violin Concerto D major, op. 61 Cadenza 2nd movement (Hubert Léonard) Luthier SE SP PPV NPV G. Aif S. Zygmuntowicz P. Greiner P. Pistoni W. Muller J. Dilworth P. Robin A. Giordano D. Baqué R. Hargrave Ch. G¨ otting L. Bellini F. Chaudière D. Fantin J. Curtin R. Regazzi G.C. Guicciardi R. Scrollavezza

0.794 0.735 0.865 0.879 0.895 0.861 0.947 0.829 0.943 0.861 0.878 0.886 0.892 0.703 0.909 0.622 0.861 0.784

0.987 0.993 0.980 0.998 0.989 0.987 0.993 0.984 0.987 0.997 0.992 0.997 0.979 0.998 0.997 0.990 0.995 0.989

0.771 0.862 0.727 0.967 0.829 0.795 0.900 0.744 0.805 0.939 0.878 0.939 0.717 0.963 0.937 0.793 0.912 0.806

0.989 0.985 0.992 0.994 0.993 0.992 0.997 0.990 0.997 0.992 0.992 0.994 0.993 0.982 0.995 0.977 0.992 0.987

correctness ρ = 0.841

Table 5. Classification results for the testing set of recording 2 (see Table 1). The best parameter combination of training set is used (see Table 3). Saschko Gawriloff, J.S. Bach: Partita II D Minor - Sarabande, BWV 1004 Luthier SE SP PPV NPV N. Amati 0.951 0.991 0.945 0.992 G.B. Guadagnini 0.844 0.982 0.882 0.975 A. Guaneri 0.899 0.978 0.866 0.984 P. Guaneri 0.970 0.984 0.910 0.995 A. Horvath 0.919 0.985 0.908 0.987 A. Stradivari 0.893 0.980 0.883 0.982 J.-B. Vuillaume 0.826 0.986 0.908 0.971 correctness ρ = 0.900

7 Discussion In the last decade, ordinal symbolic dynamics has become an important new methodology for the quantitative characterisation of coupled dynamical systems. In this contribution, a measure based on ordinal symbolic dynamics was applied to sound data. Motivated by the sound generating mechanisms of violins, where the different parts of the instrument act in a complex manner in frequency space, we proposed a measure describing the relationship between the different frequency bands to characterise the

484


sound of the violin. The objective was to test, whether the coupling complexity is a suitable tool to distinguish the acoustical characteristics of different violins. For this purpose, two sets of recordings with solo violin music were selected. In each set, the same violinist plays the same piece with different violins under the very same conditions. Due to this setting, it is reasonable to attribute any difference in sound (between the various versions) to the specific instruments. In previous applications [3], coupling complexity C(L) was applied to two signals of different subsystems. In this contribution, C(L) was used to describe the relationship between two signals generated by different frequency filters of the same time series, thus providing a signature of the violin. Four different partitioning methods in frequency space were implemented and the corresponding classification rate was assessed. For both recording sets, the data were divided into training and testing set. All the different parameters, that is delay time T in the generation of symbolic representations, and partition methods were ranked according to the resulting correctness (see Eq. (14)). The correctness was calculated in the training sets using a SVM with five-fold cross validation. The set of parameters leading to the highest correctness was used to build a SVM model, which was then applied to the testing set. Best results were achieved in the training sets with the simplest technique, i.e. an equi-partitioning of the violin’s pitch range, up to 4000 Hz, putting the remaining frequencies (those beyond 4000 Hz till Nyquist frequency νnyq = 22050 Hz) in one bin. It is surprising that any psycho-acoustically motivated frequency binning does not outperform simple equi-partitioning. One may speculate that all the violins are designed to satisfy the ear and that mel- or erb-scaled based partitioning therefore tend to level out the differences. Moreover, combining fundamental frequency and its 2n -harmonic does not improve the classification rate, implying that the different instruments behave more similar in the coupling of these signals. Only for the recording set 2 and delay time T = 1000, the 2n -harmonic partitioning convinces, yielding the highest value ρ = 0.854. In the training set, values for the correctness of 0.818 (recording set 1) and of 0.893 (recording set 2) could be achieved. The recording set 2 consists of only Nc = 7 classes (compared to Nc = 18 of the first one) and pieces of longer duration. This facts may explain the higher classification rates. For a better estimation of this classification rates, the same classification method was applied to the raw power spectra in the training set, yielding significantly lower rates. In case of a random classifier a correctness ρ = 0.056 (recording set 1) and ρ = 0.143 (recording set 2) would be expected. This strongly suggests that the different instruments have a specific signature of coupling complexity reflecting the characteristic interplay of frequencies in the violin.

8 Conclusion Coupling complexity was introduced to provide a new perspective on the relation between sub-systems. In this study, time series were decomposed according to spectral properties and the resulting filtered data were the subject of a coupling complexity analysis. The coupling complexity indices were considered as a characteristic feature of the system under investigation. Precisely, in this contribution, the complex interplay of the constructional elements of the violin, acting in different frequency ranges, was investigated. The hypothesis was tested that coupling complexity between frequency-filtered signals of solo violin music is a signature of the instrument. A classification approach using a support vector machine was performed with strict separation of training and testing set.


485

Both recording sets achieved best classification in the trainings sets with the same parameter combination. These parameters yielded values of correctness well beyond 0.80 in the testing sets. This confirms that coupling complexity between frequency-filtered signals may be interpreted as a fingerprint of the violin. There are many possible applications of this approach, ranging from medical diagnostics (e.g. evaluation of brain or cardiac signals) to quality management in industrial processes as for example analysis or classification of structure-born sound. J.M.A. was financially supported by the Spanish Ministry of Science and Innovation, Grant No. MTM2012-31698.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

C. Bandt, B. Pompe, Phys. Rev. Lett. 88, 174102 (2002) R. Monetti, W. Bunk, T. Aschenbrenner, F. Jamitzky, Phys. Rev. E 79, 046207 (2009) J.M. Amig´ o, R. Monetti, T. Aschenbrenner, W. Bunk, Chaos 22, 013105 (2012) R. Monetti, J.M. Amig´ o, T. Aschenbrenner, W. Bunk, Eur. Phys. J. Special Topics 222, 421 (2013) H. von Helmholtz, Die Lehre von den Tonempfindungen als physiologische Grundlage f¨ ur die Theorie der Musik (Vieweg & Sohn, Braunschweig, 1913) J.C. Schelleng, Scientific American, 87 (1973) G. Bissinger, J. Acoust. Soc. Amer. 120, 482 (2006) M. Schleske, Catgut Acoust. Soc. J. 4, 50 (2002) T. Kinnunen, Spectral Features Automatic Text-Independent Speaker Recognition, Licentiate’s Thesis, University of Joensuu, Finland, 2003 C.-C. Chang, C.-J. Lin, ACM Trans. Intell. Syst. Technol. 2, 27:1 (2011) C.-W. Hsu, C.-C. Chang, C.-J. Lin, A practical guide to support vector classification, http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/guide.pdf (2010) R. Ricci, N. Shiozaki, The Legacy of Cremona – Ruggiero Ricci plays 18 Contemporary Violins (CD), Dynamic, Genoa (2001) S. Gawriloff, K. Ratner, What about this, Mr. Paganini? (CD), Tacet Musikproduktion, Stuttgart (1995)

Quantitative characterisation of audio data by

Quantitative characterisation of audio data by

Suggest Documents

Quantitative oculographic characterisation of ...

Qualitative and quantitative characterisation of

comparison of fire detection and quantitative characterisation by modis ...

Quantitative Characterisation of Solid Tumours by 18F ...

Quantitative Characterisation of Solid Tumours by 18F ...

Characterisation of the natural environment: quantitative indicators ...

Quantitative PCR Detection and Characterisation of Human ...

Quantitative PCR Detection and Characterisation of ... - CyberLeninka

Quantitative characterisation of Plasmodium vivax ... - Semantic Scholar

Quantitative characterisation of grain shape

Iron Ore Quantitative Characterisation Through Reflected Light ...

Quantitative non-invasive cell characterisation and ... - Nature

Data Sheet - Ashly Audio

An Efficient Audio Coding Scheme for Quantitative

Wavelet-based Indexing of Audio Data in Audio ... - Semantic Scholar

Automated Classification of Audio Data and Retrieval Based on Audio ...

Measurement of parasitological data by quantitative real-time PCR ...

A quantitative method for the characterisation of karst aquifers based

Diplomarbeit Quantitative characterisation of sea ice melt ... - Core

The Characterisation and Quantitative Analysis of Wine Using ...

Diplomarbeit Quantitative characterisation of sea ice melt ... - ePIC

Quantitative characterisation of last stage solidification in nickel base ...

Quantitative analysis of the effects of audio biofeedback on weight ...

Representing Audio Data by FS-Trees and Adaptable ... - UNC Charlotte