Jul 12, 2014 - Whisker plot for the Beatles restricted chord vocabulary . . . . . 41. 9 ... starting from an input song, cover songs are retrieved: chord information is .... take into account 83% of the chord types in the Billboard dataset by including.
Chord Recognition with Stacked Denoising Autoencoders
Author: Nikolaas Steenbergen
Supervisors: Prof. Dr. Theo Gevers Dr. John Ashley Burgoyne
A thesis submitted in fulfilment of the requirements for the degree of Master of Science in Artificial Intelligence in the Faculty of Science
July 2014
Abstract In this thesis I propose two different approaches for chord recognition based on stacked denoising autoencoders working directly on the FFT. These approaches do not use any intermediate targets such as pitch class profiles/chroma vectors or the Tonnetz, in an attempt to remove any restrictions that might be imposed by such an interpretation. It is shown that these systems can significantly outperform a reference system based on state-of-the-art features. The first approach computes chord probabilities directly from an FFT excerpt of the audio data. In the second approach, two additional inputs, filtered with a median filter over different time spans, are added to the input. Hereafter, in both systems, a hidden Markov model is used to perform a temporal smoothing after pre-classifying chords. It is shown that using several different temporal resolutions can increase the classification ability in terms of weighted chord symbol recall. All algorithms are tested in depth on the Beatles Isophonics and the Billboard datasets on a restricted chord vocabulary containing major and minor chords and an extended chord vocabulary containing major, minor, 7th and inverted chord symbols. In addition to presenting the weighted chord average recall, a post-hoc Friedman multiple comparison test for statistical significance on performance is also conducted.
1
Acknowledgements I would like to thank Theo Gevers and John Ashley Burgoyne for supervising my thesis. Thanks to Ashley Burgoyne, for his helpful thorough advice and guidance. Thanks Amogh Gudi for all the fruit full discussions about deep learning techniques while lifting weights and sweating in the gym. Special thanks to my parents, Brigitte and Christiaan Steenbergen and my brothers Alexander and Florian, without their help, support and love, I would not be where I am now.
2
Contents 1 Introduction
7
2 Musical Background 2.1 Notes and Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Other Structures in Music . . . . . . . . . . . . . . . . . . . . . .
8 8 10 11
3 Related Work 3.1 Preprocessing / Features . . . . . . . . . . . . . . 3.1.1 PCP / Chroma Vector Calculation . . . . 3.1.2 Minor Pitch Changes . . . . . . . . . . . 3.1.3 Percussive Noise Reduction . . . . . . . . 3.1.4 Repeating Patterns . . . . . . . . . . . . . 3.1.5 Harmonic / Enhanced Pitch Class Profile 3.1.6 Modelling Human Loudness Perception . 3.1.7 Tonnetz / Tonal Centroid . . . . . . . . . 3.2 Classification . . . . . . . . . . . . . . . . . . . . 3.2.1 Template Approaches . . . . . . . . . . . 3.2.2 Data-Driven Higher Context Models . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
12 12 12 13 14 14 15 15 15 16 16 17
4 Stacked Denoising Autoencoders 4.1 Autoencoders . . . . . . . . . . . 4.2 Autoencoders and Denoising . . . 4.3 Training Multiple Layers . . . . . 4.4 Dropout . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
20 20 22 23 24
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
5 Chord Recognition Systems 5.1 Comparison System . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Basic Pitch Class Profile Features . . . . . . . . . . . . . 5.1.2 Comparison System Simplified Harmony Progression Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Harmonic Percussive Sound Separation . . . . . . . . . . 5.1.4 Tuning and Loudness-Based PCPs . . . . . . . . . . . . . 5.1.5 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Stacked Denoising Autoencoders for Chord Recognition . . . . . 5.2.1 Preprocessing of Features for Stacked Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Stacked Denoising Autoencoders for Chord Recognition . 5.2.3 Multi-Resolution Input for Stacked Denoising Autoencoders
25 25 26
6 Results 6.1 Reduction of Chord Vocabulary . . . . . . . . . . . 6.2 Score Computation . . . . . . . . . . . . . . . . . . 6.2.1 Weighted Chord Symbol Recall . . . . . . . 6.3 Training Systems Setup . . . . . . . . . . . . . . . 6.4 Significance Testing . . . . . . . . . . . . . . . . . 6.5 Beatles Dataset . . . . . . . . . . . . . . . . . . . . 6.5.1 Restricted Major-Minor Chord Vocabulary
36 37 37 37 38 39 39 39
3
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
28 28 29 30 31 32 33 34
6.6
6.7
6.5.2 Extended Chord Vocabulary . . . . . . . . Billboard Dataset . . . . . . . . . . . . . . . . . . . 6.6.1 Restricted Major-Minor Chord Vocabulary 6.6.2 Extended Chord Vocabulary . . . . . . . . Weights . . . . . . . . . . . . . . . . . . . . . . . .
7 Discussion 7.1 Performance 7.2 SDAE . . . 7.3 MR-SDAE . 7.4 Weights . . 7.5 Extensions .
on . . . . . . . .
the . . . . . . . .
Different Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
42 44 44 46 47
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
49 49 49 51 52 52
8 Conclusion
55
A Joint Optimization A.1 Basic System Outline . . . . . . . . . . . . A.2 Gradient of the Hidden Markov Model . . A.3 Adjusting Neural Network Parameters . . A.4 Updating HMM Parameters . . . . . . . . A.5 Neural Network . . . . . . . . . . . . . . . A.6 Hidden Markov Model . . . . . . . . . . . A.7 Combined Training . . . . . . . . . . . . . A.8 Joint Optimization . . . . . . . . . . . . . A.9 Joint Optimization Possible Interpretation
4
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
61 61 61 62 62 63 63 63 64 65
List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Piano keyboard and MIDI note range . . . . . . . . . . . . . . . Conventional autoencoder training . . . . . . . . . . . . . . . . . Denoising autoencoder training . . . . . . . . . . . . . . . . . . . Stacked denoising autoencoder training . . . . . . . . . . . . . . SDAE for chord recognition . . . . . . . . . . . . . . . . . . . . . MR-SDAE for chord recognition . . . . . . . . . . . . . . . . . . Post-hoc multiple-comparison Friedman tests for Beatles restricted chord vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . Whisker plot for the Beatles restricted chord vocabulary . . . . . Post-hoc multiple-comparison Friedman tests for Beatles extended chord vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . Whisker plot for Beatles extended chord vocabulary . . . . . . . Post-hoc multiple-comparison Friedman tests for Billboard restricted chord vocabulary . . . . . . . . . . . . . . . . . . . . . . Post-hoc multiple-comparison Friedman tests for Billboard extended chord vocabulary . . . . . . . . . . . . . . . . . . . . . . . Visualization of weights of the input layer of the SDAE . . . . . Plot of sum of absolute values for the input layer of the SDAE . Absolute training error for joint optimization . . . . . . . . . . . Classification performance of joint optimization while training . .
5
9 21 23 24 33 34 40 41 42 43 45 46 48 48 64 65
List of Tables 1 2 3 4 5 6 7
Semitone steps and intervals. . . . . . . . . . . . . . Intervals and chords . . . . . . . . . . . . . . . . . . WCSR for the Beatles restricted chord vocabulary . WCSR for the Beatles extended chord vocabulary . WCSR for the Billboard restricted chord vocabulary WCSR for the Billboard extended chord vocabulary Results for chord recognition in MIREX 2013 . . . .
6
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
10 11 41 43 45 47 54
1
Introduction
The increasing amount of digitized music available online has given rise to demand for automatic analysis methods. A new subfield of information retrieval has emerged that concerns itself only with music: music information retrieval (MIR). Music information retrieval concerns itself with different subcategories, from analyzing features of a music piece (e.g., beat detection, symbolic melody extraction, and audio tempo estimation) to exploring human input methods (like “query by tapping” or “query by singing/humming”) to music clustering and recommendation (like mood detection or cover song identification). Automatic chord estimation is one of the open challenges in MIR. Chord estimation (or recognition) describes the process of extracting musical chord labels from digitally encoded music pieces. Given an audio file, the specific chord symbol and temporal position and duration have to be automatically determined. The main evaluation programme for MIR is the annual “Music Information Retrieval Exchange” (MIREX) challenge1 . It consists of challenges in different sub-tasks of MIR, including chord recognition. Often improving one task can influence the performance in other tasks, e.g., finding a better beat estimate can improve the performance of finding the temporal positions of chord changes, or improve the task of querying by tapping. The same is the case for chord recognition. It can improve performance of cover song identification, in which starting from an input song, cover songs are retrieved: chord information is a useful if not vital feature for discrimination. Chord progressions also have an influence of the “mood” transmitted through music. Thus being able to retrieve the chords used in a music piece accurately could also be helpful for mood categorization, e.g., for personalized Internet radios. Chord recognition is also valuable to do by itself. It can aid musicologists as well as hobby and professional musicians in transcribing songs. There is a great demand for chord transcriptions of well-known and also lesser-known songs. This manifests itself in many Internet pages that hold manual transcriptions of songs, especially for guitar.2 Unfortunately, these mostly contain transcriptions only of the most popular songs and often several different versions of the same song exist. Furthermore, they not guaranteed to be correct. Chord recognition is a difficult task which requires a lot of practice even for humans.
1 http://www.music-ir.org/mirex 2 E.g. ultimate guitar: http://www.ultimate-guitar.com/, 911Tabs http://www.911tabs. com/, guitartabs http://www.guitaretab.com/
7
2
Musical Background
In this section I give an overview of important musical terms and concepts later used in this thesis. I first describe how musical notes relate to physical sound waves in section 2.1, then how chords relate to notes in section 2.2 and later different other aspects of music that play a role for automatic chord recognition in section 2.3.
2.1
Notes and Pitch
Pitch describes the perceived frequency of a sound. In Western tonality pitches are labelled by the letters A to G. The transcription of a musically relevant pitch and its duration is called note. Pitches can be ordered by frequency, whereby a pitch is said to be higher if the corresponding frequency is higher. The human auditory system works on a logarithmic scale, which also manifests itself in music: Musical pitches are ordered in octaves, repeating the note names, usually denoted in ascending order from C to B: C, D, E, F, G, A, B. We can denote different octave relationships with an additional number as a subscript added to the symbol described previously. So a pitch A0 is one octave lower than the corresponding pitch A1 one octave above. Two pitches one octave apart double in corresponding frequency. Humans typically perceive those two pitches as the same pitch (Shepard, 1964). In music an octave is split into twelve roughly equal semitones. By definition each of the letters C to B are two semitone steps apart, excepting the steps from E to F and B to C, which both are only one semitone apart. To denote those notes that are in between the named letters, the additional symbols ] for a semitone step in increasing frequency and [ for a step in decreasing frequency directions are used. For example we can describe the musically relevant pitch between C and D both as C] and D[. Because this system only defines the relationship between pitches, we need a reference frequency. In modern Western tonality usually the reference frequency of A4 at 440 Hz is standard (Sikora, 2003). In practice slight deviations of this reference tuning may occur, e.g., due to instrument mistuning or similar. This reference pitch thus defines the respective frequencies of other notes implicitly through the octave and semitone relationships. We may compute the corresponding frequencies for all other notes given a reference pitch with following equation: n
fn = 2 12 ∗ fr ,
(1)
where fn the frequency for n semitone steps from the reference pitch fr . The human ear can perceive a frequency range of approximately 20 Hz to 20 000 Hz. In practice this frequency range is not fully used in music. For example the MIDI standard, which is more than sufficient for musical purposes in terms of octave range, covers only notes in semitone steps from C−1 , corresponding to about 8.17 Hz, to G9 , which is 12 543.85 Hz. A standard piano keyboard covers the range from A0 at 27.5 Hz to C8 4 186 Hz. Figure 1 depicts a standard piano keyboard in relation to the range of frequencies of MIDI standard notes, with indicated physical sound frequencies.
8
9
C0 16 Hz
1
C1 33 Hz C2 65 Hz
C3 131 Hz C4 262 Hz
C5 523 Hz
A4 440 Hz
piano note range
MIDI note range
C6 1047 Hz C7 2093 Hz
C8 4186 Hz
88
G9 12544 Hz
C9 8372 Hz
127
Figure 1: Piano keyboard and MIDI note range. White keys depict the range of the standard piano, for those notes that are described by letters. Black keys deviate semitone from a note described by a letter. The gray area depicts extensions over the note range of a piano, covered by the MIDI standard.
C−1 8 Hz
0
2.2
Chords
For the purpose of this thesis we define a chord as three or more notes played simultaneously. The distance in frequency of two notes is called an interval. In a musical context we can describe an interval as the number of semitone steps two notes are apart (Sikora, 2003). A chord consists of a root note, usually the lowest note in terms of frequency. The interval relationship of the other notes played at the same time defines the chord type. Thus a chord can be defined as a root-note and a type. In the following we use the notation :, proposed by Harte (2010). We can refer to the notes in musical intervals in order of ascending frequencies as: root-note, third, fifth, and if there is a fourth note seventh. In Table 1, we can see the intervals for chords considered in this thesis and the semitone step distance for those intervals. The root note and fifth have fixed intervals. For the seventh and third, we differentiate between major and minor intervals, differing by one semitone step. For this thesis we restrict ourselves to two different chord vocabularies to be recognized, the first one containing only major and minor chord types. Both major and minor chords consist of three notes: the root note, the third and the fifth. The interval between root note and third distinguishes major and minor chord types (see tables 1 and 2) a major chord contains a major third, while the minor chord contains a minor third. We distinguish between twelve root notes for each chord type, for a total of 24 possible chords. Burgoyne et al. (2011) propose a dataset which contains songs from the Billboard charts from 1950s through the 1990s. This major-minor chord vocabulary accounts for 65% of the chords. We can extend this chord vocabulary to take into account 83% of the chord types in the Billboard dataset by including variants of the seventh chords, by adding an optional fourth note to a chord. Hereby, in addition to simple major and minor chords, we add 7th, major 7th and minor 7th chord-types to our chord-type vocabulary. Major 7th chords and minor 7th chords are essentially major and minor chords, whereby the added fourth note has the interval major seventh and minor seventh respectively. In addition to different chord types, it is possible to change the frequency order of the notes for different intervals by “pulling” one note below the rootnote in terms of frequency. This is called chord inversion. Thus our extended chord vocabulary containing major, minor, 7th, major 7th and minor 7th also contains all possible inversions. We can denote this through an additional identifier in our chord syntax: < root-note>:/, where the inversion-identifier can either be 3, 5, or 7 played below the root-note. For example E:maj7/7 would be a major 7 chord, consisting of the root note E, interval root-note minor third major third fifth minor seventh major seventh
number of semitone-steps 0 3 4 7 10 11
Table 1: Semitone steps and intervals.
10
chord-type major minor 7 major7 minor7
intervals 1,3,5 1,[3,5 1,3,5,[7 1,3,5,7 1,[3,5,[7
notes
Table 2: Intervals and chords. root-note denoted as 1, third as 3, fifth as 5 and seventh as 7. We denote minor as [ a major third, fifth, and major seventh, and the major seventh is played below the root note in terms of frequency. It is possible, however, that in parts of the song, no or only non-harmonic instruments (e.g., percussion) are playing. To be able to interpret this case we define an additional non-chord symbol, thus adding an additional chord symbol to our 24 different chord symbols for the restricted chord vocabulary, leaving us with 25 different symbols. The extended chord vocabulary contains major, minor, 7th, major 7th and minor 7th chord types (depicted in table 2) and all possible inversions. So, for each root-note, this leaves us with 3 different chord symbols for major and minor, and four different chord symbols for extended chords, thus 216 different symbols and an additional non-chord symbol. Furthermore, we assume that chords cannot overlap, although this is not strictly true, for example, due to reverb, multiple instruments playing chords, etc. However, in practice this overlap is negligible and reverb is often not that long. Thus we regard a chord to be a continuous entity with designated start point, end point and a chord symbol (either consisting of the root note, chord type and inversion, or a non-chord symbol).
2.3
Other Structures in Music
A musical piece has several other components, some contributing additional harmonic content, for example vocals, which might also carry a linguistically interpretable message. Since a music piece has an overall harmonic structure and an inherent set of music theoretical harmonic rules, this information also influences the chords played at any time and vice versa, but does not necessarily contribute to the chord played directly. The duration and start and end point in time of a chord played is influenced by rhythmic instruments, such as percussion. These do not contribute to the harmonic content of a music piece but nonetheless are interdependent with other instruments in terms of timing, thus the beginning and end of a chord played. These additional harmonic and non-harmonic components are part of the same frequency range as components that directly contribute to the chord played. From this viewpoint, if we do not explicitly take into account additional components, we are dealing with an additional task of filtering out this “noise” due to these extra components in addition to the task of recognizing chords themselves.
11
3
Related Work
Most musical chord estimation methods can broadly be divided into two subprocesses: preprocessing of features from wave-file data, and higher-level classification of those features into chords. I first describe in section 3.1 the preprocessing steps of the raw wave-form data, as well as the extensions and the refinements of its computation steps to take more properties of waveform music data into account. An overview of higher-level classification organized by methods applied is given in section 3.2. These not only differ in the methods per se, but also in what kind of musical context they take into account for the final classification. More recent methods take more musical context into account and seem to perform better. Since the methods proposed in this thesis are based on machine learning, I have decided to organize the description of other higher level classification approaches from a technical perspective rather than from a music-theoretical perspective.
3.1
Preprocessing / Features
The most common preprocessing step for feature extraction from waveform data is the computation of so called pitch class profiles (PCPs), a human-perceptionbased concept coined by Shepard (1964). He conducted a human perceptual study in which he found that humans are able to perceive notes that are in octave relation as equivalent. A similar representation can be computed from wave form data for chord recognition. A PCP in a music-computational sense is a representation of the frequency spectrum wrapped into one musical octave, thus an aggregated 12-dimensional vector of the energy of the respective input frequencies. This is often called a chroma vector. A sequence of chroma vectors over time is called a chromagram. The terms PCP and chroma vector in chord recognition literature are used interchangeably. It should be noted, however, that only the physical sound energy is aggregated: this is not purely music harmonic information. Thus the chromagram may contain additional non-harmonic noise, such as drums, harmonic overtones and transient noise. In the following I will give an overview of the basics of calculating the chroma vector and different extensions proposed to improve the quality of these features. 3.1.1
PCP / Chroma Vector Calculation
In order to compute a chroma vector, the input signal is broken into frames and converted to the frequency domain, which is most often done through a discrete Fourier transform (DFT), using a window function to reduce spectral leakage. Harris (1978) compares 23 different window functions and finds that the performance depends very much on the properties of the data. Since musical data is not heterogeneous, there is no single best-performing windowing function. Different window functions have been used in the literature, and often the specific window function is not stated. Khadkevich and Omologo (2009a) compare the performance impact of using Hanning, Hamming and Blackman windowing functions on musical wave form data applied to the chord estimation domain. They state that the results are very similar for those three types. However, the Hamming window performed slightly better for window lengths of
12
1024 and 2048 samples (for a sampling rate of 11025 Hz), which are the most common lengths in automatic chord recognition systems today. To convert from the Fourier domain to a chroma vector, two different methods are used. Wakefield (1999) sums energies of frequencies in the Fourier space closest to the pitch of a chroma vector bin (and its multiples) in order to aggregate the energy in a discrete mapping from spectral frequency domain to the corresponding chroma vector bin, converting the input directly to a chroma vector. Brown (1991) developed a so called constant-Q transform, using a kernel matrix multiplication to convert the DFT spectogramm into logarithmic frequency space. Each bin of the logarithmic frequency representation corresponds to the frequency of a musical note. After conversion into logarithmic frequency domain, we then can simply sum up the respective bins, to obtain the chroma vector representation. For both methods the aggregated sound energy in the chroma vector is usually normalized either to sum to one or with respect to the maximum energy in a single bin. Both methods lead to similar results and are used in current literature. 3.1.2
Minor Pitch Changes
In Western tonality music instruments are tuned to the reference frequency of A4 above middle C (MIDI note 69), whose standard frequency is 440 Hz. In some cases the tuning of the instruments can deviate slightly, usually less than a quartertone from this standard tuning: 415–445 Hz (Mauch, 2010). Most humans are unable to determine an absolute pitch height without a reference pitch. We can hear a mistuning of one instrument with some practice, but it is difficult to determine a slight deviation of all instruments from the usual reference frequency described above. The bins for the chroma vectors are relative to a fixed pitch, thus minor deviations in the input will affect its quality. Minor deviations of the reference pitch can be taken into account through shifting the pitch of the chromagram bins. Several different methods have been proposed: Harte and Sandler (2005) use a chroma vector with 36 bins, 3 per semitone. Computing a histogram of energies with respect to frequency for one chroma vector and the whole song and examining the peak positions in the extended chroma vector enables them to estimate the true tuning and derive a 12-bin chroma vector, under the assumption that the tuning will not deviate during the piece of music. This takes a slightly changed reference frequency into account. G´omez (2006) first restricts the input frequencies from 100 to 5000 Hz to reduce the search space and to remove additional overtone and percussive noise. She uses a weighting function which aggregates spectral peaks not to one, but to several chromagram bins. The spectral energy contributions of these bins are weighted according to a squared cosine distance in frequency. Dressler and Streich (2007) treat minor tuning differences as an angle and use circular statistics to compensate for minor pitch shifts, which was later adapted by Mauch and Dixon (2010b). Minor tuning differences are quite prominent in Western tonal music, and adjusting the chromagram can lead to performance increase, such that several other systems make use of one of the former methods, e.g.: Papadopoulos and Peeters (2007, 2008), Reed et al. (2009), Khadkevich and Omologo (2009a), Oudre et al. (2009).
13
3.1.3
Percussive Noise Reduction
Music audio often contains noise that can not directly be used for chord recognition, such as transient or percussive noise. Percussive and transient noise normally is short, in contrast to harmonic components, which are rather stable over time. A simple way to reduce this is to smooth subsequent chroma vectors through filtering or averaging. Different filters have been proposed. Some researchers, e.g., Peeters (2006), Khadkevich and Omologo (2009b), Mauch et al. (2008), use a median filter over time after tuning and before aggregating the chroma vectors, to remove transient noise. G´omez (2006) uses several different filtering methods and derivatives based on a method developed by Bonada (2000) to detect transient noise and leave a window out of the chroma vector calculation of 50 ms before and after transient noise, reducing the input space. Catteau et al. (2007) calculate a “background spectrum” by convolving the logfrequency spectrum with a Hamming window of length of one octave, which they subtract from the original chroma vector to reduce noise. Because there are methods to estimate a beat from the audio signal (Ellis, 2007), and chord changes are more likely to appear on these metric positions, several systems aggregate or filter the chromagram only in between those detected beats. Ni et al. (2012) use a so called harmonic percussive sound separation algorithm described in Ono et al. (2008), which attempts to split the audio signal into percussive and harmonic components. After that they use the median chroma feature vector as representation for the complete chromagram between two beats. A similar approach is used by Weil et al. (2009), who also use a beat tracking algorithm, and average the chromagram between two consecutive beats. Glazyrin and Klepinin (2012) calculate a beat-synchronous smoothed chromagram and propose a modified Prewitt filter from image recognition for edge detection applied to music to suppress non-harmonic spectral components. 3.1.4
Repeating Patterns
Musical pieces inherit a very repetitive structure, e.g., in popular music higherlevel structures such as verse and chorus are repeated, and usually those are repetitions of different harmonic (chord) patterns themselves. These structures can be exploited to improve the chromagram through recognizing and averaging or filtering those repetitive parts to remove local deviation. Repetitive parts can also be estimated and used later in the classification step to increase performance. Mauch et al. (2009) first perform a beat estimation and smooth the chroma vectors in a prefiltering step. Then a frame-by-frame similarity matrix from the beat-synchronous chromagram is computed and the song is segmented into an estimation of verse and chorus. This information is used to average the beat synchronous chromagram. Since beat estimation is a current research topic itself and often does not work perfectly, there might be errors in the beat positions. Cho and Bello (2011) argue that it is advantageous to use recurrent plots with a simple threshold operation to find similarities on a chord level for later averaging, thus leaving out the segmentation of the song into chorus and verse and beat detection. Glazyrin and Klepinin (2012) build upon and alter the system of Cho and Bello. They use a normalized self-similarity matrix on the computed chroma vectors using Euclidean distance as a comparison measure.
14
3.1.5
Harmonic / Enhanced Pitch Class Profile
One problem of the computation of PCPs in general is to find an interpretation for overtones (energy in integer multiples of the fundamental frequency), since these might generate energy in frequencies that contribute to chroma vector bins other than the actual notes of the respective chord. For example the overtones of A4 (440 Hz) are at 880 Hz and 1320 Hz, which is close to E6 (MIDI note 68) at approximately 1318.51 Hz. Several different ways to achieve this have been proposed. In most cases the frequency range that is taken into account is restricted, e.g., approx from 100 Hz to 5000 Hz (Lee, 2006; G´omez, 2006). Most of the harmonic content is contained in this interval. Lee (2006) refines the chroma vector by computing the so called “harmonic product spectrum”, in which the product of the energy for octave multiples (up to a certain number) for each bin is calculated. Later the chromagram on basis of this harmonic product spectrum is computed. He states that multiplying the fundamental frequency with its octave multiples can decrease noise on notes that are not contained in the original piece of music. Additionally he finds a reduction of noise induced by “false” harmonics compared to conventional chromagram calculation. G´omez (2006) proposes an aggregation function for the computation of the chroma vector, in which the energy of the frequency multiples are summed, but first weighted by a “decay” factor, which is dependent on the multiple. Mauch and Dixon (2010a) use a non-negative least-squares method to find a linear combination of “note profiles” in a dictionary matrix to compute the log-frequency representation similar to the constant-Q transform mentioned earlier. 3.1.6
Modelling Human Loudness Perception
Human loudness perception is not directly proportional to the power or amplitude spectrum (Ni et al., 2012), thus the different representations described above do not model human perception accurately. Ni et al. (2012) describe a method to incorporate this through a log10 scale for the sound power in respect to frequency. Pauws (2004) uses a tangential weighting function to achieve a similar goal for key detection. They find an improvement on the quality of the resulting chromagram compared to non-loudness-weighted methods. 3.1.7
Tonnetz / Tonal Centroid
Another representation of harmonics is the so called Tonnetz, which is attributed to Euler in the 19th century. It is a planar representation of musical notes on a 6-dimensional politype, where pitch relations are mapped onto its vertices. Close musical harmonic relations (e.g., fifths and thirds) have a small Euclidean distance. Harte et al. (2006) describe a way to compute a Tonnetz from a 12bin chroma vector, and report a performance increase for a harmonic change detection function, compared to standard methods. Humphrey et al. (2012) use a convolutional neural network from the FFT to model a projection function from wave form input to a Tonnetz. They perform experiments on the task of chord recognition with a Gaussian mixture model, and report that the Tonnetz output representation outperforms state-of-the-art chroma vectors.
15
3.2
Classification
The majority of chord recognition systems compute a chromagram using one or a combination of methods described above. Early approaches use predefined chord templates and compare them with the computed frame-wise chroma features from audio pieces, which are then classified. With the supply of more and more hand-annotated data, more data-driven learning approaches have been developed. The most prominent data-driven model adopted is taken from speech recognition, the hidden Markov model (HMM). Bayesian networks are also used frequently, which are a generalization of HMMs. Recent approaches propose to take more musical context into account to increase performance, such as a local key, bass note, beat and song structure segmentation. Although most chord recognition systems rely on the computation of single chroma vectors, more recent approaches compute two chroma vectors for each frame. A bass and treble chromagram (differing in frequency range) are computed, as it is reasoned that the sequence of bass notes have an important role in the harmonic development of a song and can colour the treble chromagrams due to harmonics. 3.2.1
Template Approaches
The chroma vector as an estimate of the harmonic content of a frame of a music piece should contain peaks at bins that correspond to chord notes played. Chord template approaches use chroma-vector-like templates. These can be either predefined through expert knowledge, or learned from data. Those templates are then compared with a fitting function with the computed chroma vector of each frame respectively. The frame is then classified as the chord symbol corresponding to the best-fitting template. The first research paper explicitly concerned with chord recognition is by Fujishima (1999), which constitutes a non-machine-learning system. Fujishima first computes simple chroma vectors as described above. He then uses predefined 12-dimensional binary chord patterns (either 1 or 0 for present and non-present notes in the chroma vector in the chord) and computes the inner product with the chroma vector. For real-world chord estimation, the set of chords consists of schemata for “triadic harmonic events, and to some extent more complex chords such as sevenths and ninths”. Fujishima’s system was only used on synthesized sound data, however. Binary chord templates with an enhanced chroma vector using harmonic overtone suppression were used by Lee (2006). Other groups use a more elaborate chromagram with tuning (36 bins) for minor pitch changes reducing chord types to be recognized (Harte and Sandler, 2005; Oudre et al., 2009). Oudre et al. (2011) extend the methods already mentioned, by comparing different filtering methods as described in section 3.1.3 and measures of fit (Euclidean distance, Kullback-Leibler divergence and Itakura-Saito divergence) to select the most suitable chord template. They also take harmonic overtones of chord notes into account, such that bins in the templates for notes not occurring in the chord do not necessarily have to be zero. Glazyrin and Klepinin (2012) use quasi-binary chord templates, in which the tonic and the 5th are enhanced and the template is normalized afterwards. The templates are compared to smoothed and fine-tuned chroma vectors. Chord templates do not have to be in form of chroma vectors. They can also
16
be modelled as a Gaussian, or as mixture of Gaussians as used by Humphrey et al. (2012), in order to get an probabilistic estimate of a chord likelihood. To eliminate short spurious chords that only last a few frames, they use a Viterbi decoder. They do not use the chroma vector for classification, but a Tonnetz as described in section 3.1.7. The transformation function is learned by a convolutional neural network from data. It should be noted that basically all chord template approaches can model chord “probabilities” that can in turn be used as input for higher level classification methods or for temporal smoothing such as hidden Markov models, described in section 3.2.2 as shown by Papadopoulos and Peeters (2007). 3.2.2
Data-Driven Higher Context Models
The recent increase in availability of hand-annotated data on chord recognition has spawned new machine-learning-based methods. In chord-recognition literature, different approaches have been proposed, from neural networks, to systems adopted from speech recognition to support vector machines and others. More recent machine learning systems seem to capture more and more context of music. In this section I describe higher level classification models found organized by machine learning methods used. Neural Networks Su and Jeng (2001) try to model the human auditory system with artificial neural networks. They perform a wavelet transform (as an analogy to the ear) and feed the output into a neural network (as an analogy for the cerebrum ) for classification. They use a self-organizing map to determine the style of chord and the tonality (C, C# etc.). It was tested on classical music to recognize 4 different chord types (major, minor, augmented, and diminished). Zhang and Gerhard (2008) propose a system based on neural networks to detect basic guitar chords and their voicings (inversions) with the help of a voicing vector and a chromagram. The neural network in this case first is trained to identify and output the basic chords; a later post processing step will determine the voicing. Osmalsky et al. (2012) build a database with several different instruments playing single chords individually, part of it recorded in a noisy and part of it in a noise-free environment. They use a feed-forward neural net with a chroma vector as input to classify 10 different chords and experiment with different subsets of their training set. HMM Neural networks do not take time dependencies between subsequent inputs into account. In music pieces there is a strong interdependency of subsequent chords, which renders a classification of chords for a whole music piece difficult to model based solely on neural networks. Since a template and neural net based approaches do not explicitly take temporal properties of music into account, a widely adopted method is to use a hidden Markov model. It has proven to be a good tool for the related field of speech recognition. The chroma vector is treated as observation, which can be modelled by different probability distributions, and the states of the HMM are the chord symbols to be extracted. Sheh and Ellis (2003) pioneered HMMs for real-world chord recognition. They propose that the emission distribution be a single Gaussian with 24 dimensions, trained from data with expectation maximization. Burgoyne et al.
17
(2007) state that a mixture of Gaussians is more suitable as the emission distribution. They also compare the use of Dirichlet distributions as the emission distribution and conditional random fields as the higher level classifier. HMMs are used with slightly different chromagram computations and training initialisations according to prior music theoretic knowledge by Bello and Pickens (2005). Lee (2006) build upon the systems of Bello and Pickens and Sheh and Ellis, generate training data from symbolic files (MIDI) and use an HMM for chord extraction. Papadopoulos and Peeters (2007) compare several different methods of determining the parameters of the HMM and observation probabilities. They conclude that a template-based approach combined with an HMM with a “cognitive based transition matrix” shows the best performance. Later Papadopoulos and Peeters (2008, 2011) propose an HMM approach focusing on (and extracting) beat estimates to take into account musical beat addition, beat deletion or changes in meter to enhance recognition performance. Ueda et al. (2010) use Harmonic Percussive Sound Separation chromagram features and an HMM for classification. Chen et al. (2012) cluster “song-level duration histograms” to take time duration explicitly into account in a so-called duration-explicit HMM. Ni et al. (2012) is the best performing system of 2012 MIREX challenge in chord estimation. It works on the basis of an HMM, bass and treble chroma and beat and key detection. Structured SVM Weller et al. (2009) compare the performance of HMMs and support vector machines (SVMs) for chord recognition and achieve stateof-the-art results using support vector machines. n-grams Language and music are closely related. Both spoken language and music rely on audio data. Thus it makes sense to apply spoken-languagerecognition approaches to music analysis and chord recognition. A dominant approach for language recognition is an n-gram model. A bigram model (n = 2) is essentially a hidden Markov model, in which one state only depends on the previous one. Cheng et al. (2008) compare 2-, 3-, and 4-grams, thus making one chord dependent on multiple previous chords. They use it for song similarity after a chord recognition step. In their experiments the simple 3- and 4-grams outperform the basic HMM system of Harte and Sandler (2005); they state that n-grams are able to learn the basic rules of chord progressions from hand annotated data. Scholz et al. (2009) use a 5-gram and compare different smoothing techniques and find that modelling more complex chords with 7ths and 9ths should be possible with n-grams. They do not state how features are computed and interpreted. Dynamic Bayesian Networks Musical chords develop meaning in their interplay with other characteristics of a music piece, such as bass note, beat and key: they can not be viewed as an isolated entity. These interdependencies are difficult to model with a standard HMM approach. Bayesian networks are a generalization of HMMs, in which the musical context can be modelled more intuitively. Bayesian networks give the opportunity to model interdependencies simultaneously, creating a more sound model for music pieces from a musictheoretic perspective. Another advantage of a Bayesian network is that it can
18
directly extract multiple types of information, which may not be a priority for the task of chord recognition, but is an advantage for the extended task of general transcription of music pieces. Cemgil et al. (2006) were among the first to introduce Bayesian networks for music computation. They do not apply the system to chord recognition but to polyphonic music transcription (transcription on a note-by-note basis). They implement a special case of the switching Kalman filter. Mauch (2010) and Mauch and Dixon (2010b) make use of a Bayesian network and incorporate beat detection, bass note and key estimation. The observations of the Bayesian network in the system are treble and bass chromagrams. Dixon et al. (2011) compare a similar system to a logic based system. Deep Learning Techniques Deep learning techniques have beaten the state of the art in several benchmark problems in recent years, although for the task of chord recognition it is a relatively unexplored method. There are three recent publications using deep learning techniques. Humphrey and Bello (2012) call for a change in the conventional approach of using a variation of chroma vector and a higher level classifier, since they state recent improvements seem to bring only “diminishing return”. They present a system consisting of a convolutional neural network with several layers, trained to learn a Tonnetz from a constant-Qtransformed FFT, and subsequently classify it with a Gaussian mixture model. Boulanger-Lewandowski et al. (2013) make use of deep learning techniques with recurrent neural networks. They use different techniques including a Viterbilike algorithm from HMMs and beam search to take temporal information into account. They report upper-bound results comparable to the state of the art using the Beatles Isophonics dataset (see section 6.5 for a dataset description) for training and testing. Glazyrin (2013) uses stacked denoising autoencoders with a 72-bin constant-Q transform input, trained to output chroma vectors. A self-similarity algorithm is applied to the neural network output and later classified with a deterministic algorithm, similar to the template approaches mentioned above.
19
4
Stacked Denoising Autoencoders
In this section I give a description of the theoretical background of stacked denoising autoencoders used for the two chord recognition systems in this thesis following Vincent et al. (2010). First a definition of autoencoders and their training method is given in section 4.1, then it is described how this can be extended to form a denoising autoencoder in section 4.2. We can stack denoising autoencoders to train them in an unsupervised manner and possibly get a useful higher level data abstraction by training several layers, which is described in section 4.3.
4.1
Autoencoders
Autoencoders or autoassociators try to find an encoding of given data in the hidden layers. Similar to Vincent et al. (2010) we define the following: We assume a supervised learning scenario. A training set of n touples of inputs x and targets t. Dn = {(x1 , t1 ), .., (xn , tn )}, where x ∈ Rd if the input is real valued, or x ∈ [0, 1]d . Our goal is to infer a new, higher 0 level representation y, of x. The new representation again is y ∈ Rd or d0 y ∈ [0, 1] depending if real valued or binary representation is assumed. Encoder A deterministic mapping fθ that transforms the input x to a hidden representation y is called an encoder. It can be described as follows: y = fθ (x) = s(W x + b),
(2)
where θ = {W, b}, W a d × d0 weight matrix and b an offset (or bias) vector of dimension d0 . The function s(x) is a non linear mapping, e.g., a sigmoid activation function 1+e1−x . The output y is called the “hidden representation”.
Decoder A deterministic mapping gθ0 that maps hidden representation y back to input space by constructing a vector z = gθ0 (y) is called a decoder. Typically this is in form of a mapping: z = gθ0 (y) = W 0 y + b0
(3)
or a mapping followed by a non-linearity: z = gθ0 (y) = s(W 0 y + b0 )
(4)
where θ0 = {W 0 , b0 }, W 0 a d0 × d weight matrix and b0 an offset (or bias) vector of dimension d. Often the restriction W > = W 0 is imposed on the weights. z can be regarded as an approximation of the original input data x, reconstructed from the hidden representation y.
20
Input x fθ Hidden representation y
L(x, z) Loss function gθ0
Autoencoder training output z Figure 2: Conventional autoencoder training. Vector x from the training set is projected by fθ (x) to hidden representation y, hereafter projected back to input space using gθ0 (y) to compute z. The loss function L(x, z) is calculated and used as training objective for minimization. Training The idea behind such a model is to get a good hidden representation y, from which the decoder is able to reconstruct the original input as closely as possible. It can be shown that finding the optimal parameters for such a model can be viewed as a maximization of the lower bound between the mutual information of the input and the hidden representation in the first layer (Vincent et al., 2010). To estimate the parameters we define a loss function. This can be for a binary input x ∈ [0, 1]d the cross entropy: L(x, z) = −
d X
xk log(zk ) + (1 − xk ) log(1 − zk )
(5)
k=1
or for real valued input x ∈ Rd : L(x, z) = ||x − z||2 ,
(6)
The “squared error objective”. Since we use real valued input data, this squared error objective is used in this thesis as loss function. Given this loss function we want to minimize the average loss (Vincent et al., 2008):
θ∗ , θ0∗ = arg min0 θ,θ
n n 1X 1 X (i) L(x(i) , z (i) ) = arg min0 L x , gθ0 fθ (x(i) ) , (7) θ,θ n n i=1 i=1
Where θ∗ , θ0∗ denote the optimal parameters for encoding and decoding function for which the loss function is minimized, which might be tied. This can be achieved iteratively by backpropagation. n is the number of training samples. Figure 2 visualizes the training procedure for an autoencoder. If the hidden representation y is of the same dimensionality as the input x, it is trivial to construct a mapping that yields zero reconstruction error, the identity mapping. Obviously this constitutes a problem since merely learning the identity mapping does not lead to any higher level of abstraction. To evade this problem a bottleneck is introduced, for example by using fewer nodes for a 21
hidden representation thus reducing its dimensions. It is also possible to impose a penalty on the network activations to form a bottleneck, and thus train a sparse network. These additional restrictions force the neural network to focus on the most “informative” parts of the data leaving out noisy “uninformative” parts. Several layers can be trained in a greedy manner to achieve a yet higher level of abstraction. Enforcing Sparsity To prevent autoencoders from learning the identity mapping, we can penalize activation. This is described by Hinton (2010) for restricted Boltzman machines, but can be used for autoencoders as well. The general idea is that it is less informative if we have nodes that fire very frequently, i.e. a node that is always active does not add any useful information and could be left out. We can enforce sparsity by adding a penalty term for large average activations over the whole dataset to the backpropagated error. We can compute the average activation of a hidden unit j over all training samples with: n
pˆj =
1 X j (i) f (x ) n i=1 θ
(8)
In this thesis the following addition to the loss function is used, which is derived from the KL divergence: h X p 1−p Lp = β p log + (1 − p) log( ) , pˆj 1 − pˆj j=1
(9)
where pˆ the average activation over the complete training set for hidden unit j, n the number of training samples, p is a target activation parameter and β a penalty weighting parameter, all specified beforehand. The bound h is the number of hidden nodes. For a sigmoidal activation function p is usually set to a value that is close to zero, for example 0.05. A frequent setting for β is 0.1. This ensures that units will have a large activation only on a limited amount of training samples and otherwise have an activation close to zero. We now simply add this weighted activation error term to L(x, z), described above.
4.2
Autoencoders and Denoising
Vincent et al. (2010) propose another training criterion in addition to the bottleneck. They state that an autoencoder can also be trained to “clean a partially corrupted input”, also called denoising. If noisy input is assumed, it can be beneficial to corrupt (parts of) the input of the autoencoder while training and use the uncorrupted input as target. The autoencoder is hereby encouraged to reconstruct a “clean” version of the corrupted input. This can make the hidden representation of the input more robust to noise, and can potentially lead to a better higher level abstraction of the input data. Vincent et al. (2010) state that different types of noises can be considered. There is “masking noise”, i.e., setting a random fraction of the input to 0, “salt and pepper noise”, i.e., setting a random fraction of the input to either 0 or 1, and, especially for real-valued input, isotropic additive Gaussian noise, i.e. adding noise from a Gaussian distribution to the input. To achieve this, we 22
corrupt the initial input x into x ˜ according to a stochastic mapping x ˜ ∼ qD (˜ x|x). This corrupted input is then projected to the hidden representation as described before by means of y = fθ (˜ x) = s(W x ˜ + b). Then we can reconstruct z = gθ0 (y). The parameters θ and θ0 are trained to minimize the average reconstruction error between output z and the uncorrupted input x, but in contrast to “conventional” autoencoders, z is now a deterministic function of x ˜ instead of x. For our purpose, under usage of additive Gaussian noise, we can train the denoising autoencoder with a squared error loss function: L2 (x, z) = ||x − z||2 . Parameters can be initialized at random and then optimized by backpropagation. Figure 3 depicts training of a denoising autoencoder. Uncorrupted input x qD Corrupted input x ˜
L(x, z) Loss function fθ
Hidden representation y gθ0 Denoising autoencoder output z Figure 3: Vector x form training set is corrupted with qD and converted to hidden representation y. The loss function L(x, z) is calculated from the output and the uncorrupted input and used for training
4.3
Training Multiple Layers
If we want to train (or initialize training parameters for supervised backpropagation for) deep networks, we need a manner to extend the approach from a single layer, as described in the previous sections, to multiple layers. As described by Vincent et al. (2010), this can be easily achieved by repeating the process for each layer separately. Depicted in figure 4 is such a greedy layer wise training. First we propagate the input x through the already trained layers. Note that we do not use additional corruption noise yet. Next we use the uncorrupted hidden representation of the previous layer as input for the layer we are about to train. We train this specific layer as described in the previous sections. The input to the layer to be trained is first corrupted by qD and then (2) projected into latent space by using fθ . We then project it back to “input” (2) space of the specific layer with gθ0 . Using an error function L, we can optimize the projection functions with respect to the defined error, and therefore possibly obtain a useful higher-level representation. This process can be repeated several times to initialize a deep neural network structure, circumventing usual problems that arise when initializing deep networks at random and then applying 23
backpropagation. Next we can apply a classifier on the output of this deep neural network trained to supress noise. Alternatively we can add another layer of hidden nodes for classification purposes on top of the previously unsupervised trained network structure and apply standard backpropagation to fine-tune the network weights according to our supervised training training targets t. Loss function L(y, z2 ) (2)
(2)
fθ
gθ0
(2)
fθ
qD
(1)
(1)
fθ
(1)
fθ
x
fθ
x
x
Figure 4: Training of several layers in a greedy unsupervised manner. The input is propagated without corruption. To train an additional layer the output of (2) (2) the first layer is corrupted by qD and the weights are adjusted with fθ ,gθ0 with the respective loss function. After training for this layer is completed, we can train subsequent layers.4
4.4
Dropout
Hinton et al. (2012) were able to improve performance on several other recognition tasks, including MNIST for hand written digit recognition and TIMIT a database for speech recognition, by randomly omitting a fraction of hidden nodes from training for each sample. This is in essence training a different model for each training sample and iteration on one training sample only. According to Hinton et al. this prevents the network from overfitting. In the testing phase we make use of the complete network again. Thus what we effectively are doing with dropout is averaging: averaging many models trained on one training sample each. This has yielded an improvement in different modelling tasks (Hinton et al., 2012).
4 As
shown in (Vincent et al., 2010)
24
5
Chord Recognition Systems
In this section I describe the structure of three different approaches to classify chords. 1. We first describe the structure of a comparison system: a simplified version of the Harmony Progression Analyzer as proposed by Ni et al. (2012). The features computed can be considered state of the art. We discard, however, additional context information like key, bass and beat tracking, since the neural network approaches developed in this thesis do not take this into account (although it should be noted that in principle the approaches developed in this thesis could be extended to take this additional context information into account as well). The simplified version of the Harmonic Progression Analyzer will serve as a reference system for performance comparison. 2. A neural network initialized by stacked denoising autoencoder pretraining with later backpropagation fine-tuning can be applied to an excerpt of the FFT to estimate chord probabilities directly, which then can be smoothed with the help of an HMM, to take temporal information into account. We substitute the emission probabilities with the output of the stacked denoising autoencoders. 3. This approach can be extended by adding filtered versions of the FFT over different time spans to the input. We extend the input to include two additional vectors, median-smoothed over different timespans. Here again additional temporal smoothing is applied in a post-classification process. In section 5.1 we describe the comparison system and briefly the key ideas incorporated in the computation of state-of-the-art features. Since the two other approaches described in this thesis make use of stacked denoising autoencoders that interpret the FFT directly, we describe beneficial pre-processing steps in section 5.2.1. In section 5.2.2 we describe a stacked denoising autoencoder approach for chord recognition in which the outputs are chord symbol probabilities directly, and in section 5.2.3 we propose an extension of this approach inspired by a system developed for face recognition and phone recognition by Tang and Mohamed (2012) under usage of a so called multi-resolution deep belief network and apply it to chord recognition with the use of stacked denoising autoencoders. Appendix A describes the theoretical foundation of applying a joint optimization of the HMM and neural network for chord recognition.
5.1
Comparison System
In this section we describe a basic comparison system for the other approaches implemented. It reflects the structure of most current approaches and uses state-of-the-art features for chord recognition. Most recent chord recognition systems rely on an improved computation of the PCP vector and take extra information into account such as bass notes or key information. This extra information is usually incorporated into a more elaborate higher-level framework, such as multiple HMMs or a Bayesian network.
25
The comparison system consists of the computation of state-of-the-art PCP vectors for all frames, but only a single HMM for later classification and temporal alignment of chords, which allows for a more fair comparison to the stacked denoising autoencoder approaches. The basic computation steps described in the following are used in the approach described by Ni et al. (2012). They split the computation of features into a bass chromagram and a treble chromagram, and track them with two additional HMMs. The computed frames are aligned according to a beat estimate. To make this more elaborate system comparable, again we only compute one chromagram containing both bass and treble and use a single HMM for temporal smoothing and do not align frames according to an beat estimate. We first describe the very basic steps of PCP features predominantly used in chord recognition for 15 years in section 5.1.1, hereafter in section 5.1.2 we describe extensions of the basic PCP used in the comparison system. 5.1.1
Basic Pitch Class Profile Features
The basic pipeline for computing a pitch class profile as a feature for chord recognition consists of two steps: 1. The signal is projected from time to frequency domain through a Fourier transform. Often files are downsampled to 11 025 Hz to allow for faster computation. This is also done in the reference system. The range of frequencies is restricted through filtering, to only analyse frequencies below, e.g., 4000 Hz (about the range of the keyboard of a piano, see figure 1) or similar, since other frequencies carry less information about the chord notes played and introduce more noise to the signal. In the reference system a frequency range from approximately 55 Hz to 1661.2 Hz is used, as this interval is proposed in the original system (Ni et al., 2012). 2. The second step consists of a constant-Q transform, which projects the amplitude of the signal in the linear frequency space to a logarithmic representation of signal amplitude, in which each constant-Q transform bin represents the spectral energy in respect to the frequency of a musical note. 3. In a third step the bins representing one musical note and its octave multiples are summed and the resulting vector is sometimes normalized. In the following section we describe the constant-Q transform and computation of the PCP in more detail. Constant-Q transform After converting the signal from time to frequency domain through a discrete or fast Fourier transform, we can apply an additional transform to make the frequency bins logarithmically spaced. This transform can be viewed as a set of filters in time domain, which filter a frequency band according to a logarithmic scaling of center frequencies of the constant-Q bins. Originally it was proposed to be an additional term in the Fourier transform, but it has been shown by Brown and Puckette (1992) to be computationally more efficient to filter the signal in Fourier space, thus applying the set of filters transformed into Fourier space to the signal also in Fourier space. This
26
can be realized with a matrix multiplication. This transformation process to logarithmically spaced bins is called the constant-Q transform (Brown, 1991). The name stems from the factor Q, which describes the relationship between fk center frequency of each filter and the filter width Q = ∆f . Q is a so-called k quality factor which stays constant, fk is the center frequency and ∆fk the width of the filter. We can choose the filters such that they filter out the energy contained in musically relevant frequencies (i.e., frequencies corresponding to musical notes): 1 (10) fkcq = (2 B )kcq fmin , where fmin is the frequency for the lowest musical note to be filtered, fkcq the center frequency corresponding to constant-Q bin kcq . B denotes the number of constant-Q frequency bins per octave, usually B = 12 (one bin per semitone). Setting Q = 11 establishes a link between musically relevant frequencies and 2 B −1
filter width of our filterbank. Different types of filters can be used to aggregate the energy in relevant frequencies and to reduce spectral leakage. For comparison system we make use of a Hamming window as described as well by Brown and Puckette (1992): w(n, fkcq ) = 0.54 + 0.46 cos M (fk
)
M (fk
2πn M (fkcq )
(11)
)
cq − 1, M (fkcq ) is the window size, computable where n = − 2 cq , . . . , 2 with Q and corresponding center frequency fkcq for constant-Q bin, and kcq and n the current input bin in time domain and sampling rate of the input signal fs (Brown, 1991): fs . (12) M (fkcq ) = Q fkcq
We can now compute the filters and thus the respective sound power in the signal filtered according to a musically-relevant set of center frequencies. Instead of applying these filters in time domain, it is computationally more efficient to do so in spectral domain, by projecting the window functions to Fourier space first. We can apply the filters hereafter through a matrix multiplication in frequency space. As denoted by Brown and Puckette (1992) for bin kcq of the constant-Q transform can write: X cq [kcq ] =
N −1 1 X X[k]K[k, kcq ], N
(13)
k=0
where kcq describes the constant-Q transform bin, X[k] the signal amplitude at bin k in Fourier domain, N is the number of Fourier bins and K[k, kcq ] the value of the Fourier transform of our filter w(n, fkcq ) for constant-Q transform kcq at Fourier bin k. Choosing the right minimum frequency and quality factor will result in constant-Q bins corresponding to harmonically-relevant frequencies. Having transformed the linearly-spaced amplitude per frequency to a musically spaced constant-Q transform bin, we can now continue to aggregate notes that are one octave apart, hereby reducing the dimension of the feature vector significantly.
27
PCP Aggregation Shepard’s (1964) experiments on human perception of music suggest that humans can perceive notes one octave apart as belonging to the same group of notes, known as pitch classes. Given these results we compute pitch class profiles based on the signal energy in logarithmic spectral space. As described by Lee (2006): Ncq −1
P CP [k] =
X
|X cq (k + mB)|,
(14)
m=0
where k = 1, 2, ..., B is the index for the PCP bin, Ncq is the number of octaves in the frequency range of the constant-Q transform. Usually B = 12, so that one bin for each musical note in one octave is computed. For pre-processing, e.g., correction of minor tuning differences, B = 24 or B = 36 are also sometimes used. Hereafter the resulting vector is usually normalized, typically with respect to the L1 , L2 or L∞ norm. 5.1.2
Comparison System Simplified Harmony Progression Analyzer
In this section I describe the refinements made to the very basic chromagram computation defined above. The state-of-the-art system proposed by Ni et al. (2012) takes additional context into account. They state that tracking the key and the bass line provides important context that provides useful additional information for recognizing musical chords. For a more accurate comparison with stacked denoising autoencoder approaches, which cannot easily take such context into account, we discard the musical key, bass and beat information that is used by Ni et al. We compute the features with the code that is freely available from their website5 and adjust it to a fixed stepsize of 1024 samples with a sampling rate of 11 025 Hz thus a step size of approximately 0.09s per frame, instead of a beat-aligned step size. In addition to a so-called harmonic percussive sound separation algorithm as described by Ono et al. (2008), which attempts to split the signal into an hamonic and a percussive part, Ni et al. implement a loudness-based PCP vector and correct for minor tuning deviations. 5.1.3
Harmonic Percussive Sound Separation
Ono et al. (2008) describe a method to discriminate between the percussive contribution to the Fourier transform and the harmonic one. This can be achieved by exploiting the fact that percussive sounds most often manifest themselves as bursts of energy spanning a wide range of frequencies but only during a limited time. On the other hand, harmonic components span a limited frequency range but are more stable over time. Ono et al. present a way to estimate the percussive and harmonic parts of the signal contribution in Fourier space as an optimization problem which can be solved iteratively: Fh,i is the short-time Fourier transform of an audio signal f (t) and Wh,i = |Fh,i |2 is its power spectrogram. We minimize the L2 norm of power spectrogram gradients, J(H, P ), with Hh,i the harmonic component and Ph,i the percussive 5 https://patterns.enm.bris.ac.uk/hpa-software-package
28
component, with h the frequency bin and i the time in Fourier space: J(H, P ) =
1 X 1 X (Hh,i−1 − Hh,i )2 + 2 (Ph−1,i − Ph,i )2 , 2 2σH 2σP h,i
(15)
h,i
subject to the constraint that Hh,i + Ph,i = Wh,i
(16)
Hh,i ≥ 0,
(17)
Ph,i ≥ 0,
(18)
and where Wh,i is the original power spectrogram, as described above, and σH and σP are parameters to control the smoothness vertically and horizontally. Details for an iterative optimization procedure can be found in the original paper. 5.1.4
Tuning and Loudness-Based PCPs
Here we describe further refinements of the PCP vector, first how to take minor deviations (less than a semitone) from the reference tuning into account, and later an addition proposed by Ni et al. (2012) to model human loudness perception. Tuning To take into account minor pitch shifts of the tuning of the specific song, features are fine-tuned as described by Harte and Sandler (2005). Instead of computing a 12-bin chromagram directly, we can compute multiple bins for each semitone, as described in section 5.1.1 for setting B > 12 (e.g., B = 36). We can then compute a histogram of sound power peaks with respect to frequency and select a subset of constant-Q bins to compute the PCP vectors, to shift our reference tuning according to small deviations for a song. Loudness Based PCPs Since human loudness perception of sound in respect to frequencies is not linear, Ni et al. (2012) propose a loudness weighting function. First we can compute a “sound power level matrix”: Ls,t = 10 log10
||Xs,t ||2 , s = 1, ..., S, t = 1, ..., T, pref
(19)
where pref indicates the fundamental reference power, and Xs,t the constant-Q transform of our input signal as described in the previous section (s denoting the constant-Q transform bin and t the time). They propose to use A-weighting (Talbot-Smith, 2001), in which we add a specific value depending on the frequency. An approximation to human sensitivity of loudness perception in respect to frequency is then given by: L0s,t = Ls,t + A(fs ), s = 1, ..., S, t = 1, ..., T,
(20)
A(fs ) = 2.0 + 20 log10 (RA (fs )),
(21)
where
29
and RA (fs ) =
(fs2
+
20.62 )
122002 fs4 p . 2 (fs + 107.72 )(fs2 + 737.92 )(fs2 + 122002 )
(22)
Having calculated this we can proceed to compute the pitch class profiles as described above, using L0s,t . Ni et. al. normalize the loudness-based PCP vector after aggregation according to: X p,t =
0 Xp,t − minp0 Xp0 0 ,t , maxp0 Xp0 0 ,t − minp0 Xp0 0 ,t
(23)
0 where Xp,t denotes the value for PCP bin p time t. Ni et al. state that due to this normalization, specifying the reference sound power level pref is not necessary.
5.1.5
HMMs
In this section we give a brief overview of the hidden Markov model (HMM), as far as important for this thesis. It is a widely used model for speech as well as chord recognition. A musical song is highly structured in time – certain chord sequences and transitions are more common than others – but PCP features do not take any time dependencies into account by themselves. A temporal alignment can increase the performance of a chord recognition system. Additionally, since we compute the PCP features from the amplitude of the signal alone, which is noisy in regards to chord information due to percussion, transient noise or other, the resulting feature vector is not clean. HMMs in turn are used to deal with noisy data, which adds another argument to use HMMs for temporal smoothing. Definition There exist several variants of HMMs. For our comparison system we restrict ourselves to an HMM with a single Gaussian emission distribution for each state. For the stacked denoising autoencoders we use the output of the autoencoders directly as a chord estimate and as emission probability. An HMM with a Gaussian emission probability is a so-called continuous-densities HMM. It is capable of interpreting multidimensional real valued input such as the PCP vectors we use as features, described above in section 5.1.1. An HMM estimates the probability of a sequence of latent states corresponding to a sequence of lower-level observations. As described by Rabiner (1989), an HMM can be defined as a 5-tuple consisting of: 1. N , the number of states in the model. 2. M , the number of distinct observations, which in the case of a continuous densities HMM is infinite. 3. A = {aij }, the state transition probability distribution, where aij = P (qt+1 = Sj |qt = Si ), 1 ≤ i, j ≤ N , and qt denotes the current state at time t. If the HMM is ergodic (i.e., all transitions to every state from every state are possible) for all i and j, aij > 0. Transition probabilities N P aij = 1 and 1 ≤ i ≤ N . satisfy the stochastic constraints j=1
30
4. B = {bj (O)}, the set of observation probabilities, which in our case is infinite. bj (Ot ) = P (Ot |qt = Sj ), the observation probability in state j, where , 1 ≤ j ≤ N , for observation Ot at time t. If we assume a continuousdensity HMM, i.e., we have a real-valued, possibly multidimensional input, we can use a (mixture of) Gaussian distributions for the probability distriM P bution bj (O): bj (Ot ) = Zjm N (Ot , µjm , Σjm ), with 1 ≤ j ≤ N . Here m=1
Ot is the input vector at time t, Zjm the mixture weight (coefficient) for the mth mixture in state j and N (O, µjm , Σjm ), the Gaussian probability density function, with mean vector µjm and covariance matrix Σjm for state j and component m. 5. π = {πi } where πi = P (q1 = Si ), with 1 ≤ i ≤ N . This is the initial state probability. Parameter Estimation We can define the states to be the 24 chord symbols and the non-chord symbol for the simple major-minor chord discrimination task, and 217 different symbols for the extended chord vocabulary, including major, minor, 7th and inverted chords and the non-chord symbol. The features in case of the baseline system are computed as a 12-bin PCP vector, with a single Gaussian as emission model for the HMM. In case of the stacked denoising autoencoder systems, we can use the output of the networks directly as emission probabilities. Since we are dealing with a fully annotated data set, it is trivial to estimate the initial state probabilities and the transitions by computing relative frequencies with help of supplied ground truth. In the case of Gaussian emission model, we can estimate the parameters from training data by the EM algorithm (McLachlan et al., 2004). Likelihood of a Sequence To compute the likelihood of given observations belonging to a certain chord sequence we can compute the following: P (q1 , q2 ...qt , O1 , O2 , ...Ot |λ) = π1 b1
T Y
at,t−1 bt (Ot ),
(24)
t=2
where π1 is the initial state probability for state at time 1, b1 the emission probability for the first observation, at,t−1 the transition probability from state t − 1 to state t, and bt (Ot ) the emission probability for time t for observation Ot at time t. λ denotes the parameters of our HMM. The most likely sequence of hidden states for given observations can be computed efficiently with the help of the Viterbi algorithm (see Rabiner, 1989, for details).
5.2
Stacked Denoising Autoencoders for Chord Recognition
A piece of music contains additional non-harmonic information, or harmonic information which does not directly contribute to the chord played at a certain time in the song. This can be considered as noise for the objective of estimating the correct chord progressions from a song. Since stacked denoising autoencoders are trained to reduce artificially added noise, they seem to be a 31
suitable choice for application on noisy data, and have been shown to achieve state-of-the-art performance on several benchmark tests (including audio genre classification) (Vincent et al., 2010). Moreover deep learning architectures can be partly trained in an unsupervised manner, which might prove to be useful for a field like chord recognition, since there is a huge amount of unlabeled digitized musical data available, but only a very limited fraction of this is annotated. In this section I describe two systems relying on stacked denoising autoencoders for chord recognition. The preprocessing of the input data follows the same basic steps for the two stacked denoising autoencoder approaches, described in section 5.2.1. All approaches make use of an HMM to smooth and interpret the neural network output as a post-classification step. Since the chord ground truth is given, we are also able to calculate a “perfect” PCP and train stacked denoising autoencoders to approximate the former from given FFT input. A description of how to apply a joint optimization procedure for the HMM and neural network for chord recognition, taken from speech recognition, is given in appendix A (This did not yield any further improvements, however). Furthermore it is possible to train a stacked denoising autoencoder to model chord probabilities directly which then are smoothed by an HMM, described in section 5.2.2. Hereafter I propose an extension to this approach by extending the input of the stacked denoising autoencoders to cover multiple resolutions, smoothed over different time spans, in section 5.2.3. 5.2.1
Preprocessing of Features for Stacked Denoising Autoencoders
In all approaches described below, we employ the stacked denoising autoencoders directly to the Fourier transformed signal. This minimizes the preprocessing steps, and restrictions imposed, but still some preprocessing of the input can increase the performance. 1. To restrict the search space only the first 1500 FFT bins are used. This restricts the frequency range to approximately 0 to 3000 Hz. Most of the frequencies emitted by harmonic instruments are still contained in this interval. 2. Since values taken from the FFT directly contain high-energy peaks, we apply a square root compression as done by Boulanger-Lewandowski et al. (2013) for deep belief networks. 3. We then normalize the FFT frames according to the L2 norm in a final preprocessing step.
32
5.2.2
Stacked Denoising Autoencoders for Chord Recognition chord symbols
SDAE
single frame
preprocessing
FFT input, one time frame
Figure 5: Stacked denoising autoencoder for chord recognition, single resolution. Humphrey et al. (2012) state that the performance of chord recognition systems has not improved significantly recently, and suggest that one reason could be the widespread usage of PCP features. They try to find a different representation by modelling a Tonnetz under usage of convolutional neural networks. Cho and Bello (2014), who evaluate the influence on performance of different parts of chord recognition systems, also come to the conclusion that the choice of feature computation has a great influence on the overall performance and suggest the exploration of other types of features differing from the PCP. A nice property of deep learning approaches is that they are often able to find a higher level representation of the input data by themselves and do not rely on predefined feature computation. When classifying data, we can train a neural network to output pseudoprobabilities for each class given an input. This is done through a final logistic regression layer (or softmax) for the output of the neural network. We use a softmax output and a 1-of-K encoding, such that we have K outputs, each of which can be interpreted as a probability of a certain chord being played. Thus we can use the output of a 1-of-K encoding softmax output layer neural network directly as substitute for the emission probability of the HMM and further process it with temporal smoothing to compute a final chord symbol output. Since deep learning provides us with a powerful strategy for neural network training, we are able to discard all steps of the conventional PCP vector computation and restrictions that might be imposed by them – apart from the FFT – and train the network to classify chords. This differs from previous approaches
33
like Boulanger-Lewandowski et al. (2013) and Glazyrin (2013), who use deep learning techniques but still model PCPs either as intermediate target or as output of the neural network. Figure 5 depicts the processing pipeline of the system. This system, with a single input frame is referred to as stacked denoising autoencoder (SDAE). 5.2.3
Multi-Resolution Input for Stacked Denoising Autoencoders chord symbols
SDAE
Concatenate frames single frame
median filter
median filter
preprocessing
FFT input, multiple time frames
Figure 6: Stacked denoising autoencoder for chord recognition, multi-resolution Glazyrin (2013), who uses stacked denoising autoencoders (with and without recurrent layers) to estimate PCP vectors from the constant-Q transform, states that he suspects it to be beneficial to take multiple subsequent frames into account, but also writes that informal experiments did not show any improvements in recognition performance. Boulanger-Lewandowski et al. (2013) also make use of a recurrent layer with a deep belief network to take temporal information into account before additional (HMM) smoothing. Both approaches thus reason that it might be beneficial to take temporal information into account before using an HMM as a final computation step. We can find a similar paradigm in Tang and Mohamed (2012), used with deep learning. They propose a system in which images of faces are analyzed by a deep belief network. In addition to the original image they propose extending the input to different subsampled versions of the image for face recognition and report improved performance over a single resolution input. They also report improved performance for extending the classifier input to several inputs 34
with different subsampling ranges applied to phone recognition and temporal smoothing with deep belief networks on the TIMIT dataset. The proposed system in this thesis is designed to take additional temporal information into account before the HMM post-processing as well. Following the intuition of Glazyrin and the idea of Tang et al., we extend the input of the stacked denoising autoencoder, computing two different time resolutions of the FFT and concatenating them with the original input of the stacked denoising autoencoders. In addition to the original FFT vector, we apply a median filter for different ranges of subsequent frames around the current frame. After median filtering each vector is preprocessed as indicated in section 5.2.1. Hereafter we join the resulting vectors and use them as frame-wise input for the stacked denoising autoencoders. Cho and Bello (2014) conduct experiments to evaluate the influence on performance of different parts of the most prevalent constituents of chord recognition systems. They find that pre-smoothing has a significant impact on chord recognition performance in their experiments. They state that through filtering we can eliminate or reduce transient noise, which is generated by short bursts of energy such as percussive instruments, although this has the disadvantage to also “smear” chord boundaries. However, in the proposed system we supply both the original input in which the chord boundaries are “sharp”, but with transient noise, and a version that is smoothed. Cho and Bello (2014) compare average filtering and median filtering and find that there is little to no difference in terms of recognition performance. We use a median filter instead of an average filter since it is a prevalent approach in chord recognition. Median filters are applied in several other approaches, e.g., Peeters (2006), or Khadkevich and Omologo (2009b), to reduce transient noise. The stacked denoising autoencoders are again trained to output chord probabilities by fine tuning with traditional backpropagation. In the following we refer to this as a multi resolution stacked denoising autoencoder (MR-SDAE). Figure 6 illustrates the processing pipeline of the MR-SDAE.
35
6
Results
Finding suitable training and testing sets for chord estimation is difficult because transcribing chords in songs requires a significant amount of training, even for humans. Only experts are able to transcribe chord progressions of songs accurately and in full detail. Furthermore, most musical pieces are subject to strict copyright laws. This poses the problem that ground truth and audio data are delivered separately. Different recordings of the same song might not fit exactly to the ground truth available due to minor temporal deviations. There are, fortunately, tools to align ground truth data and audio files. For the following experiments, Dan Ellis’ AUDFPRINT tool was used to align audio files with publicly available ground truth.6 We report results on two different datasets: a transcription of 180 Beatles songs, and the publicly available part of the McGill Billboard dataset, containing 740 songs. The Beatles dataset has been available for several years, and as other training data is scarce, many algorithms published in the MIREX challenge have been pretrained on this dataset. Because of the same scarcity of good data, the MIREX challenge has also used the Beatles dataset (with a small number of additional songs) to evaluate the performance of chord recognition algorithms, and thus the “official” results on the Beatles dataset might be biased. We report a cross-validation performance, in which we train the algorithm on a subset of the data and test it on the remaining unseen part. This we repeat ten times for different subsets of the dataset, and report the average performance and 95% confidence interval. This is done to give an estimation how the proposed methods might perform on unseen data. However the Beatles dataset is composed by one group of musicians only, which itself might bias the results, since musical groups tend to have a certain style of playing music. Therefore we also conduct experiments on the Billboard dataset, which is not restricted to one group of musicians, but rather contains popular songs from Billboard Hot 100 charts from the 1958 to 1991. Additionally the Billboard dataset contains more songs, thus providing us with more training examples. To compare the proposed methods to other methods, we use the training and testing set of the MIREX 2013 challenge, a subset of the McGill Billboard dataset that was unpublished before 2012 but is available now. Although there are more recent results on the Billboard dataset (MIREX 2013), the test set ground truth for that part of the dataset has not yet been released. Deep learning neural network training was implemented with the help of Palm’s deep learning MATLAB toolbox (Palm, 2012). HMM smoothing was realized with functions of Kevin Murphy’s Bayes net MATLAB toolbox.7 Computation of state-of-the-art features was done under usage of Ni et al.’s code.8 In the following I first give an explanation of how we can measure the performance of the algorithms in section 6.2. Training algorithms to learn the set of all possible chords is infeasible in this point of time due to the number of possible chords and relative frequencies of chords appearing in the publicly available datasets. Certain chords appear in popular songs more frequently than others, and so we train the algorithms to recognize a set of these chord symbols 6 http://www.ee.columbia.edu/
~dpwe/resources/matlab/audfprint/
7 https://github.com/bayesnet/bnt
8 https://patterns.enm.bris.ac.uk/hpa-software-package
36
containing only major and minor chords, which we call the restricted chord vocabulary, and a set of chords containing major, minor, 7th and inverted chords, which we call the extended chord vocabulary. In section 6.1, I describe how to interpret chords that are not part of these sets. Results are reported for both chord symbol subsets on the Beatles dataset in section 6.5 for the reference system, SDAEs and MR-SDAEs. Results for both subsets on the Billboard set are reported in section 6.6. The results of other algorithms submitted to MIREX 2013 for the Billboard test set used in this thesis are stated in section 7.5.
6.1
Reduction of Chord Vocabulary
As described in section 2.2, chords considered in this thesis consist of three or four notes, with distinct interval relationships to the root note. We have a certain set of chord symbols in the two chord symbol sets. The first contains only major and minor chords with three notes, the second an extension to this chord symbol set containing also 7th and inverted chords. For the Billboard dataset these two subsets are already supplied. For the Beatles dataset, we need to reduce the chords in the ground truth to match the chord symbol sets we want to recognize, since those are fully-detailed transcriptions, which contain chord symbols not in our defined subsets. Some chords are an extension of other chords, e.g., C:maj7 can be seen as an extension of C:maj, since the first one contains the same notes as the latter one but for the additional fourth note with interval 7 above the root note C. We thus reduce all other chords in the ground truth according to following set of rules: 1. If the ground truth chord symbol is in the subset of chord symbols to be recognized, leave it unchanged. 2. If there is a subset of notes that matches a chord symbol in the recognition set, denote instead of the original ground truth symbol the symbol in the recognition set (e.g., C:maj7 is mapped to C:maj for the restricted vocabulary). 3. If there is no subset of chord notes from a symbol in the recognition set for the original ground truth, denote it as non-chord (e.g., C:dim is mapped to the non-chord symbol).
6.2
Score Computation
The results reported use a method of measurement that has been proposed by Harte (2010) and Mauch (2010): the weighted chord symbol recall (WCSR). In the following a description of how it is computed is provided. 6.2.1
Weighted Chord Symbol Recall
Since most of chord recognition algorithms including the ones proposed here work on a discretized input space, but the ground truth is measured in continuous segments with start time, end time and a distinct chord symbol, we need a measure to estimate the performance of any proposed algorithm. This could be achieved by simply discretizing the ground truth according to the discretization of the estimation, and hereafter performing a frame-wise comparison. However,
37
Harte (2010) and Mauch (2010) propose a more accurate measure. The framewise comparison measure can be enhanced by computing the relative overlap of matching chord segments between the continuous-time ground truth and the frame-wise estimation of chord symbols by the recognition system: This is called chord symbol recall (CSR): P P A E S E Si ∩ Sj SiA P j A , (25) CSR = S A Si i
SiA
where is one segment of the hand annotated ground truth, and SjE one segment of the machine estimation. The test set for musical chord recognition usually contains several songs, which each have a different length and contain a different number of chords. Thus we can extend the CSR for a corpus of songs if we sum the the results for each song weighted by its length. This is the weighted chord symbol recall (WCSR), used for evaluating performance on a corpus containing several songs: N P
W CSR =
Li CSRi
i=0 N P
,
(26)
Li
i=0
where Li the length of song i and CSRi the chord symbol recall between machine estimation and hand annotated segments for song i.
6.3
Training Systems Setup
Conducting experiments following parameters are found to be suitable. The stacked denoising autoencoders are trained with 30 iterations of unsupervised training with additive Gaussian noise, variance 0.2, and fraction of corrupted inputs 0.7. The autoencoders have 2 hidden layers with 800 hidden nodes each with a sigmoid activation function, the output layer contains as many nodes as there are chord symbols. To enforce sparsity an activation penalty weighting of β = 0.1, and target activation p = 0.05 is used. The dropout is set to 0.5, and batch training with a batch of 100 samples is used. The learning rate is set to 1 and momentum to 0.5. For the MR-SDAE the previous and subsequent 3 frames for the second input vector, and the previous and subsequent 9 frames for the third input vector are used. Due to memory restrictions only a subset of frames of the complete training set for training of the stacked denoising autoencoder based systems is employed. 10% of the training data for validation while training is separated. Additionally I extended Palm’s deep-learning library with an early stopping mechanism, which stops supervised training after the performance on the validation set does not improve for 20 iterations, or else after 500 iterations, to restrict computation time. It then returns the best performing weight configuration according to the training validation. For the comparison system, since not all chords of the extended chord vocabulary are included in all datasets, missing chords are substituted with the mean PCP vector in the training set. Malformed covariance matrices are corrected by adding a small amount of random noise. 38
6.4
Significance Testing
Similar to Mauch and Dixon (2010a), a Friedman multiple comparison test is used to test for significant differences in performance of the proposed algorithms and the reference system. This tests the performance of different algorithms on a song level, but differs from the WCSR, which takes the song length into account in the final score. The Friedman multiple comparison test measures the statistical significance of ranks, thus indicating whether an algorithm outperforms another algorithm with statistical significance on a song level without regard to the WCSR for songs in general. For the purpose of testing for statistical significance of performance, we select one fold of the cross validation on the Beatles dataset, on which the performance is close to the mean, and one test run for the Billboard dataset, which is close to the mean as well for the SDAE based approaches. All plots for the post hoc multiple comparison Friedman test for significance show the mean rank and 95% confidence interval in term of ranks.
6.5
Beatles Dataset
The Beatles Isophonics dataset9 contains songs of the Beatles and Zweieck. We only use the Beatles songs for evaluating the performance of algorithms, since it is difficult to come by the audio data of the Zweieck songs. The Beatles-only subset of this dataset consists of 180 songs. In section 6.5.1 and section 6.5.2, the results for restricted and extended chord vocabulary, for the comparison system, SDAE and MR-SDAE are reported. The cross-validation performance across ten folds is shown. We partition the dataset into ten subsets, where we use one for testing and nine for training. For the first fold we use every tenth song from the Beatles dataset starting from the first, as ordered in the ground truth, the second fold every tenth song starting from the second etc. We train ten different models, one for each testing partition. Since we use a HMM smoothing step, we show “raw” results without HMM smoothing and a final performance of the systems with temporal smoothing, for the neural network approaches. The reference system uses the HMM even for classification, and thus we only report a single final performance statistic. All results are reported as WCSR as described above and used in the MIREX challenge. Since there are ten different results, one for testing on each partition, I report the average WCSR, as well as a 95% confidence interval of the aggregated results. To get an insight into the distribution of performance results, I also plot box-and-whisker diagrams. Finally I perform Friedman multiple comparison tests for statistical significance across algorithms. Since the implementation of the learning algorithms in MATLAB is memory intensive, I subsample the training partitions for the SDAEs. For SDAE, I use every 3rd frame for training, and for MR-SDAE, every 4th frame, resulting in approximately 95000 and 71000 training samples for each fold. 6.5.1
Restricted Major-Minor Chord Vocabulary
Friedman multiple comparison tests Values are computed on fold five of the Beatles dataset, which yields a result close to the mean performance for 9 http://isophonics.net/datasets
39
all algorithms tested. In figure 7 the results of the post hoc Friedman multiple comparison tests for all systems smoothed and unsmoothed on the restricted chord vocabulary task are depicted. The algorithms showed significantly different performance, with p < 0.001.
SDAE
MR-SDAE
S-HPA
SDAE
MR-SDAE 1
1.5
2
2.5
3
3.5
4
4.5
Mean column ranks with 95% confidence interval
Figure 7: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Beatles dataset on the restricted chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold).
Whisker Plot and Mean Performance In this section results for the proposed algorithms, SDAE, MR-SDAE and the reference system on the reduced major-minor chord symbol recognition task are presented. Figure 8 depicts a box-and-whisker diagram for the performance of the algorithms with and without temporal smoothing and the performance of the reference system. The upper and lower whiskers depict the maximum and minimum performance of all results of the ten-fold cross validation, while the upper and lower boundaries of the boxes represent the upper and lower quartiles. We can see the median of all runs as a dotted line inside the box. The average WCSR together with 95% confidence intervals over folds before and after temporal smoothing can be found in table 3.
40
5
100
WCSR in %
80
60
40
20
0
SDAE MR-SDAE S-HPA
SDAE MR-SDAE
Figure 8: Results for the simplified HPA, SDAE and MR-SDAE for the restricted chord vocabulary 10-fold cross-validation on the Beatles dataset with and without HMM smoothing. Results after smoothing are highlighted in bold.
System S-HPA SDAE MR-SDAE
Not smoothed – 65.40 ± 6.94 67.13 ± 7.06
Smoothed 68.92 ± 9.32 69.69 ± 7.41 70.05 ± 7.92
Table 3: Average WCSR for the restricted chord vocabulary on the Beatles dataset, smoothed and unsmoothed, with 95% confidence interval Summary In the Friedman multiple comparison test in figure 7, we observe that the mean ranks of post-smoothing SDAE and MR-SDAE are significantly higher than the mean ranks of the reference system (S-HPA), and also that smoothing significantly improves the performance. Mean ranks for SDAE and MR-SDAE without smoothing are lower than that of the reference system, however not significantly. The MR-SDAE has a slightly higher mean rank compared to the SDAE, but not significantly. In figure 8 we can observe that pre- and post-smoothed SDAE and MRSDAE distributions are negatively skewed. The S-HPA however is skewed positively. The skewness of the distribution does not change much for the SDAE and MR-SDAE comparing before and after smoothing, however, smoothing improves the performance in general. In table 3, we can see that the mean performance of MR-SDAE outperforms the SDAE slightly and that both achieve higher mean performance compared to the reference system after HMM smoothing. The means for results before
41
HMM smoothing for SDAE and MR-SDAE are lower however. 6.5.2
Extended Chord Vocabulary
Friedman Multiple Comparison Tesst Again values are computed for fold five of the Beatles dataset. In figure 9 the results of the post hoc Friedman multiple comparison tests for all systems smoothed and unsmoothed on the extended chord vocabulary task are depicted. The algorithms showed significantly different performance, with p < 0.001.
SDAE
MR-SDAE
S-HPA
SDAE
MR-SDAE 1
1.5
2
2.5
3
3.5
4
4.5
Mean column ranks with 95% confidence interval
Figure 9: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Beatles dataset on the extended chord vocabulary for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold).
Whisker Plots and Means Similar to above we depict box-and-whisker diagrams for the unsmoothed and smoothed results of ten-fold cross validation for the extended chord symbol set in Figure 10. Table 4 depicts the average WCSR and 95% confidence interval over folds for the training for smoothed and unsmoothed results.
42
5
100
WCSR in %
80
60
40
20
0
SDAE MR-SDAE S-HPA
SDAE MR-SDAE
Figure 10: Whisker plot for simplified HPA, SDAE, and MR-SDAE using the extended chord vocabulary and 10-fold cross-validation on the Beatles dataset, with and without smoothing. Results after smoothing are highlighted in bold.
System S-HPA SDAE MR-SDAE
Not smoothed – 55.93 ± 7.12 57.52 ± 6.52
Smoothed 48.54 ± 7.89 59.73 ± 7.37 60.02 ± 6.81
Table 4: Average WCSR for simplified HPA, SDAE and MR-SDAE usging the extended chord vocabulary on the Beatles dataset, smoothed and unsmoothed, with 95% confidence intervals. Summary In the Friedman multiple comparison tests in figure 9, we can observe that again the post-smoothing performance in terms of ranks of the SDAE and MR-SDAE is significantly better than the reference system. In comparison to the restricted chord vocabulary recognition task, the margin is even larger. A peculiar thing to note, is that with the extended chord vocabulary the presmoothing performance of MR-SDAE is not significantly worse than the postsmoothing performance of both SDAE based chord recognition systems. SDAE shows lower mean ranks before smoothing than the reference system, and MRSDAE seems to perform slightly better than the reference system, although not significantly so before smoothing. In figure 10, we can see similar negatively-skewed distributions of cross validation results for SDAE and MR-SDAE, like the restricted chord vocabulary setting. Again we can observe that the skewness of the distributions do not change much after smoothing, but we can observe an increase in performance.
43
However, in the extended chord vocabulary task, the medians of the SDAE and MR-SDAE are higher than that of the reference system, showing values even higher than the best performance of the reference system. The reference system on the extended chord vocabulary does not show a distinct skew. The better performance is also reflected in table 4, where the proposed systems achieve higher means before and after HMM smoothing compared to the reference system.
6.6
Billboard Dataset
The McGill Billboard dataset10 consists of songs randomly sampled from the Billboard Hot 100 charts from 1958 to 1991. This dataset currently contains 740 songs, of which we separate 160 songs for testing and use the remaining for training the algorithms. The selected test set corresponds to the official test set of the MIREX 2012 challenge. Although there are results for algorithms in the MIREX challenge 2013 on the Billboard dataset, the ground truth of the specific test set has not been publicly released at this point of time. Similar to the Beatles dataset, the audiofiles are not publicly available, but there are several different audio recordings for the songs in the dataset. We again use Dan Ellis’ AUDFPRINT tool to align audio data with the ground truth. For this dataset the ground truth is already available in the right format for restricted major-minor chord vocabulary and extended 7th and inverted chord vocabulary, thus we do not need to reduce the chords ourselves. Since the Billboard dataset is much larger than the Beatles dataset, we sample every 8th frame for the SDAE training and every 16th for the MR-SDAE, resulting in approximately 170 000 and 85 000 frames respectively. Algorithms were run five times. 6.6.1
Restricted Major-Minor Chord Vocabulary
Friedman Multiple Comparison Tests In figure 11 the results for the post hoc Friedman multiple comparison test for the Billboard restricted chord vocabulary task for the reference system and smoothed and unsmoothed SDAE and MR-SDAE are presented. The algorithms showed significantly different performance, with p < 0.001.
10 http://ddmal.music.mcgill.ca/billboard
44
SDAE
MR-SDAE
S-HPA
SDAE
MR-SDAE 1
1.5
2
2.5
3
3.5
4
4.5
Mean column ranks with 95% confidence interval
Figure 11: Mean and 95% confidence interval for post hoc Friedman multiple comparison tests for the Billboard dataset on the restricted chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold). Mean Performance In this section results for the MIREX 2012 test partition of the Billboard dataset for the restricted major-minor chord vocabulary are depicted. Table 5 shows the results for performance of the SDAE with and without smoothing. Since we do not perform a cross validation on this dataset and the comparison system does not have any randomized initialization, we report the 95% confidence interval for the SDAEs only, with respect to multiple random initialisations (note that these are not directly comparable to the confidence intervals over cross-validation folds as reported for the Beatles dataset). System S-HPA SDAE MR-SDAE
Not smoothed – 61.19 ± 0.32 62.97 ± 0.16
Smoothed 66.04 66.35 ± 0.31 66.46 ± 0.40
Table 5: Average WCSR for the restricted chord vocabulary on the MIREX 2012 Billboard test set, smoothed and unsmoothed, with 95% confidence intervals if applicable.
45
5
Summary Figure 9, depicting the Friedman multiple comparison test for significance, reveals that in the Billboard restricted chord vocabulary task, the reference system does not perform significantly worse than the post-smoothing SDAE and MR-SDAE. It is also notable that in this setting the pre-smoothing MR-SDAE significantly outperforms the pre-smoothing SDAE. Similar to the restricted chord vocabulary task for the Beatles test, on the Billboard dataset, the means before smoothing are lower than those of the reference system. However, we can still observe a better pre-smoothing mean performance for MR-SDAE, in comparison with SDAE. Comparing mean performance HMM smoothing, we see no significant differences. 6.6.2
Extended Chord Vocabulary
Friedman Multiple Comparison Tests In figure 12 the results for the post hoc Friedman multiple comparison test for the Billboard extended chord vocabulary task for the reference system and smoothed and unsmoothed SDAE and MR-SDAE are presented. The algorithms showed significantly different performance, with p < 0.001.
SDAE
MR-SDAE
S-HPA
SDAE
MR-SDAE 1
1.5
2
2.5
3
3.5
4
4.5
Mean column ranks with 95% confidence interval
Figure 12: Mean and 95% confidence intervals for post hoc Friedman multiple comparison tests for the Billboard dataset on the extended chord vocabulary, for the comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smoothing (normal weight) and after (highlighted in bold).
46
5
Mean Performance Table 6 depicts the performance of the reference system and SDAEs on the extended chord vocabulary containing major, minor, 7th and inverse chord symbols. Again no confidence interval is reported for the reference system since there is no random component and results are the same for multiple runs. System S-HPA SDAE MR-SDAE
Not smoothed – 46.74 ± 0.19 47.77 ± 0.49
Smoothed 46.44 50.23 ± 0.32 50.81 ± 0.50
Table 6: Average WCSR for the extended chord vocabulary on the MIREX 2012 Billboard test set, smoothed and unsmoothed, with 95% confidence intervals if applicable.
Summary The Friedman multiple comparison test in figure 12 shows again significantly better performance for the post-smoothing SDAE systems in comparison to the pre-smoothing performance, and also to the reference system. MR-SDAE again seems to achieve a higher mean rank in comparison with SDAE, however this is not statistically significant. In terms of mean performance in WCSR, depicted in table 6, the presmoothing performance figures for SDAE and MR-SDAE are higher than those of the reference system. Again MR-SDAE outperforms SDAE in mean WCSR. The same is the case after smoothing: MR-SDAE outperforms the SDAE slightly, and both perform better than the reference system.
6.7
Weights
In this section we visualize of the input layer of the neural network trained on the Beatles dataset. Figure 13 shows an excerpt of the input layer of the neural network, weights being depicted as a grayscale image, where black denotes negative weights and white corresponds to positive weights. In figure 14 the sum of absolute values over all weights for each FFT input are plotted. The vertical lines depict FFT bins, which correspond to musically important frequencies, i.e., musical notes.
47
Hidden nodes
Weights for inputs
Figure 13: Excerpt of the weights of the input layer. Black denotes negative weights, and white positive. 600
Sum of absolute weights
500
400
300
200
100
0
0
200
400
600
800
1,000
1,200
input (FFT bin)
Figure 14: Sum of absolute values for each input of the trained neural network. Vertical gray lines indicate bins of the FFT that correspond to musically relevant frequencies. 48
1,400
7 7.1
Discussion Performance on the Different Datasets
All algorithms tested seem to perform better on the Beatles dataset. This seems counter-intuitive, given that the Billboard dataset contains about four times more songs, and thus we would expect the algorithms to be able to find a better estimate of the data. A possible explanation could be that it is easier to learn chords from one single artist, or group of artists, either due to a preference of certain chords, or a preference of certain types of instruments. Furthermore the distribution of chords is not uniform: there are certain chord types or even distinct chord symbols that are more common than others. The Billboard dataset contains about 29% extended chords (7th and inverted chords) compared to 15% for the Beatles dataset. These chords are more difficult to model, which is an explanation for the difference in performance for the extended chord vocabulary recognition task on both datasets. Usually any given song contains only a very limited number of chords that are used repeatedly, which limits the amount of useful information that can be extracted from training data. This distribution of chords might not be the same in training and test sets for a small dataset, which might also explain the high variance we observe performing cross-validation on the Beatles dataset in tables 3 and 4.
7.2
SDAE
In this section I evaluate the performance of the SDAE in comparison with the simplified HPA reference system. First an examination of the final performance is given, followed by the distribution of results for different folds for the cross-validation test. A description of the effects of HMM smoothing is given hereafter, followed by an analysis for the results of the “raw” performance without smoothing and concluding remarks. Post-Smoothing In the experiments the smoothed SDAE significantly outperforms the reference system with state-of-the-art-features on the extended chord vocabulary task. This is true for both datasets, the Beatles and the Billboard datasets. These results are important since working with bigger chord vocabularies is the direction the field of chord recognition is moving. On the restricted chord vocabulary the smoothed SDAE shows at least comparable performance, significantly outperforming the reference system on the Beatles dataset, and showing not significantly worse performance on the Billboard dataset. Distribution for Cross-Validation We can also observe in figures 8 and 10 that the distribution of results for the cross-validation is negatively skewed in both the restricted and extended chord vocabulary tests. Thus results cluster towards higher-than-average performance. The reference system for the restricted chord vocabulary behaves inversely. It is positively skewed, and its upper outliers have better performance than those of the SDAE. The whisker plots for the extended chord vocabulary show no apparent skewness of the reference system. However, here the lower outliers of the unsmoothed SDAE perform similarly to the median of the reference system. Thus superiority of the stacked denoising 49
autoencoders is reflected here as well. In all cases smoothing increases the performance of the SDAE, but does not change the skewness of the distributions much. Effects of Smoothing We can observe that temporal smoothing increases the performance in terms of ranks and mean WCSR in all cases. However, it is not the case for all songs in the test sets that HMM smoothing is beneficial. There are cases in which HMM smoothing can lead to a decrease of WCSR in comparison to the “raw” SDAE estimation. This indicates that an HMM may not constitute a perfect temporal model of chords and supports the findings of Boulanger-Lewandowski et al. (2013), who propose a dynamic programming extension to beam search for temporal smoothing instead of Viterbi decoding. This might have to do with the HMM modelling temporal duration of chords symmetrically for the frame wise feature computation method used in the thesis. This is not true for chords, since they usually last a certain amount of time (according to length of note). Pre-Smoothing Notable is that the proposed SDAE system shows comparable performance even without temporal HMM smoothing for the extended chord vocabulary test. On the restricted chord vocabulary test on the Beatles dataset, the reference system yields a higher mean WCSR, however we do not observe that it significantly outperforms the unsmoothed SDAE. In the restricted chord vocabulary test on the Billboard dataset, however, the reference system outperforms the unsmoothed SDAE significantly and yields a higher WCSR as well. Concluding Remarks These experiments show that stacked denoising autoencoders can be applied directly to the FFT, extending Glazyrin’s (2013) approach from the constant-Q transform, yielding comparable performance to a system with a conventional framework interpreting state of the art features, or in case of the extended chord vocabulary even yielding significantly better performance. Unlike Glazyrin (2013), Boulanger-Lewandowski et al. (2013) and Humphrey et al. (2012), we do not try to model PCP’s or a Tonnetz as an intermediate target. The only restriction that is imposed on the system is the preprocessing of the FFT data: Otherwise there are no restrictions on the computation of the features, in attempting to circumvent a possible “glass ceiling” as suggested by Humphrey et al. (2012) and Cho and Bello (2014). The mean performance increases in WCSR in comparison to the simplified HPA system are for the restricted chord vocabulary about 1.12% and 0.47% (0.77 and 0.31 percentage points increase). For the extended chord vocabulary, however, it improves 23.05% and 8.16% (11.19 and 3.79 percentage points) for the Beatles and Billboard datasets, respectively. Although we can already observe a better mean performance, one could expect a further performance increase if it were possible to use more training data, which could be realized with an memory-optimized implementation. In addition to that, removing the maximum training iterations and keeping the early stopping mechanism would also likely improve the performance at least slightly.
50
7.3
MR-SDAE
In the following I compare the MR-SDAE with the SDAE and the reference system. Similar to above for the SDAE, first a description of final performance is given, followed by a comparison of the distribution of cross-validation results on the Beatles dataset between SDAE and MR-SDAE. An examination of the pre-smoothing performance, effects of smoothing on the performance, and concluding remarks are given hereafter. Post-Smoothing Similar to the smoothed SDAE, the smoothed MR-SDAE outperforms the reference system in all cases except the Billboard restricted chord vocabulary test. Even on the Billboard restricted chord vocabulary test, it does not perform significantly worse, and still shows slightly better performance in terms of mean WCSR. Comparing the smoothed SDAE to the smoothed MR-SDAE, we always observe a slightly higher mean WCSR, but no statistical significance in the Friedman multiple comparison test. Distribution of Cross-Validation Performance SDAE and MR-SDAE seem to have a similar distribution of results for the different folds of the crossvalidation tests on the Beatles dataset. However the MR-SDAE seems to be a little more negatively skewed in comparison with the SDAE, especially in the pre-smoothed case. Pre-Smoothing In comparison to the unsmoothed SDAE, the unsmoothed MR-SDAE performs better in terms of WCSR, however, this is only statistically significant for the restricted chord vocabulary on the Billboard dataset. Effects of Smoothing In spite of performing the median smoothing on two different temporal levels in the multi-resolution case, HMM smoothing still increases the mean performance and thus seems beneficial. However it seems that an improved recognition rate before smoothing does not necessarily yield the same improved WCSR after smoothing. The improvements over the SDAE after smoothing are diminished compared to the pre-smoothing performance increase in WCSR, which is similarly observed by Cho and Bello (2014) for a single input with median smoothing. No post-smoothing improvements over the SDAE are significant in the Friedman multiple comparison test. As is the case with SDAE, MR-SDAE performance on some songs is diminished by temporal smoothing. Concluding Remarks These experiments show that taking temporal information into account as suggested (but not shown) by Glazyrin (2013), through adding more frames to the input, or by using a recurrent layer as done by Boulanger-Lewandowski et al. (2013), is beneficial to the classification performance, at least before HMM smoothing in terms of mean WCSR. Nonetheless, using an unsmoothed version of the input together with median smoothed ones, we find that after classification HMM smoothing can increase the performance more than pre-classification smoothing, as Cho and Bello (2014) found for a single smoothed input in their experiments.
51
This also supports findings of Peeters (2006),Khadkevich and Omologo (2009b), Mauch et al. (2008), and Cho and Bello (2014), that the use of median smoothing techniques before further processing or classification can be beneficial. The “raw” classification performance for MR-SDAE measured by the mean WCSR pre-smoothing increases 2.65% and 2.91% (1.73 and 1.78 percentage points) for the restricted chord vocabulary and for the extended chord vocabulary 2.84% and 2.20% (1.59 and 1.03 percentage points) on the Beatles and Billboard datasets respectively, compared to SDAE. After smoothing the performance is further increased by 0.52% (0.36 percentage points) for the Beatles restricted vocabulary and by 0.17% (0.11 percentage points) for the Billboard set. For the extended chord vocabulary the improvement is 0.49% and 1.15% (0.29 and 0.58 percentage points), for the Beatles and Billboard datasets, respectively. However, it has to be noted, that the MR-SDAE is trained on less training data in the experiments, such that for the Beatles dataset only every fourth frame instead of every third and for the Billboard only every 16th instead of every 8th frame is used for training. Further improvements could be achieved through improving the implementation to be able to train on more frames of the dataset. It could also be evaluated if the proposed system might work as well with a more restricted input frequency range, which would enable us to in turn take more frames into account. Other smoothing techniques could be evaluated, similar to the work of Boulanger-Lewandowski et al. (2013).
7.4
Weights
Figure 14 shows the sum of absolute weights for the input layer of the SDAEs: vertical lines highlight input bins that correspond to music harmonic relevant frequencies. For MR-SDAE the input looks similar, replicated for the three different parts for different temporal smoothing. The network seems to emphasize the weights on musically relevant frequencies. In figure 13, we can see that for some nodes, some frequencies have negative weights, which might indicate that some nodes from the network block certain frequencies, thus weights for musically relevant frequencies are not positively emphasized for all nodes. In addition to that, the the sum of the absolute input weights seem to diminish for frequencies that correspond to higher tones, as can be seen in figure 14. It might be caused by clustering of fundamental frequencies of pitches at the lower end of the spectrum and the decaying sound amplitude of overtones. This seems to be similar to the PCP vector computation by Ni et al. (2012), with decreasing sensitivity to higher frequencies according to human perception.
7.5
Extensions
In table 7, results for the MIREX 2013 challenge11 on the test set used in this thesis are depicted, both for the restricted and extended chord vocabulary task that are also basis for the experiments evaluated in this thesis. Since the specific Billboard test set was released before 2013, it might be possible that 11 http://www.music-ir.org/mirex/wiki/2013:Audio_Chord_Estimation_Results_ Billboard_2012
52
some algorithms have been handed in pretrained on this part of the Billboard data, although this is unlikely. Features of the reference system used in this thesis are based on the system NMSD2, by Ni et al. (2012), highlighted in bold. Algorithms are denoted by the the first letter of the authors names, and as it is possible to hand in several algorithms, a number is added to the algorithm identifier. Some submissions were submitted multiple times, one version to be trained by MIREX, and another pretrained. For algorithms submitted pretrained and untrained we show the version that is trained by MIREX; for submitted pretrained algorithms we show the best-performing version. The performance of the full version of the reference system performs about 15% (10 percentage points) better for the restricted chord vocabulary task and about 24% (12 percentage points) better for the extended task in comparison with the MR-SDAE, in terms of WCSR. This can be attributed to the extension of the system to take into account other musical structures as described in section 2.3. It performs a beat estimation, which can align the temporal duration of the HMM states, more closely to the temporal duration of chords. Another problem that arises if using a single HMM for modelling chord progressions is that it assumes conditional independence between chord states. This does not correspond to chord progessions that we find in the “real world”, thus, as described by Cho and Bello (2014), we have no guarantee that a single HMM can model chord progressions sufficiently accurately. However, in combining several HMMs (or by using a Bayesian network), we are able to break this conditional independence and take more structures of a musical piece into account. Ni et al. also use more information about other musical structures, for example, musical key estimation and bass note estimation, to improve chord recognition performance tracked by a multi-stream HMM. Despite the relatively worse performance of the systems proposed in this thesis, it would be possible to fully integrate the proposed methods into the system developed by Ni et al. (2012). It also is in principle possible to estimate other musical qualities such as beat and bass with the help of stacked denoising autoencoders, although it would have to be evaluated wether they could offer a comparable performance increase. Since it was shown that the SDAE system yielded better performance in terms of mean WCSR for chord recognition, and more importantly, significantly better performance on a rank comparison level, it is to be expected that once integrated, the performance of a system based on stacked denoising autoencoders could compete or even outperform the system by Ni et al. (2012).
53
System NMSD2 CB4** KO2** PP4 CF2 PP3 MR-SDAE* SDAE* NG1
major minor 76 76 76 70 72 73 66 66 71
major minor 7th inv 63 63 60 53 53 51 51 50 50
Table 7: MIREX 2013 results on the Billboard train and test set used for evaluation in this thesis. * denotes systems proposed in this thesis. The system the feature computation was taken for the comparison system, is highlighted in bold. Values for performance are sorted for the extended chord vocabulary recognition task. Systems denoted with ** were trained specifically for the MIREX evaluation. All other algorithms were submitted pre-trained.
54
8
Conclusion
In this thesis I presented two deep learning approaches to chord recognition based on stacked denoising autoencoders. Both are evaluated against a basic HMM system with state-of-the-art PCP features. The first system works on a truncated version of the FFT, applying square root compression and normalizing according to the L2 norm. It outputs chord probabilities directly. Recognition performance is enhanced with the help of an HMM to allow for post-classification temporal smoothing. The second algorithm uses two additional temporally-subsampled versions of the input, with a median filter. Again it estimates the chord probabilities directly without any further restrictions on computation. The post-classification chord probabilities are again smoothed in time by an HMM. The reference system, stacked denoising autoencoders from the FFT, and the extension with multiple resolution input, are tested extensively on the Beatles and Billboard datasets. Two different chord recognition vocabularies were used: a conventional major-minor vocabulary and an extended chord vocabulary containing 7th and inverted chords as proposed in MIREX 2012. The results are shown for the Beatles dataset with ten-fold cross-validation, and the Billboard test and train set of MIREX 2012 is used. Post hoc Friedman tests are performed to test for statistically significant differences in performance. In this thesis it is is shown that the multi-resolution approach can lead to better results in mean WCSR before HMM smoothing, although these improvements seem to be less after HMM smoothing. It is also shown that SDAE and MR-SDAE show comparable performance for the restricted major-minor chord recognition task, and show superior performance – both in mean WCSR and in the Friedman tests – on the extended chord vocabulary on both datasets compared to the reference system. It would be possible to fully integrate the SDAE or MR-SDAE in a system similar to Ni et al. (2012) to take more musical information into account, potentially outperforming state-of-the-art systems.
55
References Bello, J. P. and Pickens, J. (2005). A robust mid-level representation for harmonic content in music signals. In Proceedings of the International Society for Music Information Retrieval Conference, pages 304–311. Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Global optimization of a neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks, 3(2):252–259. Bonada, J. (2000). Automatic technique in frequency domain for near-lossless time-scale modification of audio. In Proceedings of the International Computer Music Conference, pages 396–399. Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chord recognition with recurrent neural networks. Proceedings of the International Society for Music Information Retrieval Conference, pages 335–340. Brown, J. C. (1991). Calculation of a constant Q spectral transform. Journal of the Acoustical Society of America, page 425. Brown, J. C. and Puckette, M. S. (1992). An efficient algorithm for the calculation of a constant Q transform. Journal of the Acoustical Society of America, 92(5):2698–2701. Burgoyne, J. A., Kereliuk, C., Pugin, L., and Fujinaga, I. (2007). A crossvalidated study of modelling strategies for automatic chord recognition in audio. In Proceedings of the International Society for Music Information Retrieval Conference, pages 251–254. Burgoyne, J. A., Wild, J., and Fujinaga, I. (2011). An expert ground-truth set for audio chord recognition and music analysis. Proceedings of the International Conference on Music Information Retrieval, pages 633–638. Catteau, B., Martens, J.-P., and Leman, M. (2007). A probabilistic framework for audio-based tonal key and chord recognition. In Advances in Data Analysis, pages 637–644. Springer, Berlin Heidelberg. Cemgil, A. T., Kappen, H. J., and Barber, D. (2006). A generative model for music transcription. IEEE Transactions on Audio, Speech, and Language Processing, 14(2):679–694. Chen, R., Shen, W., Srinivasamurthy, A., and Chordia, P. (2012). Chord recognition using duration-explicit hidden Markov models. In Proceedings of the International Society for Music Information Retrieval Conference, pages 445– 450. Cheng, H.-T., Yang, Y.-H., Lin, Y.-C., Liao, I.-B., and Chen, H. H. (2008). Automatic chord recognition for music classification and retrieval. In Proceedings of IEEE International Conference on Multimedia and Expo, pages 1505–1508. Cho, T. and Bello, J. P. (2011). A feature smoothing method for chord recognition using recurrence plots. In Proceedings of the International Society for Music Information Retrieval Conference, pages 651–656. 56
Cho, T. and Bello, J. P. (2014). On the relative importance of individual components of chord recognition systems. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(2):477–492. Dixon, S., Mauch, M., and Anglade, A. (2011). Probabilistic and logic-based modelling of harmony. In Exploring Music Contents, pages 1–19. Springer Berlin Heidelberg. Dressler, K. and Streich, S. (2007). Tuning frequency estimation using circular statistics. In Proceedings of the International Conference on Music Information Retrieval, pages 357–360. Ellis, D. P. (2007). Beat tracking by dynamic programming. Journal of New Music Research, 36(1):51–60. Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using common LISP music. In Proceedings of the International Computer Music Conference, pages 464–467. Glazyrin, N. (2013). Mid-level features for audio chord estimation using stacked denoising autoencoders. Russian Summer School in Information Retrieval. http://romip.ru/russiras/doc/2013_for_participant/ russirysc2013_submission_13_1.pdf [Online; accessed 12-July-2014]. Glazyrin, N. and Klepinin, A. (2012). Chord recognition using prewitt filter and self-similarity. In Proceedings of the Sound and Music Computing Conference, pages 480–485. G´ omez, E. (2006). Tonal description of polyphonic audio for music content processing. INFORMS Journal on Computing, 18(3):294–304. Harris, F. J. (1978). On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1):51–83. Harte, C. (2010). Towards automatic extraction of harmony information from music signals. PhD thesis, University of London. Harte, C. and Sandler, M. (2005). Automatic chord identifcation using a quantised chromagram. In Proceedings of the Audio Engineering Society Convention. Harte, C., Sandler, M., and Gasser, M. (2006). Detecting harmonic change in musical audio. In Proceedings of the ACM workshop on Audio and Music Computing Multimedia, pages 21–26. ACM. Hinton, G. (2010). A practical guide to training restricted Boltzmann machines. Momentum, 9(1):926. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Humphrey, E. J. and Bello, J. P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In International Conference on Machine Learning and Applications, volume 2, pages 357–362. 57
Humphrey, E. J., Cho, T., and Bello, J. P. (2012). Learning a robust Tonnetzspace transform for automatic chord recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 453–456. Khadkevich, M. and Omologo, M. (2009a). Phase-change based tuning for automatic chord recognition. In Proceedings of Digital Audio Effects Conference. Khadkevich, M. and Omologo, M. (2009b). Use of hidden Markov models and factored language models for automatic chord recognition. In Proceedings of the International Conference on Music Information Retrieval, pages 561–566. Lee, K. (2006). Automatic chord recognition from audio using enhanced pitch class profile. In Proceedings of the International Computer Music Conference, pages 306–313. Mauch, M. (2010). Automatic chord transcription from audio using computational models of musical context. PhD thesis, Queen Mary University of London. Mauch, M. and Dixon, S. (2010a). Approximate note transcription for the improved identification of difficult chords. In Proceedings of the International Society for Music Information Retrieval Conference, pages 135–140. Mauch, M. and Dixon, S. (2010b). Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1280–1289. Mauch, M., Dixon, S., and Mary, Q. (2008). A discrete mixture model for chord labelling. In Proceedings of the International Society for Music Information Retrieval Conference, pages 45–50. Mauch, M., Noland, K., and Dixon, S. (2009). Using musical structure to enhance automatic chord transcription. In Proceedings of the International Society for Music Information Retrieval Conference, pages 231–236. McLachlan, G. J., Krishnan, T., and Ng, S. K. (2004). The EM algorithm. Technical report, Humboldt-Universit¨at Berlin, Center for Applied Statistics and Economics (CASE). Ni, Y., McVicar, M., Santos-Rodriguez, R., and De Bie, T. (2012). An end-toend machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1771–1783. Ono, N., Miyamoto, K., Le Roux, J., Kameoka, H., and Sagayama, S. (2008). Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In Proceedings of the European Signal Processing Conference, pages 240–244. Osmalsky, J., Embrechts, J.-J., Van Droogenbroeck, M., and Pierard, S. (2012). Neural networks for musical chord recognition. In Journ´ees d’informatique musicale.
58
Oudre, L., Grenier, Y., and F´evotte, C. (2009). Template-based chord recognition: Influence of the chord types. In Proceedings of the International Conference on Music Information Retrieval, pages 153–158. Oudre, L., Grenier, Y., and F´evotte, C. (2011). Chord recognition by fitting rescaled chroma vectors to chord templates. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2222–2233. Palm, R. B. (2012). Prediction as a candidate for learning deep hierarchical models of data. Master’s thesis, Technical University of Denmark. Papadopoulos, H. and Peeters, G. (2007). Large-scale study of chord estimation algorithms based on chroma representation and HMM. In Proceedings of the International Workshop Content-Based Multimedia Indexing, pages 53–60. Papadopoulos, H. and Peeters, G. (2008). Simultaneous estimation of chord progression and downbeats from an audio file. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 121–124. Papadopoulos, H. and Peeters, G. (2011). Joint estimation of chords and downbeats from an audio signal. IEEE Transactions on Audio, Speech, and Language Processing, 19(1):138–152. Pauws, S. (2004). Musical key extraction from audio. In Proceedings of the International Society for Music Information Retrieval Conference, pages 96– –99. Peeters, G. (2006). Musical key estimation of audio signal based on hidden Markov modeling of chroma vectors. In Proceedings of the International Conference on Digital Audio Effects, pages 127–131. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Reed, J., Ueda, Y., Siniscalchi, S. M., Uchiyama, Y., Sagayama, S., and Lee, C.-H. (2009). Minimum classification error training to improve isolated chord recognition. In Proceedings of the International Conference on Music Information Retrieval, pages 609–614. Scholz, R., Vincent, E., and Bimbot, F. (2009). Robust modeling of musical chord sequences using probabilistic n-grams. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 53–56. Sheh, A. and Ellis, D. P. (2003). Chord segmentation and recognition using em-trained hidden Markov models. Proceedings of the International Society for Music Information Retrieval Conference, pages 185–191. Shepard, R. N. (1964). Circularity in judgments of relative pitch. Journal of the Acoustical Society of America, 36:2346. Sikora, F. (2003). Neue Jazz-Harmonielehre. Schott Musik International, Mainz, 3rd edition.
59
Su, B. and Jeng, S.-K. (2001). Multi-timbre chord classification using wavelet transform and self-organized map neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 3377–3380. Talbot-Smith, M. (2001). Audio engineer’s reference book. Taylor & Francis, Oxford. Tang, Y. and Mohamed, A.-r. (2012). Multiresolution deep belief networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 1203–1211. Ueda, Y., Uchiyama, Y., Nishimoto, T., Ono, N., and Sagayama, S. (2010). HMM-based approach for automatic chord detection using refined acoustic features. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, pages 5518–5521. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the International Conference on Machine learning, pages 1096–1103. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 9999:3371–3408. Wakefield, G. H. (1999). Mathematical representation of joint time-chroma distributions. In Proceedings of the SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, pages 637–645. Weil, J., Sikora, T., Durrieu, J.-L., and Richard, G. (2009). Automatic generation of lead sheets from polyphonic music signals. In Proceedings of the International Society for Music Information Retrieval Conference, pages 603– 608. Weller, A., Ellis, D., and Jebara, T. (2009). Structured prediction models for chord transcription of music audio. In Proceedings of the International Conference on Machine Learning and Applications, pages 590–595. Zhang, X. and Gerhard, D. (2008). Chord recognition using instrument voicing constraints. In Proceedings of the International Society for Music Information Retrieval Conference, pages 33–38.
60
A
Joint Optimization
In the following, we describe the application of a joint neural network HMM optimization approach proposed by Bengio et al. (1992) for speech recognition, applied to chord recognition. This was the original goal of this thesis, but it did not yield any improvement after basic initialization of the components.
A.1
Basic System Outline
The system consists of two main components: 1. A continuous-density HMM which estimates time the temporal correlation of chord progressions and performs the final classification. 2. A neural network (initialized as stacked denoising autoencoder) with softmax activation, which is trained to approximate the computation of “perfect” normalized PCPs computed from the ground truth chord symbols. Both are trained separately at first, the neural network according to precomputed training data and the HMM on basis of of the neural network output for emission probabilities and ground truth chord data for estimation of transition probabilities. After this, a joint optimization, based on the gradient of the HMM according to a global optimization criteria (maximum likelihood), is performed and the neural network’s weights are adjusted. In turn the emission probabilities of the HMM are updated on basis of the new neural network output of the training data, until the system does not improve further.
A.2
Gradient of the Hidden Markov Model
We define the emission probability bt of the HMM as follows: bt = P (Yt |St ),
(27)
the probability of emitting the neural network output Yt in state St at time t according to our state sequence determined by the training data. The joint probability of state and observation sequence is defined as: P (q1 , q2 ...qt , O1 , O2 , ...Ot |λ) = π1 b1
T Y
bt at−1,t
(28)
t=2
with π1 being the initial state probability, bt the probability of emission as stated in equation (27), and at−1,t the transition probability from state St−1 to St , where t indicates the time step. qt are the states q at time step t and λ the parameters of the HMM, and Ot the observation at time t. We want to maximize the log likelihood of the model according to following optimization criterion: ! T Y C = log π1 b1 bt at−1,t , (29) t=2
61
Similar to Bengio et al. (1992). Since the transition probabilities are fixed by the provided ground truth, we take the partial derivative in respect to bt leaving us thus with: T Q ∂ log π1 b1 bt at−1,t ∂C t=2 = (30) ∂bt ∂bt We rewrite the logarithm of the product as a sum of logarithms. Since the derivative in respect to bt does not affect the initial state probability distribution, transition probabilities or emission probabilities of the other states, these can be dropped, leaving us with: ∂log(bt ) 1 ∂C = = ∂bt ∂bt bt
(31)
Since we are using a continuous-densities HMM, the emission probability bt can be represented as a mixture of Gaussians, as described in (Bengio et al., 1992): X 1 Zk −1 > p exp − (Yt − µk )Σk (Yt − µk ) , bi,t = (32) 2 (2π)n |Σk | k where n is the number of Gaussian components per state of the HMM, and Zk , µk and Σk the gain (or mixture weight), mean and covariance matrix of Gaussian component k respectively.
A.3
Adjusting Neural Network Parameters
Since we are aiming to change the neural network parameters according to the HMM optimization gradient, we need to adjust the neural network parameters as described in Bengio et al. (1992). Using the chain rule we take partial derivative of the optimization criterion C in respect to the neural network output Yj,t for the j th component of the output at time t: ∂C ∂C ∂bi,t = , ∂Yj,t ∂bi,t ∂Yj,t where
∂bt ∂Yjt
∂bi,t ∂Yj,t
=
P k
(33)
by differentiating equation (32), which can be written as follows:
√ Zk
2π|Σk |
P l
> dk,lj (µkl − Ylt ) exp − 12 (Yt − µk )Σ−1 , k (Yt − µk ) (34)
where dk,lj is the element (l,j) of the inverse of the covariance matrix (Σ−1 ) for the k th Gaussian distribution and µkl is the lth element of the k th Gaussian mean vector µk .
A.4
Updating HMM Parameters
Rabiner (1989) provides the standard methods to update continuous-density HMMs we can update the gain Zjk for state j and component k as follows: 62
T P 0 Zjk
=
γt (j, k)
t=1 T P M P
(35) γt (j, k)
t=1 k=1
The mean µjk for state j and component k can be computed with: T P
µ0jk
=
γt (j, k)Yt
t=1 T P
,
(36)
γt (j, k)
t=1
where Ot the observation, specific neural network output at time t. The covariance Σjk for state j and component k can be computed with: T P
Σ0jk =
γt (j, k)(Yt − µjk )(Yt − µjk )T
t=1 T P
,
(37)
γt (j, k)
t=1
γt (j, k) describing the probability of being in state j at time t with the k th Gaussian mixture component: Zjk N (Yt , µjk , Σjk ) , γt (j, k) = δtj P Zjm N (Yt , µjm , Σjm )
(38)
m
where the term δtj is 1 if j is equal to the state in our ground truth data and 0 otherwise.
A.5
Neural Network
The neural network can be pretrained as a stacked denoising autoencoder directly on a preprocessed excerpt of the FFT as described in section 5.2.1. To approximate “perfect” PCPs computed from the ground truth, we add an additional softmax output layer and finetune with backpropagation.
A.6
Hidden Markov Model
In the implemented system we try to estimate only major, minor and nonchords, thus leaving us with 25 possible symbols. The emission probabilities of each state in the HMM are modelled by a mixture of two Gaussians. The HMM in turn is trained on the output of the pretrained neural network applying the expecation maximization algorithm for estimating the parameters of the Gaussian mixtures.
A.7
Combined Training
For the joint optimization of the neural network and the HMM we iteratively adjust the neural network weights according to the HMM gradient for the global optimization criterion (as described above). After the neural network weights 63
are adjusted, we update the HMM with the methods described in section A.4 Every alternating neural network weight adjustment and HMM update a test is performed and in theory the training is completed when the change in performance of the system falls below a prior specified threshold.
A.8
Joint Optimization
For testing purposes we train the initialized neural network HMM hybrid on an excerpt of the Beatles dataset, to recognize the restricted set of chord symbols of major and minor chords. We plot the training error that is backpropagated in the joint optimization, described in section A.3, and the performance on a validation set, that is separated from the training samples. Since the SDAE is trained to model PCP vectors in this case, we compute for the joint optimization approach twelve distinct training errors according to each training sample (or multiple of such vectors for batch training). In figure 15, we plot the sum of averaged absolute training errors for each output over one epoch of neural network training. Figure 16 depicts the performance on the validation set after each learning epoch of the joint optimization. We denote the percent of accurately estimated chord symbol segments according to the WCSR. The figures depict performance for 20 epochs (training iterations).
Average of absolute training error
30
25
20
15
10
5
0
0
5
10
15
20
Epochs Figure 15: Average absolute backpropagated training error on training set per epoch.
64
100
WCSR in %
80
60
40
20
0
0
5
10
15
20
Epochs Figure 16: Classifcation performance on validation set after training one epoch.
A.9
Joint Optimization Possible Interpretation
In figure 15 and 16 we can see the training error and the performance on the validation set for 20 iterations of joint training, after an initial pretraining phase. We can see the average training error for each iteration of joint optimization decreasing. This suggests that the neural network HMM hybrid is learning according to the error measure defined above: The gradient in respect to overall likelihood of the HMM is decreasing. However, as we can see in figure 16, the performance on the validation set is decreasing as well. This means despite maximizing the likelihood of the HMM, the performance decreases. These are similar to symptoms we can observe with overfitting. Further experiments were conducted trying to backpropagate the Euclidean distance to the aggregated weighted means of the Gaussian mixture model for the emission of the respective state, but to no avail. I conclude that this method seems to maximize the likelihood according to defined criteria, but is unable to improve performance further after initializing both the Gaussian mixture emissions HMM and stacked denoising autoencoders.
65