AUDIO ONSET DETECTION USING MACHINE LEARNING TECHNIQUES: THE EFFECT AND APPLICABILITY OF KEY AND TEMPO INFORMATION
CHING-HUA CHUAN Department of Computer Science, University of Southern California Viterbi School of Engineering, Los Angeles, California, USA
[email protected] ELAINE CHEW Epstein Department of Industrial and Systems Engineering Hsieh Department of Electrical Engineering University of Southern California Viterbi School of Engineering, Los Angeles, California, USA
[email protected]
This paper explores the effect musical context, namely key and tempo, on audio onset detection using machine learning techniques, with a focus on the changes in performance caused by mismatched key and tempo between training and test pieces, and the potential benefits of incorporating such musical information. We extract frequency energy information from audio as the input instance attributes for the machine learning techniques. The audio is synthesized from MIDI files, which provide exact onset times. We test two state-of-the-art machine learning algorithms, Support Vector Machines and Neural Networks, for their learning and classification of audio onsets. In the first experiment, testing is performed on the training piece, transposed to different keys and time-stretched to various tempi. The results show that audio onset detection performs significantly better when the key and tempo of the test and training sets concur, than when they are different. We propose several ways to incorporate key and tempo in an onset detection system. In the second experiment, we use J. S. Bach’s 24 Preludes from the Well-Tempered Clavier Book 1. The inclusion of tempo information improves the results on average. The positive performance changes supports the usefulness of tempo and key knowledge in the design of onset detection systems, or in the prescription of confidence statistics to onset detection outcomes. Keywords: Audio Onset Detection; Support Vector Machines; Neural Networks; Key; Tempo.
1. Motivation The objective of this paper is to explore the effect of high level musical context on audio onset detection, and to investigate the effectiveness of machine learning techniques for audio onset detection. Audio onset detection is the problem of identifying the time at which a note or a pitch is sounded in a music excerpt, from an audio recording. The input is some music audio signal, usually in wave format; the 1
2
music can be either monophonic or polyphonic, in any tempo, and sounded using any combination of instruments. Obtaining onset information from music audio is essential for applications such as automatic transcription, music phrasing structure analysis, music similarity assessment, expressive performance analysis, and music information retrieval. Several challenges exist in audio onset detection. Audio signals can be noisy, and the amount of data gathered from the time and frequency domains overwhelming. A note onset does not always result in a distinct change in the time and frequency domains; for example, a pitch with the same frequency may be played repeatedly, or two notes may be played continuously without silence in between. Tempi may vary significantly from piece to piece, sometimes even within one piece; the tempo variation can cause difficulties in assigning an appropriate analysis window size. A note or a chord (multiple notes) can be played simultaneously, increasing the diversity of data manifestations. Finally, the timbre quality of the instrumentation can produce unpredictable harmonic series in the audio signals. The use of machine learning techniques in audio onset detection allows us to build complex models from audio features. Machine learning has been proven to be an effective techniques for handling large amounts of data in data mining. In addition to its ability for processing data in higher order dimensions, machine learning techniques can automatically generate decision boundaries appropriate for the training samples. The audio features and its musical structure interact to affect the system performance; if the test samples vary too much from the training set in particular music features, the test results could be far from those for the training set. In order to ensure the machine learning techniques’ high performance, we need to better understand the effect of these musical structures, so as to account for them in the system. In this paper, we focus on the effect of musical context on audio onset detection using machine learning techniques. We explore the importance and usefulness of musical knowledge by studying the performance differences due to the music structure disparity between training and test samples. We begin with three questions: which high level musical knowledge parameters affect onset detection performance; how much does the performance change when the test samples differ from the training set in these parameters; and, how can we use this knowledge to improve the performance of the system. Our study focuses on two musical features, key and tempo. Key refers to the pitch context for the music excerpt; it provides information on the reference pitch for all other pitches in the segment, and the hierarchical organization of the pitches one in relation to another. Tempo refers to the speed of the beats in the music segment; the beats are the periodic pulses generated by the note material in the music. The approach of audio onset detection using machine learning techniques often consists of three steps: audio signal processing, data representation, and data modeling for prediction. In the first step, we use the Fast Fourier Transform (FFT), a conventional signal processing tool. In step two, data representation, we select
3
audio features that have the potential to be affected by musical context to represent the audio signals, in this case, the frequency (musical pitch) information. For comparison, we use several different data representations: one agnostic to musical knowledge, and others containing varying degrees of musical information. In step three, data modeling, we apply two machine learning algorithms for classification: Support Vector Machines (SVM) and Neural Networks (NN). We conduct two experiments to study the effect of key and tempo on onset detection. The first experiment, which we call the Effect Experiment, is designed to test the effect of mismatched key and tempo on system performance. In this first experiment, we train the system on one piece, and test the system on the same piece, transposed to different keys, and at different tempi, to observe the performance changes introduced by the mismatched keys and tempi. The second experiment, which we call the Applicability Experiment, focuses on the applicability of key and tempo information in an onset detection system for a more general data set. In this second experiment, we choose a data set containing a wide variety of keys and tempi, and leave one piece out as the test sample, and use the rest for training. We apply different levels of key and tempo knowledge to the system to see if such information can improve the performance by reducing the diversity in musical context. The paper reports the average correct rate, precision and recall of the onset detection results.
1.1. Background Audio onset detection has attracted much research attention in recent years. This task is fundamental for advancing music research in areas such as beat tracking, audio transcription, and performance analysis. However, many of these application systems are limited by the lack of a robust audio onset detection system. There are two basic approaches to audio onset detection. The first approach uses purely signal processing techniques, and aims to find the one or more features, or equations, that best describe the onset event. Researchers who focus on audio signal processing have proposed various audio features for audio onset detection. Candidate features include amplitude envelope [15, 20], energy [19, 9], phase [1], and their combinations [10, 2]. Issues such as multi-band filtering [4, 15, 12, 17], peak picking [3, 5], and system evaluation [6, 8] have been widely discussed. Most of these proposed systems share similar processing steps, with different parameter adjustments. The most effective approach to audio onset detection through feature description is still a topic of debate. The second approach applies machine learning techniques to construct the best decision boundary from a given data set. The machine learning techniques are usually applied to the extracted audio features to predict the patterns of note onsets in a higher dimension domain. A feed-forward neural network (FNN) is used for piano onset detection in [18], and multiple networks trained with different hyper-parameters are constructed in [16]. SVMs are used for binary onset classi-
4
fication [14], as well as percussion instrument recognition [11, 21]. Other machine learning methods, such as k-Nearest Neighbor [13] and boosting [7], have been proposed for the onset detection problem. Most studies demonstrate system performance by showing average statistics on a particular set of samples, but no analysis has been performed on the effect of musical characteristics (except timbre) of the data set on onset detection. In this paper, we investigate the learning capabilities of two machine learning algorithms, and seek to determine the musical context parameters that affect their performance. We choose two common machine learning techniques, SVMs and NNs, for our investigation. In the Effect Experiment, we explore the effect of key and tempo on the onset detection system, showing how these musical structures affect system performance through the use of mismatched training and test samples. We propose several ways to incorporate key and tempo information in the machine learning techniques. The methods are tested, in the Applicability Experiment, on a data set containing a wide variety of keys and tempi in order to evaluate their usefulness. The remainder of the paper is organized as follows: Section 2 describes the construction of the system, the details at each stage, including data preparation and audio feature extraction, and the methods for incorporating key and tempo information in the system. Section 3 describes the experiments and results: Section 3.1 presents the experiment designed to evaluate the effect of key and tempo on the system, while Section 3.2 shows the evaluations of the usefulness of such information on a broader test set. Section 4 discusses the conclusions of the paper. 2. System Description Two types of systems are invoked in this paper to test the effect and applicability of musical context, namely, key and tempo, on audio onset detection. In the Effect Experiment, shown in Figure 1, the system is trained on one piece and tested on the same piece, transposed to different keys, and time-stretched to different tempi. In the Applicability Experiment, shown in Figure 2, the system is tested on one piece from the music database, when the system is trained on the remaining pieces in the database. Each audio onset detection system consists of two sub-systems. The first subsystem, shown in the upper dashed boxes in Figures 1 and 2, is responsible for learning the hypotheses underlying the training data. The second sub-system, in the lower dashed boxes in the two figures, applies the learned classifier to predict the results for the incoming test data. The input audio is spliced into time chunks of equal length, each of which is regarded as an input instance for the machine learning algorithm and the classifier. The instance, labeled X and X’ in Figures 1 and 2, consists of several attributes corresponding to audio features, such as amplitude and frequency. For each instance, the system learns or predicts whether the instance contains a note onset via the f(X)
5
Y Synthesizer Audio (wave)
MIDI
Audio feature extraction
X
Machine learning algorithm
X'
Learned classifier f(X')
Learning Learned classifier f(X)
Modify to different key/tempo Synthesizer Audio (wave)
MIDI
Audio feature extraction
Classifying Results Evaluation Y'
Fig. 1. System diagram for the Effect Experiment: testing the effect of key/tempo experiment on audio onset detection using machine learning.
Y Synthesizer Audio (wave)
MIDI
Audio feature extraction
X
Machine learning algorithm
Learning Learned classifier f(X)
Key/Tempo information
Synthesizer Music Database
Audio (wave)
Audio feature extraction
X'
Learned classifier f(X')
Classifying Results Evaluation Y'
Fig. 2. System diagram for the Applicability Experiment: verifying the applicability of key/tempo experiment on audio onset detection using machine learning.
and f(X’) functions in Figures 1 and 2. The evaluation answer, the ground truth, given by Y and Y’ in Figures 1 and 2, is provided by the corresponding MIDI files. 2.1. Data preparation We collect training and test pieces from the classical music archives website, www.classicalarchives.com, which contains a collection of western classical music pieces across a wide span of stylistic periods, in MIDI and/or audio recording formats. In order to perform inductive learning, and to evaluate the results with ease, we start with MIDI samples that contain exact onset time information. All MIDI samples are re-synthesized with acoustic grand piano sounds. We use the Winamp
6
software, with an 11.025 kHz sampling rate, to render MIDI files into audio (wave format). By limiting the timbre variety of the output audio, we study only the effects of music context on audio onset detection.
2.2. Audio feature extraction We process the synthesized audio files using a half-overlapped sliding window. The size of the window is 1024 samples, which corresponds to approximately 92.9 ms, based on the sampling rate of 11.025 kHz. The half-overlapped sliding window provides a detection precision up to 46.45 ms. We use the FFT implemented in Matlab to extract the signal intensity and frequency energy of the audio wave at each time chunk. In order to generate the signal intensity of the audio wave, we first splice the audio wave into several time chunks by applying half-overlapped hanning windows of size 1024 samples. Then, we use a Gaussian curve to smooth the amplitude. The signal intensity at each time chunk is computed as the sum of squared amplitudes of the samples in the window. Specifically, the signal intensity at time chunk t is calculated as: 1 SI(t) = N
N (t+ 12 )
N
2
[x(n + m) × g(m)]2 ,
(1)
n=N (t− 12 ) m=− N 2
where x(n) is the amplitude of sample n, N (= 1024) is the hamming window size, and g(m) is the standard Gaussian smoothing function with zero mean and standard deviation half the window size: −2m2 1 N ). g(m) = √ ( )−1 exp( N2 2π 2
(2)
We use the FFT to extract the frequency information of the audio wave. The frequency energy at j Hz at time t, F Et (j), is calculated by multiplying the squared frequency response magnitude, f yt (j), with the frequency, f xt (j) = j, and normalizing by the maximum of all such products: F Et (j) =
f yt (j)2 × f xt (j) . maxi {f yt (i)2 × f xt (i)}
(3)
The weighting by f xt (i) results in the high frequency content (HFC) function proposed by Masri [19], which gives rise to sharp peaks during onsets. In order to set the detecting precision to half steps (semitones), we generate the frequency energy for each of the 88 pitches from A0 to C8, based on a reference frequency of A4 = 440 Hz. The frequency energy of pitch k at time t, P Et (k), is defined as the sum of the frequency energies over all frequencies that map to pitch
7
k:
P Et (k) =
F Et (f xt (j)),
(4)
f xt (j) p(f xt (j)) = 12 × log2 + 49. 440
(5)
j∈{j:p(f xt (j))=k}
Figure 3 illustrates the process of audio feature extraction, with the goal of generating frequency energies corresponding to the 88 pitches on a piano keyboard.
audio signal
t-th window Hanning window
Normalized Magnitude
FFT 1
0.5
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Hz
PEt(1 ) (pitch A0)
...
...
...
PEt( 88 ) (pitch C8)
Fig. 3. Audio feature extraction corresponding to 88 pitches.
2.3. Basic representation of input attributes This section describes how we pre-process the training and test data. Each instance is represented by 88 attributes, and 1 class value. The 88 attributes represent frequency energy deviations. Each energy deviation is given by the difference between the current frequency energy and the average of the energies in the previous and the next chunks. The i-th attribute of input signal Ai at the t-th window is generated by the equation: 1 (6) At,i = P Et (i) − [P Et−1 (i) + P Et+1 (i)] , where i = 1, . . . , 88. 2
8
These 88 attributes are used in the basic representation, where the system does not consider any key or tempo information. The 88 attributes correspond to the fixed frequency ranges of the 88 pitches, from A0 to C8. In this paper, we use a binary representation (onset or non-onset) as the class attribute for classification. 2.4. Key information incorporation In order to account for the key information of the input piece, the attributes are rearranged in order according to the key information. The two types of representations used to incorporate key information are described in the following sub-sections. 2.4.1. Key77 representation In the first type of the representation, called Key77, 77 attributes are used instead of the basic representation’s 88. The attributes are shifted, according to the key, so that the {1st, 13th, 25th, . . . , 73rd} attributes are related to the tonic, the most stable , AKey77 pitch in the key. For example, the attributes at the t-th window {AKey77 t,1 t,13 , Key77 AKey77 , . . . , A } of a piece in C major are used to present the information t,25 t,73 , related to the pitches {C1, C2, C3, . . . , and C7}. Similarly, attributes {AKey77 t,2 Key77 Key77 AKey77 , A , . . . , A } correspond to the frequency energy deviations of the t,14 t,26 t,74 pitches {C1, C2, . . . , and C7}. In this way, the i-th attribute is defined by the key, and not the absolute frequency. 2.4.2. Key12 representation In the second type of the representation, called Key12, only 12 attributes are used. The 12 attributes correspond to the 12 pitch classes, where the first attribute is related to the tonic. The value of the attribute is calculated as the sum of the attributes corresponding to the same pitch name in the basic 88-attribute reprein C major is sentation. For example, the first attribute at the t-th window AKey12 t,1 generated by the sum of the attributes corresponding to the pitches {C1, C2, C3, . . . , and C8} in the basic representation. The 12 attributes are normalized by their maximum: 7 1 = At,12k+1 (7) AKey12 t,1 maxj {AKey12 } k=0 t,j Table 1 summarizes the meanings of the first attributes in the Basic, Key77, and Key12 representations. 2.5. Tempo information incorporation Now, we turn our attention to time structure in music data, in particular, the tempo. We demonstrate why tempo is important in onset detection through a quick example. Assume that the shortest duration between any two notes is x, this shortest
9
Table 1. Meanings of first attributes in the Basic, Key77, and Key12 representations. Type
Meaning of the first attribute
Basic Key77 Key12
Frequency energy deviation in fixed frequency range corresponding to A0 Energy deviation of lowest pitch corresponding to the tonic of the key Total energy deviation of pitch class corresponding to tonic of key
duration is directly related to the tempo. In order to extract an onset clearly, the window size must be less than x, otherwise one window could contain two onsets. The shift-forward time between two adjacent windows needs to be x/n, where n is an integer, in order to align well with all the onsets in the piece. Figure 4 depicts the relations between the shortest duration and the window size and shift.
x
(shortest duration between two notes, based on tempo)
…
Fig. 4. The window size and shift-forward time in relation to audio onsets.
2.6. Machine learning algorithms Two state-of-the-art machine learning algorithms are chosen for our study, namely the NN and the SVM. We use the software implementations of these methods in Weka [22], a collection of machine learning algorithms for data mining tasks, for the experiments. The NN method is chosen because of its ability to handle highdimensional features in noisy data. In this paper, we explore different settings for the number of hidden layers, training time, and learning rates. The SVM method has been proven to be a particularly effective algorithm amongst machine learning techniques. The SVM derives its advantage from the optimization process, which maximizes the distance between decision boundaries. In this paper, three SVM parameters are adjusted for analysis: the complexity parameter, the kernel function, and the settings of the kernel function.
10
2.7. Classifier Selection In order to select the best classifier for each machine learning algorithm, we compute 10-fold across validation errors for each training set. Figure 5 shows the cross validation errors for the SVM algorithm. Three parameters are adjusted in the Sequential Minimal Optimization (SMO), the implementation of SVM in Weka. The parameters are: the complexity parameter C (the tradeoff between fitting the training data and maximizing the separation margin), the kernel function (polynomial vs. Gaussian), and the kernel parameters (degree of polynomial or Gaussian variance γ). As shown in Figure 5, on the training set consisting of all 12 major key Preludes in Bach’s Well-Tempered Clavier Book 1, using the Key12 representation, the Gaussian kernel with γ = 10, at complexity C = 30 gives the lowest percentage error. Figure 6 shows the training results of the NN method, with different parameters polynomial (exponent = 1) polynomial (exponent = 2) Gaussian (r = 1) Gaussian (r = 5) Gaussian (r = 10)
)
91 90.5 90 89.5 89 88.5 88 10
30
50 Complexity parameter C
70
90
Fig. 5. 10-fold cross validation errors for the SMO (SVM) method.
)
89.2 89 88.8 88.6 Learning rate = 0.25, training time = 100 Learning rate = 0.25, training time = 300 Learning rate = 0.4, training time = 100 Learning rate = 0.4, training time = 300
88.4 88.2 88
2
3
4
5 6 7 Number of hidden layers
8
9
Fig. 6. 10-fold cross validation errors for the NN method.
10
11
settings, for the training set comprising of all 12 major Preludes in Bach’s WellTempered Clavier Book 1, with the Key12 representation. Three parameters are adjusted: the number of hidden layers (ranging from 2 to 10), training iterations (100 or 300), and learning rates (0.25 or 0.4). The momentum used in all cases is 0.2. The best training results obtained for the NN is for the system setting with 6 hidden layers, a learning rate 0.4, and 300 training iterations. 3. Experiments and Results This section presents the results of our experiments: the Effect Experiment, and the Applicability Experiment, as described in the previous sections. 3.1. Experiment 1: effect of key and tempo In order to explore the effect of musical context, specifically the effects of key and tempo, on audio onset detection, we choose only one musical piece, and create several versions of the same piece, first in different keys, then at different tempi. We use J.S. Bach’s Prelude No. 1 in C major, BWV 846, and Prelude No. 1 in C minor, BWV 847, as the training pieces in major and minor keys respectively. 3.1.1. Testing pieces in different keys In order to explore the effects of tonality on onset detection, we transpose the training pieces, BWV 846 and BWV 847, to twelve other keys in the same mode (major or minor). We use the first 32 bars of the pieces as the training excerpts. We represent the keys to which the pieces are transposed according to tonal distance on the circle of fifths, shown in Figure 7. The circle cycles through all twelve keys by fifths, and adjacent keys differ from each other only by one note. For example, G is the fifth note in a C major scale, and G major’s key signature contains only one C F +5/-7
0
Cm G -5/+7
Bb -2/+10
+2/-10 D -3/+9 A
Eb +3/-9 Ab
+4/-8 E
-4/+8 Db
+1/-11 -1/+11 - 6,+6 B Gb
F#
(a) major
Fm +5/-7
0
Gm -5/+7
Bbm -2/+10
+2/-10 Dm
Ebm +3/-9 Abm
-3/+9 Am +4/-8 Em
-4/+8
+1/-11 -1/+11 - 6,+6 Bm Dbm Gbm
F#m
(b) minor
Fig. 7. Tonal distance of keys on the circle of fifths.
12
accidental more than that of C major. F and G majors are the ones furthest from C major; as a result, they are diametrically opposite C on the circle-of-fifths. Numbers inside the circle indicate the numbers of transposition half steps: the nearest and second nearest ways for transposing the pitches in the piece. For example, lowering all the pitches in C major by five half steps, or raising them by seven half steps, transposes the piece to G major. Figure 7 (b) presents the keys on the circle-of-fifths in relation to C minor. Each test excerpt is spliced into 46.45 ms time chunks, generating 578 for BWV 846, and 407 for BWV 847 instances, each labeled as either having an onset or not, serving as ground truth. The results are evaluated using three measurements: correctly classified instances (correct percentage rate), and the precision and recall of onset events. We first report the results for the Bach’s C major Prelude, BWV 846. Figure 8 shows the onset detection correct rates for the twelve (transposed) test sets, with the results ordered by: (a) tonal distance (selecting the nearest transposition given in Figure 7); (b) tonal distance (the second nearest transposition according to Figure 7); and, (c) pitch distance. The correct rates range from 100%, using SVM on the original C major test piece, to 53.98%, using NN on the D major piece in the second nearest transposition form. It is not surprising that the onset detection results for the C major piece, the one in the same key as the training data, has far fewer errors than those for the pieces in other keys. Since all the test pieces contain the same onset events at the same amplitude, the disparity between the C major piece results and those of the others suggests that the system is sensitive to tonal differences. In Figure 8(a), the correct rate of the pieces in keys other than C major decreases with the increasing tonal distance, when the nearest transposition is selected. This relatively well-behaved degradation of the correct rate with tonal distance is not as obvious when using the second nearest transposition, as shown in Figure 8(b). Notice that the degradation in performance is apparent in the pitch distance plot, Figure 8(c), especially when the NN algorithm is used. Figures 8(d) and (e) give the precision and recall figures plotted again pitch distance. The precision for the SVM method fall at the extremes, either close to 1 or 0, while the recall drops significantly when the test piece is not in C major. The precision and recall for the NN algorithm show similar degradation with pitch distance. We conduct a similar set of tests using Bach’s C minor Prelude, BMW 847. Figure 9 shows the correct onset detection rate for the Prelude transposed to different keys, for both the SVM and NN methods. The correct rates range from 100%, for both the SVM and NN methods applied to the C minor piece, to 63.31%, for the NN on the D minor piece. As with the C major Prelude, the degradation of the correct rate with increasing tonal distance can be observed in the nearest transposition test set, as shown in Figure 9(a). Observe, in Figure 9(b), that when the transposition is greater than seven half steps (one perfect fifth step), the correct rates for SVM
13 100 SVM NN
100 90 Correct Rate (%)
Correct Rate (%)
90
80
70
80 70 60
60 Gb -6
Db +1
Ab Eb Bb F C G D A E -4 +3 -2 +5 0 -5 +2 -3 +4 (a) Correct rate (nearest transposition in tonal distance)
B -1
50 Gb F# Key +6 Half Steps +6
Db -11
Ab +8
Eb -9
Bb +10
F -7
C 0
G +7
D -10
A +9
E -8
B +11
F# -6
(b) Correct rate (2nd nearest transposition in tonal distance)
100 SVM NN
Correct Rate (%)
90 80 70 60 50 Db -11
D -10
Eb -9
E -8
F F#/Gb -7 -6
G -5
Ab -4
A -3
Bb B C Db D -2 -1 0 +1 +2 (c) Correct rate in pitch distance
Eb +3
E +4
F F#/Gb +5 +6
G +7
Ab +8
A +9
Bb +10
B Key +11 Half Steps
D -10
Eb -9
E -8
F F#/Gb -7 -6
G -5
Ab -4
A -3
Bb B C Db D -2 -1 0 +1 +2 (d) Precision in pitch distance e
Eb +3
E +4
F F#/Gb +5 +6
G +7
Ab +8
A +9
Bb +10
B Key +11 Half Steps
D -10
Eb -9
E -8
F F#/Gb -7 -6
G -5
Ab -4
A -3
Bb -2
Eb +3
E +4
F F#/Gb +5 +6
G +7
Ab +8
A +9
Bb +10
B Key +11 Half Steps
1
Precision
0.8 0.6 0.4 0.2 0 Db -11 1
Recall
0.8 0.6 0.4 0.2 0 Db -11
B C Db D -1 0 +1 +2 (e) Recall in pitch distance
Fig. 8. Onset detection results using both SVM and NN on Bach’s Prelude in C major (BWV 846) in different keys.
stay mostly constant at 70%, while the NN results do not appear correlated to tonal distance. In Figure 9(c), the correct rates using both SVM and NN decrease when the keys depart from C minor, in both directions according to pitch distance. The systematic degradation of the correct detection rates is more evident for the NN method, especially on the pieces transposed down by half steps. Similar behavior can be observed in the precision and recall figures shown in Figures 9(d) and (e).
14 100
100 SVM NN 90
90
Correct Rate (%)
Correct Rate (%)
95
85 80 75
80
70
70 65 Gbm Dbm Abm Ebm Bbm Fm Cm Gm Dm Am Em -6 +1 -4 +3 -2 +5 0 -5 +2 -3 +4 (a) Correct rate (nearest transposition in tonal distance)
Bm -1
60 F#m Key Gbm Dbm Abm Ebm Bbm Fm Cm Gm Dm Am Em Bm +6 Half Steps -6 +1 -4 +3 -2 +5 0 -5 +2 -3 +4 -1 (b) Correct rate (2nd nearest transposition in tonal distance)
F#m +6
100 SVM NN Correct Rate (%)
90
80
70
60 Dbm -11
Dm -10
Ebm -9
Em -8
Fm -7
F#m -6
Gm -5
Abm -4
Am -3
Bbm Bm Cm Dbm Dm Ebm -2 -1 0 +1 +2 +3 (c) Correct rate in pitch distance
Em +4
Fm +5
F#m +6
Gm +7
Abm +8
Am +9
Bbm +10
Bm Key +11 Half Steps
Dm -10
Ebm -9
Em -8
Fm -7
F#m -6
Gm -5
Abm -4
Am -3
Bbm Bm Cm Dbm Dm Ebm -2 -1 0 +1 +2 +3 (d) Precision in pitch distance
Em +4
Fm +5
F#m +6
Gm +7
Abm +8
Am +9
Bbm +10
Bm Key +11 Half Steps
Dm -10
Ebm -9
Em -8
Fm -7
F#m -6
Gm -5
Abm -4
Am -3
Bbm -2
Em +4
Fm +5
F#m +6
Gm +7
Abm +8
Am +9
Bbm +10
Bm Key +11 Half Steps
1 0.8
Precision
0.6 0.4 0.2 0 Dbm -11 1
Recall
0.8 0.6 0.4 0.2 0 Dbm -11
Bm Cm Dbm Dm Ebm -1 0 +1 +2 +3 (e) Recall in pitch distance
Fig. 9. Onset detection results using both SVM and NN on Bach’s Prelude in C minor (BWV 847) in different keys.
3.1.2. Testing pieces at different tempi This next experiment tests the effect of tempi on onset detection. In this experiment we change the tempo of the excerpts, for both the C major Prelude and the C minor Prelude, to fourteen different values. In this paper we use the tatums per minute (tpm), instead of the more common beats per minute (bpm), as the tempo information for onset detection. A tatum is a temporal atom, the shortest duration between notes in a piece; it is usually some fraction of the beat. For Bach’s Prelude
15
in C major, BWV 846, the tempo is changed to 222, 238, 248, 258, 268, 278, 282, 294, 298, 308, 318, 328, 338, and 354 tpm respectively; the original version was at 288 tpm, which corresponds to 36 bpm. The tatums per minute, beats per minute, and minimum inter-onset time interval (in ms) conversion numbers are displayed in Table 2. The original tempo is indicated with boldface text. For Bach’s Prelude in C minor, BWV 847, the tempo is modified to 350, 368, 374, 380, 386, 392, 396, 404, 408, 414, 420, 426, 432, and 450 tmp respectively; the original version was at 400 tpm, which corresponds to 50 bpm. The tempo is set in the MIDI files, which provide the ground truth onsets for the modified pieces. The tatums per minute, beats per minute, and minimum inter-onset time interval conversion numbers for the C minor Prelude are shown in Table 3. Again, the original tempo is indicated with boldface text.
Table 2. Tatums per minute, beats per minute, and minimum inter-onset-time interval conversions for the C major Prelude. tpm bpm minIOI tpm bpm minIOI
222 27.75 270.03 288 36.00 208.33
238 29.75 252.10 294 36.75 204.08
248 31.00 241.94 298 37.25 201.34
258 32.25 232.56 308 38.50 194.81
268 33.50 223.88 318 39.75 188.58
278 34.75 215.83 328 41.00 182.93
282 35.25 212.77 338 42.25 177.51
288 36.00 208.33 354 44.25 169.49
Table 3. Tatums per minute, beats per minute, and minimum inter-onset-time interval conversions for the C minor Prelude. tpm bpm minIOI tpm bpm minIOI
350 43.75 171.43 400 50.00 150.00
368 46.00 163.04 404 50.5 148.51
374 46.75 160.43 408 51.00 147.06
380 47.50 157.89 414 51.75 144.93
386 48.25 155.44 420 52.50 142.86
392 49.00 153.06 426 53.25 140.85
396 49.50 151.52 432 54.00 138.89
400 50.00 150.00 450 56.25 133.33
Figure 10 shows the correct rates, precision and recall numbers for Bach’s Preludes in C major (Figures 10(a), (c) and (e)) and C minor (Figures 10(b), (d), and (f)) at different tempi, for both the SVM and NN algorithms. Figures 10(a) and (b) show significantly higher correct rates at the original tempi, 288 and 400 tpm, respectively. As the tempo departs from the original, the correct rate gradually decreases. The worst correct rate is 77.28%, for the test piece in C major at 354 tpm (88.5 bpm), 66 tpm more than the original. For the test piece in C minor at 450 tpm (112.5 bpm), 50 tpm more than the original, the correct rate reaches a low of 78.7%. These results imply that the test result is sensitive to the tempo of the training piece. In Figures 10(e) and (f), the same degradation is observed in the
16 100
100 SVM NN
90 85
90 85 80
80 75 222
SVM NN
95 Correct Rate (%)
Correct Rate (% )
95
238 248 258 268 278 288 298 308 318 328 338
Tempo 354 (tpm)
75 350
368 374 380 386 392
400
408 414 420 426 432
Tempo 450 (tpm)
(b) Correct rate in C minor
(a) Correct rate in C major 1
1
0.95 0.95 0.9 0.9 0.85 0.85 222
238 248 258 268 278 288 298 308 318 328 338
(c) Precision in C major 1
Tempo 354 (tpm) 0.8350
368 374 380 386 392
400
408 414 420 426 432
Tempo 450 (tpm)
(d) Precision in C minor 1
0.8 0.8 0.6 0.6 0.4 0.4
0.2 0 222
238 248 258 268 278 288 298 308 318 328 338
(e) Recall in C major
Tempo 354 (tpm) 0.2350
368 374 380 386 392
400
408 414 420 426 432
Tempo 450 (tpm)
(f) Recall in C minor
Fig. 10. Correct rates for SVM and NN methods on Bach’s BWV 846 and 847 at different tempi.
recall measure when the tempo increases or decreases from the original. The best recall is found at the original tempo, with the value 1, while the worst recall for C major, 0.17, is found at 354 tpm, and the corresponding lowest recall for C minor, 0.3, is at 386 tpm. In Figure 10(c) and (d), the precision does not show any clear degradation patterns.
3.2. Experiment 2: applicability of key and tempo To examine the applicability of key and tempo, we choose as our data set Bach’s 24 Preludes in his Well-Tempered Clavier Book 1. The 24 Preludes contain pieces in all possible keys at various tempi, providing a dataset with the kind of variety that is closer to realistic situations. We extract the first 32 bars of the pieces as excerpts for the experiments. We treat each excerpt individually as the test sample, when the system is trained on the remainder of the pieces. We report the average results for all test samples. Detailed information such as key, time signature, and tempo of the 24 pieces is shown in Table 4. We first apply only the key information; next we test the system with only the tempo information; and finally, we provide both key and tempo information, to ex-
17
Table 4. Keys, time signatures, and tempi of Bach’s 24 Preludes in the Well-Tempered Clavier Book 1 used in the Applicability Experiment. No.
Key
Time Sig.
Tempo (bpm)
Tatum (ms)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
C major C minor C major C minor D major D minor E major E minor E major E minor F major F minor F major F minor G major G minor A major A minor A major A minor B major B minor B major B minor
2/2 2/2 3/8 6/4 2/2 2/2 2/2 3/2 12 / 8 2/2 12 / 8 2/2 12 / 16 2/2 24 / 16 2/2 3/4 6/8 2/2 9/8 2/2 2/2 2/2 2/2
36.0 50.0 138.0 126.0 54.0 60.0 36.0 55.0 216.0 25.5 144.0 30.0 216.0 30.0 336.0 48.0 80.0 92.0 29.0 180.0 23.0 26.0 31.5 49.5
208.30 150.00 217.40 238.00 138.90 125.00 208.30 136.40 277.70 294.00 208.30 250.00 277.80 250.00 178.58 156.25 187.50 326.00 258.60 166.70 163.00 288.50 238.10 303.00
amine the usefulness of such information in improving onset detection performance. Sections 3.2.1, 3.2.2, and 3.2.3 describe the experiment results with respect to the applicability of key only, tempo only, and both key and tempo.
3.2.1. Incorporating key In order to test the usefulness of the key information, we use three types of representations for input to the machine learning algorithms. The three representations are: Basic (no key information), Key77 (77 attributes, transposed so that the tonic is the lowest pitch), and Key12 (twelve pitch classes, transposed so that the tonic is the lowest pitch class). The construction and the meaning of these representations are described in Section 2.4. Figure 11 shows the average correct rate, precision, and recall using SVM and NN on the 24 Preludes from Bach’s Well-Tempered Clavier Book 1. Average performance results are presented with the standard deviation across all test pieces. The 24 Preludes are organized into three datasets: All (all 24 pieces), Major (12 major pieces), and Minor (12 minor pieces). By separating the pieces in major and minor modes, we can observe the effect of mode, as part of the key information analysis.
96
96
94
94
92
92 Correct Rate (%)
Correct Rate (%)
18
90 88 86 84
88 86 84
82 80
90
82 Basic All
Basic Major
Basic Minor
Key77 Major
Key77 Minor
Key12 Major
80
Key12 Minor
Basic All
Basic Major
(a) Correct rate (SVM)
0.9
0.9
0.85
0.85
Precision
Precision
Key77 Minor
Key12 Major
Key12 Minor
Key12 Major
Key12 Minor
Key12 Major
Key12 Minor
0.95
0.8 0.75
0.8 0.75
0.7
0.7
0.65
0.65
0.6 Basic All
Basic Major
Basic Minor
Key77 Major
Key77 Minor
Key12 Major
0.6
Key12 Minor
Basic All
Basic Major
(c) Precision (SVM) 0.9
0.8
0.8
0.7
0.7
0.6
0.5
0.4
0.4 Basic Major
Basic Minor
Key77 Major
(e) Recall (SVM)
Key77 Minor
Key77 Major
Key77 Minor
0.6
0.5
Basic All
Basic Minor
(d) Precision (NN)
0.9
Recall
Recall
Key77 Major
(b) Correct rate (NN)
0.95
0.3
Basic Minor
Key12 Major
Key12 Minor
0.3 Basic All
Basic Major
Basic Minor
Key77 Major
Key77 Minor
(f) Recall (NN)
Fig. 11. Average correct rates, precision and recall using SVM and NN with and without key information.
The methods with Basic representation perform better than those using the Key77 and Key12 representations. These results suggest that re-arranging the attributes as a way to incorporate key information may not be particularly useful for onset detection for this dataset. The results using Key77 and Key12 representations do not differ much from each other, suggesting that removing register information may not significantly affect onset detection performance. The experiment results for the major pieces generally report higher precision and recall than those for the minor pieces, as shown in Figures 11(c) through (f). This outcome may be caused by the greater diversity among the minor mode pieces.
19
3.2.2. Incorporating tempo
96
96
94
94 Correct Rate (%)
Correct Rate (%)
In order to test the usefulness of the tempo information, we compare the results of the methods with and without incorporating such information. We apply the tempo information in the way described in Section 2.5 for better alignment of onset events. The test pieces, Preludes from Bach’s Well-Tempered Clavier Book 1, are further divided into two groups: Fast and Slow, according to the tempi of the MIDI files. When the tatum is less than 200 ms, i.e. when the tempo is more than 75 bpm when the tatum is the sixteenth note with the quarter note as the beat, the piece is labeled as fast; otherwise, it is considered slow. Based on the tempo information listed on Table 4, nine excerpts are labeled as fast (the C minor, D major, D minor, E minor, G major, G minor, A major, A minor, and B major Preludes), and the remaining fifteen are labeled as slow.
92 90 88
92 90 88
86
86
84
84
82
Basic/Fast
Tempo/Fast
Basic/Slow
82
Tempo/Slow
Basic/Fast
0.95
0.95
0.85
0.85
0.75 0.65
Tempo/Slow
0.75
0.55 Basic/Fast
Tempo/Fast
Basic/Slow
Basic/Fast
Tempo/Slow
Tempo/Fast
Basic/Slow
Tempo/Slow
(d) Precision (NN)
(c) Precision (SVM) 0.9
0.9
0.85
0.85
0.8
0.8 Recall
Recall
Basic/Slow
0.65
0.55
0.75
0.75
0.7
0.7
0.65
0.65
0.6
Tempo/Fast
(b) Correct rate (NN)
Precision
Precision
(a) Correct rate (SVM)
Basic/Fast
Tempo/Fast
Basic/Slow
(e) Recall (SVM)
Tempo/Slow
0.6
Basic/Fast
Tempo/Fast
Basic/Slow
Tempo/Slow
(f ) Recall (NN)
Fig. 12. Average correct rates, precision and recall using SVM and NN, with and without tempo information.
20
96
96
94
94 Correct rate (%)
Correct rate (%)
Figure 12 shows the average correct rate, precision and recall using SVM (Figures 12(a), (c), and (e)) and NN (Figures 12(b), (d), and (f)) on the 24 Preludes, separated into fast and slow sets, with and without tempo information. We use the Basic representation when no tempo information is given. The mean and standard deviation results are displayed in Figure 12. The results for the experiments with slow pieces, and with tempo information (Tempo/Slow) outperform those without tempo information on the same dataset (Basic/Slow), regardless of which machine learning algorithm is used. These results imply that using tempo information for splicing audio into processing time chunks may improve the onset detection performance for pieces whose tatums are longer than 200 ms. For the pieces labeled as fast, the results with tempo information (Tempo/Fast) are slightly better than those without (Basic/Fast) when NN is used, as shown in Figures 12(b), (d), and (f). The higher average performance of the Tempo/Fast setting over the Basic/Fast setting is offset by the larger standard deviation errors. When using the SVM technique, tempo information does not appear to benefit onset detection for the fast pieces, as shown in Figures 12(a), (c), and (e).
92 90 88 86
90 88 86 84
84 82
92
Basic
82 Basic /Tempo
Key77
Key77 /Tempo
Key12
Basic
Key12 /Tempo
Basic /Tempo
(a) Major pieces
Key77
Key77 /Tempo
Key12
Key12 /Tempo
(b) Minor pieces
96
96
94
94
Correct rate (%)
Correct rate (%)
Fig. 13. Average correct rates with and without key and tempo information using SVM.
92 90 88
92 90 88
86
86
84
84 82
82 Basic
Basic /Tempo
Key77
Key77 /Tempo
(a) Major pieces
Key12
Key12 /Tempo
Basic
Basic /Tempo
Key77
Key77 /Tempo
Key12
(b) Minor pieces
Fig. 14. Average correct rates with and without key and tempo information using NN.
Key12 /Tempo
21
3.2.3. Incorporating both key and tempo Next, we consider incorporating both key and tempo information in the machine learning systems. Figure 13 shows the average correct rates with and without key and tempo information using SVM on the twelve major (shown in Figure 13(a)) and the twelve minor (shown in Figure 13(b)) mode pieces. Since the addition of key information via the Key77 and Key12 representations did not improve the onset detection results in Section 3.2.1, it is not surprising that using both key and tempo, shown by the Key77/Tempo and Key12/Tempo bars, does not improve the performance either. Note that the correct rates of the Basic/Tempo, Key77/Tempo, and Key12/Tempo settings are significantly higher than the ones of Basic, Key77, and Key12, without tempo information, for the major mode pieces, as shown in Figure 13(a). The improvements resulting from the added tempo information are more apparent in Figure 14(a) for the NN method. 4. Conclusions and Discussions In this paper we provide sensitivity studies on the importance and usefulness of musical context on audio onset detection using machine learning techniques. We have demonstrated the effect of key and tempo by training the system on one piece each in a major key and a minor key, namely, Bach’s Prelude in C major, BWV 846, and Prelude in C minor, BWV 847, and testing the system on the same pieces, transposed to all other possible keys in the same mode, and time-stretched to different tempi. The two machine learning algorithms, SVM and NN, show a consistent performance degradation when the test piece is veers away from the original key or tempo. In these experiments, which use test pieces of all keys, we have shown that onset detection performance is significantly improved when the training and test pieces are in the same key. The results show a clear correlation between performance degradation and key distance from the original training piece. In the experiments using test pieces of different tempi, we observed that the performance deteriorates with increasing tempo distance from the original, suggesting that onset detection performance is also sensitive to tempo change. This effect may have been enhanced by the fixed sliding window size; the window size was selected without regard to the tempi. Tables 5 and 6 summarize the best and worst results from this first set of experiments. The results indicate that tonality distance and tempo changes are important factors on audio onset detection, and thus merits future research attention. In the later part of the paper, we propose several ways of incorporating key and tempo information for audio onset detection. In a second set of experiments, we tested the applicability of key and tempo information to improving onset detection. Key information, such as mode and pitch relations, is encoded in the order of the audio attributes. Tempo information is used to select the size of overlapped sliding windows, so as to better align segments with onset events. We test the methods
22
Table 5. Summary of the best and worst results: tests on pieces transposed to different keys. Piece
Key
Algorithm
Correct Rate (%)
Precision
Recall
Training
C Major
Test(best) Test(worst) Training
C Major D Major C Minor
Test(best) Test(worst)
C Minor D Minor
SVM NN SVM/NN NN SVM NN SVM/NN NN
91.35 94.64 100.00 53.98 84.65 94.96 100.00 63.31
0.89 0.89 1.00 0.21 0.99 0.92 1.00 0.41
0.70 0.87 1.00 0.39 0.51 0.92 1.00 0.33
Table 6. Summary of the best and worst results: tests on pieces at different tempi. Piece
Tempo (tpm)
Algorithm
Correct Rate (%)
Precision
Recall
Training
288
Test(best) Test(worst) Training
288 354 400
Test(best) Test(worst)
400 450
SVM NN SVM SVM SVM NN SVM SVM
91.35 94.64 100.00 77.28 84.65 94.96 100.00 78.70
0.89 0.89 1.00 0.96 0.99 0.92 1.00 1.00
0.70 0.87 1.00 0.81 0.51 0.92 1.00 0.4
using the 24 Preludes from Bach’s Well-Tempered Clavier Book 1; these pieces are composed in all keys, and in a variety of tempi. In the experiments testing the applicability of key, the results do not show significant improvement when key information is applied. We discovered that incorporating key information with or without register information produced similar results, and that higher correct rates were achieved on the test set with major mode pieces than on the test set with minor mode ones. In the experiments testing the applicability of tempo, the system with tempo information incorporated performed better, in general, than those without tempo information; this was especially true for the pieces in slower tempi, where the tatums were over 200 ms long. Table 7 summarizes the best and worst results from this later set of experiments.
Table 7. Summary of the best and worst results: incorporates key and/or tempo information. Test Set 12 major pieces(best) 12 major pieces(worst)
Model
Algorithm
Correct Rate (%)
Precision
Recall
Basic/Tempo Key12
SVM NN
93.43 86.58
0.88 0.78
0.85 0.62
One general concern when one applies machine learning techniques is the problem of over-fitting. In the experiments testing the effect of key and tempo, we use
23
the same piece in the testing stage, transposed to other keys and tempi, so as to constrain the problem to one of investigating system sensitivity to only key and tempo changes. In the experiments testing the applicability of key and tempo, we use 10-fold cross validation errors to generate conservative accuracy scores that prevent the learner system from over-fitting in the training stage. To understand more comprehensively the effect of musical context on audio onset detection, our future work will include more detailed investigations into the influence of model parameters on the results, and deeper explorations into the error sources, and other ways to apply key and tempo information. Further work will examine the effect of timbre on audio onset detection by synthesizing wave files of the same pieces using different instruments. Acknowledgements The research was supported in part by the National Science Foundation (NSF) through the Integrated Media Systems Center, an NSF ERC, under Cooperative Agreement EEC-9529152, and by NSF grant No.0347988. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors, and do not necessarily reflect those of NSF. References [1] J. P. Bello and M. Sandler, Phase-based note onset detection for music signals, in Proc. IEEE Int. Conf. Acoustic, Speech, and Signal Processing (ICASSP-03), pp. 49-52, Hong Kong, 2003. [2] J. P. Bello, C. Duxbury, M. Daives, and M. B. Sandler, On the use of phase and energy for musical onset detection in the complex domain, IEEE Signal Processing Letters, vol. 11, no. 6, pp. 553-556, 2004. [3] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Daives, and M. B. Sandler, A tutorial on onset detection in music signals, in IEEE Trans. on Speech and Audio Processing, vol. 13, no. 5, pp. 1035- 1047, 2005. [4] N. Collins, Using a pitch detector for onset detection, in Proc. of 6th Int. Conf. on Music Information Retrieval, pp. 100-106, London, UK, 2005. [5] N. Collins, A change discrimination onset detector with peak scoring peak picker and time domain correction, Extended abstract of the 1st Annual Music Information Retrieval Evaluation eXchange (MIREX 2005), London, U.K., 2005. [6] N. Collins, A comparison of sound onset detection algorithms with emphasis on psychoacoustically motivated detection functions, in 118th Convention Audio Engineering Society, Barcelona, Spain, 2005. [7] R. Dannenberg, Bootstrap learning for accurate onset detection, Machine Learning, vol. 65, no. 2-3, pp. 457-471, 2006. [8] S. Dixon, Onset detection revisited, in Proc. of the 9th Int. Conf. on Digital Audio Effects (DAFx06), Montreal, Canada, 2006. [9] C. Duxbury, M. Sandler, and M. Davies, A hybrid approach to musical note onset detection, in Proc. of Digital Audio Effects Conf. (DAFX’02), pp. 33-38, Hamburg, Germany, 2002. [10] C. Duxbury, J. P. Bello, M. Davies, and M. Sandler, A combined phase and amplitude based approach to onset detection for audio segmentation, in Proc. of 4th European
24
[11]
[12]
[13] [14] [15]
[16] [17]
[18]
[19] [20]
[21]
[22]
Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS-03), pp. 275-280, London, U.K., 2003. O. Gillet and G. Richard, Drum track transcription of polyphonic music using noise subspace projection, in Proc. of the 6th Inter. Conf. on Music Information Retrieval (ISMIR), London, U.K., 2005. M. Goto and Y. Muraoka, Beat tracking based on multiple-agent architecture - a realtime beat tracking system for audio signals, in Proc. of 2nd Inter. Conf. of Multiagent Systems, 1996. F. Gouyon, G. Widmer, X. Serra, and A. Flexer, Acoustic cues to beat induction: a machine learning perspective, Music Perception, vol. 24, issue 2, pp. 177-188, 2006. E. Kapanci and A. Pfeffer, A Hierarchical approach to onset detection, in Proc. of the International Computer Music Conference (ICMC’04), Miami, USA, 2004. A. Klapuri, Sound onset detection by applying psychoacoustic knowledge, in Proc. IEEE Int. Conf. Acoustic, Speech, and Signal Processing (ICASSP-99), pp. 115-118, Phoenix, USA, 1999. A. Lacoste and D. Eck, A supervised classification algorithm for note onset detection, EURASIP Journal on Applied Signal Processing, 2007. W. C. Lee and C.-C. J. Kuo, Improved linear prediction technique for musical onset detection, Proc. of Inter. Conf. on Intelligent Information Hiding and Multimedia, 2006. M. Marolt, A. Kavcic, and M. Privosnik, Neural networks for note onset detection in piano music, in Proc. of the International Computer Music Conference, Stocholm, Sweden, 2002. P. Masri, Computer modeling of sound for transformation and synthesis of musical signal, Ph.D. dissertation, University of Bristol, Bristol, UK, 1996. J. Ricard, An implementation of multi-band onset detection, Extended abstract of the 1st Annual Music Information Retrieval Evaluation eXchange (MIREX 2005), London, U.K., 2005. D. Van Steelant, K. Tanghe, S. Degroeve, B. De Baets, M. Leman, J.-P. Martens, and T. De Mulder, Classification of percussive sounds using support vector machine, in Proc. of the annual machine learning conference of Belgium and The Netherlands, Brussels, Belgium, 2004. I. H. Witten and E. Frank, Datat mining: practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.