The research of content-based music information retrieval ... 1: Illustration of exemplar-based sparse representation of a music frame. .... Crescent Serenade.
MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION Ikuo Degawa, Kei Sato, Masaaki Ikehara EEE Dept. Keio University Yokohama, Kanagawa 223-8522 Japan E-mail:{degawa, sato, ikehara} @tkhm.elec.keio.ac.jp ABSTRACT This paper investigates the pitch estimation and the instrument recognition of music signals. A note exemplar is a spectrum segment of notes of the specific pitch and instrument, which is stored as a form of dictionary preliminarily. We describe the method of reconstructing a frame of musical signals as the linear combination of exemplars from the large exemplar dictionary with sparse (l1 minimized) coefficient vector. Reconstruction constraints are imposed to KL divergence of spectra, which is found to produce better results than Euclidean distance. The proposed algorithm shows the ability to transcript music pieces with relatively many notes per a frame and to divide the instrument explicitly through some experiments. Index Terms— pitch estimation, instrument recognition, l1 regularized minimization, note exemplar. 1. INTRODUCTION
2. EXEMPLAR-BASED SPARSE REPRESENTATION We perform pitch and instrument estimation for each frame individually. Fig.1 shows the overview of exemplar-based sparse representation. Given a observation vector yt at frame t, nonnegative coefficient vector xt with sparsity constraints is determined in the following l1 minimization problem: ˆ t = arg min ||xt ||1 s.t. yt = Axt , x ⩾ 0 x
The research of content-based music information retrieval (MIR) has drawn more attention because of the explosive growth of digital musics, recently. Estimating multiple fundamental frequencies (pitches) and recognizing played instruments are important tasks for many applications including automatic music transcription [9]. Many multiple pitch detection systems have been proposed, such as the method based on spectral peaks and maximum likelihood estimation[5], and Non-negative matrix factorization (NMF) [6], and so on. On the other hand, few instrument recognition systems for estimated pitches is proposed. One of these is [4], which deals with both pitch and instrument features for realtime estimation with low calculation costs. Exemplar based sparse representation aims at reconstructing an input signal vector y into a weighted sum of atoms in the dictionary matrix A, where y = Ax. Assuming the weight vector x is sparse and has corresponding values to atoms of A, it enables to extract the information to classify or decompose input signals. It can be applied in principle to
978-1-4799-2390-8/13/$31.00 ©2013 IEEE
various systems such as face recognition [2], speech recognition [3], and so on. The pitch estimation with this technique has been tried in [1], which is specialized only in piano musics and largely relies on preprocessing such as selecting pitch candidates from spectral peaks. In our method, we apply exemplar-based sparse representation to multi-pitch estimation, and then achieve the instrument recognition in the given musical excerpt. Since preprocessing and retraining process is not necessary, the proposed method is quite simple and powerful to handle musically complicated signals because of the l1 minimization.
560
(1)
In most cases yt = Axt is underdetermined (in other words dictionary A is overcomplete). We reformulate (1) as ˆ t = arg min ||yt − Axt ||22 + λ||xt ||1 s.t. x ⩾ 0 x
(2)
where λ is a positive regularization parameter. There are a lot of method to solve (2). One of them is truncated Newton interior-point (TNIP) method [7], in which nonˆ t can be easily negativity constraint on the coefficients x added. Note that the minimization is conducted individually on each frame. One of the advantage of exemplar-based sparse representation is that it would not require any learning processes and retraining the dictionary in case of adding new note exemplars. Furthermore, it can exploit the pitch range of each instrument spontaneously, because impossible pitch candidates for a specific instrument is originally eliminated from the dictionary. The prior information such as which instrument are played in a music piece is also valuable in the proposed system, because it is easy to reassemble dictionary A with note exemplars of instruments under consideration.
Asilomar 2013
yt
xt
A
≈
×
(c)
(a)
(b)
Fig. 1: Illustration of exemplar-based sparse representation of a music frame. (a) input spectrum yt . (b)note-exemplar dictionary A. (c) sparse coefficients vector xt . 3. KL DIVERGENCE MINIMIZATION In order to obtain better results, we use the generalized Kullback-Leibler (KL) divergence d(·, ·) instead of Euclidean distance as follows. ˆ = d(y, y)
K ∑ k=1
yk log
yk − yk + yˆk yˆk
(3)
KL divergence has been found to produce better accuracy than the Euclidean distance in many sound processing methods such as [3], [12]. Then the minimization is reformulated as follows: ˆ t = arg min d(y, Ax) + λ||xt ||1 x
(4)
The cost function of (4) is minimized by first initialising the entries of the vector x to unity, and then iteratively applying the update rule: x ← x. ∗ (AT (y./(Ax)))./(AT 1 + λ).
Fig. 2: Illustration of note number decision algorithm.
(5)
where .∗ and ./ denote element-wise multiplication and division, respectively. The vector 1 is an all-one vector. The deriveration of (5) is noted in [3], [10]. 4. NOTE NUMBER ESTIMATION ˆ t becomes available, the activation score S(p, t, i) for As the x the frame t, pitch p and instrument i is calculated by summing ˆ t corresponding to the note under the values of elements in x consideration. However, we cannot simply use the activation score because deciding the number of notes (pitches) in a frame and instrument is quite complex and challenging task. A musical
561
note contains some harmonic sounds at integral multiple frequency of the basis notes, so there is a possibility to extract harmonic as another note. Deciding the number of notes by thresholding uniquely may cause octave error and so on. To address this problem, we develop an dynamic thresholding algorithm which decides the number of note in a manner similar to the one in proposed in [5] (see Fig. 2). First, silent frames are detected by summing the activation score with respect to p and thresholding by 0.1 times of maximum value of sum of activation score. The note numbers of these frames are set to 0. Second, we select M pitch in a specific instrument and frame order by the activation score S(p, t, i). Then we set the threshold to T ∆S, where 0 < T < 1 is experimentally learned constant and ∆S = S(1) − S(M ). The note number is decided in as the number of activation score exceeding the threshold. In Fig.2 note number is decided as 4. For all experiments in this paper, the maximum polyphony M is set to 10. T is empirically determined to be 0.20.
Table 1: Results of monphonic music transcription(average of 60 music pieces).
5. TEMPORAL SMOOTHING Polyphony estimation in a single frame is not robust, because there are often deletion, insertion, and substitution errors. It also can be said that, if a note if found active in a certain frame, it is very likely that the note is also active in the subsequent frames. To solve this problem, we adopted the smoothing technique based on HMM (Hidden Marcov Model) along with [1]. For the pth note (pitch), the problem can be formulated as Sˆp = arg max p S
T ∏
P (ot |spt )P (spt |spt−1 )
(6)
t=1
where Sˆ is a state sequence, spt is the state of the note at time t, ot is the music frame beginning at time t, P (ot |sit ) is the probability of observing ot given spt , and P (spt |spt−1 ) is the transition probability between states. Using Bayes’ thereom p(spt ) ∝ P (ot |spt )P (spt ), we have Sˆp = arg max p S
T ∏ P (spt |ot ) P (spt |spt−1 ) p P (s ) t t=1
(7)
instead of (6). P (spt |ot ) is obtained by dividing the activation score of the note by the maximum activation score at time t. Both the priorP (spt ) and the state transition probability P (spt |spt−1 ) can be learned from the training data in MIDI format. Then, Viterbi algorithm is applied to find the state sequence that maximizes (7). Implementation details are described in [11]. Though it is also possible to take account of smoothness between notes (internote smoothness) adopting coupled HMM, we just use (7) because coupled HMM is computationally expensive. After the above smoothing process, note estimates shorter than 100ms (in this paper 10 frames) are removed. Since pitches of music signals are locally stable, relatively short notes are considered to be noise. 6. EVALUATION In this section, we show the estimation results. We conducted two experiments: Single-instrument music transcription and Multiple-instrument music transcription, which use different datasets (training sample and test sample). The sampling frequency of music pieces are 44.1kHz, and their frequency component under 8kHz is used to produce spectrogram. STFT frame length is 100ms, hop size is 10% (10ms), and pitch range is MIDI1-MIDI120. We use the first 30s of each dataset for reducing the amount of calculation. Note samples are prepared from Logic Pro and RWC music database [8], and is sampled 5 frames at intervals of 2 frames from the head as note exemplar.
562
(a) Frame-base result
P R F
None 65.4% 65.7% 64.6%
Gauss 62.2% 71.6% 65.5%
HMM 64.1% 73.3% 67.3%
Lee’s[1] 74.4% 66.5% 70.2%
(b) Note-base result
P R F
None 67.3% 79.4% 71.4%
Gauss 62.1% 67.1% 63.2%
HMM 68.0% 79.3% 71.8%
Three standard metrics, namely precision (P ), recall (R), and F-measure (F ) are used for the evaluation. Ntp Ntp + Nfp Ntp R= Ntp + Nfn 2P R F = P +R P =
(8) (9) (10)
where Ntp is the number of correct pitch estimate (true positives), Nfp is the number of inactive pitches estimated as active (false positives), and Nfn is the number of active pitches estimated as inactive (false negative) for all frames. In addition to the above frame-based evaluation, we also perfom note-based performance evaluation of the proposed system along with [1]. If a note’s state is consecutively active for more than ten frames, the onset of the note is considered to be active at the time tick 10ms before the end of the first active frame because the hop size is 10ms. Then we calculate above three metrics same as frame base evaluation. 6.1. Single-instrument music transcription In the Single-instrument music transcription, we estimates only pitches from test pieces of MAPS database [13], where 60 music pieces were generated by a Steinway D piano, and underlying pitches can be obtained from the corresponding (aligned) MIDI file. The note exemplar dictionary consists of 4 instruments, Alto sax, Piano, Vibraphone, and Bass, and they each have 3 individuals. Table. 1 shows the performance of single-instrument music transcription. We processed and compared between three types of post-processing: ’None’ (not processed any postprocessing), ’Gauss’ (convolving Gaussian function along with time), and ’HMM’ (described in section 5). F measure 67.3% in HMM of (a) is the best results of three postprocessing since this technique harnesses musical property of signal. The tables also indicates that, although proposed
Fig. 3: The estimation result of liz et4 SptkBGCl(best note-base result) with HMM post-processing. P =97.3%, R=82.2%, F =89.1%.
Fig. 4: The estimation result of Vibraphone of Lounge Away with HMM post-processing. P =37.4%, R=74.7%, F =49.8%
method does not consist of any preprocessing steps such as tuning adjustment or note candidate selection and does not include various instrument spectra in the dictionary, the accuracy of our result is comparable to that of Lee’s method[1] on the same test samples. In note-base evaluation, F measure of None and HMM achieves over 70%, but Gauss smoothing have a bad effect on the result. This is thought that some notes combine with adjacent notes and then onsets are valished by simple smoothing technique. The pianoroll result is depicted as Fig.3 for an example. The true positives are represented as black, the false positives as red, and the false negatives as yellow. It can be seen that each note is clearly distinguished with adjacent notes, which makes the accuracy rates significantly high. Fig. 5: The estimation result of Piano of Lounge Away with HMM post-processing. P =37.4%, R=74.7%, F =49.8%
6.2. Multiple-instrument music transcription In the Multiple-instrument music transcription, we estimate pitches for multiple instruments. The component of note exemplar dictionary is same as single-instrument music transcription. Since there are few music datasets with corresponding score information like aligned MIDI file, we mixed three music sample by Logic Pro from MIDI files of RWC database for evaluation. In estimation, the the number of instruments is supposed to be known, then we use two activation scores of the instrument i with larger summations of S and binarize as noted in section 4 to generate piano rolls. In Fig.4 and 5, we show a result of the pitch and instrument estimation. It indicates that the proposed approach has very good performance to discriminate multiple instruments whose pitch ranges are overlapped in some degree and to estimate pitches independently. F measures in Table 2 and 3 is not less than 60%. To the best of our knowledge, this is the
563
greatest accuracy of pitch estimation of musics of multiple instruments. 7. CONCLUSION In this paper we presented a system for detection of multiplepitches produced by multiple-instruments. The core of the proposed system relies on l1 -norm minimization (4) and the use of KL divergence. By nature, the system is free of leaning and retraining processes of the exemplar dictionary. The results show that the proposed system has the ability to estimate pitches and recognize instrument sorts of the music samples with relatively large number of notes. We are planing to apply the method to non-harmonic musical notes (mainly
Table 2: Frame-base Results of multiple-instruments music transcription of 3 music piecies.
Table 3: Note-base Results of multiple-instruments music transcription of 3 music piecies.
P R F
Crescent Serenade piano bass 78.8% 74.2% 61.3% 55.8% 69.0% 63.7%
Lounge Away piano vibra 71.6% 55.9% 56.1% 92.4% 62.9% 69.7%
Jive piano vibra 76.5% 49.7% 57.7% 87.9% 65.7 % 63.5%
P R F
Crescent Serenade piano bass 75.1% 53.9% 67.5% 73.8% 71.1% 62.3%
Lounge Away piano vibra 66.7% 53.8% 66.0% 94.9% 66.3% 68.7%
Jive piano vibra 71.2% 46.8% 68.1% 91.8% 69.6% 62.0%
P R F
Crescent Serenade piano bass 78.5% 70.4% 68.2% 70.0% 73.0% 70.2%
Lounge Away piano vibra 71.4% 54.4% 66.5% 96.2% 68.8% 69.5%
Jive piano vibra 75.9% 48.3% 66.3% 94.6% 70.8% 63.9%
percussions) and evaluate the effectivenes of dictionary based sparse coding on it as a future work. 8. REFERENCES [1] C. Lee, Y. Yang, H. H. Chen, “Multipitch Estimation of Piano Music by Exemplar-Based Sparse Representation,” IEEE Trans. Multimedia, vol. 14, no.3, June 2012 [2] J. Wright, A. Y. Yang, A. Ganesh, and S. S. Sastry, Yi Ma, “Robust Face Recognition via Sparse Representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp 210-227, Feb. 2009 [3] J. F. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition,” IEEE Trans. on Audio, Speech, Lang. Process., vol. 19, no. 7, pp 20672080, Sept. 2011 [4] A. Cont and S. Dubnov, “Realtime Multiple-pitch and Multiple-Instrument Recognition for Music Signals Using Sparse Non-negative Constraints,” Proc. of the 10th Int. Conference on Digital Audio Effects(DAFx-07), Bordeaux, France Sept. 10-15, 2007 [5] Z. Duan, B. Pardo, and C. Zhang, “Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions,” IEEE Trans. Audio, Speech, Lang. Process., vol.18, no.8, pp. 2121-2133, Nov. 2010 [6] N. Boulanger-Lewandowski, Y. Bengio and P. Vincent, “Discriminative non-negative matrix factorization for multiple pitch estimation,” International Society for Music Information Retrieval (ISMIR), 2012
P R F
Crescent Serenade piano bass 88.8% 54.8% 83.3% 83.3% 85.9% 66.1%
Lounge Away piano vibra 72.6% 93.2% 82.0% 92.1% 77.0% 92.7%
Jive piano vibra 79.4% 91.4% 87.5% 81.3% 83.2% 86.0%
P R F
Crescent Serenade piano bass 77.7% 15.7% 69.9% 45.8% 73.6% 23.4%
Lounge Away piano vibra 57.0% 94.0% 73.7% 87.6% 64.3% 90.7%
Jive piano vibra 58.8% 81.1% 77.8% 65.9% 67.0% 72.7%
P R F
Crescent Serenade piano bass 88.8% 58.8% 83.3% 83.3% 85.9% 69.0%
Lounge Away piano vibra 72.9% 93.2% 82.0% 92.1% 77.2% 92.7%
Jive piano vibra 79.8% 91.3% 87.5% 80.2% 83.5% 85.4%
[7] S. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, “An Interior-point Method for Large-Scale L1-Regularized Least Squares,” IEEE J. Sel. Topics Signal Processing, vol. 1, no. 4, Dec. 2007 [8] M. Goto, H. Hashiguchi, T.Nishimura, and R. Oka, “RWC Music Database: Popular, classical and jazz music databases.,” in International Symposium on Music Information Retrieval (ISMIR), 2012. [9] M. A. Casey, R.Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney, “Content-based music information retrieval: current directions and future challenges,” Proc. IEEE, vol. 96, no. 4, pp 668-696, Apr. 2008 [10] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Proc. Neural Information Processing Systems, 2002, pp. 1205-12212 [11] Lou, H. L., “Implementing the Viterbi algorithm”, IEEE Signal Processing Magazine, 1995, 12(5), pp. 42-52. [12] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. on Audio, Speech, Lang. Process, vol. 15, no. 3, pp. 1066-1074, 2007. [13] MAPS Database - A piano database for multipitch estimation and automatic transcription of music http://www.tsi.telecomparistech.fr/aao/en/2010/07/08/maps-database-a-pianodatabase-for-multipitch-estimation-and-automatictranscription-of-music/
564