a review on music source separation

0 downloads 0 Views 78KB Size Report
separation such as target separation in radar or sonar, speaker ..... Principles, algorithms and applications, John Wiley and Sons Ltd, Oct. 2006 .... ISMIR'02, pp.
A REVIEW ON MUSIC SOURCE SEPARATION Ruolun Liu, Suping Li Shandong University at Weihai, Weihai, Shandong, China ABSTRACT Music source separation is relevant to many applications including automatic music transcription, remixing in studio, content related music indexing etc. Though source separation has been discussed for decades, there is no systematic summarization focuses on music source separation. Different methods and techniques have been proposed from the different point of views based on the researcher’s own specialties. With certain advantages and limitations, they are scattering over different conference proceedings, journals of different subjects, reports, dissertations, and chapter of books. One purpose of this work is to collect them together in one place, thus facilitating the future research on this fascinating topic. The other one is to point out the possible directions of future research.

Index Terms— Music source separation, CASA, BSS 1.


When instruments are being played together, their sounds are mixed up. Music Source Separation is the problem of extracting each single instrument sound from the mixture. It is the key technique in applications like automatic music transcription, remixing in studio, content related music indexing etc. Well trained musician can finish this task easily. Unfortunately, computer needs more information than people aware of to do the same job. In the past few decades, many algorithms have been proposed for source separation such as target separation in radar or sonar, speaker separation in conversation, speech separation from background noise or music, while not too many really concern the separation of ginger or instrument from the mixture of music signals. Music signals have different spectral and temporal characteristics because of different sound production mechanisms of different instruments, which often play simultaneously and at consonant pitch intervals, same tempo, and similar melodies. Thus it breaks the statistical independency and orthogonality assumptions which are commonly required in sparse coding [42, 52] and Nonnegative Matrix Factorization (NMF) [14]. The other prominent difference between music signal and other audio signals is that music signal has a strong harmonic spectral structure, which is the main reason of using sinusoidal harmonic model [47, 48]. The main obstacle to adopt sinusoidal model based separation is that partials of different instruments may overlap, which is hard to split them again. The adaptive spectral filtering can extract individual notes using pitch and timings knowledge as a priori information [30]. This technique, to some degree, solves the problem of overlapping partials. So far there is no reliable approach in general sense. One category of music source separation, streaming as we named, can be traced back to Computational Auditory Scene

Analysis (CASA) where perceived auditory events are grouped into auditory streams according to common psycho-acoustical cues including onset or offset, amplitude variation, frequency variation [47, 48, 49, 56, 57], spectral centroid, location similarity and harmonicity [11, 19, 37, 44]. New cues are still being explored to segregate instruments playing in the same pitch range. We name the other category as de-mixing which is relied on the information of how the sources are being mixed or the mixing structure. Thus the multi microphone recordings or multi observations are needed. The typical algorithm is based on Independent Component Analysis (ICA) or Principal Component Analysis (PCA) [1]. With this paper, we review the existed algorithms of music source separation and classify them into two categories which concern perceptual unit and mixing structure respectively. Sometimes they could also be combined together. By highlighting the assumptions and limitations of each algorithm, validities and future research directions are pointed out.



Streaming methods investigate specific properties of sources to decompose monaural mixtures into different auditory stream units. Source separation can then be achieved by masking [39] the units into sources, provided each unit belongs to a single source only. Grouping rules include: proximity, similarity, continuity, closure and common fate [12]. These rules are based on the assumption that a set of sinusoidal partials of one source tend to have harmonic frequencies, smooth spectral envelope, close onset or offset times, and correlated amplitude and frequency variations. It is also the reason why most CASA methods commonly take the sinusoidal music model as the basis of their analysis [13]. CASA system try to separates the mixtures of sound in the same way that human does, where it differs from the field of BSS. Most of people have agreed with that CASA algorithms generally consist of four steps: Transforming the mixture into a front-end representation such as correlogram. It is sometimes replaced by the STFT magnitude for the simplicity; Extracting a collection of sinusoidal partials from the above representation according to the period or principal component magnitude information; Re-organizing the partials iteratively into sources by certain grouping rules; Extracting the sources by binary masking [16]. Maher addressed the problem of harmonic collision directly with sinusoidal models and made the best guess of note(s) in each source pattern [40]. Baumann employed a similar approach using a constant-Q sinusoidal analysis to convert musical recordings into MIDI note streams [54]. People are still exploiting the additional prior knowledge of the original score to guide the extraction of the precise parameters of a given music, which is remarked as the “signal and knowledge processing”. Based on common onset of the partials and the pre-trained

timbre models, which describe the evolution of the spectral envelope, sources can be separated by grouping the tracks of same instrument [23]. The timbre models are time-frequency (T-F) templates describing the spectral shape and the evolution in time. Methods based on sinusoidal model extract sinusoidal tracks from some T-F representations of the signal, and then apply grouping rules to assign these tracks to different sources [47, 48, 58]. They are typically used for monaural signals, and may adopt some psychoacoustic cues like loudness [17]. The sinusoidal model provides a high-quality representation of pseudo-stationary musical signal. Another advantage is that different audio signals can easily be produced by means of different parameter space. Kashino et al use Bayesian network in music scene analysis to group the perceptual events according to note, chord, rhythm, and timbre knowledge. But the output of their system can only be MIDI score or reconstructed source sound [24]. The main error comes from misidentification of instrument and pitch because the hypothesis network can only be of hierarchical tree structure. Zhang et al obtain the average harmonic structure (AHS) model from mixed sound by unsupervised clustering using PCA, then extract each source according to the AHS just learned [44,58,59]. They assume that music is played in a narrow pitch range which is not favored in real music. The proposed method doesn’t work on the drum and singing voices because AHS can not represent inharmonic source. Besides, AHS depends heavily on the multipitch estimation which is another difficult problem. Moreover, the separation experiment of mixed music signals is of two instruments playing with different melodies. Other models trained for source at given pitch include: hidden Markov model HMM [21, 43], three-layer generic model for Bayesian estimation [15], Bayesian harmonic model with perceptually motivated residual priors [16], harmonic structure library of the overlapping partials [28]. The harmonic structure, as a spectral basis vector, can also be learned by unsupervised learning [34]. Some methods model the sources precisely with segments of the mixture where only one source is present, including sparse coding with a fixed dictionary [15], factorial HMM [2], or instrument timbre templates [46]. But what instrument presents and its timbre properties must be known in advance. Vincent and Plumbley provide a harmonicity extraction method using generic Bayesian inference, and they admit the difficulty of clustering the components into sources and bypass it [15]. Combining unsupervised learning and sinusoidal model, Virtanen provides an alternative to this problem by using least square fitting to interpolate between the spectral peaks of conjoint frames based on the temporal continuity of spectrum [50]. Using advanced features such as onset duration, frequency modulation, and the spectral envelop, these algorithms greatly improve the segregation performance compared to HMM or spectral decomposition methods They work well on the learned instruments, but different recordings of the same instrument might change model parameter. Besides, learning process needs the solo must present in the music.



De-mixing methods generally rely on some form of independence of the sources along with other assumptions. It can be done in either spatial domain, time domain, or spectral domain. Commonly used ICA technique decomposes the observation vector into

statistically independent variables by a de-mixing matrix. ICA itself cannot be directly used for the monaural time-domain signal, because it requires that the number of sources be less than that of elements in the observation vector. But it works well for the convolutive mixture collected by multiple observations when the number of sources is less than that of the observations [1]. Spatial domain method describes the mixing process over multiple microphones, ears for stereo music, using a probabilistic model. The estimated time invariant de-mixing matrix is based on the independence of the sources. Since spatial information is important for long-term auditory streaming, other people proposed to rearranged monaural mixture into multichannel measurement according to spatial and spectra-temporal cues. “These hybrid models also address the limitations of conventional multichannel source separation methods on reverberant or under-determined mixture” [16]. Geometric source separation (GSS) combines convolutive blind source separation with adaptive beamforming of a linear uniform array to separate sources by azimuth [25]. By supposing sources are nearly disjoint in the T-F plane, sources can be separated using T-F binary masks. Inter-channel Intensity Difference (IID) and Inter-channel Time Difference (ITD) are summarized to derive optimal mask by source azimuths [4, 19]. But they fail when sources overlap in azimuth and binary masks leads to "burbling" noise in the de-mixed sources [4]. These models improve the separation performance with direction masking in certain conditions at a small computational cost. But they seem remain the approaches to be designed for proofs of concept rather than optimized for speed or performance. Time domain methods are those without using neither spatial nor spectral information about the mixing process. Moreover, method without using prior information of sources is called Blind Source Separation (BSS) [6]. It relies on the assumption that the source signals do not correlate with each other. For example, the signals may be mutually statistically independent. BSS thus separates a set of signals into another set of signals, such that the regularity of each resulting signal is maximized, and the regularity between the signals is minimized (statistical independence is maximized). In the past, a number of approaches to blind source separation have been proposed [6, 18, 38, 53]. Early works studied the separation of instantaneous mixtures varies frameworks, neutral networks, information theory [6]. Then the methods were extended to convolved mixtures [53]. Although these approaches come from different areas of interest, they are all built on similar principles. Usually, iterative algorithms update the coefficients in a reconstruction system by minimizing some error measurement or by maximizing some information measurement. Then this matrix is used to separate the sources. The basic assumption of these methods is that the sources are statistically independent. Even this is not true in general for music, the methods can still perform quite well for certain situations. However, there are also some drawbacks such as they work only for multichannel mixtures, there can be no more sources than sensors. In addition, for real world mixtures, the FIR de-mixing filters need to be sufficiently long, which makes the computational complexity very high. Spectral domain methods are designed for the monaural mixture. They analyze the mixing structure in STFT with ICA over a time varying non-Gaussian source model [8, 29, 31, 33, 36]. Independent Subspace Analysis (ISA) decomposes the mixture

power spectrogram into a sum of time-varying typical spectra, so ISA is actually the ICA on magnitude of STFT. Typical spectra are either learnt from solo or estimated from the mixture using ICA [29, 55], positive ICA [52] or NMF [37]. It has been used in many monaural mixture separations [8, 10, 11, 20, 29], even in the MPEG-7 standardization framework [27]. ISA assumes that the magnitude or power spectrogram of the music be a phase-invariant projection of the music signal. By projecting the spectrogram into source subspaces, ISA constructs source power spectrograms and obtains the source signals by inverse Fourier transform or adaptive Wiener filtering [1]. Good performance is achieved for solo transcription [8, 11] and separating percussions from others [26]. However, the ability to separate pitched instruments has not been well explored. HMM can learn accurately the priors of the logarithmic source power spectra from solos. It gives satisfying results on speech mixtures, but over learning could happen on music as each source may have a large number of hidden states [7]. By contrast, unconstrained generic source model is more applicable. For example, NMF decomposes the mixture short-term magnitude spectrum into a sum of components modeled by a fixed magnitude spectrum and a time-varying gain, assuming no constraints about the spectra and the gains except positivity [37]. It has been found efficient to decompose the spectrogram [3, 44, 50]. The sparseness constraint is also added to the basis vectors [5, 42, 53]. Basis vectors are grouped by the similarity of marginal distributions [26] or timbre features [8], which are learned from solos using support vector machines (SVMs) [32] while other methods clustering manually. These methods perform well on percussions rather than pitched instruments or singing voices [44]. Moreover, it provides a more natural way to model the note with nonstationary fundamental frequency. Since the magnitude or power spectrogram, used in ISA, is nonnegative by definition, it is natural to restrict the basis functions to be entry-wise nonnegative. If sources are restricted to be purely additive, the gain is also restricted to be nonnegative. As we all know, T-F representation of music sound are often sparse, which has a unique decomposition into nonnegative components. Proposed by Lee and Seung [14], NMF calculates the factorization by minimizing the reconstruction error measurements. It has been used in several unsupervised learning tasks and also in music source separation, where the non-negativity constraint is proved to be sufficient [37]. The NMF based approaches can be seen in many applications [14, 32, 35, 37]. Some of these methods even regard each note as a source, which might be appropriate for music transcription and work for source separation in a very limited case. Assuming that the harmonic structure of an instrument remains approximately same at different pitches and there are some sections in the measurement spectrogram where only a single note of a single source appears, rather than learning algorithm, Kim gave a monaural separation method by selecting a few appropriate nonnegative basis vectors using the sparseness of spectral coefficients in the measurement spectrogram [34]. T. Virtanen studied the sound source separation by NMF with temporal continuity and sparseness criteria [50]. The magnitude or power spectrum of each frame of the input mixture is denoted by an observation vector which is a weighted sum of basis functions. Some methods estimate both the basis function and the time-varying weights from a mixed input signal, while others use per-trained basis functions or some prior information about the weight coefficients. The basis sets are supposed disjoint and each

component belongs to one source only. The model is flexible in the sense that it is suitable for representing both harmonic and percussive sounds. The main advantage of the model is that it makes no strong assumptions about the signal generation mechanism. Sparse coding is an unsupervised learning technique used in vision modeling. If a signal is sparse, it can be represented with a few chosen active elements. Sparse coding has been used for audio signal separation in many papers [41, 42, 45]. De-mixing matrices are estimated by combining the multiplicative update rule with projected gradient descent [14]. Together with a temporal continuity, this algorithm was used for drum separation [52]. Shift-invariant non-negative matrix and tensor factorization have been other hotspots of separating sources from single or multiple mixture observation(s) [51]. Frequency domain shift-invariance allows a pitched instrument to be modeled as the same frequency basis function with different notes, while time domain shift-invariance allows the temporal evolution to be captured. This was improved by using an additive sinusoidal model where a pitched instrument is modeled by a set of harmonic coefficients. A source-filter model was also incorporated to allow the timbre change with pitch. However, the limitation that all the sources need to have the same parameters restricts the problem to be the score assisted separation [22]. Varying the parameters for different individual instrument would further improve the separations performance of shift-invariant factorization methods.



Despite the improvements mentioned above, music source separation is still a largely unsolved problem, and there are some clear shortcomings in the existed algorithms. Future researches include estimating the number of components, automatic clustering, and a better way to estimate overlapping partials. This might stimulate the development of new models and algorithms that are more efficient in cooperating specific domain knowledge and useful prior information. Since machine learning can provide more information of sources, the most significant tendency could be the mixed approaches using unsupervised learning.

REFERENCE [1] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York: Wiley, 2001. [2] A. Ozerov, P. Philippe, R. Gribonval and F. Bimbot, “One microphone singing voice separation using source-adapted models,” Proc. IEEE WASPAA05, pp. 90-93, 2005. [3] B. Wang and M. D. Plumbley, “Musical audio stream separation by non-negative matrix factorization,” Proc. DMRN Summer Conf., pp. 23–24, Glasgow, U.K., 2005. [4] C. Avendano and J.-M. Jot, “Frequency domain techniques for stereo to multichannel upmix,” Proc. AES 22nd VSEA, 2002. [5] C. Fevotte and S. J. Godsill, “A Bayesian approach for blind separation of sparse sources,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp. 2174–2188, Nov. 2006. [6] C. Jutten and J. Herault, “Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture,” Signal Processing, Elsevier, vol. 24, pp. 1-10 1991. [7] C. Raphael, “Automatic transcription of piano music,” Proc. ISMIR, 2002. [8] C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from polyphonic music using independent subspace analysis,” Proc. 4th Int. Symp. ICA BSS, pp. 843–848, Nara, Japan, 2003. [9] D. Ellis, “Prediction-driven computational auditory scene analysis,” Ph.D. dissertation, MIT, 1996.

[10] D. Fitzgerald, E. Coyle, and B. Lawlor, “Independent subspace analysis using locally linear embedding,” Proc. DAFx, 2003. [11] D. FitzGerald, E. Coyle, and B. Lawlor, “Sub-band independent subspace analysis for drum transcription,” Proc. DAFx, pp. 65–69, Hamburg, Germany, 2002. [12] D. K. Mellinger, “Event formation and separation in musical sound,” Ph.D. dissertation, Stanford University, 1991. [13] D. L. Wang and G. J. Brown, Computational auditory scene analysis: Principles, algorithms and applications, John Wiley and Sons Ltd, Oct 2006, 2006. [14] D. Lee, H. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature 401 pp.788–791, 1999. [15] E. Vincent and M. D. Plumbley, “Single-channel mixture decomposition using Bayesian harmonic models,” Proc. ICA06, pp. 722–730, 2006. [16] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley, and M. E. Davies, “Model-based audio source separation,” Queen Mary Univ. of London, London, U.K., Tech. Rep. C4DM-TR-05-01, 2006. [17] H. Purnhagen, N. Meine and B. Edler, “Sinusoidal coding using loudness-based component selection,” Proc. ICASSP2002, Orlando, Florida, USA, pp. 2-2, May 13–17, 2002. [18] H. Sawada, R. Mukai, S. Araki, and S. Makino, “A robust and precise method for solving the permutation problem of frequency-domain blind source separation,” Proc. ICA, 2003. [19] H. Viste and G. Evangelista, “On the use of spatial cues to improve binaural source separation,” Proc. DAFx, 2003. [20] J. Brown and P. Smaragdis, “Independent component analysis for automatic note extraction from musical trills,” J. Acoust. Soc. Amer., vol. 115, pp. 2295–2306, May 2004. [21] J. Hersheya and M. Casey, “Audio-visual sound separation via hidden Markov models,” Proc. NIPS’02, pp. 1173–1180, 2002. [22] J. Woodruff, B. Pardo and R. Dannenberg, “Remixing stereo music with score-informed source separation,” Proc. of the 7th ISIMIR, 2006. [23] J. Burred and T. Sikora, “Monaural Source Separation from Musical Mixtures Based on Time-Frequency Timbre Models.” Proc. ISMIR2007, Vienna, Austria, Sept. 2007. [24] K. Kashino, K. Nakadai, T. Kinoshita, and H. Tanaka, “Application of bayesian probability network to music scene analysis,” Working notes of IJCAI Workshop on CASA, 1995. [25] L. Parra and C. Alvino, “Geometric source separation: merging convolutive source separation with geometric beamforming,” IEEE Trans. on SAP, vol. 10, no. 6, 2002. [26] M. Casey and A. Westner, “Separation of mixed audio sources by independent subspace analysis,” Proc. ICMC’00, pp. 154–161, Berlin, Germany, 2000. [27] M. Casey, “MPEG-7 sound-recognition tools,” IEEE Trans. CSVT, vol. 11, no. 6, pp. 737–747, Jun. 2001. [28] M. Bay and J. W. Beauchamp, “Harmonic source separation using prestored spectra,” Proc. ICA06, pp. 561–568, 2006. [29] M. Casey and A. Westner, “Separation of mixed audio sources by independent subspace analysis,” Proc. ICMC, 2000. [30] M. Every and J. Szymanski, “Separation of synchronous pitched notes by spectral filtering of harmonics,” IEEE Trans. ASLP, vol. 14, no. 5, pp. 1845-1856, 2006. [31] M. Every, “Separation of musical sources and structure from single-channel polyphonic recordings,” PhD thesis, Department of Electronics, University of York, U.K., 2006. [32] M. Hel´en and T. Virtanen, “Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine,” EUSIPCO, Antalaya, Turky, Sep. 2005. [33] M. K. I. Molla and K. Hirose, “Single-mixture audio source separation by subspace decomposistion of Hilbert spectrum,” IEEE Trans. ASLP, vol. 15, no. 3, pp. 893–900, Mar. 2007. [34] M. Kim and S. Choi, “Monaural music source separation: Nonnegativity, sparseness, and shift-invariance,” Proc. ICA06, pp.617–624, 2006. [35] M. Plumbley, S. Abdallah, J. Bello, M. Davies, G. Monti, and M. Sandler, “Automatic transcription and audio source separation,” Cybernetics and Systems, pp. 603–627, 2002.

[36] N. Mitianoudis and M. Davies, “Audio source separation of convolutive mixtures,” IEEE Trans. SAP, vol. 11, no. 5, 2003. [37] P. Smaragdis and J. Brown, “Non-negative matrix factorization for polyphonic music transcription,” Proc. WASPAA, pp. 177–180, New Paltz, NY, 2003. [38] P. Vanroose, “Blind source separation of speech and background music for improved speech recognition,” the 24th Symposium on Information Theory Proc., pp. 103–108, May 2003. [39] R. Weiss and D. Ellis, “Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking,” Proc. SAPA’06, pp. 31–36, Oct. 2006. [40] R. Maher, “An approach for the separation of voices in composite music signals,” Ph.D. thesis, U Illinois Urbana-Champaign, 1989. [41] S. Abdallah and M. Plumbley, “An independent component analysis approach to automatic music transcription,” Proc. AES. 114th Convention, Amsterdam, The Netherlands, Mar. 2003. [42] S. Abdallah and M. Plumbley, “Unsupervised analysis of polyphonic music using sparse coding,” IEEE Trans. Neural Network, vol. 17, no. 1, pp. 179–196, 2006. [43] S. Roweis, “One microphone source separation,” Proc. NIPS’00, Denver, USA, pp. 793-799, Dec. 2000. [44] S. Vembu and S. Baumann, “Separation of vocal from polyphonic audio recordings,” Proc. ISMIR05, pp. 337–344, 2005. [45] T. Blumensath and M. Davies, “Sparse and shift-invariant representations of music,” IEEE Trans. ASLP, vol. 14, no. 1, pp. 50–57, Jan. 2006. [46] T. Kinoshita, S. Sakai, and H. Tanaka, “Musical sound source identification based on frequency component adaptation,” Proc. IJCAI, pp.18-24, Stockholm, Sweeden, 1999. [47] T. Tolonen, “Methods for separation of harmonic sound sources using sinusoidal modeling”. Proc. AES 106th Convention, Munich, Germany, pp. 3-4, May, 1999. [48] T. Virtanen and A. Klapuri, “Separation of harmonic sound sources using sinusoidal modeling,” Proc. IEEE ICASSP’00, pp. II765–II769, Istanbul, Turkey, June 2000. [49] T. Virtanen, “Algorithm for the separation of harmonic sounds with time-frequency smoothness constraint,” Proc. DAFx2003, pp. 35–40, 2003. [50] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. ASLP, vol. 15, no. 3, pp. 1066–1074, Mar. 2007. [51] T. Virtanen, “Sound source separation in monaural music signals,” Doctoral dissertation, Tampere University of Technology, 2006. [52] T. Virtanen, “Sound source separation using sparse coding with temporal continuity objective,” Proc. ICMC, Singapore, pp. 231–234, 2003. [53] T. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Blind source separation of more sources than mixtures using overcomplete representations,” IEEE Signal Process. Lett., vol. 6, no. 4, pp. 87–90, Jun. 1999. [54] U. Baumann, “Pitch and onset as cues for segregation of musical voices,” Proc. 2nd Int’l Conf. on Music Perception and Cognition, Los Angeles, 1992. [55] Y. Feng, Y. Zhuang, and Y. Pan, “Popular music retrieval by independent component analysis,” Proc. ISMIR’02, pp. 281–282, 2002. [56] Y. Li and D. L.Wang, “Separation of singing voice from music accompaniment for monaural recordings,” IEEE Trans. ASLP, vol. 15, no. 4, pp. 1475–1487, May 2007. [57] Y. Sakuraba and H. Okuno, “Note recognition of polyphonic music by using timbre similarity and direction proximity,” Proc. ICMC, 2003. [58] Y. Zhang and C. Zhang, “Separation of music signals by harmonic structure modeling,” Proc. NIPS2006, pp. 1617–1624, Vancouver, Canada, Dec. 2006. [59] Z. Duan, Y. Zhang, C. Zhang and Z. Shi, “Unsupervised Single-Channel Music Source Separation by Average Harmonic Structure Modeling,” IEEE Trams. ASALP no.4, vol.16, pp.766-778, May 2008.