Sparse Music Representation With Source-Specific ... - IEEE Xplore

9 downloads 0 Views 852KB Size Report
Abstract—We propose a source-specific dictionary approach to efficient music representation, and apply it to separation of music signals that coexist with ...
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

337

Sparse Music Representation With Source-Specific Dictionaries and Its Application to Signal Separation Namgook Cho, Member, IEEE, and C.-C. Jay Kuo, Fellow, IEEE

Abstract—We propose a source-specific dictionary approach to efficient music representation, and apply it to separation of music signals that coexist with background noise such as speech or environmental sounds. The basic idea is to determine a set of elementary functions, called atoms, that efficiently capture music signal characteristics. There are three steps in the construction of a source-specific dictionary. First, we decompose basic components of musical signals (e.g., musical notes) into a set of source-independent atoms (i.e., Gabor atoms). Then, we prioritize these Gabor atoms according to their approximation capability to music signals of interest. Third, we use the prioritized Gabor atoms to synthesize new atoms to build a compact dictionary. The number of atoms needed to represent music signals using the source-specific dictionary is much less than that of the Gabor dictionary, resulting in a sparse music representation. For the single-channel music signal separation, we project the mixture signal onto source-specific atoms. Experimental results are given to demonstrate the efficiency and applications of the proposed approach. Index Terms—Matching pursuit, musical signal processing, music signal separation, source-specific signal processing, sparse signal representation.

I. INTRODUCTION UMANS are often able to recognize an individual sound from a complex acoustic environment. It was suggested in [1] that the human auditory system might have a highly efficient coding mechanism that extracts the meaningful structure of audio signals for perception and conveys the information to the brain compactly. Generally speaking, redundancy reduction plays an important role in mammalian perceptual processing [2]. Mathematically, we can view this problem as finding a sparse (or compact) representation of an audio signal, namely, a combination of a small number of functions taken from a set of elementary functions. We call these elementary functions atoms, and the set formed by these atoms a dictionary. The technique of sparse signal representation finds applications in numerous audio processing problems such as audio structure analysis [3], automatic music transcription [4], and audio source separation [5].

H

Manuscript received August 02, 2009; revised December 23, 2009; accepted March 25, 2010. Date of publication April 08, 2010; date of current version October 27, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Tomohiro Nakatani. N. Cho is with the Digital Media and Communications R&D Center, Samsung Electronics, Suwon 443-742, Korea (e-mail: [email protected]). C.-C. Jay Kuo is with the Ming Hsieh Department of Electrical Engineering and Signal and Image Processing Institute, University of Southern California, Los Angeles, CA 90089 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2010.2047810

Methods to represent real-valued audio signals can be roughly classified into two categories: the orthogonal basis expansion and the overcomplete representation. Orthogonal bases, such as Fourier and wavelet bases, provide a complete representation of signals with finite energy. However, this approach does not guarantee a compact representation for a certain type of signals [6]. With the overcomplete representation, we find a dictionary of atoms to span the signal space, where the dictionary may include more atoms than necessary, leading to a non-orthogonal, linearly dependent set. For instance, the Matching Pursuit (MP) algorithm [7] approximates signals using parameterized time-frequency atoms from an overcomplete dictionary. It is known that MP finds a nearly optimal sparse representation of an arbitrary input signal; however, the complexity of the algorithm remains very high in general [8]. One of the aims of this paper is to study the sparse representation of music signals, which leads to the adoption of the overcomplete representation. Specifically, the following two issues for the sparse music representation will be addressed. First, we determine a source-specific dictionary tailored to music signals, assuming that music signals have unique characteristics that differentiate them from speech and environmental sounds [9]. As studied in [10] and [11], music signals tend to contain strong harmonic components, which can be used as prior knowledge for audio signal analysis. These properties will be exploited in the design of music source-specific atoms so that the atoms are highly correlated with the class of music signals, but not with other classes of sounds. Second, although the overcomplete representation provides a concise representation of audio signals, it has one shortcoming, i.e., its computational complexity is high. It is desirable to reduce the complexity as much as possible. To find a sparse representation for music signals, we build a source-specific dictionary that captures inherent music characteristics efficiently. There are three steps in the construction of a source-specific dictionary. First, we decompose basic components of musical signals (e.g., musical notes) into a set of source-independent atoms (i.e., Gabor atoms). Then, we prioritize these Gabor atoms according to their approximation capability to music signals of interest. Third, we use the prioritized Gabor atoms to synthesize new atoms to build a compact dictionary. The number of atoms needed to represent music signals using the source-specific dictionary is much less than that of the Gabor dictionary, resulting in a sparse music representation. We apply the proposed representation technique based on source-specific dictionaries to two audio processing problems: approximation of music sounds and music signal separation from single-channel mixtures. Experimental results are

1558-7916/$26.00 © 2010 IEEE

338

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

by . Since musical sound and speech have the characteristics of harmonic components in common, there exist some overlap between the two subspaces as shown in Fig. 1(a). If a finite set of atoms of a specific subspace is known a priori such that (1) Fig. 1. The representations of a mixture of music and speech signals using (a) source-specific atoms and (b) source-independent atoms.

given to demonstrate the efficiency and the applications of the proposed approach. The rest of this paper is organized as follows. Related previous work is reviewed in Section II, where the need of sourcespecific dictionaries is discussed. The three-step process to construct source-specific dictionaries for efficient music representation is described in Section III. The application to music signal separation is presented in Section IV. Experimental results are given in Section V to illustrate the performance of the proposed algorithm in terms of adapting to musical structures, approximation capability, and music signal separation. Possible future extension is stated in Section VI. Finally, concluding remarks are given in Section VII. II. REVIEW OF PREVIOUS WORK In this section, we will review two classes of elementary functions (or atoms) for signal representation: 1) source-specific atoms that can represent a specific class of sounds in a compact way and 2) source-independent atoms that are used to approximate all classes of sounds. Both of them can be applied to audio source separation but in different ways. A. Source-Specific Atoms Optimal auditory filters for different classes of natural sounds were studied by Lewicki in [1]. It was observed that different classes of sounds have different characteristics and, as a result, the derived optimal auditory filters have different shapes. For example, efficient coding of music signals results in sinusoidal filters, whose lengths extend over the entire analysis window, resembling a Fourier representation. In contrast, the coding of environmental sounds yields a set of filters that resemble a wavelet representation, where their amplitude envelopes are localized in time. Since speech signals share properties of music and environmental sounds, its coding yields a representation intermediate between those in music and environmental sounds. Our research in [12] and [13] aimed to separate music signals from background noise such as speech and/or environmental sounds. Based on the observation in [1], we considered a sourcespecific representation of audio signals. It is assumed that a single-channel mixture signal can be represented of the following form:

where music signal and speech signal as depicted in Fig. 1(a). The music subspace and the speech are subsets of the universal audio space denoted subspace

and speech atoms , and their index sets with music atoms and , respectively, we can extract the desired audio content by projecting the mixture onto the corresponding subspace via (2) where and represent the orthonormal projections onto and , respectively. The main challenge of the subspaces in (1), which this approach is to find a finite set of is highly correlated with the class of music signals of interest, yet uncorrelated with other classes of sounds (i.e., speech and environmental sounds). A similar idea was adopted in [14] for single-channel source separation, where time-domain Independent Component Analysis (ICA) elementary functions were learned from a training dataset a priori and then used to separate unknown test sound sources. This method employed unsupervised learning from arbitrary musical sounds. Some of learned ICA functions for music signals may be shared by speech, which correspond to the overlap region in Fig. 1(a). The atoms in the overlap region degrade the source separation performance. B. Source-Independent Atoms An alternative approach to separate a source from a sound mixture is to use a set of source-independent atoms as illustrated in Fig. 1(b), where the universal audio space is spanned by these atoms as

with index set . Source separation can be achieved as follows. First, the mixture is decomposed into source-independent atoms and their coefficients. Then, these atoms are grouped according to some similarity criteria. Afterwards, grouped atoms are recombined to reconstruct source signals. For example, the magnitude spectrogram of a single-channel mixture can be decomposed into the product of basis spectra and time-varying gains using the ICA [15], [16] or the nonnegative matrix factorization (NMF) [17]. Then, audio separation can be accomplished by clustering the basis spectra into disjoint sets using statistical distance measure [15], instrument-specific features [16], or original sources as the [17]. Finally, the phase information of the original source is used to resynthesize time-domain estimates of the source. There are however several challenges associated with the approach. The clustering process is a nontrivial task, and the phase has to be estimated for the resynthesis process [17], [18]. To tackle these shortcomings of the source-independent representation, we employ a representation based on source-specific atoms and address the overlapping issue through synthe-

CHO AND KUO: SPARSE MUSIC REPRESENTATION WITH SOURCE-SPECIFIC DICTIONARIES

339

Fig. 2. Constructing music source-specific atoms and the corresponding dictionary from musical notes.

sizing new atoms obtained by reorganization of source-independent atoms. Fig. 3. (a) Note signal G4 of clarinet (top) and the decay of ja j as a function of iteration number m (bottom), and (b) the accumulation values of c , where the predefined threshold  is set to 0.99.

III. SPARSE MUSIC REPRESENTATION WITH SOURCE-SPECIFIC DICTIONARIES In this paper, we adopt the approach of source-specific atoms and dictionaries, and look for a sparse representation for recordings of harmonic musical instruments. The main challenge is to reduce the overlap region in Fig. 1(a) as much as possible. Our basic idea for the sparse music representation is to exploit the observation from short-time Fourier transform (STFT) of music signals, that is, the musical notes have a harmonic structure and most energy of note signals is concentrated in a small set of Fourier functions that correspond to the harmonic structure. We can use a small set of functions to represent music signals efficiently without the entire functions in the dictionary. As depicted in Fig. 2, the proposed scheme consists of three major steps, which are detailed in the figure. A. Music Decomposition With Matching Pursuit In the first step, we attempt to analyze essential characteristics of a specific musical instrument from their audio waveforms. These waveforms can be easily obtained from a music database that contains various musical instruments. To analyze music signals, we adopt the overcomplete representation with redundant dictionaries. Its main advantage is that it enables a compact representation of complex signals by capturing essential signal characteristics with a small number of functions. On the other hand, although signal representation with orthonormal bases such as Fourier or wavelets is simple, music signals might not be efficiently represented by them in a compact (or sparse) manner [6]. In the following, by using MP with a redundant set of Gabor atoms to decompose music signals, the signal energy will spread over a small number of atoms as compared to orthonormal bases, leading to a more compact representation. Gabor atoms are obtained by dilatation, translation, and modulation of a mother function of the following form: [7] (3) where , and are the scale, position, and frequency paramis concentrated eters, respectively. The atom energy of in the neighborhood of time and frequency , and proportional in frequency. The Gabor dictionary can be to in time and expressed by , where the parameter vector is drawn from an index set . Generally speaking, any representative audio waveforms from the same class of instruments can be used as the input to the first step in Fig. 2. In our implementation, we choose the

th musical note signal as the representative one. We begin by setting the initial residual equal to the input signal as

At step , MP chooses residual

that maximizes the correlation with (4)

Then, it calculates the new residual as (5) The note signal can be decomposed into a linear combiGabor atoms chosen among the Gabor dictionary nation of plus the residual term , as (6)

B. Atom Prioritization into With the decomposition of signal atoms, we build the following approximation

Gabor

(7) is the coefficient for such that . It is observed that these atoms have a different contribution to the signal representation. The top subfigure of Fig. 3(a) shows a signal of clarinet representing note G4, which was obtained from the RWC Musical Instrument Sound Database [19] and downsampled to 11 025 Hz. The curve shown in Fig. 3(a) illustrates the decay of the magnitude correlation as a function of iteration number , which was obtained by the decomposition of the whole note signal using MP with Gabor atoms1, as discussed in Section III-A. decays very fast for From the curve, we see that . These atoms capture inherent characteristics of the note signal well and the energy of the residual decreases quickly. Therefore, atoms with high energy correlation can be viewed as coherent components with respect to the given signal. On the where

1The parameters of Gabor atoms for the decomposition are described in detail in Section V-A

340

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

Fig. 4. Comparison of prioritized atoms distributed in the parameter space (; s): (a) 319 atoms for the clarinet sound; (b) 602 atoms for the drum sequence; and (c) overlapping atoms between the prioritized dictionaries and with a total number of 61. Note that the scale parameter is logarithmic to the base 2, and each dot represents the center frequency and the scale parameter of a prioritized atom.

D

D

other hand, decays slowly for , where the corresponding atoms have low-correlation values and the residues behave similarly to random noise with no meaningful structure [7]. These selected atoms no longer reflect signal properties but simply decrease the energy of the residual. In the second step of Fig. 2, we select the coherent components (or atoms) from a note signal using high energy correlation. to the representation The contribution of Gabor atom of the single note signal can be measured by the normalized squared magnitude correlation defined as (8)

Specifically, we compute the accumulation values of the normalized squared magnitude correlation by sorting with descending order. Fig. 3(b) shows the accumulation values of with respect to the number of atoms, where the Gabor atoms chosen in the decomposition of the note signal are sorted using their normalized squared magnitude correlation. Then, we select as coherent components the Gabor atoms that satisfy the condition

(9) which forms a small set of Gabor atoms, i.e., a subdictionary, , where . The subdictionary can capture the harmonic structure of the note signal well. From the experiments with various musical instruments, we found that setting to 0.95–0.99 yields good performance in capthe threshold turing inherent characteristics of their notes. In Fig. 3(b), we to 0.99. It select 385 Gabor atoms by setting the threshold should be emphasized in Fig. 3 that since there may exist some chosen Gabor atoms that have same scale and frequency parameters, but different time-shift, during the decomposition, the has smaller atoms in number than 385, say 79 final set in the experiment2. 2In this paper, we take into account only two parameters, i.e., scale and frequency, when constructing a Gabor dictionary. For the translation-invariant property of atoms in time and frequency, we use a fast Fourier transform to compute all scalar products with shifted atoms when finding the best atom. More details are in Section V-A.

Furthermore, we use signals, which correspond to variations of a musical note, to compute the union set of the subdictionaries, each of which is obtained from a variation of the note. For example, the RWC Musical Instrument Sound Database provides three variations for each note of a specific musical instrument. Each variation features an instrument from a different manufacturer played by a different musician, which provides a large variety of sounds. Thus, a collection of the subdicfor the th note from a specific musical instrument tionaries can be determined as (10) and is a finite set of indexes of . The where atoms determined by (10) can be called as dominant or prioritized atoms that represent the characteristics of the note signal. By repeating the above process for various note signals, we can , form a prioritized dictionary, a collection of where is the number of notes for a specific instrument. For the single-channel music separation from background sounds, we assume that each class of audio sounds can be represented by a prioritized dictionary. If these prioritized dictionaries of various sources have little overlap with each other, i.e., a small number of commonly-shared atoms, separating the music sources by projecting the mixture into the music prioritized dictionary can be easily accomplished. On the other hand, if the overlapping is significant, extracting the desired music source signals from the mixture will be less trivial. Thus, source separation performance would degrade significantly due to the overlap region between the audio sources. Consider a simple example where a test signal is generated by mixing samples of pitched musical instrument and drum sequence3. To determine the characteristics of their prioritized atoms in the parameter space, we choose the center frequency of , and atoms from an interval of normalized frequencies the logarithmic scale of atoms from , where . Figs. 4(a) and (b) show the distribution of the prioritized atoms , and drum , for clarinet respectively. We see that atoms of the clarinet sound are distributed in the lower frequency region as shown in Fig. 4(a) while atoms of the drum sequence are distributed over a wider range of frequency and scale components as illustrated 3The analyzed signals correspond to excepts from an eight-note melodic recording of clarinet [20] and acoustic drum sequence [21].

CHO AND KUO: SPARSE MUSIC REPRESENTATION WITH SOURCE-SPECIFIC DICTIONARIES

in Fig. 4(b). The overlapping atoms between the prioritized and are shown in Fig. 4(c). Note that dictionaries . It is meant from the example that separating the may also extract clarinet signal from the mixture using the drum components due to the existence of the shared atoms. The overlapping atoms come from the low frequency region in which significant overlap happens when the two sounds are mixed together. Note that the overlapping atoms represent the overlap region shared by the two different subspaces, as illustrated in Fig. 1(a). One simple way to meliorate the separation performance, we may cluster atoms that come from different source signals into disjoint sets using a priori information, as discussed in Section II-B. In this paper, we exploit the harmonic structure of music sounds to address the overlapping issue, which will be discussed in Section III-C.

C. Source-Specific Atoms and Dictionaries In Section III-B, we mentioned that the prioritized dictionaries of different instruments may have degraded performance in source separation due to the overlap region between them. In the last step in Fig. 2, we will address this problem by reorganizing prioritized atoms through linear combinations so as to generate a set of new atoms called source-specific atoms. The harmonic features of musical instruments enable redundancy reduction and compact representation of source signals with a small number of atoms. For instance, non-percussive music sounds usually consist of a limited number of musical notes (i.e., 12 notes in each octave range). It implies that most energy of music signals is concentrated in a small set of atoms. Mathematically, we can express the source-specific atom as a linear combination of prioritized atoms in form of (11) where is the weighting coefficient according to the imporin the decomposition of note signal . Note tance of that all prioritized atoms have the same time localization and . The entire the new atom is normalized such that set of atoms , forms a source-specific dictionary denoted by . The learning process of source-specific atoms will be detailed in Section V-A. After that, we will show that the time-frequency representation of each new atom has a well-organized harmonic structure of the corresponding musical note. The synthesized source-specific atoms will be used for approximation of real music sounds in Section V-B and separation of music signals from a single-channel mixture in Section V-C. It is worthwhile to comment the difference between the th and source-specific atom . They musical note signal are close but not identical. To take the piano instrument as an example, the audio waveforms of the same note signal but from different pianos still vary, which are called the intra-class variation. Furthermore, the matching pursuit decomposition of will yield a large number of Gabor atoms, including many with is a more robust repvery small weights. In contrast, atom resentation, whose synthesizing Gabor atoms have to come from

341

the prioritized subdictionary. Thus, we may view source-speas a denoised version of musical note signal cific atom , where the denoising process is used to increase its robustness against intra-class variation. We should point out that a similar idea was proposed in [22] to create the so-called harmonic dictionary. A signal model was used, assuming a priori knowledge of musical harmonics, i.e., between the frequency the harmonicity of the th overtone partial and the fundamental frequency , where is integer and is the number of the partials. In the model, each partial of musical harmonics is represented by one Gabor atom, and a harmonic atom has essentially peaks in its Fourier transform, which are located around frequencies with a common width of the order of . Recently, Leveau et al. [23] has proposed mid-level music representation for musical instrument recognition, where instrument-specific harmonic atoms are designed especially to obtain timbre information by learning amplitudes of harmonic partials from individual notes. On the other hand, we need no harmonicity assumption to construct source-specific atoms; we get the information of the pitch and the strength of overtone partials from learning isolated notes. IV. APPLICATION TO MUSIC SIGNAL SEPARATION In this section, the source-specific representation discussed in Section III is applied to the music signal separation problem, where music signal is mixed with different background sounds such as speech or environmental sounds. Here, we assume that musical instruments performed are known a priori. consists of Let us assume that an audio mixture signal and background sound as music signal (12) We would like to extract the music signal from the observed , where the music signal will be represented by mixture source-specific atoms. Since a music signal tends to have a broader spectrum, its time-frequency representation often overlaps with that of the background sound. Due to the overlap, it is more difficult to impose the sparsity assumption in the time-frequency plane (i.e., only one source is active in a time-frequency point) [24]. It is well understood that the impairment on the sparsity assumption results in the distortion, known as the musical noise artifact, in the estimated sources. Here, we use the source-specific representation technique to extract the music signal by selecting the best approximating atoms that have stronger energy correlation with the music signal than with the background sound in the mixture. onto atoms To this end, we project the mixture signal from a music source-specific dictionary . After initialization in the matching pursuit technique, we by setting have the following computation at the th step. for all . 1) Compute 2) Select a (near) best atom of the dictionary, which yields the maximum projection value by (13)

342

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

3) Compute the new residual according to

where (14)

Note that musical notes can be easily identified from the mixture steps, the by (13), which yield large projection values. After enhanced music signal can be reconstructed as a weighted sum of source-specific atoms chosen from by (15) where is the gain of at the th step. Finally, the residual background signal can be obtained by . V. EXPERIMENTAL RESULTS In our experiments, source-specific atoms were obtained for each musical instrument on isolated notes from three music databases: the RWC Musical Instrument Sound Database [19], the McGill University Master Samples Library [25], and the University of Iowa Musical Instrument Samples [26]. The acoustic databases cover various musical instruments, providing individual notes of a specific instrument over the entire range of tones that could be produced by that instrument. Test signals were chosen from several excerpts of audio sounds, anechoic recordings of solo performances of monophonic and polyphonic instruments [20], speech signals [27], and environmental sounds [28]. Mixture signals were generated by mixing samples of one or more pitched musical instrument sources and speech or environmental sounds. All sounds used in the experiments were mono and downsampled to 11 025 Hz. To measure the quality of the reconstructed sound with respect to the original one, the source-to-distortion ratio (SDR), represented in decibels (dBs), was used [29]. SDR is a global performance measure that accounts for three types of distortion: artifacts, interference, and noise. A higher performance measure indicates a better reconstruction scheme with less distortion. A. Learning Musical Structure and Source-Specific Atoms This subsection now illustrates how to analyze harmonic structure of music (i.e., decomposition of musical note signals using MP with Gabor atoms, followed by prioritization) and how to construct music source-specific atoms, as discussed in Section III. Then, we present results that correspond to each step in Fig. 2. The discrete multiscale Gabor dictionary is the in (3) such that collection of atoms , where and are some constants. To analyze real-valued music signals, real Gabor atoms are usually used to construct the discrete Gabor dictionary, instead of complex-valued atoms as in (3). We begin with a source-independent dictionary consisting of real Gabor atoms of the following form [7], [30]:

(16)

and normalizing constant is chosen such that . Atoms are characterized by their scale , time-position , frequency , and phase . A truncated is used to generate real Gabor atoms Gaussian envelope , and . with discretized parameters, • The scale varies between 1 and atom length , i.e., , where 2 represents the base and . In our samples (or 92.8 ms) and the experiments, atoms have a dyadic scale. Among these scales, the largest three are selected after the decomposition of a musical note to learn music structure. • For the learning process, 800 different frequencies are used, which uniformly spread over the interval , with a step of of normalized frequencies . • The phase is set to 0. • The Gabor dictionary is often built by considering the scale, frequency, and time-shift parameters so that its atoms can be translated to any place in the (residual) signal. In order to reduce the computational complexity, an effective MP implementation was proposed in [30], which uses the fast Fourier transform (FFT) to compute all scalar products with shifted atoms. Here, we built the Gabor dictionary with only two parameters (i.e., scale to for and frequency) while setting time shift all atoms. At each MP iteration, we compute the best time position of each individual atom with respect to the residual signal. When MP selects the best atom in the decomposition of a residual signal, we use FFT to compute the time shift , of an atom by the correlation and the residual signal . That between the atom FFT where is, we calculate is a component-wise multiplication between two vectors, and , respectively4. To and and are FFTs of sum up, we find the best time positions of all atoms in the Gabor dictionary, and select the best one among these candidate atoms that maximizes the correlation with the residual signal. This results in a Gabor dictionary of size without taking all possible time-shifts of Gabor atoms into account, i.e., (17) and are the number of scales and the number of where frequency bins, respectively. After analyzing three variations of each note using the MP with Gabor atoms, we find prioritized atoms by setting in (9) to 0.99 empirically as discussed in Section III-B. Then, we use the music signal model in (11) to synthesize a source-specific in atom from the set of prioritized atoms. The coefficient 4The approach is similar to the estimation of cross-correlation between one signal and its time-delayed version to determine time delay [31].

CHO AND KUO: SPARSE MUSIC REPRESENTATION WITH SOURCE-SPECIFIC DICTIONARIES

Fig. 5. (a)–(c) show atoms obtained from three variations of clarinet note G4, and (d) is the synthesized atom corresponding to the clarinet note G4, where each top subfigure presents the time-domain waveform and each bottom subfigure shows its time–frequency representation.

(11) is determined by the correlation value of prioritized obtained in the decomposition of a note signal. atom In Fig. 5(a)–(c), we present three atoms and their time–frequency representations, which were obtained from three variations of clarinet note G4. For comparison, a set of prioritized atoms in , which was obtained from a variation of the note, is used to synthesize a new atom with the music signal model. With the three sets of prioritized atoms, we can determine a pri, which oritized subdictionary for the note by consists of a total of 221 Gabor atoms. Fig. 5(d) shows the , as dissource-specific atom, which was synthesized using cussed in Section III-C. The time–frequency representation of the source-specific atom illustrates that it captures the inherent harmonic structure of the note well. Note that each partial of the note, including its fundamental frequency, is not represented by one Gabor atom, but rather several Gabor atoms with different scales. Fig. 6(a)–(c) shows the time-domain waveforms and time-frequency representations of an atom that is obtained from clarinet samnote G4, with respect to different scale, where ples and the largest three scales, i.e., 8, 9, and 10, are selected for comparison. Each waveform is normalized with its energy equal to one. We observe a clear harmonic structure in all scales. The first feature is different duration of partials in time, e.g., the shortest duration in scale 8. The other feature is different width of the order of 1/s, e.g., the thickest width in scale 8. We observe in Fig. 6(a) that the energy of the first three partials, including the fundamental frequency, is similar to each other. On the other hand, the first and the third partial in Fig. 6(c) have more energy than the second partial. This information is important in synthesizing the new atoms for a specific musical instrument. For instance, Fig. 8(a) shows that the energy of the second partials of clarinet notes below atom index 20 is weaker than the energy of the fundamentals and the third partials. It is also observed from the spectrogram of a real clarinet sound in Fig. 11(a). In addition, we observe in Fig. 6(c) that several harmonic partials in the high-frequency region are missing. Thus, only one scale is not

343

Fig. 6. Source-specific atoms synthesized using different scales: (a) scale 10, j , (b) scale 9, , and (c) scale 8, . (d) Scales . Note that represents the number of 8, 9, and 10 are used, prioritized atoms.

D j = 33

jD j = 25 jD j = 83

j1j

jD j = 25

Fig. 7. Source-specific atoms that correspond to notes C4 and C6 of trumpet and their time-frequency representations, where f denotes the fundamental frequency of the notes.

enough to represent the characteristics of a note in our work. However, the missing information can be compensated by the information that comes from other scales, as shown in Fig. 6(d), where the Fourier transform of the atom has several peaks, located around harmonic frequencies. Note in Fig. 6(d) that each overtone partial, including fundamental frequency, consists of one or more prioritized atoms, which results in relative amplitude between the partials. Fig. 7 shows source-specific atoms obtained from two note signals of the trumpet instrument and their time–frequency representations. These two isolated notes were C4 and C6 obtained from the McGill University Master Samples Library. In these examples, given a Gabor dictionary of size 8000, 132 atoms for note C4 and 36 atoms for note C6 were selected as their prioritized atoms. There exist 13 overlapping atoms between the two sets of prioritized atoms. With the source-specific atoms, we do not consider each individual Gabor atom but view the entire waveform for a note as an object (i.e., one atom for a musical note). For instance, given a mixture of trumpet note C4 and trumpet note C6, MP can extract the source component that comes from trumpet note C4 by choosing the source-specific atom shown in Fig. 7(a) as the best atom. Fig. 8(a) illustrates the whole source-specific dictionary for the clarinet, where 40 individual notes of the clarinet were used to create the corresponding source-specific atoms. All

344

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

Fig. 8. Spectra of source-specific atoms for (a) the clarinet and (b) the piano, where each column represents an atom.

Fig. 9. Illustration of various source-specific atoms: (a) atoms obtained from clarinet note G4 (top), piano note G4 (middle), and trumpet note G4 (bottom) and (b) spectra corresponding to these atoms.

time-domain atoms in the dictionary were transformed to the Fourier transform magnitude domain, i.e., FFT , are source-specific atoms and . Thus, where the -axis in Fig. 8 represents the atom index and the -axis shows the frequency component of each atom. On the other hand, Fig. 8(b) presents a source-specific dictionary for the piano, which consists of 50 atoms obtained from 50 individual notes of the piano. To illustrate the difference between source-specific atoms obtained from various instruments, we used a note signal, G4 from the clarinet, the piano, and the trumpet to construct their sourcespecific atoms as discussed in Section III. The resultant atoms are shown in Fig. 9(a). For comparison, their time-domain atoms were transformed into the Fourier transform magnitude domain as shown in Fig. 9(b). It is observed that pitches between these three atomic spectra are almost identical in frequency, but with different strengths among their overtone partials. The sourcespecific atoms are constructed by analyzing the pitch and the strength of overtone partials from isolated notes of a specific instrument, but not time evolution of coarse spectral energy distribution of a sound. This information constitutes the timbre and it is closely related to the recognition of musical instruments [32]. Note that the information on partial’s amplitudes may be used to represent distinct characteristics of an instrument. In the following subsections, we will evaluate the potential of the proposed source-specific representation in applications to music signal approximation and separation. B. Approximation of Music Sounds The MP process iteratively constructs an approximant by adding an atom of the source-specific dictionary that best matches a music signal at each iteration. Starting with , we select the best atom with its optimal time

Fig. 10. Approximation capability of the proposed music representation with source-specific dictionaries: (a) a real clarinet sound and (b) a real piano sound. The original signal (top), the reconstructed signal (middle), and the SDR values as a function of the number of approximating atoms (bottom).

position with respect to the residual signal at the th step; i.e., . It is worthwhile to point out that we compute the optimal time position of each atom with respect to the residual signal. Specifically, for atom , its optimal time position can be obtained by FFT , where is a component-wise multiplication between FFTs of the residual signal and the atom, since it provides the maximum correlation between the atom and that maximizes the the current residual signal. Then, atom correlation with among all atoms is selected, and we update the residual signal by removing the component along the iterations, the music signal is apselected atom. After proximated by a linear combination of the source-specific atoms as (18) where

is the coefficient for such that . In our experiments, we employed the classic greedy algorithm on a frame-by-frame basis, where the frame size was set to atom length , and a non-overlapping rectangular moving window of unit height was used. The algorithm was iterated until a desired level of accuracy has been achieved, in terms of and the current the energy ratio between the original signal , which is maximum correlation of the best atom defined by where is Euclidean norm. Here, the relative energy ratio was set to 0.01. To evaluate the ability of the proposed representation technique in capturing inherent harmonic structures of real music sounds, a solo clarinet audio signal and a polyphonic piano sound were approximated using source-specific atoms learned from individual notes of the instruments. The top subfigure of Fig. 10(a) shows an excerpt of a recording of a solo clarinet piece that consists of seven different notes, which is obtained from monophonic sounds (CD-quality, 16-bit, and 44 100 Hz) [20] and downsampled to 11 025 Hz. The original audio signal was approximated using the source-specific dictionary for the clarinet. To evaluate its approximation performance, the SDR value as a function of cumulative source-specific atoms is plotted in the bottom subfigure of Fig. 10(a), where we accumulate the source-specific atoms that correspond to the notes in the clarinet signal to show that the atoms are able to

CHO AND KUO: SPARSE MUSIC REPRESENTATION WITH SOURCE-SPECIFIC DICTIONARIES

345

TABLE I COMPARISON FOR MUSIC SIGNAL SEPARATION BETWEEN SOURCE-INDEPENDENT AND SOURCE-SPECIFIC REPRESENTATION APPROACHES. 20, 40, AND 60 INDEPENDENT COMPONENTS WERE USED FOR THE SOURCE-INDEPENDENT REPRESENTATION, WHILE THE SOURCE-SPECIFIC REPRESENTATION EMPLOYED 40 SOURCE-SPECIFIC ATOMS

approximate the original signal efficiently. We see that, after using seven atoms from the clarinet source-specific dictionary that correspond to the seven different notes of the real clarinet signal, the SDR of the reconstructed signal becomes saturated around 19 dB. It means that seven atoms are enough to capture most of the energy of the original clarinet signal. The middle subfigure of Fig. 10(a) shows the reconstructed signal obtained using only seven atoms from the dictionary. Similarly, we used a real piano sound of Bach’s InventionBWV772 [33] to test the approximation performance of a polyphonic sound. The top subfigure of Fig. 10(b) shows an excerpt of the sound that consists in a 12-note melodic recording of piano. We observe in Fig. 10(b) that 12 different source-specific atoms for the piano can approximate the original signal efficiently, resulting in a reconstructed signal of 17 dB. It is well known that the dictionary size in the iterative greedy algorithm such as MP has a great impact on the computational complexity [30]. There are two major factors that affect the dicand the tionary size as shown in (17): the range of scales . Note that the source-specific resolution of frequency bins dictionary for a specific musical instrument always has the same number of atoms due to a fixed number of musical notes from , where rethe music database (e.g., for the clarinet turns the cardinality of a set, when a total of 40 clarinet notes is adopted as training sounds). On the other hand, a dictionary used to Gabor atoms to represent in MP usually needs about the clarinet sounds. Therefore, the music representation with source-specific dictionaries offers important complexity reduction with respect to the classical Gabor dictionary [13]. C. Music Signal Separation We apply the proposed representation technique with sourcespecific dictionaries to single-channel music signal separation problem, as discussed in Section IV. In our experiments, the iterative greedy algorithm was employed with source-specific dictionaries on a frame-by-frame basis, where each frame was set to the same length as atom size (i.e., 1024 samples) without any overlap. The algorithm is iterated until a target energy ratio between the original signal and the maximum correlation of the best atom has been achieved, which was set to 0.01. Two different approaches for music signal separation were discussed in Section II. One adopts source-independent

atoms while the other exploits source-specific atoms that are determined by a training process. Two source-independent algorithms were chosen for comparison with our source-specific algorithm, which are ISA [15], [16] and NMF [34]. As discussed in Section II-B, we used them for spectral decomposition in STFT magnitude domain to obtain basis spectra and their gains. Clustering was then performed to achieve source separation. Finally, the phase information of the original sources was used to resynthesize time-domain source estimates. NMF was tested with two different cost functions by minimizing the Euclidean distance (NMFEUC) and the Kullback-Leibler divergence (NMFDIV) [34]. ISA and NMF were tested by factoring the magnitude spectrogram of the mixture signal into the various numbers of independent components. The mixture was generated by mixing a solo clarinet recording and a speech utterance. The original SDR of the mixture signal without any processing is 0.062 dB. For the automatic clustering process in ISA, NMFEUC, and NMFDIV, the standard -means algorithm with the symmetric Kullback–Leibler metric was used and best results were selected after repeating 50 times experiments. The quality of reconstructed signals with different algorithms is compared in Table I. Note that NMFDIV yields slightly better results than NMFEUC and ISA. The poor performance of these algorithms based on source-independent atoms is probably due to poor results of automatic clustering. It is observed that manual clustering after spectral factorization gives better results than automatic clustering, which implies that it is a nontrivial task to cluster the basis spectra into disjoint sets with respect to all underlying source signals. As the number of independent components in the factorization increases, its performance might improve, but manual clustering would become too troublesome and unreliable (e.g., the case of 60 independent components in ISA and NMF methods). On the other hand, even though the proposed scheme assumes musical instruments performed are known a priori, it needs no clustering nor resynthesis process due to time-domain source-specific atoms tailored to musical harmonic structure. In Table I, we compare the performance of the source-specific dictionary (SSD) with that of the ISA/NMF methods using 40 independent components since the source-specific dictionary for clarinet consists of 40 atoms as illustrated in Fig. 8(a). We see

346

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

TABLE II MUSIC SIGNAL SEPARATION FOR SOLO MUSICAL INSTRUMENT SOUNDS

that the source-specific dictionary approach provides better results for both reconstructed music and background signals in most cases. Next, we tested a wider range of audio sounds including musical instrument sounds, speech, and environmental sounds. The results are shown in Table II, where each performance value gives the quality of the reconstructed music sound with respect to the original one in SDR (dB). For example, the separation performance of clarinet signal was measured as 10.77 dB from a mixture of clarinet and speech utterance. Environmental sounds are broadband and nonharmonic, and they typically consist of a mixture of ambient sounds. For example, a street sound was collected from ambient outdoors on a city street, which contains sounds of moving car and human speaking in the background. Thus, the environmental sounds have unstructured characteristics in the time-frequency domain. Due to the very different signal structures between environmental sounds and pitched harmonic ones, the proposed representation technique with SSD can offer good SDR values of estimated pitched harmonic sources. As shown in Table II, our method has demonstrated good performance in the case of woodwind instrument, environmental sounds, and white noise,5 except for the case of polyphonic piano and violin with vibrato. The poor performance in piano sounds could be attributed to the rich sound model, including attack, decay, sustain and release (known as the ADSR model [35]) as compared with woodwind instruments. Violin is usually played with vibrato that has the frequency and the amplitude modulation effect, which results in poor performance in source separation. Fig. 11(a) presents the spectrograms of the real clarinet sound in Fig. 10(a), which consists of seven different notes, and its reconstructed signal. The original clarinet sound was approximated by only seven source-specific atoms as discussed in Section V-B. The harmonic structure of the clarinet sound is successfully extracted by these atoms as shown in the spectrogram of the reconstructed signal. Fig. 11(b) shows spectrograms of a mixture signal of clarinet and male speech utterance, and the separated clarinet signal. Since the speech signal has a different structure as compared to that of the clarinet sound, we can successfully extract the harmonic structure of the clarinet sound from the mixture signal. Furthermore, we consider a more challenging case with multiple musical instrument recordings. Here, solo recordings of polyphonic piano and violin with vibrato were used for the mixtures. The results are shown in Table III, where two musical instruments are played simultaneously to generate single channel mixtures. As compared with the results in Table II, the separa5The signal-to-noise ratio (SNR) of the music signal in white noise was 5 dB.

Fig. 11. Spectrograms of signals. (a) Approximation: a real clarinet sound (top) and its reconstructed signal (bottom). (b) Signal separation: a mixture of clarinet and male speech (top) and extracted clarinet signal (bottom). TABLE III MUSIC SIGNAL SEPARATION FOR MULTIPLE MUSICAL INSTRUMENT SOUNDS

tion performance degrades due to the similar musical harmonic structure between the pitched musical instruments. Finally, we provide an example of music signal separation under the room acoustic environment. We conducted experiments with simulated room recordings of several sources using the room configuration shown in Fig. 12, which consists of one omnidirectional microphone and two loudspeakers. Each monaural recording was simulated by convolving the source signals with the room impulse responses using the Roomsim of the simulated toolbox [36]. The reverberation time room was 112 ms. Table IV shows the separation performance of music signals in terms of SDR. As compared with the results on anechoic mixtures in Table II, the performance of the representation technique with SSD in music signal separation degrades significantly. In the reverberant environment, the reflected source signals arrive at the microphone as delayed and attenuated copies of the directed-path source signal, which makes the mixture signal at the microphone smeared across time. Due to this reason, some source-specific atoms were selected incorrectly when the representation technique with SSD computed the best atoms from the residual signal. Note that the amount of smearing is a function of reverberation time, . VI. FUTURE EXTENSION: FROM GABOR ATOMS TO CHIRP ATOMS Due to the complex phenomena, it is difficult to derive a compact representation of oscillatory signals using Gabor atoms. Instead, we need several constant frequency Gabor atoms of smaller scales to represent time-varying frequency components such as the oscillation of partials in a note. Chirp atoms have been proposed to deal with the nonstationary behavior of signals [37], specifically used for vibrato sounds [38]. The chirplet decomposition provides local information about the local struc-

CHO AND KUO: SPARSE MUSIC REPRESENTATION WITH SOURCE-SPECIFIC DICTIONARIES

Fig. 12. Configuration of loudspeakers and a microphone in a room, where recordings were simulated with an absorption coefficient of 0.6 for room’s surface. TABLE IV MUSIC SIGNAL SEPARATION FOR SIMULATED ROOM RECORDINGS WITH RT = 112 ms

ture of a signal in terms of scale , time-shift , frequency-shift , and chirp rate [37] as (19) where the instantaneous frequency varies linearly with time. When the chirp rate , the chirp atom is exactly the same as the Gabor atom. Given a signal with vibrato, the chirplet decomposition can reconstruct the original signal well with just a few chirp atoms, thus providing a compact representation. Thus, we can use the chirp atoms, instead of Gabor atoms, in the first step of the proposed representation scheme to learn inherent structure of the oscillatory notes. Since the increasing and decreasing parts of oscillatory partials are decomposed based on their temporal position, the position parameter of the chirp atom should be obtained accordingly. Note that we constructed source-specific in Section V-A. After learning, the atoms by setting source-specific dictionary constructed using chirp atoms is expected to contain a small number of atoms due to the fixed number of notes. A modified matching pursuit algorithm, called ridge pursuit, was introduced in [38] as an effective tool for signal decomposition with chirp atoms. Currently, we investigate the problem of efficient modeling of vibrato sounds using chirp atoms. VII. CONCLUSION AND FUTURE WORK A systematic way to find an efficient representation of harmonic music signals based on source-specific dictionaries was presented. The proposed approach extracts the essential features of music signals by modeling their basic components, i.e., musical notes, using source-independent atoms. Due to the efficiency of these source-specific atoms, the number of atoms

347

needed to represent music signals are much smaller as compared with that of the Gabor dictionary, resulting in a lower complexity algorithm. The proposed scheme was applied to approximation of music sounds and music signal separation with a single-channel mixture of multiple sounds. The proposed technique builds source-specific atoms and dictionaries from Gabor atoms. Since Gabor atoms chosen in our work are harmonic signals, the resultant source-specific atoms are harmonic as well. For this reason, the proposed algorithm will probably be not effective for nonharmonic musical instrument recordings. We need to find other types of atoms as the building elements. Along this line of thoughts, it appears interesting yet challenging to obtain speech-specific dictionaries because of special properties of speech signals such as inharmonicity and irregular pitch sweeps. Besides, by following the implementation in [30], we set the phase to 0 during the Gabor dictionary construction in Section V-A. Although this simple implementation works well in our experiments, the phase optimization for music signals may speed up the convergence of signal decomposition by subtracting more energy from the residual signal at each iteration. We plan to investigate the phase optimization technique for highly sparse representation of music signals. ACKNOWLEDGMENT The authors would like to thank the reviewers for their constructive comments and suggestions which improved the paper significantly. REFERENCES [1] M. S. Lewicki, “Efficient coding of natural sounds,” Nature Neurosci., vol. 5, pp. 356–363, Apr. 2002. [2] H. B. Barlow, Possible Principles Underlying the Transformations of Sensory Messages. Cambridge, MA: MIT Press, 1961. [3] T. Blumensath and M. Davies, “Sparse and shift-invariant representations of music,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 1, pp. 50–57, Jan. 2006. [4] S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of polyphonic music using sparse coding,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 179–196, Jan. 2006. [5] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse decomposition in a signal dictionary,” Neural Computat., vol. 13, pp. 863–882, 2001. [6] M. M. Goodwin and M. Vetterli, “Matching pursuit and atomic signal models based on recursive filter banks,” IEEE Trans. Signal Process., vol. 47, no. 7, pp. 1890–1902, Jul. 1999. [7] S. G. Mallat and Z. Zhang, “Matching pursuit with time-frequency dictionaries,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397–3415, Dec. 1993. [8] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004. [9] T. Zhang and C.-C. J. Kuo, “Audio content analysis for online audiovisual data segmentation and classification,” IEEE Trans. Speech Audio Process., vol. 9, no. 4, pp. 441–457, May 2001. [10] E. Vincent and X. Rodet, “Underdeterminedsource separation with structured source priors,” in Proc. Int. Conf. Ind. Compon. Anal. Blind Signal Separation, Sep. 2004, pp. 327–332. [11] E. Vincent and X. Rodet, “Music transcription with ISA and HMM,” in Proc. Int. Conf. Ind. Compon. Anal. Blind Signal Separation, Sep. 2004, pp. 1197–1204. [12] N. Cho, Y. Shiu, and C.-C. J. Kuo, “Audio source separation with matching pursuit and content-adaptive dictionaries (MP-CAD),” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., New Paltz, NY, 2007, pp. 287–290.

348

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

[13] N. Cho, Y. Shiu, and C.-C. J. Kuo, “Efficient music representation with content-adaptive dictionaries,” in IEEE Int. Symp. Circuits Syst., Seattle, WA, May 2008, pp. 3254–3257. [14] G. J. Jang and T. W. Lee, “A probabilistic approach to single channel blind signal separation,” in Advances in Neural Information Processing Systems, Vancouver, BC, Canada, Dec. 2002, pp. 1178–1180. [15] M. A. Casey and A. Westner, “Separation of mixed audio sources by independent subspace analysis,” in Proc. Int. Compon. Music Conf., Berlin, Germany, 2000, pp. 154–161. [16] C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from polyphonic music using independent subspace analysis,” in Proc. 4th Int. Symp. Ind. Compon. Anal. Blind Signal Separation, Nara, Japan, 2003, pp. 843–848. [17] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparse criteria,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 1066–1074, Mar. 2007. [18] D. FitzGerald, “Automatic drum transcription and source separation,” Ph.D. dissertation, Dublin Inst. Technol., Dublin, Ireland, 2004. [19] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC music database: Music genre database and musical instrument sound database,” in Proc. Int. Conf. Music Inf. Retrieval, Washington, DC, Oct. 2003, pp. 229–230. [20] P. Leveau, L. Daudet, and G. Richard, “Methodology and tools for the evaluation of automatic onset detection algorithms in music,” in Proc. Int. Conf. Music Inf. Retrieval, Barcelona, Spain, 2004, pp. 72–75. [21] The Acoustic Drum Sequences. [Online]. Available: http://www.cs. tut.fi/~tuomasv [22] R. Gribonval and E. Bacry, “Harmonic decomposition of audio signals with matching pursuit,” IEEE Trans. Signal Process., vol. 51, no. 1, pp. 101–111, Jan. 2003. [23] P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 116–128, Jan. 2008. [24] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. Signal Process., vol. 52, no. 7, pp. 1830–1847, Jul. 2004. [25] F. Opolko and J. Wapnick, McGill “University master samples,” McGill Univ., Montreal, QC, Canada, 1987, Tech. Rep.. [26] The University of Iowa Musical Instrument Samples Database. [Online]. Available: http://theremin.music.uiowa.edu [27] D. Schobben, K. Torkkola, and P. Smaragdis, “Evaluation of blind signal separation methods,” in Proc. Int. Conf. Ind. Compon. Anal. Blind Signal Separation, Sep. 1999, pp. 11–15. [28] The BBC Sound Effects Library. [Online]. Available: http://www. soundideas.com/bbc.html [29] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, Jul. 2006. [30] P. Jost, P. Vandergheynst, and P. Frossard, “Tree-based pursuit: Algorithm and properties,” IEEE Trans. Signal Process., vol. 54, no. 12, pp. 4685–4697, Dec. 2006. [31] C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-24, no. 4, pp. 320–327, Aug. 1976. [32] A. Klapuri and M. Davy, Signal Processing Methods for Music Transcription.. New York: Springer-Verlag, 2006.

[33] J. S. Bach, Inventions and Sinfonias BWV 772-786 by R. Stahlbrand (piano), recorded in 2007. [34] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Neural Inf. Process. Syst., 2001, pp. 556–562. [35] C. Roads, The Computer Music Tutorial.. Cambridge, MA: MIT Press, 1998. [36] K. P. D. Campbell and G. Brown, “A Matlab simulation of shoebox room acoustics for use in research and testing,” Comput. Inf. Syst. J., vol. 9, pp. 48–51, 2005. [37] A. Bultan, “A four-parameter atomic decomposition of chirplets,” IEEE Trans. Signal Process., vol. 47, no. 3, pp. 731–745, Mar. 1999. [38] R. Gribonval, “Fast matching pursuit with a multiscale dictionary of Gaussian chirps,” IEEE Trans. Signal Process., vol. 49, no. 5, pp. 994–1001, May 2001.

Namgook Cho (S’07–M’10) received the Ph.D. degree in electrical engineering from the University of Southern California (USC), Los Angeles, in 2009. Since 2009, he has been with the Digital Media and Communications R&D Center, Samsung Electronics, Suwon, Korea. His research interests include sparse representations of audio signals, audio source separation, and signal processing and machine learning for content analysis.

C.-C. Jay Kuo (S’83–M’86–SM’92–F’99) received the B.S. degree from the National Taiwan University, Taipei, in 1980 and the M.S. and Ph.D. degrees from the Massachusetts Institute of Technology, Cambridge, in 1985 and 1987, respectively, all in electrical engineering. He is currently Director of the Signal and Image Processing Institute (SIPI) and a Professor of Electrical Engineering, Computer Science, and Mathematics at the University of Southern California (USC). His research interests are in the areas of digital image/video analysis and modeling, multimedia data compression, communication and networking, and multimedia database management. He has guided about 100 students to their Ph.D. degrees and supervised 20 postdoctoral research fellows. He is a coauthor of about 175 journal papers, 790 conference papers, and ten books. He is co-Editor-in-Chief for the Journal of Visual Communication and Image Representation, and Editor for the Journal of Information Science and Engineering, LNCS Transactions on Data Hiding and Multimedia Security and the EURASIP Journal of Applied Signal Processing. Dr. Kuo is a Fellow of SPIE. He received the National Science Foundation Young Investigator Award (NYI) and Presidential Faculty Fellow (PFF) Award in 1992 and 1993, respectively. He is also the recipient of the Electronic Imaging Scientist of the Year Award in 2010 and the holder of the 2010–2011 Fulbright–Nokia Distinguished Chair in Information and Communications Technologies.

Suggest Documents