IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 11, NOVEMBER 2010
913
Sparse Representation of Musical Signals Using Source-Specific Dictionaries Namgook Cho, Member, IEEE, and C.-C. Jay Kuo, Fellow, IEEE
Abstract—The sparse representation of music sounds that consist of a single note at a time was examined in [1]. Here, we extend the results to a more generic setting where music sounds may contain multiple notes (or chords) at the same time. The basic idea is to determine a set of elementary functions, called source-specific atoms, that efficiently capture music signal characteristics. We first decompose basic components of musical signals (i.e,, musical notes) into a set of Gabor atoms. Then, these Gabor atoms are prioritized according to their approximation capability to music signals of interest, and the prioritized Gabor atoms are used to synthesize source-specific atoms. To find a sparse representation for musical chords, we generate new atoms by regrouping source-specific atoms. This technique is applied to the approximation of real piano recordings, and its effectiveness in terms of good approximation capability and low computational complexity is demonstrated by experiments. Index Terms—Gabor atoms, music signal processing, musical chords, source-specific atoms, sparse approximation.
I. INTRODUCTION
I
T has been conjectured for a long while, e.g., [2], that the human auditory system may have a highly efficient coding mechanism that extracts the meaningful structure of audio signals for perception and conveys the information to the brain compactly. Mathematically, we can view the capability of redundancy reduction as finding a sparse (or compact) representation of an audio signal, which corresponds to the combination of a small number of functions taken from a set of elementary functions. We call these elementary functions atoms, and the set formed by these atoms a dictionary. Methods to represent real-valued audio signals can be classified into two main categories: 1) the complete representation and 2) the overcomplete representation. Orthogonal bases, such as Fourier and wavelet bases, provide a complete representation to signals with finite energy. However, this approach may not offer a compact representation for some signals, e.g., music recordings [3]. With an overcomplete representation, we find a dictionary of atoms to span the signal space, where the dictionary may include more atoms than necessary, leading to a nonorthogonal, linearly dependent set. For instance, Matching Pursuit [4] can find a nearly optimal sparse representation of an arbitrary input signal using an overcomplete dictionary. How-
Manuscript received June 21, 2010; revised August 09, 2010; accepted August 16, 2010. Date of publication September 02, 2010; date of current version September 20, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Constantine L. Kotropoulos. The authors are with the Ming Hsieh Department of Electrical Engineering and Signal and Image Processing Institute, University of Southern California, Los Angeles, CA 90089 USA (e-mail:
[email protected];
[email protected]. edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2010.2071864
ever, the complexity of the Matching Pursuit algorithm is in general very high. Efficient music representation with source-specific atoms for musical notes was studied in [1], and it was used to separate music signals from co-existing speech and/or environmental sounds. In this letter, we extend the work to a more generic setting where music sounds may contain multiple notes (or chords) at the same time. Thus, our research objective is to find the sparse representation of musical chords. The generalization is motivated by the applications of the sparse music representation to compact signal approximation for compression as well as multiple fundamental frequency estimation for transcription. Our approach can be roughly stated below. We first adopt the overcomplete representation to determine a source-specific dictionary tailored to musical note signals. Specifically, we decompose basic components of musical signals (e.g., musical notes) into a set of Gabor atoms. Then, we prioritize these Gabor atoms according to their approximation capability to music signals of interest. The prioritized Gabor atoms are used to synthesize new atoms, called the note-specific atoms, to build a compact dictionary for music notes. The above two steps are borrowed from [1]. The main contribution of this work lies in finding a sparse representation for musical chords, which is the last step. To achieve this goal, we derive new atoms, called the chord-specific atoms, which is in form of a weighted sum of note-specific atoms. The number of note-specific and chord-specific atoms needed to represent music signals is much less than that of the Gabor dictionary, resulting in a lower complexity algorithm. The rest of this letter is organized as follows. The process to construct source-specific dictionaries for musical notes is reviewed in Section II. The generalization to chord-specific atoms is discussed in Section III. Performance evaluation is presented in Section IV. Finally, conclusions are given in Section V. II. SPARSE REPRESENTATION OF MUSICAL NOTES We review the work in [1] which attempts to find a sparse representation of musical note signals. The proposed scheme consists of three major steps as described below. 1) Signal Decomposition with Gabor Atoms It analyzes the essential characteristics of a specific musical instrument from their audio waveforms. The waveforms of isolated notes can be easily obtained from a music database that contains various musical instruments. It adopts the Matching Pursuit algorithm with a redundant set of Gabor atoms to decompose musical note signals. Gabor atoms are obtained by dilation, translation, and modulation of a mother function of the following form
where , , and are the scale, position, and frequency parameters, respectively. The Gabor dictionary is the set of Gabor atoms.
1070-9908/$26.00 © 2010 IEEE
914
IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 11, NOVEMBER 2010
The th note signal of a specific musical instrument Gabor can be decomposed into a linear combination of chosen among plus the residual term , atoms as (1) is the coefficient for such that . Depending upon the magnitude correlation in (1), has a different contribution to the signal representation. Some atoms are highly correlated with the signal so that they capture inherent characteristics of the note signal well. On the other hand, others with low energy correlation rarely reflect signal properties [4]. 2) Identification of Coherent Gabor Atoms to the representaThe contribution of Gabor atom tion of the single note signal can be measured by the normalized squared magnitude correlation defined as . In this step, it computes the accuby sorting them with a descending mulation values of order and selects as coherent components the Gabor atoms that satisfy where
(2) where and is a threshold value that can be determined empirically. For example, the yields good performance in our experchoice of iments (the choice of is justified in detail in [1]). The atoms determined by (2), which corresponds to a small set of Gabor atoms, are called prioritized atoms that represent the characteristics of the note signal. 3) Generation of Source-Specific Atoms for Note Signals It generates a new atom for each musical note, called the note-specific atom, by re-organizing prioritized atoms obtained in (2). The note-specific atom is expressed as a linear combination of prioritized atoms in form of (3) where is a finite set of indices of prioritized atoms for is the weighting coefficient the th note signal, and in the decomposition according to the importance of . The coefficient is determined by of note signal of prioritized atom . Note that the correlation value . the new atom is normalized such that By repeating the above process for various note signals, the , where is the number of entire set of atoms , notes for a specific instrument, forms a source-specific dictio. nary for notes, denoted by III. SPARSE REPRESENTATION OF MUSICAL CHORDS In this section, we focus on the sparse representation of musical chords. A musical chord consists of multiple harmonic notes that are played simultaneously. Due to its complex synthesizing elements, it is awkward to derive a compact representation for the musical chord using Gabor atoms directly. Instead, we exploit the harmonic structure in the musical chord and derive its sparse representation using note-specific atoms obtained in (3). That is, it is expressed as a weighted sum of note-specific atoms in form of
(4) where is chosen such that and is a finite set of notes. The most frequently encountered chords in music are triads, which consist of three distinct notes. Thus, we can define 36 classes or chord types according to their sonorities in our model; namely, the major, minor, and diminished chords for each pitch class. Here, we ignore the augmented chords since they rarely appear in Western tonal music. With the synthesis process in , is built for all chord (4), a new source-specific dictionary, types of consideration. Note that further notes may be added in the triads to give extended chords, for example, seventh chords. The extended chords can be easily synthesized in (4) by adding more note-specific atoms. It is worthwhile to comment the intra-class variation of musical sounds; i.e., the audio waveforms of the same note or chord signal but different pianos vary. Compared to Matching Pursuit that yields a large number of Gabor atoms including many with very small weights, the source-specific atoms can provide more robust music representation due to the prioritization (or denoising) process that selects only coherent Gabor atoms for a new atom. The denoising process increases the robustness of the source-specific atoms against the intra-class variation. We can summarize the computational algorithm to obtain the consisting of notes sparse representation of a music signal and chords as follows. Here, we assume that the musical instrument is known a priori. Computational algorithm for sparse music representation Given a source-specific dictionary, for some set of indices , and Initialization: Set
, where ,
.
Iteration: We perform the following computation at the th step. for all . 1) Compute 2) Select the best (or the near best) atom of the dictionary
3) Compute the new residual according to
After
iterations, one get an -term approximation as . To give an example, we show the time-domain waveforms and their time-frequency representations of three note-specific atoms in Fig. 1(a)–(c), which were synthesized using three isolated notes of the piano (i.e., C5, E5, and G5). Given these atoms, we can synthesize another source-specific atom for a musical chord, C major, using (4), which is shown in Fig. 1(d). We see that the time-frequency representation of the chord captures the inherent harmonic structures of all composing notes reasonably well. Although the source-specific atom can be expressed as a linear combination of Gabor atoms as well, we do not treat each individual Gabor atom but the entire waveform for a note
CHO AND KUO: SPARSE REPRESENTATION OF MUSICAL SIGNALS
915
Fig. 1. The time-domain waveform (top) and its time-frequency representation (bottom) for source-specific atoms that correspond to (a) note C5, (b) note E5, and (c) note G5 of the piano sound, and (d) chord C major.
or chord as the basic unit. That is, one atom for a single musical note or chord. IV. PERFORMANCE EVALUATION A. Experimental Setup In our experiments, source-specific atoms were synthesized using isolated notes of piano obtained from the RWC Musical Instrument Sound Database [5]. The acoustic database provides individual notes of the piano over the entire range of tones that could be provided by the piano. All sounds used in the experiments were mono and downsampled to 11 025 Hz. To measure the quality of the reconstructed sound with respect to the original one, we calculate the source-to-distortion ratio (SDR) in decibels (dBs), which is defined as , where is the origis the reconstructed one. Given a dictionary, inal signal and we employ the classic greedy algorithm that iterates until a desired level of accuracy is achieved. The stopping criterion we adopted is described in terms of the energy ratio between the maximum correlation at the th iteration and the input signal , where is Euclidean : norm and is a predefined threshold which was set to 0.01 in the experiment. To analyze real-valued music signals, real Gabor atoms were used to construct a discrete Gabor dictionary [1]. They were characterized by their scale, time-position, frequency, and phase. The scale was dyadic and varied between 1 and atom (which was set to 1024 samples). For the learning length process, 800 different frequencies were used, which uniformly spread over the interval of normalized frequencies [0, 0.5]. The phase was set to 0. As a result, this yielded a Gabor dictionary without taking all possible time-shifts of size of Gabor atoms into account. The time-shift parameters can however be computed at each Matching Pursuit iteration. We used the fast Fourier transform to compute the best time
position of each individual atom with the correlation between the atom and the residual signal. To analyze music signals on a frame-by-frame basis, the frame size was set to atom length and a nonoverlapping rectangular moving window of unit height was used. We will evaluate the capability of the source-specific representation technique in capturing inherent harmonic structure of music sounds. To this end, we used the Kostka-Payne corpus1 which is musical excerpts from the textbook Harmony by Stefan Kostka and Dorothy Payne. It comprises 46 classical symbolic music files with corresponding label files that contain annotated note boundaries as well as note names. The MIDI files were converted into audio files in a WAVE format using a free software synthesizer, Timidity++.2 It used a sample-based synthesis technique to create harmonically rich audio as in real recordings (we used Merlin Vienna Soundfont to synthesize the MIDI files). After that, based upon the annotated note boundaries, we chopped small pieces of signals off the synthesized audio files. The music excerpts were categorized according to the number of notes and/or chords within them. There were five classes: 1) the single note, 2) the single chord, 3) one chord and one note, 4) one chord and two notes, and 5) two chords. The numbers of excerpts for these five classes were 174, 114, 352, 114, and 45, respectively. Thus, the total number of excerpts extracted from these audio files was 799. B. Comparison of Approximation Performance We compare the approximation performance of the sourcespecific representation, Tree-Based Pursuit [6] and Matching Pursuit on extracted music excerpts in Fig. 2(a). For the proposed method, we constructed the source-specific dictionary, , of size 304, which contains 88 source-specific atoms for notes (i.e., the number of notes for the piano) and 1[Online].
Available: http://www.link.cs.cmu.edu/music-analysis.
2[Online].
Available: http://timidity.sourceforge.net.
916
IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 11, NOVEMBER 2010
Fig. 2. Comparison of (a) the approximation performance and (b) the computational complexity between the proposed method (SSD), Tree-Based Pursuit (TBP), and Matching Pursuit (MP) for five chord/note classes, where the number below each category denotes the number of music excerpts used. (c) Comparison of the approximation performance for real piano sounds. The number inside the parenthesis is the value of .
source-specific atoms for chords (i.e., three chord types 12 notes in an octave 6 octaves in the piano). In contrast, Matching and Tree-Based Pursuit used 8000 Gabor atoms to represent the same music excerpts. The mean values of SDRs are presented together with the corresponding standard deviations, which are denoted by smaller bars. As shown in Fig. 2(a), the approximation performance of the source-specific representation and Tree-Based Pursuit is comparable while Matching Pursuit offers the best results in the mean SDR value. The excellent performance of Matching Pursuit is achieved at the expense of a high computational complexity. Note that the variance of Tree-Based Pursuit is larger than that of the source-specific representation and Matching Pursuit; the results of Tree-Based Pursuit were not as consistent as those of the other methods. C. Comparison of Computational Complexity It is well known that the dictionary size in the iterative greedy algorithm such as Matching Pursuit has a great impact on the computational complexity [6]. A dictionary demanded to Gabor by Matching Pursuit usually needs about atoms to represent the piano music. In contrast, the proposed source-specific representation technique has a much smaller number of atoms for piano sounds due to a fixed number of musical notes and chords. Thus, the sparse representation of piano sounds using the source-specific dictionary demands a significantly lower complexity than that using the classical Gabor dictionary. This is confirmed by experimental results3 shown in Fig. 2(b), where we compute the average time required by each method to reach 15 dB (SDR). D. Evaluation on Real Piano Sounds Finally, we compare the approximation performance of the proposed method, Tree-Based Pursuit and Matching Pursuit by testing them on two real piano songs [7], [8]. In the experiments, the whole recording of the real piano was chopped into multiple one-second music excerpts and each individual excerpt was approximated using different methods separately. The results were then averaged over all the excerpts. Fig. 2(c) illustrates that Matching Pursuit with gives the best performance while Tree-Based Pursuit performs the worst. Note that the proposed method with shows better approximation performance than Matching Pursuit with . The proposed method used only 304 atoms for the approximation, which is approximately 3.8% of Gabor atoms used 3The reported experiments were conducted on the Intel Core 2 Quad processor at 2.5 GHz with 3 GB memory.
in Matching Pursuit. The poor performance of Tree-Based Pursuit can be explained as follows. Although Tree-Based Pursuit can reduce the complexity significantly by searching the best atoms in a hierarchical tree structure, the chosen atoms in the tree structure were not correlated with the harmonic structure of real piano sounds well. However, since the source-specific atoms are synthesized so as to be well tailored to the harmonic structure of note signals, the atoms can be highly correlated with the real music signals, which yields better performance than Tree-Based Pursuit. V. CONCLUSION AND FUTURE WORK A systematic way to find a low-complexity sparse representation of harmonic music signals based on source-specific dictionaries was proposed in this work. Performance comparison was provided to demonstrate the effectiveness of the proposed method in approximation and computational complexity. Since the source-specific atoms are harmonic, the proposed method may not be effective for nonharmonic musical instrument recordings (e.g., pop, rock, and rap/hiphop music). In addition, for more complex musical signals (e.g., with timbre or vibrato) or those played by a large number of instruments (e.g., the performance of an orchestra), a low-complexity sparse representation of music sounds is still a challenging problem. In those case, we need to find other types of atoms as the building elements. For example, chirp atoms can be used to model vibrato sounds. REFERENCES [1] N. Cho and C.-C. J. Kuo, “Sparse music representation with sourcespecific dictionaries and its application to signal separation,” IEEE Trans. Audio, Speech, Lang. Process, to be published. [2] H. B. Barlow, Possible Principles Underlying the Transformations of Sensory Messages. Cambridge, MA: MIT Press, 1961. [3] M. M. Goodwin and M. Vetterli, “Matching pursuit and atomic signal models based on recursive filter banks,” IEEE Trans. Signal Process., vol. 47, pp. 1890–1902, 1999. [4] S. G. Mallat and Z. Zhang, “Matching pursuit with time-frequency dictionaries,” IEEE Trans. Signal Process., vol. 41, pp. 3397–3415, 1993. [5] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC music database: Music genre database and musical instrument sound database,” in Int. Conf. Music Inf. Retrieval, Washington, DC, 2003, pp. 229–230. [6] P. Jost, P. Vandergheynst, and P. Frossard, “Tree-based pursuit: Algorithm and properties,” IEEE Trans. Signal Process., vol. 54, pp. 4685–4697, 2006. [7] J. S. Bach, Inventions and Sinfonias BWV 772-786 by R. Stahlbrand (piano), recorded in, 2007. [8] J. Pachelbel, Variations on the Kanon From the Album December by G. Winston (piano), 2001.