IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
1845
Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics Mark R. Every and John E. Szymanski
Abstract—This paper discusses the separation of two or more simultaneously excited pitched notes from a mono sound file into separate tracks. In fact, this is an intermediate stage in the longer-term goal of separating out at least two interweaving melodies of different sound sources from a mono file. The approach is essentially to filter the set of harmonics of each note from the mixed spectrum in each time frame of audio. A major consideration has been the separation of overlapping harmonics, and three filter designs are proposed for splitting a spectral peak into its constituent partials given the rough frequency and amplitude estimates of each partial contained within. The overall quality of separation has been good for mixes of up to seven orchestral notes and has been confirmed by measured average signal-to-residual ratios of around 10–20 dB. Index Terms—Music note separation, partial extraction, separation of overlapping harmonics.
I. INTRODUCTION
T
HIS paper presents a data-driven approach, based upon an analysis in the spectral domain, to separating multiple simultaneously excited pitched notes from a mono recording. An attempt has been made to separate a mix of between two and seven notes into the same number of tracks plus a residual. The notes have approximately equal energies, are initially excited simultaneously and have almost the same duration. This research is ultimately directed at separating a longer recording of an instrumental ensemble into its constituent instrumental parts or melodic lines. Potential applications of the aforesaid “mono-to-multitrack” system are numerous. For example, classic recordings only available in mono could be separated into individual instrumental parts, remastered track by track, and remixed again, potentially even with new instruments. Alternatively, one might want to remove a disturbing cough in a live recording, and this might be achieved by separating the recording into harmonic and residual components. Other applications exist in the areas of effects processing, audio spatialization, restoration, structured compression and coding, and music cataloguing and retrieval. The basic idea in this approach to the problem is that if the pitch of each note in the mix is known in a particular time frame,
Manuscript received September 16, 2004; revised June 30,2005. This work was carried out when the authors were with the Department of Electronics, University of York, York, U.K. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gerald Schuller. M. R. Every is with CVSSP, SEPS, University of Surrey, Guildford, GU2 7XH, U.K. (e-mail:
[email protected];
[email protected]). J. E. Szymanski is with the Media Engineering Research Group, Department of Electronics, University of York, Heslington, York, YO10 5DD, U.K. (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSA.2005.858528
then it is possible to identify the harmonics of each note in the spectrum, and then to construct comb-like filters to filter the set of harmonics of each note out of the composite spectrum. Pitchbased separation techniques have already been applied to speech separation and enhancement, for example, in [1]–[3], and [4] reviews a wide range of approaches to sound segregation related to auditory scene analysis. In [1], vocalic speech was separated from a mix of two competing talkers. There, crosstalk arising from spectral peaks that were shared by the two speakers was identified as a cause of degradation in the separated waveforms. This issue becomes even more important when one is attempting to separate out more than two pitched sources, since one would expect many more partials to be overlapping. Furthermore, as the occurrence of overlapping partials is far more common in music than in speech due to the tendency to play harmonically related notes together, the treatment of overlapping harmonics is highly relevant to music source separation. The full separation task consists of 1) detecting all salient spectral peaks while the spectrum typically contains some lowlevel broadband energy due to noise and spectral leakage, 2) estimating note pitch trajectories over all time frames using a multipitch estimator, 3) matching spectral peaks to note harmonics, and 4) constructing filters to remove the individual note spectra from the mixed spectrum. A multipitch estimator is used to estimate the pitch trajectory of each note in the mix over all time frames. Although accuracy in the former stages is essential to achieving a realistic separation, we prefer here to emphasise the filtering stage, and, in particular, the problem of overlapping harmonics, which is felt is the area that has been least explored. To clarify what class of sounds these algorithms have been applied to, by the term “pitched,” it is implied that a note has a perceivable pitch, and it contains most of its energy in harmonics that are at roughly integer multiples of the fundamental frequency. The algorithms have been tested on bassoon, cello, B clarinet, E clarinet, flute, French horn, oboe, piano, saxophone, trombone, and violin samples. II. METHOD To begin with, in each time frame the original mix of the individual notes , is split into overlapping time frames and a fast Fourier transform (FFT) is computed on the Hamming , where is the window function. Time windowed data, samples at frames of 186 ms in length (FFT length a sampling rate kHz) have been used with an 87.5% indicates the value of the complex FFT specoverlap. , and is the trum at frequency bin corresponding frequency in Hertz.
1558-7916/$20.00 © 2006 IEEE
1846
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 1.
Thresholding and peak picking of the amplitude spectrum jF
A. Spectral Peak Identification A reliable method for detecting spectral peaks was necessary for multipitch estimation and to locate harmonics in the spectrum. Peak detection was performed successfully at frequencies with a freup to the Nyquist limit, by thresholding quency dependent threshold , where is the shape of the threshold and th is a frequency-independent threshold height. Peak picking was then performed on the resulting thresholded spectrum. The reason for using a variable threshold is that the typical rolloff of harmonic amplitudes at higher frequencies often results in higher harmonics being too small to be detected by ap. These higher harmonics plying a constant threshold to are, however, very helpful for pitch estimation and are perceptually significant. was arrived at in the following manner. The smoothed with amplitude envelope was calculated by convolving a normalized Hamming window of length samples. An odd-numbered window length was chosen for symmetry involves a weighted sum reasons, i.e., the calculation of , at an equal number of bins on either side of terms of bin . An alternative method for calculating the spectral envelope is the regularized calculation of the cepstrum coefficients [5]. The Hamming windowing method was preferred due to its computational efficiency, effectiveness and the fact that the latter method involves the calculation of a matrix inversion which was sometimes found to be numerically unstable. Then, up to the Nyquist limit, where we define was used for the a suitable range for is [0.5, 1), and results given here. Smaller values of produce a flatter envelope, and this helps to avoid spurious peaks being detected in regions of low spectral amplitude. To satisfy scaling invariance, . Fig. 1 shows the amplitude spectrum
j
using a frequency-dependent threshold e
1
E^ (k).
and threshold for a mix of two violin notes with pitches A5 (880 Hz) and E6 (1319 Hz). Next, a search was made to find all local maxima in above the threshold. A frequency bin was considered to be a peak maximum if (1) , where is in the range (0,1], is the length of vector , and we have chosen . This peak picking algorithm incorporates the simplest case, , of checking whether the amplitude in each discrete Fourier transform (DFT) bin is larger than only its nearest neighbors, but can also be adapted to more noisy spectra by assigning a longer vector to . The algorithm is not computationally expensive and is easy to implement, although a systematic comparison has not yet been made with other methods for peak-picking, such as sinusoidal modeling of the DFT spectrum [6]. Once the spectral peaks had been identified, a refinement was made of the center frequency of each DFT maximum to subfrequency bin resolution using a DFT frequency interpolator. At the same time, the peak amplitudes were refined using an amplitude interpolator. A number of interpolation methods were examined [7]: Quinn’s first and second interpolator, Grandke’s interpolator, the quadratic interpolator, the baricentric interpolator and the DFT method implemented in the software package InSpect [8]. The accuracy of the various DFT interpolators are dependent on the type of windowing applied to the data. It was found that the DFT and Grandke’s method were both suitable frequency and amplitude interpolation methods for Hamming windowed data, although both methods are in fact more accurate for Hanning windowed data. The DFT method involves the calculation of two FFTs in each time frame, and is not, hence, as computationally efficient as
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
1847
Fig. 2. Estimated pitch trajectories of two synchronous flute notes played with vibrato using the multipitch estimator. The reference lines show the transcribed note pitches (G5 = 784 Hz and A5 = 880 Hz).
Grandke’s method. However, the DFT performed marginally better than Grandke’s method in preliminary tests measuring the frequency and amplitude interpolation errors for a sinusoid to 20 dB in white noise, in which the SNR was varied from and, hence, was chosen as the preferred interpolator. Another method for estimating peak frequencies [9] follows from minimizing, in a least-squares sense, the difference between the observed spectrum and the first-order limited expansion of the Fourier transform of the window function around each peak.
B. Multipitch Estimation Following spectral peak detection and given a priori, the number of notes in the mix, a multipitch estimator was designed and used to estimate the pitch trajectories of all notes in the mix. As there is typically some variation in pitch over the note duration, for example, due to vibrato, and since the filtering stage is sensitive to slight pitch variations, it was necessary to estimate of all notes in every time the pitches frame. Pitch estimates in individual frames were then combined in such as way as to form smooth note pitch trajectories while ignoring isolated and clearly incorrect pitch estimates. Fig. 2 shows the results of using the multipitch estimator to estimate the pitch trajectories of two flute notes played with vibrato. The multipitch estimator, while being moderately reliable for 2–3 synchronous notes, was markedly worse at higher polyphonies as discussed in Section IV, and its implementation will not be expanded upon since better note error rates have been reported for another multipitch estimator [10].
C. Estimating the Harmonic Frequencies Once the pitch trajectories of all notes were calculated, then, in a particular time frame, each detected spectral peak could potentially be matched with any single note that contained a harmonic within a limited range about the peak center frequency. A match was not made when more than one harmonic from different notes happened to exist within this range. In this case, we
refer to these as overlapping harmonics, and since their amplitude spectra are likely to be significantly overlapping, the resulting spectral content shared by the harmonics will be called an overlapping peak. To clarify, although, in Fig. 3, two peaks were detected in the peak detection stage, the term “overlapping peak” refers to the entire peak shared by both harmonics. On the other hand, a “nonoverlapping peak” is a spectral peak that is within range of at most one harmonic. Overlapping peaks will be discussed in Section II-E. Following the matching process and an adequate treatment of overlapping peaks, a filter was designed for each note, whose effect when multiplied by the DFT spectrum, was to remove from the spectrum the peaks that were matched uniquely to harmonics of that note, and a portion of the energy in any overlapping peaks that the unmatched harmonics of this note may have contributed to. The result of filtering the composite spectrum with this set of filters was a segmentation of the mixed spectrum into several constituent spectra corresponding to each note, and a residual basically containing the low-level noise envelope of the mixed spectrum and any inharmonic partials. For the present being, we consider nonoverlapping spectral peaks. A spectral peak was matched to a note if its frequency was within a range of any , where is the frequency of the th harmonic of note . Typically, was chosen to be in the range [0.01, 0.1]. If more than one peak was , the largest peak was matched found within this range of with note and the others ignored. An identity was not used in the above expression for the harmonic frequencies due to the following reasons. First, the deviation of harmonic frequencies from exact harmonicity can be quite significant, especially in piano notes which will be discussed shortly. Second, any inacwould be compounded when mulcuracy in a pitch estimate tiplying by to find the th harmonic frequency. Last, separation results were improved by using nonrigid estimates of the harmonic frequencies, and this is believed to be partly due to the better frequency localization of constituent harmonics in overlapping spectral peaks. The procedure for extrapolating eswas, first, to determine the frequency of the timates of the fundamental frequency component of note . If there existed was set a nonoverlapping peak within range of , then
1848
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
its partial frequencies are stretched rather more substantially than in most Western instruments according to [11]
(3) This comes about by physical consideration of the piano string stiffness in the equation of motion for transverse waves in a vibrating bar. is the inharmonicity coefficient, and, for a typin the middle register, the 17th parical value of tial would be shifted to about the frequency of the 18th par. An iteratial had the note been purely harmonic ( tive method was tested for producing better estimates of the in (2), based upon (3). A predicpredicted frequencies tion can be made by forming a joint estimate of using dea least-squares error minimization of (3) over the first , and then substituting and back tected partials . This improved tracking mainly of higher into (3) to find piano partials, which become progressively further apart in frequency as increases. The technique was not used, in general, however, because of the increase in computation time arising . It predictably yielded results of from the estimation of around zero when applied to other pitched orchestral instruments.
Fig. 3. Filtering of a spectral peak arising from two overlapping harmonics: a) construction of the filters H (k ) using (5) and (6) is determined by the predicted harmonic frequencies f , and f , and predicted harmonic amplitudes A and A ; b) comparison of the filtered and original amplitude spectra of the individual harmonics.
to ; otherwise, it was set to . The remaining harmonic frequencies of note were calculated iteratively according to (2)
a nonoverlap peak (2) otherwise is the predicted frequency of the where th harmonic of note . As (2) relies on knowing whether the peak is nonoverlapping, and this, in turn, depends on whether multiple harmonics of separate notes have been predicted in the local vicinity of this peak, this iterative process had to be performed concurrently for all pitches. The synchronization of these iterative processes was determined by always applying the next iteration of (2) to the note corresponding to the minimum . To clarify, means the of the set th harmonic of note . To begin with, , and is incremented with each iteration for note . We found that, as exact harmonicity was not enforced above, i.e., the predicted harmonic frequencies were allowed to shift slightly to coincide with nonoverlapping peak frequencies, the above method was flexible enough to detect partials slightly detuned from exact harmonicity. However, in the case of the piano,
D. Filtering of Nonoverlapping Harmonics The width of a nonoverlapping spectral peak , centered at the frequency bin , was found by searching at frequency bins for the first minima in and on opposite sides of the main lobe. If the peak was was matched with note , the amplitude of the filter set to unity across the entire width of the peak, . Appreciably better results were achieved using these variable width filter notches at the peak frequencies, rather than using fixed width filter notches. Thus, in the resynthesis stage, when is multiplied by the DFT spectrum, this has the effect of filtering the entire main lobe of the peak from the original mixed spectrum. is real, it follows that the DFT spectrum is Since complex conjugate symmetric about the Nyquist frequency, . Thus, frequency components i.e., above the Nyquist limit are easily removed from the mixed . However, it is only spectrum by using actually exists. in (10) that an imaginary component in An advantage of the approach used here as opposed to models in which separated harmonics are synthesized using well-behaved sinusoids [10], [12]–[15] is that, since the amplitude of the filter notches is unity across the width of all the nonoverlapping peaks matched to harmonics, then the residual contains at most some traces of the detected harmonics due to sidelobes, which do not tend to be noticeable. This also holds for overlapping peaks due to a normalization (5), which will be discussed in Section II-E. In the case of sinusoidal models, if the harmonics are not well modeled by sinusoids with slowly time-varying amplitude and frequency, and the residual is calculated by subtracting the set of sinusoids from the original waveform, then there could be some leakage of the harmonics into the residual.
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
E. Filtering of Overlapping Harmonics Previous approaches to separating overlapping partials in both speech and music fields include those that rely on sinusoidal models [1], [10], [12]–[15], a perceptually motivated smoothing of the amplitude spectrum of each source [12], linear models for overtone amplitudes [15], spatial mixing models [16], and a multistrategy approach [17]. Of the techniques based upon a sinusoidal model, [1] iteratively subtracts larger amplitude partials to reveal partially hidden weaker partials, [12], [15] iteratively estimates the phases and amplitudes of closely spaced sinusoids alternately with frequency estimates, and [10], [13], [14] use pre-estimated sinusoidal frequency estimates to calculate amplitude and phase estimates. In [1], when two overlapping partials were closely spaced and of comparable amplitude, a linear amplitude interpolation between neighboring harmonics was used to share the peak between the two vocal sources. Amplitude modulation or beating resulting from closely spaced sinusoids was used in [10] to resolve the amplitude trajectories of closely spaced sinusoids. A multistrategy approach was employed in [17] for separating duet signals. Beating was exploited if two partials were separated by less than 25 Hz, and the duration of the overlap was longer than two beat periods. When the overlap was shorter than two periods, an amplitude interpolation method like the one in [1] was used. When the partials were separated by between 25 and 50 Hz, a linear equations method was used to determine the amplitudes of the two partials given the measured composite spectrum, and partials separated by more than 50 Hz were not considered to be overlapping. In [14], when the frequency difference between intersecting sinusoids was less than 25 Hz, a linear amplitude and cubic phase multiframe interpolation method was used to interpolate the sinusoids between boundary frames at which the amplitudes and phases of the individual sinusoids could be resolved. Finally, a method was developed in [16] for resolving overlapping partials across multiple time frames, which combined spatial demixing techniques with inference based on the fact that neighboring harmonics of a single note usually have common amplitude and frequency modulation characteristics. The method effectively estimates in the frequency regions frequency masks similar to where overlaps occur. The technique applies to additive mixing models in which microphones in a room record different mixtures of the sources, where, in general, . Here, when dealing with overlapping harmonics, filters were designed to split the spectral content shared by overlapping harmonics into parts using overlapping filters. Three filter designs are proposed for this purpose; the first two are alternative methods for partitioning the energy in a shared spectral peak, and the third uses a model of the sum of DFT spectra to recover the DFTs of the individual harmonics. The filters designs were all dependent on the extrapolated harmonic frequencies , and it was also found beneficial to include some dependency on the predicted amplitudes of the harmonics. The were predicted using a simple linear interpolation between the amplitudes of the nearest harmonics of each pitch that were matched to nonoverlapping peaks.
1849
The first energy-based filter design for separating overlapping harmonics is (4) where
is the frequency in Hertz of bin , and are the first minima in above and below the set of predicted harmonic frenotes that contain a quencies. is the set of the particular harmonic within the overlapping peak, and a suitable value for is . For appearance, , i.e., the index of the harmonic of pitch that constitutes a part of the overlapping peak, has been shortened to . (4) is followed by a normalization to obtain (5) The second energy-based filter design introduces a dependency on the Fourier transform of the continuous window function (6) and this is, again, normalized using (5) to obtain . In practice, we approximate the continuous spectrum by the DFT of the zero-padded window function (a zero-padding factor of 64 has been used) and is rounded to the describes the shape nearest equivalent frequency bin. of the window function in the spectral domain, which is a maximum at and decreases as increases. Fig. 3(a) illusobtained using (5) and (6) trates the shape of the filters applied to a spectral peak comprised of two harmonics. Fig. 3(b) compares the filtered amplitude spectra using these filters, with the amplitude spectra of the original unmixed harmonics. This is, evidently, a good separation of the overlapping peak into its constituent harmonics. The above two filter designs, (4) and (6), were proposed simply as way of splitting the energy in an overlapping peak into parts in a way that reflects the predictions of the amplitudes and frequencies of the constituent harmonics. In ideal conditions in which these predictions are exact, and the peak arises from stationary sinusoids of nearly equal frequency and pre-estimated phase offset, then it is possible to separate the overlapping peak almost exactly into its constituent parts using complex filters. Suppose we use the following signal model to describe a cluster of frequencies giving rise to an overlapping peak in the DFT
(7) where at the start of the current frame and the signal is assumed to be continuous in the range . Then, it can be shown that the Fourier transform of the continuous
1850
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 4. Filtering of an overlapping spectral peak arising from two windowed sinusoids: a) mixed peak is the sum of the peaks arising from the individual sinusoids; b) resulting peak corresponding to the first sinusoid after filtering [filtered a, b, and c used (4), (6), and (10), respectively]; c) as in b), but for the second sinusoid.
signal multiplied by the continuous window function , is a convolution of the individual Fourier transforms and results in
Assuming the model is accurate and apart from an arbitrary constant, is approximately equal to evaluated at the discrete frequency bins. For the moment, we are observing in a limited frequency range between the minima on opposite sides of the positive frequency peak, to which the second term in brackets in (8) has very little effect. Thus
across only approximately accurate, and as the width of the peak, any leakage of the peak into the residual is avoided. Equation (10) is, in practice, applied to our set of overlapping harmonics by using the substitution . However, we still need a way to predict . Suppose we measure at different frequency bins: , with equivalent frequencies in Hertz: which are chosen to be the nearest frequency bins to under the condition that . Then, we obtain a set of independent linear equations which can be solved by a matrix inversion to find
(9)
(11)
(8)
where Suppose that all the possible to design a filter
and
are known. Then, it is .. .
.. . (10)
that, when multiplied by , results in approximately the DFT of the windowed sinusoid . One could correctly argue that given and , it would be easier to compute the expected shape of the DFT of the windowed sinusoid and simply subtract it from the overlapping peak in . However, any slight error in these parameters will result in an imperfect subtraction, and leakage of this sinusoid into the residual. A similar effect will occur if the original signal model was at all inaccurate, for instance, due to the sinusoid being only approximately stationary during the frame. The filter design of (10) performs a reasonably good separation when the parameter estimates are
.. .
.. .
Therefore (12) is approximated by the DFT of the zeroOnce again, padded window function. Fig. 4 compares the result of filtering a spectral peak arising from two synthesized overlapping sinusoids, using the filters in
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
Fig. 5. Hz).
1851
Filter shapes calculated using (5) and (6) and the amplitude spectrum of three violins with pitches (A5 = 880 Hz), (D[6 = 1109 Hz), and (E 6 = 1319
(4), (6), and (10). Whereas, in Fig. 3, the DFT amplitude was shown, in Fig. 4, the real component of the DFT spectrum is shown to illustrate that (10) was much better than (4) and (6) at resolving the phases of the original sinusoids correctly. A similar observation was made for the imaginary spectra. In Fig. 4, small random errors were added to the known sinusoidal amplitudes and frequencies in (4), (6), and (10) to simulate normal conditions of operation in which these quantities would be estimated imperfectly from the data. Finally, Fig. 5 shows the mixed amplitude spectrum of three violin notes in a single time frame, and Fig. 6 illustrates its separation using (5) and (6) into three source spectra plus a residual spectrum.
having a random amplitude in the range [0, 1], and a random , were added together to simuphase offset in the range late a random overlapping peak. The robustness of the three filter or was then evaldesigns as a function of the error in uated, by substituting into (4), (6), and (10) either the correct amplitude of both sinusoids and a rough estimate of their frequencies, or vice versa. In the former case, the rough estimates were produced by adding to each known sinusoidal freof . In the quency, a random frequency in the range were produced by mullatter case, the rough estimates of tiplying each known sinusoidal amplitude by a random number . We used as a measure of the error in the range between the original and separated DFTs of the two sinusoids, the quantity
F. Resynthesis of Separated Notes In any particular frame, the filtered spectrum for note was by . The time waveform of obtained by multiplying each separated note was synthesized by performing an inverse FFT of the corresponding filtered spectrum, then dividing by the original Hamming window used in the analysis, and then using an overlap-add method with triangular windows to interpolate the resulting time segments between frames. The residual time waveform was produced by subtracting the sum of the separated note waveforms from the original time signal. III. FILTER PERFORMANCE To evaluate the relative performance of the filters, they were applied in turn to the task of separating two overlapping sinusoids, with results shown in Fig. 7. The measurements were made as follows: Two sinusoids with a random relative frequency difference in the range [0, 4] frequency bins, and each
(13) is the where is the DFT of the windowed mixed sinusoids, is the filter DFT of the windowed unmixed sinusoid , and for sinusoid . Fig. 7 shows the average value of over iterations for each value of , as is varied from 0 to 1. It is clear that the separation performances of all filters decrease when the frequency and amplitude estimates decrease in accuracy, i.e., increases. The results reveal that when the frequency and amplitude estimates are accurate, the complex filter (10) is the most precise, but as these estimates become more inaccurate, eventually, (10) becomes misleading and less robust than (4), and (6) and (10) appear to have an advantage only when errors in are less than 0.2 frequency bins. Equation (4) demonstrates the most stable behavior and is the most accurate method when the errors in the frequency estimates are large.
1852
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 6. Segmentation of the spectrum in Fig. 5 using the filters shown, into three harmonic sources and a residual (note different amplitude scales).
based upon perceptual judgement, since the key aim of the algorithms is to achieve perceptually acceptable separation of the sources. Hence, some sound examples have been presented on the internet [18] so that the separated sounds can be compared directly with the unmixed and mixed originals. We also use a quantifiable measure, the signal-to-residual ratio (SRR), to evaluate the similarity between the time waveform , and its corresponding original of each separated note dB
Fig. 7. Filter error R(r ) when separating two overlapping sinusoids, as a function of the inaccuracy in a) the sinusoidal amplitude estimates A and b) the sinusoidal frequency estimates f [filtered a, b, and c correspond to (4), (6), and (10), respectively].
IV. RESULTS The real sound examples used were all western orchestral instrument note samples of length 2–8 s, 16 bits and sampled at 44.1 kHz. All samples aside from the piano were recorded in an anechoic chamber, although it has been observed that adding small amounts of reverb diminishes the separation quality only slightly. These samples were scaled to have equal mean squared amplitude, and summed to produce mixed note samples on which the separation algorithms were applied, allowing the direct comparison of the separated sounds and the original recordings. The most meaningful way to evaluate the performance of the separation algorithms would be
(14)
The correct matching of the set of original notes to the set of separated notes was achieved by swapping the order of the separated notes until the maximum of was achieved for each . sources is The mean signal-to-residual ratio (MSRR) over defined as (15) and the average increase in the sum of SRRs is
(16) is the mixed original signal. These quanwhere tities measure how well the original sound has been separated into individual notes, with larger values indicating better separation performance. No attempt has been made yet to split the residual waveform any further, and so it is expected that for mixes of notes containing large nonharmonic components, the should decrease. MSRR and
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
1853
TABLE I SRRs AND =M , FOR THE SEPARATION OF 2–7 SYNCHRONOUS VIOLIN NOTES
assess the performance of the filters in ideal conditions. The audio samples corresponding to Tables I and Table II are available at [18]. Fig. 8 shows an example of the original, separated and residual spectrograms obtained when separating a mix of three notes (sample number in Table II). C. Average Separation Results
A. Separating Harmonically Unrelated Notes for sample Table I shows the calculated SRRs and mixes of between two and seven violin notes, using (4) and (5) to separate overlapping harmonics. Although the separation performances shown are very good, one must take into consideration that these examples mostly consist of notes that are not related to each other by harmonic intervals like major thirds, fourths, fifths or octaves. The results should, thus, be interpreted as showing that the removal of nonoverlapping harmonics with unit amplitude filters across the width of harmonic peaks is capable of producing high SRRs. B. Separating Harmonically Related Notes To measure the effectiveness of the three filter designs proposed for separating overlapping harmonics, three real sample mixes were produced consisting of multiple notes from different instruments, in which the pitches were deliberately chosen to have harmonic relationships, thus resulting in a higher proportion of overlapping harmonics. Three synthesized sound examples were also produced and the results of the six test samples are shown in Table II. is given for the three For each sample, the MSRR and filter designs in Section II-E and also using 1) no treatment of overlapping harmonics, 2) Parsons’ method for splitting overlapping harmonics by interpolating between the amplitudes of neighboring harmonics [1], and 3) the nonlinear least-squares (NLS) method for estimating parameters of a model of closely spaced sinusoids [13]. For Parsons’ method, a filter was used, followed by the normalization in (5). In [13], the NLS method is explained in the context of two overlapping harmonics, but the method can easily be generalized to more than two overlapping harmonics. Nonoverlapping harmonics were treated identically for all methods as described in Section II-D. The first synthetic sample is a mix of three synthesized notes and Hz. The second synthesized with sample is a sum of two notes: the first note has a constant pitch of 440 Hz and the second is a linear glissando between 400 and 480 Hz. In the last synthetic sample, the first note has a constant pitch of 300 Hz, and the second note has a frequency-modulated (FM) pitch centered on 450 Hz with FM amplitude 10 Hz and FM frequency 5 Hz to simulate vibrato. In all the synthetic examples, the first 20 harmonics are present with decreasing har, and the exact time-varying harmonic amplitudes monic frequency and amplitude trajectories were provided to
Finally, results are presented in Table III for the MSRR and for polyphonies of 2–5 instruments, each averaged over 100 random sample mixes, and using the same set of methods for separating overlapping harmonics as in Table II. Random mixes were constructed by firstly selecting a random set of unique instruments from a set of 11 orchestral instrument types (bassoon, cello, B clarinet, E clarinet, flute, French horn, oboe, piano, saxophone, trombone, and violin), and then selecting a note for each instrument randomly from within its complete pitch range. The individual notes were drawn from a set of 479 samples extending in pitch from A0 (27.5 Hz) to C8 (4186 Hz). In initial studies, the multipitch estimator was used. It was able to detect all pitches correctly in a random mix 53.2%, 37.7%, 17.4%, and 4.0% of the time for polyphonies of 2, 3, 4, and 5, respectively, where a correct pitch estimate is defined as one that is within 3%, i.e., half a semi-tone, of the known transcribed pitch. The mixes for which multipitch estimation was unsuccessful tended to occur when there were strong harmonic relationships between the constituent notes, and for these mixes lower SRRs would be expected due to the greater likelihood of overlapping harmonics. It was found that by averaging SRRs over only those sample mixes for which the multipitch estimator was correct, the average results were biased to “easy mixtures,” and resulted in SRRs of approximately 5 dB higher than those given in Table III. Thus, it was decided that instead, the rough pitches would be provided in advance, so the results in Table III show the average performance over all random mixes. The multipitch estimator was used only to refine and track each pitch over time within a limited pitch range around the provided rough pitch. V. DISCUSSION The violin separation in Table I showed that by multiplying a mixed spectrum by a filter of unit amplitude across the width of each peak containing a single harmonic, these harmonics could be extracted very effectively. This was done with relatively little computational expense in comparison to spectral subtraction methods involving sinusoidal parameter estimation. Table II provides some insight into the performances of the three filter designs in comparison to previous approaches to separating overlapping harmonics. In relation to Table I, one notices an overall decrease in performance, which is not surprising due to the relatively larger proportion of overlapping harmonics in Table II. In Table II, the two energy-based filter designs produced overall the highest SRRs with real samples (samples {4}–{6}). They also performed very well for the synthesized samples (samples 1–3). Equation (10) was the best performer for sample {1}. As the harmonic frequency and amplitude trajectories were provided for the synthesized samples, this indicates that
1854
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
TABLE II SRRs AND =M , FOR THE SEPARATION OF HARMONICALLY RELATED MIXED NOTES
under ideal conditions of stationary sinusoids, (10) performs the best, which is confirmed by observing Fig. 7 at low values of . The relatively low performance of Parsons’ method can be explained by the lack of frequency dependence in the filter design. Nevertheless, it provides some advantage over removing overlapping harmonics from the separated notes altogether (the first column of data in Table II). The low performance of the NLS method is surprising, and upon further examination of the separated spectra some insight was obtained. The low SRRs are mostly due to a few instances in which the amplitudes of the closely spaced sinusoids modeling an overlapping peak were grossly overestimated. The explanation could be that there is very little preventing a least-squares method from interpreting a small overlapping peak as the addition of two sinusoids of very large amplitude but nearly opposite phase, which destructively interfere. Although this interpretation may correspond to a minimum least-squares error, given that the spectrum is unlikely to be composed of stationary sinusoids, the small overlapping peak would more likely be the addition of two
relatively low amplitude harmonics. The addition of a term into the NLS estimation that penalizes joint amplitude estimates with large variance might partially overcome this problem. As the frequency and amplitude estimates of the closely spaced sinusoids need to be optimized to find the least-squares fit, the computational cost of the NLS method is relatively large. There is also a risk that this optimization does not converge to the global best fit. The worst overall SRRs in Table II were obtained for sample {5}, which corresponds with a general observation that separation performance is worse for lower pitched samples. This is an unavoidable consequence of using a fixed frequency resolution transform, the DFT. It means that harmonic frequency estimates are worse relative to for lower pitched notes. Also, the ratio of the width of harmonics to the spacing between harmonics becomes larger for lower pitches; hence, there is relatively more overlapping spectral content for lower pitched note mixes. Table III again validates the use of frequency-dependent filters by showing a consistent improvement in SRR over Par-
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
1855
Fig. 8. Original spectrograms of a cello, soprano saxophone, and flute, and the spectrograms of the corresponding notes after separation (gray scales of all figures are equivalent). TABLE III AVERAGE MSRR AND =M , FOR POLYPHONIES OF 2–5 INSTRUMENTS
of (10) is demonstrated in Fig. 7; notably, (10) is not very robust to large errors in harmonic frequency and amplitude estimates. Overall, not only is (4) the most predictable with respect to errors in harmonic amplitude and frequency estimates as shown in Fig. 7, but Table III shows that it also achieves the highest SRRs. The SRRs in Table III for (4) are about 7 dB higher than the average separation results reported in [15]. In [15], a larger selection of 26 different instruments was used although pitches were restricted to between 65 and 2100 Hz, and the results reported were the average over clean mixes and mixes with addi-dB pink noise. The cases in which the multipitch estive timator failed were not accounted for in the average separation results in [15]. VI. CONCLUSION
sons’ method for the three proposed filter designs. Similarly to in Table II, the two energy-based filter designs, (4) and (6), achieved the highest SRRs. The NLS method and (10) were both derived by assuming that the sinusoids in an overlapping peak are stationary within each time frame. Given that these methods performed worse than (4) and (6), and the fact that relatively long window lengths (186 ms) were used, it would be reasonable to conclude that this assumption is inaccurate for most real samples. Another contributing factor to the lower performance
Results have been presented for separating mixes of between two and seven synchronous notes from a mono track. Average SRRs of around 10–20 dB have been achieved, and mixes of two notes were separated with almost imperceptible differences between the separated and original notes. In some sample mixes, SRRs of up to 30 dB were obtained. With increasing numbers of notes in the mix, the separation quality predictably decreases, but mixes of up to seven notes have been separated with enough fidelity to easily allow the listener to match the separated and original notes correctly. This work has been extended [19] to separating synchronous note sequences or instrumental parts.
1856
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
A product of the separation is the residual, which contains the inharmonic partials and noise characteristics of the original mix. This has been used elsewhere to make audible subtle attack characteristics of piano notes, and may have application in the synthesis of realistic instrument sounds. No attempt has been made to split the residual into separate sources, and, thus, there are audible artifacts when attempting to separate notes with significant noise or nonharmonic content. A solution has been proposed to the problem of separating overlapping harmonics from different notes, which is an important issue when attempting to separate musical sources. Three filter designs have been developed for separating multiple harmonics from an overlapping spectral peak. The first two designs are alternative methods for splitting the energy in the overlapping peak by using predictions of the frequencies and amplitudes of each harmonic constituting the peak. The third filter design is complex and attempts to recover the DFTs of the individual unmixed harmonics. All of these filter designs resulted in overall improvements to separation performance, and the first energy-based filter design, (4) was overall the best performer. The work has confirmed that via the use of a priori model-based information, in the form of both prespecified models of harmonic structures of pitched instruments and a filtering methodology to handle overlapping spectral peaks, it is possible to separate the harmonic structure of multiple instruments from a mono recording to a high fidelity. ACKNOWLEDGMENT
[8] M. Desainte-Catherine and S. Marchand, “High precision fourier analysis of sounds using signal derivatives,” J. Audio Eng. Soc., vol. 48, no. 7/8, pp. 654–667, Jul./Aug. 2000. [9] P. Depalle and T. Hélie, “Extraction of spectral peak parameters using a short-time fourier transform modeling and no sidelobe windows,” presented at the IEEE Workshop Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 1997. [10] A. Klapuri, T. Virtanen, and J.-M. Holm, “Robust multipitch estimation for the analysis and manipulation of polyphonic musical signals,” presented at the COST-G6 Conf. Digital Audio Effects, Verona, Italy, Dec. 2000. [11] N. H. Fletcher and T. D. Rossing, The Physics of Musical Instruments, 2nd ed. New York: Springer-Verlag, 1998. [12] T. Virtanen and A. Klapuri, “Separation of harmonic sounds using multipitch analysis and iterative parameter estimation,” presented at the IEEE Workshop Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 2001. [13] T. Tolonen, “Methods for separation of harmonic sound sources using sinusoidal modeling,” presented at the AES 106th Convention, Munich, Germany, May 1999. [14] T. F. Quatieri and R. G. Danisewicz, “An approach to co-channel talker interference suppression using a sinusoidal model for speech,” IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 1, pp. 56–69, Jan. 1990. [15] T. Virtanen and A. Klapuri, “Separation of harmonic sounds using linear models for the overtone series,” presented at the IEEE Int. Conf. Acoustics, Speech, Signal Processing, Orlando, FL, May 2002. [16] H. Viste and G. Evangelista, “A method for separation of overlapping partials based on similarity of temporal envelopes in multi-channel mixtures,” in IEEE Trans. Audio, Speech, Lang. Process., to be published. [17] R. C. Maher, “Evaluation of a method for separating digitized duet signals,” J. Audio Eng. Soc., vol. 38, no. 12, pp. 956–979, Dec. 1990. [18] Note Separation Demonstrations, M. R. Every and J. E. Szymanski. (2004, Jun.). [Online]. Available: http://www-users.york.ac.uk/~jes1/ Separation1.html [19] M. R. Every and J. E. Szymanski, “A spectral-filtering approach to music signal separation,” presented at the 7th Int. Conf. Digital Audio Effects, Naples, Italy, Oct. 2004.
The authors would like to thank the three anonymous reviewers for their well informed and helpful suggestions. Mark R. Every received an Honours degree in physics from the University of the Witwatersrand, South Africa, in 1999. He is currently pursuing the M.S. degree in music technology at the University of York, York, U.K., on a British Commonwealth Scholarship. He is currently an academic fellow at the Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford, U.K. His current research interests include music and audio signal processing, content description and extraction, and machine learning
REFERENCES [1] T. W. Parsons, “Separation of speech from interfering speech by means of harmonic selection,” J. Acoust. Soc. Amer., vol. 60, no. 4, pp. 911–918, Oct. 1976. [2] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1135–1150, Sep. 2004. [3] L. Ottaviani and D. Rocchesso, “Separation of speech signal from complex auditory scenes,” in Proc. COST G-6 Conf. Digital Audio Effects, Limerick, Ireland, Dec. 2001, pp. 87–90. [4] M. Cooke and D.P.W. Ellis, “The auditory organization of speech and other sources in listeners and computational models,” Speech Commun., vol. 35, no. 3–4, pp. 141–177, 2001. [5] O. Cappé, J. Laroche, and E. Moulines, “Regularized estimation of cepstrum envelope from discrete frequency points,” presented at the Workshop Applications of Signal Processing to Audio and Acoustics, WASPAA, Mohonk, NY, Oct. 1995. [6] X. Rodet. Musical sound signal analysis/synthesis: Sinusoidal residual and elementary waveform models. presented at IEEE Time-Frequency and Time-Scale Workshop. [Online]. Available: http://mediatheque.ircam.fr/articles/textes/Rodet97e/ [7] (Revised 1999, May). How to Interpolate Frequency Peaks. dspGuru, Iowegian Intern. Corp., M. Donadio. [Online]. Available: http://www.dspguru.com/howto/tech/peakfft2.htm
+
techniques.
John E. Szymanski received an Honours degree in mathematics and the Ph.D. degree in theoretical physics from the University of York, York, U.K., in 1980 and 1984, respectively. He joined the Department of Electronics, University of York, in 1986, where he is currently a Senior Lecturer within the Media Engineering Group. His research interests include physical modelling, computational signal processing, inverse problems, and optimization methods.