IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
121
A Perceptual Subspace Approach for Modeling of Speech and Audio Signals With Damped Sinusoids Jesper Jensen, Member, IEEE, Richard Heusdens, and Søren Holdt Jensen, Senior Member, IEEE
Abstract—The problem of modeling a signal segment as a sum of exponentially damped sinusoidal components arises in many different application areas, including speech and audio processing. Often, model parameters are estimated using subspace based techniques which arrange the input signal in a structured matrix and exploit the so-called shift-invariance property related to certain vector spaces of the input matrix. A problem with this class of estimation algorithms, when used for speech and audio processing, is that the perceptual importance of the sinusoidal components is not taken into account. In this work we propose a solution to this problem. In particular, we show how to combine well-known subspace based estimation techniques with a recently developed perceptual distortion measure, in order to obtain an algorithm for extracting perceptually relevant model components. In analysis-synthesis experiments with wideband audio signals, objective and subjective evaluations show that the proposed algorithm improves perceived signal quality considerable over traditional subspace based analysis methods. Index Terms—Complex exponentials, perceptually relevant sinusoids, psycho-acoustical distortion measure, sinusoidal modeling, speech and audio processing, subspace-based signal analysis.
I. INTRODUCTION
S
INUSOIDAL models have proven to provide accurate and flexible representations of a large class of acoustic signals including audio and speech signals. For speech and audio processing, sinusoidal models have been applied in areas such as speech coding (e.g., [1]–[3]) and enhancement (e.g., [4], [5]), speech signal transformations (e.g., [6]–[8]), music synthesis (e.g., [9], [10]), and more recently low bit-rate audio coding (e.g., [11]–[13]). The applications above can be described in an analysis-modification-synthesis framework, where in the analysis stage model parameters are estimated for consecutive signal frames; in this stage it is typically assumed that each signal frame can be well represented as a linear combination of constant-amplitude, constant-frequency sinusoidal functions. In the modification phase, the estimated parameters may be quantized or otherwise modified. Finally, in the synthesis stage, the resulting parameters are
Manuscript received November 20, 2002; revised July 14, 2003. This research was conducted within the ARDOR project and was supported by the E.U. under Grant IST-2001-34095. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ravi P. Ramchandran. J. Jensen and R. Heusdens are with the Department of Mediamatics, Delft University of Technology, 2628 CD Delft, The Netherlands (e-mail:
[email protected];
[email protected]). S. R. Jensen is with the Department of Communication Technology, Aalborg University, 9220 Aalborg, Denmark (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSA.2003.819948
used for reconstructing the possibly modified signal using interpolative (e.g., [14], [15]) or overlap/add synthesis (e.g., [3]). Recently, a number of extended sinusoidal model variants have been proposed, where the constant-amplitude, constantfrequency assumption has been relaxed (e.g., [11], [16], [17]). An extended model of particular interest is the so-called exponential sinusoidal model (ESM), which aims at representing signal segments as sums of exponentially damped sinusoidal functions. Observing that damped oscillations occur commonly in many natural signals including speech and audio, the ESM is often a physically reasonable model. Furthermore, since exponentially damped sinusoids play a fundamental role in linear system theory, a vast amount of research supports the treatment of this model. The ESM has been applied to analysis and/or synthesis of audio (e.g., [18]–[21]) as well as speech signals (e.g., [22], [23]). In many speech and audio applications it is of interest to represent only the perceptually relevant time/frequency regions of the signal in question by exploiting the masking properties of the human auditory system. For example, in most audio coding schemes (e.g., [24]) and in perceptual speech coding algorithms (e.g., [25], [26]), the parameter estimation and/or quantization stages have been tailored to represent the perceptually most significant signal regions. For the ESM, the parameter estimation schemes can roughly be divided into two main groups: analysis-by-synthesis schemes such as the matching pursuit (MP) based algorithms described in [16], [27] and subspace-based schemes (e.g., [28]–[30]). While some work has been done for extracting perceptually relevant sinusoids using MP based schemes (e.g., [31], [32]), less efforts have been directed toward estimating perceptually relevant ESM parameters using subspace techniques [21]. In [21] an attempt is made to combine psycho-acoustic information with a subspace based ESM parameter estimation scheme. In this scheme, the signal to be modeled is divided into subbands and an independent (low-order) ESM is used for each subband. The ESM components are estimated in an iterative manner, one at a time, by assigning at each iteration an additional damped sinusoid to the subband with the largest residual error-noise to masking level, in much the same way as the bits are assigned to different subbands in MPEG [24]. The approach in [21] operates at a lower computational complexity than a corresponding full-band scheme. However, it is not optimal because subbands are treated independently. Furthermore, no perceptual knowledge is used for estimating the sinusoids within each subband.
1063-6676/04$20.00 © 2004 IEEE
122
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
This paper describes an alternative approach for determining perceptually relevant ESM parameters. The aim is to minimize a perceptually motivated distortion measure. Furthermore, the presented framework allows for joint estimation of the parameters of interest. The presented algorithms combine well-known subspace based estimation algorithms and a distortion measure derived from a recently developed psycho-acoustical masking model. The paper is structured as follows. In Section II, we introduce the perceptual distortion measure on which the proposed algorithm relies. Section III describes briefly a traditional scheme for ESM parameter estimation and then moves on to treat the proposed algorithm. Section IV reports results of simulation experiments performed with the proposed algorithm. First, a number of simple simulation experiments are conducted to illustrate differences between the proposed algorithm and traditional (nonperceptual) subspace estimation techniques. Then, it is shown through simulations with audio signals how the dimensions of the input matrix influence objective and subjective modeling performance. Furthermore, an analysis of the distribution of estimated sinusoids for speech and audio signal modeling reveals a simple way of reducing the computational load of the algorithms in the study. Finally, the subjective performance of the proposed algorithm is evaluated in a listening test. Section V summarizes and concludes the paper. II. PERCEPTUAL DISTORTION MEASURE In order to account for human auditory perception in the estimation of the ESM parameters, we use the recently developed perceptually relevant distortion measure described in [33]. This distortion measure has a number of desirable properties, which makes it particularly useful for the application at hand, namely audio and speech representation. First of all, the distortion measure is general in the sense that it does not make any assumptions on the origin of the target signals. Consequently, it is applicable to both speech and audio signals. This is in contrast to many of the well-known objective speech quality measures [34], which rely on an autoregressive signal production model. Secondly, in any practical situation, the distortion measure defines a norm on a Hilbert space. This property is important because it makes the distortion measure mathematically tractable and allows it to be incorporated in optimization algorithms aiming at minimizing norm based criteria. Ignoring time-domain masking phenomena, signal distortion becomes audible when the log power spectrum of the modeling error exceeds a frequency dependent threshold, called the masking threshold. Many models exist for computing the masking threshold. In this paper we use the model proposed in [35], which differs from traditional spreading-function based models in the sense that it takes into account all auditory filters for computing the distortion, rather than considering the auditory filter receiving most of the distortion. Furthermore, this psycho-acoustical model avoids the explicit classification of tonal and noise-like masker signal components, which is often used in traditional models (e.g., [24]). Although we in this
work use the masking threshold derived from the model in [35], it is just as possible to use another psycho-acoustical model, e.g., the one used in MPEG-Audio [24]. It should, however, be noted that the model in [35] provides masking threshold predictions which are better in accordance with experimental psycho-acoustical data than those obtained with conventional masking models [35], [36]. The distortion measure can be written as [33] (1) is a where indicates the Fourier transform operation, weighting function representing the frequency-dependent is the analysis sensitivity of the human auditory system, window, and
is the modeling error, i.e., the difference between the original signal and the modeled signal . The weighting function is usually chosen to be the reciprocal of the masking threshold. In order for (1) to define a norm, the weighting function must be positive and real for all , and the window for . In this case, the sequence must satisfy distortion in (1) can be rewritten as a convolution of two (infinite) discrete-time sequences (2) where is the inverse Fourier transform of , and denotes the vector -norm. In the context of the ESM, the modeled signal frame is given by (3) , and are amplitude, damping, angular where , , frequency, and phase parameters, respectively. The problem at hand is, for a given original signal frame , to find the set of ESM parameters which minimize the perceptual distortion measure in (2). Since a convolution operation can be formulated in terms of a matrix-vector multiplication, the minimization problem of interest can be stated as (4) is a diagonal matrix with the elements of where is an (inthe analysis window on the main diagonal, and finite) Toeplitz filtering matrix containing the elements of the, in this case symmetric, filter impulse response . We treat the specific structure of in further detail in Section III-B. The may be interpreted as a effect of premultiplication with transformation from the linear domain where the -norm does not necessarily correlate well with subjective quality to a perceptual domain where the -norm is better in accordance with perceived quality.
JENSEN et al.: PERCEPTUAL SUBSPACE APPROACH FOR MODELING OF SPEECH AND AUDIO SIGNALS
III. ESTIMATION OF PERCEPTUALLY RELEVANT ESM PARAMETERS The algorithms to be presented rely on the observation that the modeled segment in (3) can be expressed as a sum of complex exponentials (5) where are complex amplitude parameters and are so-called signal poles. From (5) we see that the signal poles contribute nonlinearly to the object function in (4). The nonlinearity related to signal pole estimation can be circumvented by using the so-called HTLS algorithm by Van Huffel et al. [29], which is a total least squares (TLS) based variant of Kung et al.’s original state space algorithm [28]. These algorithms belong to the class of single shift-invariant methods within the set of subspace-based signal analysis algorithms [30]. The HTLS algorithm is not immediately suited for solving the weighted problem in (4). To take in (4) into account, we consider the filtering of the matrix instead the so-called prefiltered HTLS algorithm described in [37]. Having estimated the signal poles using this algorithm, can be found as the solution to a the complex amplitudes weighted linear least-squares problem. In the following we give a brief review of the HTLS algorithm and the prefiltered HTLS algorithm in order to support our discussions in the remainder of the paper; for an in-depth treatment of the algorithms, the reader is referred to [29] and [37], respectively. A. Signal Poles With HTLS Let us assume initially that the observed signal frame can be represented by the ESM in (5) without error, and that the correct model order is known. The HTLS algorithm first arranges the observed signal frame in a Hankel data matrix as follows:
.. . where position (SVD) of
.. . .. .
(6)
. The singular value decomis given by
123
matrix (or ). If the observed eigenvalues of the signal satisfies (5) and is known, the underlying signal poles can be recovered without error. However, in practice, the observed signal frame will not satisfy the ESM exactly, the Hankel matrix will typically have a rank larger than , and the shift invariance property in (8) will only be approximately valid. In contains the left singular vectors correthis case, the matrix sponding to the largest singular values, and the matrix (or ) is estimated as the total least squares solution [38] of the (incompatible) matrix equations in (8). B. Signal Poles With Prefiltered HTLS The HTLS algorithm described above is not immediately suited for solving the problem in (4) because HTLS does not take the filtering operation of (and ) into account. A first obvious choice for adapting HTLS to this problem is to filter order FIR filter , i.e., the observed signal frame in the , arrange this sequence calculate the convolution sequence , and then use as input to the HTLS in a data matrix algorithm. However, this approach is not acceptable because will no longer have rank ; even when the observed signal is known, it is not possible frame satisfies the ESM and to retrieve the underlying signal poles without error using this approach. Alternatively, the convolution sequence could be truncated before arranging it in . Specifically, by first and last elements of and arranging discarding the retains the rank the remaining middle part in , the matrix of and the shift invariant property; that is, when satisfies is known, the signal poles can be estimated the ESM and without error. The problem with this approach, however, is that is wasted during the potentially useful data in the edges of truncation. Furthermore, in order to have samples left after the truncation, the length of the filter impulse response is limited . to The above mentioned drawbacks can be overcome by using the so-called prefiltered HTLS algorithm [37], which retains the rank of the original signal without discarding potentially useful data. In the prefiltered HTLS algorithm, the Hankel data matrix is postmultiplied by a full rank filter matrix and the HTLS algorithm is applied to the matrix product ; is premultiplied a similar description can be derived when [37]. It is straight-forward to with a filter matrix show that , which generally is not Hankel structured, retains the rank- and shift-invariant property. That is, when satisfies is known, we have (5) and
(7) where is a diagonal matrix containing the nonzero singular values, and the matrices and contain as columns the corresponding left and right singular vectors, respectively. The shift invariance prop(and ) ensures that the following matrix equations erty of are satisfied: (8) where the superscripts ( ) denote deletion of the top (bottom) row of the matrix in question. The signal poles are found as the
where ( ) contains the left (right) singular vectors corresponding to the nonzero singular values of the filtered matrix , and the signal poles can be recovered ( ). In this ideal without error as the eigenvalues of scenario, the only requirement is that the filter matrix have full rank. is to implement the The purpose of the filter matrix convolution in (2). The range of the summation in (2) is infinite, and its frequency domain counterpart in (1) involves an integral. In practice, however, we work with finite-length time-domain
124
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
sequences, and the integral in (1) should be replaced by a summation, i.e., a point-wise multiplication in the frequency domain. This, in turn, means that the convolution in (2) becomes a circular convolution, rather than a linear. Hence, we consider a circular Toeplitz filter matrix (see (9) at the bottom of the . Initial experiments indeed show page) with that this filter matrix structure leads to better performance than, e.g., the Toeplitz filter matrix proposed in [37]. Forming corresponds to convolving circularly the product with the FIR filter impulse response . each row in
1. Compute perceptual weighting filter from a psycho-acoustical masking model (e.g., [35]), and construct filter maand . trices 2. Construct Hankel structured data ma. trix 3. Compute prefiltered data matrix . 4. Find perceptual relevant signal pole estimates using the HTLS algorithm [37]: . 5. Contruct the Vandermonde matrix [(11)] from estimated signal poles, and estimate complex amplitude vector from weighted linear least-squares problem [(10)].
C. Estimation of Complex Amplitudes Having estimated the signal poles using the prefiltered HTLS algorithm, the complex amplitudes (and thus and phases ) are found as the solution to real amplitudes the weighted linear least squares problem (10)
IV. SIMULATION RESULTS A number of simulation experiments were conducted to study and evaluate the performance of the proposed algorithm, P-ESM, and to demonstrate the differences between this scheme and the traditional scheme without perceptual prefiltering (i.e., , ). We denote this latter scheme by -ESM, where “ ” reflects that processing is done to minimize an unweighted -norm (as opposed to a perceptually weighted -norm). Objective as well as subjective tests were performed. Seven audio signals, sampled at a frequency of 44.1 kHz, were used in the experiments (see Appendix I). A fixed frame length samples (23.2 ms) was used, and frames were of extracted with an overlap of 50%. The filter impulse response had a length of samples (5.8 ms); initial experiments showed slightly better performance with this choice compared and . For the P-ESM algorithm, the to (and thus the filtered matrix ) had Hankel data matrix dimensions and , while for the standard -ESM algorithm, the data matrix was chosen “as square as possible,” i.e., and . Simulation studies reported in Section IV-C show that these matrix dimensions lead to the best overall performance of the algorithms. The window used in the experiments was a Hanning window.
where
is the complex amplitude vector, is a circular Toeplitz filter matrix whose and first column and row is , respectively, is a diagonal matrix with the elements of the analysis window on the main diagonal, and is a Vandermonde matrix constructed from the signal pole estimates
.. .
.. .
.. .
(11)
.. .
D. Algorithm Outline The proposed scheme, which we call P-ESM, for estimating perceptually relevant ESM parameters can be outlined as follows. Input: , . , Output:
.. . .. .
..
..
. ..
. ..
.. .
.. .
..
.
..
.
..
.
..
.
..
.
..
.
.
..
.
.. ..
.
..
.
..
.
..
.
..
.
.
..
.
..
.
..
.
..
.
.. .
.. .
. ..
..
.
.
. ..
.. .
..
. ..
..
.
. ..
.
.. . .. .
(9)
JENSEN et al.: PERCEPTUAL SUBSPACE APPROACH FOR MODELING OF SPEECH AND AUDIO SIGNALS
l
125
x
Fig. 1. Modeling of sum of sinusoids using P-ESM (left column) and -ESM (right column). (a)–(b) Power spectrum (solid) of original signal frame and corresponding masking curve (dashed). (c)–(d) Modeling with P-ESM and -ESM for = 2. (e)–(f) Modeling with P-ESM and -ESM for = 4. = 6. (g)–(h) Modeling with P-ESM and -ESM for
l
K
In order to have an objective quality measure we define the following “perceptual” signal-to-noise ratio ( ) for the original signal frame and its modeled counterpart (12) where the Toeplitz matrix and the diagonal matrix are identical to the ones used in (10) for estimating the complex amplitudes in the prefiltered HTLS algorithm. The SNR measure aims at reflecting the quality of the modeled frame in a perceptual domain, and is valid to the extent that the perceptual model used for constructing the filter matrix adequately represents the masking properties of the human auditory system. In some cases, it is useful to assign an objective quality measure to the modeling of several consecutive signal frames, e.g., an entire signal. To do so we use the segmental ( ) defined as the average value taken across the signal frames in question.
l
K
l
K
A. Two Case Studies We demonstrate the characteristics of the proposed method in two case studies. Example 1: Sum of Sinusoids: In this example a synthetic signal frame was generated as a sum of three stationary si, , nusoids with frequencies , amplitudes , , , and phase values . The signal frame was modeled using P-ESM and -ESM for , 4, 6. The result of the modeling procemodel orders dure is illustrated in Fig. 1. Fig. 1(a) and (b) shows the spectrum of the original signal (solid) with the corresponding masking curve (dashed). Fig. 1(c) and (d) shows the modeled spectra for for P-ESM and -ESM, respectively. In Fig. 1(c) we see that P-ESM tries to model the two closely spaced sinusoids, because these are the most important from a perceptual point of view, while as shown in Fig. 1(d), -ESM does not take perceptual information into account and selects the sinusoid at 20 kHz because it contains the largest energy. In Fig. 1(e), the model
126
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
l
x
Fig. 2. Modeling of noise-like signal frame using P-ESM (left column) and -ESM (right column). (a) and (b) Spectrum of original signal frame with = 2. (e) and (f) Modeling with P-ESM and -ESM for = 4. corresponding masking curve. (c) and (d) Modeling with P-ESM and -ESM for (g) and (h) Modeling with P-ESM and -ESM for = 16.
l
K
l
order is and P-ESM represents almost perfectly the two closely spaced sinusoids, while -ESM in Fig. 1(f) maintains both P-ESM [Fig. 1(g)] and the sinusoid at 20 kHz. For -ESM [Fig. 1(h)] retrieve the three sinusoids without error. Example 2: Noise-Like Signal Frame: Here, we consider the case where the observed signal frame contains a noise-like signal. In this example, we have selected a frame from an unvoiced speech sound (/f/ in “Viel,” German male speaker). The modeling performance for P-ESM and -ESM is illustrated, respectively, in the left and right column of Fig. 2 for the frequency range 0–6 kHz, from which it is clear that P-ESM and -ESM attack different regions of the spectrum. The P-ESM scheme aims at representing lower frequency regions while -ESM models the high-energy regions around 3 kHz , -ESM [Fig. 2(h)] first. Even for a model order of does not represent the low-frequency regions around 500 Hz, which were found the perceptually most relevant by P-ESM [see Fig. 2(c) and (e)].
K
l
K
two signal frames are modeled with the ESM using P-ESM and -ESM for parameter extraction for model orders in the range . The modeling performance for the quasiperiodic and the noise-like signal frame are shown in Figs. 3 and 4, respectively. From Fig. 3 we see that the performance for P-ESM and -ESM is almost identical for lower model orders ( ). The reason is that with this signal frame, the harmonics with highest-energy also have the highest perceptual relevance, i.e., P-ESM and -ESM extract the same sinusoids for low model orders. For larger model orders, however, the performance gap , the differbetween the two schemes grows. For ence is about 2 dB in favor of P-ESM. For noise-like frames as the one in Fig. 4, the performance gap is approximately 1.5 dB at low model orders and almost 3 dB at higher model orders. The reason for the difference at low model orders is, as illustrated in Fig. 2, that P-ESM tends to represent the perceptually important low-frequency regions first, while -ESM models the high- energy regions.
B. Performance versus Model Order In this section, we compare the performance of P-ESM and -ESM for varying model order and for two types of signal frames: a quasiperiodic signal frame taken from a steady-voiced region in a female speech signal, and a noise-like signal frame taken from an unvoiced region in a male speech signal. The
C. Performance versus Matrix Dimensions In [39] it was argued that data matrices should be constructed “as square as possible” for optimal performance with the standard HTLS algorithm. However, since it is not clear whether square data matrices are optimal with the P-ESM scheme, we
JENSEN et al.: PERCEPTUAL SUBSPACE APPROACH FOR MODELING OF SPEECH AND AUDIO SIGNALS
127
. Instead, the optimum is shifted formance in terms of where P-ESM gives an gain of more to than 2 dB over -ESM. Detailed analysis of the results shows that these conclusions hold for each of the signals used in the and . test, and for both model orders The computational complexity of P-ESM and -ESM is , which dominated by the estimation of the column space in this work is obtained from an SVD of the (prefiltered) data matrix [see (7)]; consequently, the computational complexity . From of the algorithms is approximately this we conclude that the increased P-ESM performance is obtained at a reduced computational complexity compared to -ESM, because P-ESM uses “fat” matrices ( ), ). In our whereas -ESM uses square matrices ( implementation of the algorithms, P-ESM requires 80% of the computations used for -ESM (as measured with the “flops” counter in Matlab). D. Distribution of Model Components Across Frequency
K
Fig. 3. Modeling performance versus model order for quasiperiodic signal frame. (a) Magnitude spectrum of signal frame . (b) SNR as a function of for P-ESM and -ESM.
x
l
Fig. 4. Modeling performance versus model order frame. (a) Magnitude spectrum of signal frame . (b) for P-ESM and -ESM.
K
l
x
K
K for noise-like signal SNR
as a function of
study here the impact of modeling performance for different matrix dimensions. The set of audio signals listed in Appendix I were modeled with P-ESM and -ESM using values of in the range . A constant signal frame length of samples was used, resulting in taller and narrower data matrices for increasing values of . Fig. 5 shows the modeling performance as a function of . We see that the perin terms of formance of -ESM remains essentially constant for values of in the range – 2/3, which is consistent with the results reported in [39] (although for signals very different than the ones considered here). For P-ESM, we observe that square ) lead to good but not optimal perdata matrices (
An analysis of the distribution of model components across frequency is of interest for several reasons. First, it provides a clearer insight into the characteristic of the parameter extraction schemes. Secondly, it may lead to algorithmic advantages, for example a reduction in the computational expense of the parameter estimation schemes. The audio signals described in Appendix I were represented using the P-ESM and the -ESM schemes with matrix dimenand , respectively. A fixed model sions was used for all signal frames. The estimated order of sinusoids (a total of more than 150.000 sinusoids for each estimation scheme) were collected and sorted in bins of width 500 Hz according to their estimated frequency parameter. The histograms resulting from this procedure are shown in Fig. 6. Clearly, most model components are found at lower frequencies. A careful examination of the histogram for P-ESM [Fig. 6(a)] shows that 79.6% of the components are found below 5 kHz, 92.7% below 8 kHz and 99.6% of the components are below 11 kHz. A similar analysis of Fig. 6(b) reveals the corresponding values of 89.7% below 5 kHz, 95.2% below 8 kHz, and 99.1% of all components are found below 11 kHz. The component disis almost identical tribution for a fixed model order of case. to the In order to demonstrate the distribution of estimated components further, we have included Fig. 7 which shows the location of ESM components in the time-frequency plane for one of the audio excerpts listed in Appendix I. From this figure it is clear that P-ESM distributes the sinusoids more uniformly across the frequency axis [Fig. 7(b)], while -ESM tends to “cluster” sinusoids in high-energy regions [Fig. 7(c)]. The fact that sinusoidal components almost never occur in regions above 11 kHz is significant because it may allow for a decimation of the signal frames under analysis without any noticeable loss in modeling performance. To pursue this idea further, we implemented the following decimated versions of the algorithms. First, the signal to be modeled was decimated by a factor of two using a 10th order antialiasing FIR filter after which the decimated signal was modeled using P-ESM and -ESM for parameter estimation; we denote the algorithms in
128
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
X2
Fig. 5. Average modeling performance across the test signals in Appendix I as a function of dimension of data matrix = 1024 samples ( = + 1).
N
N L M0
for a fixed frame length of
K = 50). (a) Components extracted with P-ESM. (b) Components extracted with l -ESM.
Fig. 6. Distribution of model components across frequency (
JENSEN et al.: PERCEPTUAL SUBSPACE APPROACH FOR MODELING OF SPEECH AND AUDIO SIGNALS
129
Fig. 7. Distribution of model components in the time-frequency plane for a region of Signal no. 6. (a) Time domain signal. (b) Components extracted with P-ESM. (c) Components extracted with -ESM.
l
Fig. 8. SNR
values for full-band estimation schemes
l -ESM and P-ESM and decimated schemes l -ESM-Dec and P-ESM-Dec for K = 50.
the decimated domain by P-ESM-Dec and -ESM-Dec, respectively. In the decimated domain, a frame length of was used, and the impulse response length of the perceptual filter samples (5.8 ms). The relative matrix dimensions was and for P-ESM-Dec and remained -ESM-Dec, respectively (a study of vs. for P-ESM-Dec and -ESM-Dec showed performance curves similar to that of Fig. 5 for the full-band algorithms). Finally, for comparison with the original signal, the modeled signal was reconstructed at the original sample rate of 44.1 kHz. Fig. 8 compares the modeling performance of the decimated . algorithms with that of the full-band schemes for performance with From this figure we conclude that
P-ESM-Dec and -ESM-Dec typically is at least as good as with their full-band counterparts. In fact, rejecting the upper frequency band before modeling tends to increase slightly; a reason for this is that the signal content in the high frequency band is mainly nonsinusoidal and therefore, in effect, contributes as noise in the estimation process. The main advantage, though, of performing the parameter estimation in the decimated domain is in terms of computational load, because the frame length and thus the matrix dimensions and have been reduced by a factor of 2. Since computations are dominated by the SVD of the data matrix, having a complexity , we would expect a decrease in computaof tional load by a factor of 8 for the SVD (assuming constant
130
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
ratio). In our algorithm implementations, the P-ESM-Dec algorithm requires 20% of the computations used for P-ESM and 16% of the computations used for -ESM (as measured with the “flops” counter in Matlab). The conclusions drawn from the case in Fig. 8 remain valid when a fixed model order is used. of Informal listening tests support the objective performance measures in Fig. 8. Signals generated with the decimated schemes P-ESM-Dec and -ESM-Dec are almost perceptually identical to signals generated with the full-band schemes. E. Subjective Comparison Test With Audio Signals In order to determine the possible subjective advantages of the proposed scheme, the seven test signals listed in Appendix I and comwere modeled with P-ESM and -ESM for pared in a listening test. We restricted ourselves to consider the full band algorithms only. The test signals were presented to the listeners as triplets OAB or OBA, where O was the original signal, A was the signal modeled with -ESM and B was the signal modeled with P-ESM. The task of the listener was to indicate, which signal (A or B) was perceptually closest to the original O. Each test signal triplet was presented a total of 4 times during the test, and the order (OAB or OBA) in which signals were presented was selected randomly for each presentation. Nine listeners participated in the test; the authors did not participate. The preference for P-ESM averaged across the listeners is shown in Table I. As can be seen, P-ESM performs better than -ESM for all test signals. A set of additional subjective evaluations were done by the authors and resulted in a number of conclusions on the perceptual quality of the modeled signals. Generally, P-ESM leads to modeled signals of considerably higher subjective quality than -ESM, the difference being larger for compared to . For some types of signals, e.g., Signals no. 1, 4, 5, the subjective difference is very significant, while for other signals, e.g., Signals no. 2 and 3, the difference is less distinct although still clearly noticeable. We note that these subjective obervations are well in line with Figs. 3 and 4 which showed larger improvements for more noise-like signals (such as Signals no. 1, 4, and 5) compared to signals with a larger periodic content (such as Signals no. 2 and 3). With -ESM, regions of the modeled signals sometimes sound “narrowband;” this artifact is eliminated with P-ESM (see also Fig. 7). Further, signals modeled with -ESM occasionally have a “background of musical noise”; this artifact too is not present in the P-ESM signals. In a few cases P-ESM introduces artifacts which are not observed in the -ESM signals. For example, the unvoiced speech sounds /s/ in Signal no. 6 have a more reverberant/tonal quality with P-ESM. One explanation is that -ESM tends to model noise-like sounds by means of clusters of closely spaced sinusoidal components thereby creating a signal resembling bandpass filtered noise. P-ESM, on the other hand, tends to spread sinusoidal components more uniformly across frequency, creating a signal that is perceived as tonal. The tonal artifact can be eliminated by increasing the model order for the signal frames in question, or, more efficiently, introduce an additional signal model, e.g., a filtered noise representation to model the “noiselike” signal regions.
TABLE I PREFERENCE FOR P-ESM OVER l -ESM IN SUBJECTIVE COMPARISON TEST
V. CONCLUSION This paper considered the problem of approximating a signal frame with the exponential sinusoidal model (ESM), which represents the signal frame as a linear combination of exponentially damped constant-frequency sinusoidal functions. Traditionally, model components have been estimated using the HTLS algorithm [29], which does not take perceptual relevance of the estimated components into account. In this work we propose a method to incorporate auditory information in the estimation process in order to extract only perceptually relevant model components. Perception based estimation algorithms are of importance in model based audio and speech processing in cases where the signal receiver is human. The proposed algorithm combine the prefiltered HTLS algorithm [37] and a recently developed perceptual distortion measure [35], in an attempt to minimize a signal-dependent, weighted -norm. The proposed algorithm is named P-ESM. The proposed method was studied in a number of simulation experiments. First, the differences between the proposed methods and the traditional HTLS based method, denoted -ESM here, were demonstrated in a number of simple modeling examples, using synthetically generated signal frames such as stationary sinusoidal signals, as well as natural signals such as voiced and unvoiced speech sounds. The examples verified that while -ESM focuses on high-energy spectral regions, the perceptually based algorithm, P-ESM, tends, as expected, to model spectral regions which are perceptually important. Secondly, the influence of the data matrix dimensions on modeling performance was studied. In particular, modeling ( is the performance was evaluated as a function of is the number of number of rows in the data matrix, and samples in a signal frame, which is assumed constant) where was varied in the range 0.08–0.93, resulting in taller . The traditional and narrower matrices for increasing -ESM method had a broad performance maximum roughly , supporting the assumption centered around that square data matrices lead to the best performance. The P-ESM method, however, had a narrower, but significantly . Moreover, larger maximum, centered around since the computational load in both P-ESM and -ESM is dominated by an SVD of the data matrix and therefore is , the P-ESM performance gain can be obtained at computationally lower expenses; in practice, P-ESM spent 80% of the computations used for -ESM (as measured by the “flops” counter in Matlab). Thirdly, an analysis was performed, which aimed at determining the distribution of sinusoidal components across frequency for P-ESM as well as the traditional -ESM algorithm. The analysis showed that for a range of different wideband audio signals sampled at 44.1 kHz, most sinusoidal components oc-
JENSEN et al.: PERCEPTUAL SUBSPACE APPROACH FOR MODELING OF SPEECH AND AUDIO SIGNALS
curred at low frequencies. To be more exact, for a model order (approximately 25 real damped sinusoids per signal of frame), the P-ESM and -ESM distributed 99.6% and 99.1%, respectively, of all components below 11 kHz. This observation is significant, because it suggests that typical wideband audio signals can be decimated at least by a factor of 2, thereby reducing the computational demands of the estimation algorithms further, without sacrificing any modeling performance. Simulation experiments with these algorithms verified that objective as well as subjective modeling performance remains virtually identical to the performance of the full-band algorithms, but the computational expenses for the P-ESM-Dec and -ESM-Dec is roughly 16–20% of the computational load for P-ESM and -ESM. Finally, to determine the potential advantages of incorporating a perceptual distortion measure in the parameter estimation procedure, signals modeled with P-ESM were compared to signals modeled with the traditional -ESM method in a subjective listening test involving nine listeners. This comparison test showed that the signals generated with P-ESM were of considerably higher perceptual quality than those of -ESM (preference in the range 69%–94% for P-ESM). In particular, the “narrowbanded” quality of the -ESM signals was eliminated with P-ESM. Furthermore, the comparisons showed that the “musical” background noise often present in signals modeled with -ESM, was not noticeable in the signals modeled with P-ESM. APPENDIX OVERVIEW OF TEST SIGNALS The following test signals were used in the evaluation of the presented algorithms. All signals were sampled at a frequency of 44.1 kHz.
ACKNOWLEDGMENT The authors wish to thank O. Niamut for helping preparing the listening test and they thank the nine participants in the listening test. They also would like to thank the anonymous reviewers for their helpful and constructive remarks. REFERENCES [1] P. Hedelin, “A tone-oriented voice-excited vocoder,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 205–208, 1981. [2] L. B. Almeida and J. M. Tribolet, “Harmonic coding: A low bit-rate, good-quality speech coding technique,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1982, pp. 1664–1667. [3] R. J. McAulay and T. F. Quatieri, “Sinusoidal coding,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds. New York: Elsevier, 1995, ch. 4.
131
[4] T. F. Quatieri and R. J. McAulay, “Noise reduction using a soft-decision sine-wave vector quantizer,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1990, pp. 821–824. [5] J. Jensen and J. H. L. Hansen, “Speech enhancement using a constrained iterative sinusoidal model,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 731–740, Oct. 2001. [6] T. F. Quatieri and R. J. McAulay, “Speech transformations based on a sinusoidal representation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-34, pp. 1449–1464, 1986. [7] E. B. George and M. J. T. Smith, “Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model,” IEEE Trans. Speech Audio Processing, vol. 5, no. 5, pp. 389–406, 1997. [8] Y. Stylianou, “Applying the harmonic plus noise model in concatenative speech synthesis,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 232–239, Mar. 2001. [9] X. Serra and J. Smith III, “Spectral modeling synthesis: Asound analysis/synthesis system based on a deterministic plus stochastic decomposition,” Comput. Music J., vol. 14, no. 4, pp. 12–24, 1990. [10] E. B. George and M. J. T. Smith, “Analysis-by-synthesis overlap-add sinusoidal modeling applied to the analysis and synthesis of musical tones,” J. Audio Eng. Soc., vol. 40, no. 6, pp. 497–516, June 1992. [11] B. Edler, H. Purnhagen, and C. Ferekidis, “ASAC – Analysis/synthesis codec for very low bit rates,” in Preprint 4179 (F-6) 100th AES Convention, 1996. [12] K. N. Hamdy, M. Ali, and A. H. Tewfik, “Low bit rate high quality audio coding with combined harmonic and wavelet representation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, pp. 1045–1048. [13] T. S. Verma and T. H. Y. Meng, “A 6 kbps to 85 kbps scalable audio coder,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 877–880, 2000. [14] R. J. McAulay and T. F. Quatieri, “Magnitude-only reconstruction using a sinusoidal model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1984, pp. 27.6.1–27.6.4. [15] L. B. Almeida and F. M. Silva, “Variable-frequency synthesis: An improved harmonic coding scheme,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1984, pp. 27.5.1–27.5.4. [16] M. Goodwin, “Matching pursuit with damped sinusoids,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1997, pp. 2037–2040. [17] G. Li, L. Qiu, and L. K. Ng, “Signal representation based on instantaneous amplitude models with application to speech synthesis,” IEEE Trans. Speech, Audio Processing, vol. 8, pp. 353–357, May 2000. [18] J. Laroche, “The use of the matrix pencil method for the spectrum analysis of musical signals,” J. Acoust. Soc. Amer., vol. 94, no. 4, pp. 1958–1965, Oct. 1993. [19] J. Nieuwenhuijse, R. Heusdens, and E. F. Deprettere, “Robust exponential modeling of audio signals,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, pp. 3581–3584. [20] J. Jensen and R. Heusdens, “A comparison of sinusoidal model variants for speech and audio representation,” in Proc. Eur. Signal Processing Conf., 2002, pp. 479–482. [21] K. Hermus, W. Verhelst, and P. Wambacq, “Psycho-acoustic modeling of audio with exponentially damped sinusoids,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2002, pp. 1821–1824. [22] P. Lemmerling, I. Dologlou, and S. Van Huffel, “Speech compression based on exact modeling and structured total least norm,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, pp. 353–356. [23] J. Jensen, S. H. Jensen, and E. Hansen, “Exponential sinusoidal modeling of transitional speech segments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1999, pp. 473–476. [24] “Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5 MBit/s – Part 3: Audio,” ISO/MPEG Committee, ISO/IEC 11 172–3, 1993. [25] R. D. di Iacovo and R. Montagna, “Some experiments in perceptual masking of quantizing noise in analysis-by-synthesis speech coders,” in Proc. Eurospeech, 1991, pp. 825–828. [26] D. Sen and W. Holmes, “Perceptual enhancement of CELP speech coders,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. II-105–II-108, 1994. [27] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. Signal Processing, vol. 41, pp. 3397–3415, Dec. 1993. [28] S. Y. Kung, K. S. Arun, and D. V. B. Rao, “State-space and singularvalue decomposition-based approximation methods for the harmonic retrieval problem,” J. Amer. Opt. Soc., vol. 73, no. 12, pp. 1799–1811, 1983. [29] S. Van Huffel, H. Chen, C. Decanniere, and P. Van Hecke, “Algorithm for time-domain NMR data fitting based on total least squares,” J. Magn. Reson. A, vol. 110, pp. 228–237, 1994.
132
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 2, MARCH 2004
[30] A.-J. Van der Veen, E. F. Deprettere, and A. L. Swindlehurst, “Subspacebased signal analysis using singular value decomposition,” Proc. IEEE, vol. 81, no. 9, pp. 1277–1308, 1993. [31] T. S. Verma and T. H. Y. Meng, “Sinusoidal modeling using frame-based perceptually weighted matching pursuits,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1999, pp. 981–984. [32] R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling using psychoacoustic-adaptive matching pursuits,” IEEE Signal Processing Lett., vol. 9, pp. 262–265, Aug. 2002. [33] R. Heusdens and S. van der Par, “Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2002, pp. 1809–1812. [34] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements, Objective Measures of Speech Quality. Englewood Cliffs, NJ: Prentice-Hall, 1988. [35] S. van der Par et al., “A new psychoacoustical masking model for audio coding applications,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2002, pp. 1805–1808. [36] G. Charestan, R. Heusdens, and S. van der Par, “A gammatone-based psychoacoustical modeling approach for speech and audio coding,” in Proc. ProRISC/IEEE: Workshop on Circuits, Systems and Signal Processing, 2001, pp. 321–326. [37] H. Chen, S. Van Huffel, and J. Vandewalle, “Bandpass prefiltering for exponential data fitting with known frequency regions of interest,” Signal Process., vol. 48, pp. 135–154, 1996. [38] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem – Computational Aspects and Analysis. Philadelphia, PA: SIAM, 1991. [39] S. V. Huffel, “Enhanced resolution based on minimum variance estimation and exponential data modeling,” Signal Process., vol. 33, no. 3, pp. 333–355, 1993.
Jesper Jensen (S’96–M’00) received the M.Sc. and Ph.D. degrees from Aalborg University, Aalborg, Denmark, in 1996 and in 2000, respectively, both in electrical engineering. From 1996 to 2001, he was with Center for PersonKommunikation (CPK), Aalborg University as a Researcher, Ph.D. student, and Assistant Research Professor. In 1999, he was a Visiting Researcher at Center for Spoken Language Research, University of Colorado at Boulder. Currently, he is a Postdoctoral Researcher at Delft University of Technology, Delft, The Netherlands. His main research interests are digital speech and audio signal processing, including coding, synthesis, and enhancement.
Richard Heusdens received the M.Sc. and Ph.D. degrees from the Delft University of Technology, Delft, The Netherlands, in 1992 and 1997, respectively. In the spring of 1992, he joined the Digital Signal Processing Group at the Philips Research Laboratories, Eindhoven, The Netherlands, where he worked on various topics in the field of signal processing, such as image/video compression, and VLSI architectures for image-processing algorithms. In 1997, he joined the Circuits and Systems Group of the Delft University of Technology, where he was a Postdoctoral Researcher. In 2000, he moved to the Information and Communication Theory (ICT) Group where he became an Assistant Professor, responsible for the audio and speech processing activities within the ICT group. Since 2002, he has been an Associate Professor. His research interests include signal processing, in particular audio and speech processing and information theory.
Søren Holdt Jensen (S’87–M’88–SM’00) was born in Denmark in 1964. He received the M.Sc. degree in electrical engineering from Aalborg University, Aalborg, Denmark, and the Ph.D. degree from the Technical University of Denmark, Lyngby, Denmark. Currently, he is an Associate Professor with the Department of Communication Technology, Aalborg University. Before joining Aalborg University, he was with the Telecommunications Laboratory of Telecom Denmark, the Electronics Institute of the Technical University of Denmark, the Scientific Computing Group of the Danish Computing Center for Research and Education (UNI C), and the Electrical Engineering Department of Katholieke Universiteit Leuven, Belgium. His research activities are in digital signal processing, digital communications, and speech and audio signal processing. He is a member of the editorial board of the Journal on Applied Signal Processing. Dr. Jensen is a former Chairman of the IEEE Denmark Section and the IEEE Denmark Signal Processing Chapter.