Low-complexity feature-mapped speech bandwidth ... - CiteSeerX

2 downloads 0 Views 577KB Size Report
will facilitate an audio signal bandwidth of 50 Hz–7 kHz. This is suggested since an ... Coder (CODEC) such as the Adaptive Multi Rate Wide Band. (AMR-WB).
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

577

Low-Complexity Feature-Mapped Speech Bandwidth Extension Harald Gustafsson, Ulf A. Lindgren, and Ingvar Claesson

Abstract—Today’s telecommunications systems use a limited audio signal bandwidth. A typical bandwidth is 0.3–3.4 kHz, but recently it has been suggested that mobile phone networks will facilitate an audio signal bandwidth of 50 Hz–7 kHz. This is suggested since an increased bandwidth will increase the sound quality of the speech signals. Since only few telephones initially will have this facility, a method extending the conventional narrow frequency-band speech signal into a wide-band speech signal utilizing the receiving telephone only is suggested. This will give the impression of a wide-band speech signal. The proposed speech bandwidth extension method is based on models of speech acoustics and fundamentals of human hearing. The extension maps each speech feature separately. Care has been taken to deal with implementation aspects, such as noisy speech signals, speech signal delays, computational complexity, and processing memory usage. Index Terms—Speech analysis, speech enhancement, speech synthesis.

I. INTRODUCTION

S

PEECH bandwidth extension methods denote techniques for generating frequency bands that are not present in the input speech signal. A speech bandwidth extension method uses the received speech signal and a model for extending the frequency bandwidth. The model can include knowledge of how speech is produced and how speech is perceived by the human hearing system. In telephone communications, a sufficiently reliable narrow frequency band speech signal of 0.3–3.4 kHz is used at a samkHz [1]. The limited bandwidth is sufficient pling rate of although it will affect the perceived sound quality compared to a face-to-face communication. In order to improve the sound quality, it has been suggested to increase the speech signal bandwidth in mobile phone systems [2]. A possible way to achieve an extension is to use an improved speech COder/DECoder (CODEC) such as the Adaptive Multi Rate Wide Band (AMR-WB). However, using an AMR-WB CODEC requires that both telephones at the ends of the communication link support the AMR-WB CODEC. During a transition period it is likely that mobile phones and systems do not support new Manuscript received April 24, 2002; revised November 15, 2004. This work was supported by the Foundation for Knowledge and Competence Development and by Ericsson Mobile Platforms, Sweden. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Li Deng. H. Gustafsson is with the Research Department, Ericsson Mobile Platforms AB, 221 83 Lund, Sweden and also with the ITS, Blekinge Institute of Technology, 372 25 Ronneby, Sweden (e-mail: [email protected]). U. A. Lindgren is with the Ericsson Mobile Platforms AB, 221 83 Lund, Sweden. I. Claesson is with the ITS, Blekinge Institute of Technology, 372 25 Ronneby, Sweden. Digital Object Identifier 10.1109/TSA.2005.855837

CODECs. In addition, it is unlikely that wire-line telephony will use new wide-band CODECs in the near future. Mobile phones communicating with wire-line phones can therefore not utilize the enhanced features of new CODECs. To overcome this limitation the received speech signal can be modified. The modification investigated in this paper, is meant to artificially increase the bandwidth of the speech signal. The artificial bandwidth extension will most likely have less enhancement than an improved speech CODEC. Still, a speech sound with a widened characteristic will be experienced. It is assumed that the extended signal can be synthesized from information obtained in the narrow frequency band. This is justified by the fact that the speech signal for all frequencies is produced by the same articulation configuration. The extended synthesized signal cannot be identical to the originally produced wide-band signal. For example fricated speech sounds contain noise, which characteristics can be modeled, but the exact same waveform cannot be synthesized. Parameters such as vocal tract shape and excitation source describes the articulation configuration. Hence, speech content over all frequencies can be synthesized from the same parameters. The challenge is to reliably estimate the parameters from the speech signal. This is difficult to accomplish with such accuracy that a synthesis can be obtained without perceptual distortions [3]. The outline of this paper is as follows. Previous Speech Bandwidth Extension (SBE) methods are described in Section II, followed by the presently investigated SBE method in Section III. Simulations necessary to derive parameter settings for the extension are discussed in Section IV. Results are presented in Section V and finally conclusions in Section VI. II. SPEECH BANDWIDTH EXTENSION METHODS Speech bandwidth extension methods have been suggested for a frequency band both at frequencies higher and lower compared with the original narrow frequency band. For convenience these frequency bands are henceforth termed low-band, narrow-band and high-band. The AMR-WB speech CODEC uses a bandwidth of 50 Hz–7 kHz at a sampling rate of 16 kHz [2]. Typical bandwidths used in SBE are 50 Hz–300 Hz, 300 Hz–3.4 kHz, and 3.4 kHz–7 kHz, for the low-band, narrow-band, and high-band, respectively. Early speech bandwidth extension methods date back more than a decade [4]. Similar to speech CODERs, SBE methods often use an excitation signal and a filter. A simple method to extend the speech signal into the higher frequencies is to up-sample by two neglecting the antialiasing filter [5]. The lack of antialiasing filter will cause the original spectrum to be mirrored at half the new bandwidth. The wide-band extended

1558-7916/$20.00 © 2006 IEEE

578

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

signal will have mirrored speech content up to at least 7 kHz [5]. A drawback with this method is the speech-energy gap in the 3.4–4.6 kHz region. The speech-energy gap is the result of telephone bandwidth signals not having much energy above 3.4 kHz. When the speech spectrum is mirrored the speech content in the high-band generally becomes nonharmonic even when the narrow-band contains a harmonic spectrum. This is a major disadvantage of the simple mirroring method. Harmonics are perceptually important since sounds having harmonics of the same fundamental frequency are perceptually grouped to a single sound [6]. The perceptual grouping helps in separating a complex sound input into entities, e.g., separating speech signals from the background noise signals. Formants excited by the harmonics of the same fundamental frequency are perceptually grouped to one speech sound [6]. The perceptual grouping according to fundamental frequency is so important that it is more important than the grouping according to the direction of arrival of a sound [6]. The extended speech spectrum must be shaped or at least the energy level must be altered to suit the narrow-band speech signal. The modeled spectral shape should mimic spectral shapes of different speech sounds. The simplest methods make use of a fixed shaping filter and a fixed gain to shape the aliased/folded excitation signal [5]. A development of folding techniques, such as [5], exploits a linear predictor as a means to analyze the narrow-band speech signal [7]. The linear prediction is evaluated from short time segments. The narrow-band signal is subsequently inverse-filtered by the predictor-filter, resulting in a residual signal. The residual signal is up-sampled in the same manner as in [5] to generate a wide-band excitation signal [8]. The main advantage is that the spectral envelope of the excitation signal is approximately flat. A sinusoidal modeling can also generated voiced excitation [9], [10]. In [9] and [10] the sinusoidal model is complemented with a noise source. More advanced models of spectral envelope extension is based on speech signal statistics. Such models may utilize: codebook mapping [9]–[12]; mapping by filters [13], [14]; or Gaussian mixture models (GMMs) [15], [16], all having in common that a database of wide-band speech is utilized to estimate a statistic connection between the narrow-band signal and the spectral envelope of the high-band signal. In the present context a codebook is a collection of vectors. Each codebook vector models a specific spectral envelope. A matched narrow-band spectral vector corresponds to an index in the codebook. This index also defines a suitable spectral envelope of the high-band, which is stored in a corresponding codebook. In [12] a codebook approach is used. The corresponding codebook contains autoregressive (AR) coefficients modeling the wide-band spectral envelope, not only the high-band spectral envelope. The inverse AR-filter is used to filter the narrow-band signal to obtain the excitation signal. The excitation signal is extended as in [8], and subsequently filtered by the AR-filter. The codebook maps both the spectral envelope and an implied excitation signal energy ratio between the high- and the narrowband. The codebook lookup utilizes a Hidden Markov Model (HMM) to find the best matched vector.

Epps et al. [9] suggested that two codebooks were used, one for voiced speech and another for unvoiced speech. Compared with other codebook techniques using the same total codebook size a lower spectral distortion is reported for the dual codebook SBE method [9]. The comparison is not fully just since the codebook size omits the voiced/unvoiced classification information quantity. Generally, a codebook divided into subcodebooks results in a suboptimal vector-quantization [17]. However, an important benefit is a reduced computational complexity of the index lookup procedure [17]. Statistical mapping between the narrow-band and high-band spectral envelope can also be performed by a filtering of narrowband parameters. Trajectories of the narrow-band cepstral coefficients have been used to derive the cepstral coefficients of the high-band using a least squares technique [13]. Unfortunately the reasons for using cepstral coefficients in the mapping is not stated in [13]. The result reported is an inaccurate estimation of the spectral envelope in the high-band. Gaussian Mixture Models (GMM) can be used to map the narrow-band signal to the high-band spectral envelope. A GMM is a set of weighted multivariate Gaussian distributions. The mixed Gaussian probability density function (pdf) constitutes an approximation of a process pdf. In [15], GMM are used to describe the distribution of spectral vectors. The spectral vectors describe the spectral envelopes of the narrow-band signal and of the wide-band signal. A minimum mean squares error solution is found for the mapping of the narrow-band spectral vector to the high-band spectral vector. The estimated high-band spectral vector is a sum of probability-weighted mean spectral vectors in the GMM. The excitation signal energy ratio between the high- and narrow-band is separately derived using a codebook method. The corresponding high-band codebook only has a single parameter for the energy ratio. For a GMM size of 128 and a codebook size of 128, it was reported that the GMM method outperformed conventional codebook mappings [11] in both subjective and objective tests [15]. In [16] Nilsson et al. suggested a GMM-based bandwidth extension method incorporating an estimated energy-ratio between the narrow-band and high-band signals. It is observed that an overestimation of the high-band energy is perceptually more disturbing than underestimation [16]. An asymmetric cost function is used which penalizes an overestimation more than underestimation. The asymmetric cost-function gives subjectively fewer artifacts, but is also more conservative in bandwidth extension [16]. III. LOW-COMPLEXITY FEATURE-MAPPED SPEECH BANDWIDTH EXTENSION METHOD The proposed speech bandwidth extension method maps each speech feature of the narrow-band signal to a similar feature of the high-band and low-band. The method is thus named Feature Mapped Speech Bandwidth Extension (FM-SBE). A highband synthesis model based on speech signal features is used. The relation between the narrow-band features and the highband model is partly obtained from statistical characteristics of speech data containing the original high-band. The remaining part of the mapping is based on speech acoustics.

GUSTAFSSON et al.: LOW-COMPLEXITY FEATURE-MAPPED SPEECH BANDWIDTH EXTENSION

579

Fig. 1. Overview of the suggested bandwidth extension method. The method uses an analysis of the narrow-band. The low-band and high-band are synthesized separately from parameters derived from the narrow-band. The analysis results in the parameters: narrow-band excitation e(n), peak frequencies (k ), spectral vector Q , speech sound classification CT RL, speech dynamics P , first peaks power P (1) and the pitch frequency ! .

The low-complexity of the FM-SBE method is referring to the computational complexity of the mapping from the narrow-band parameters to the wide-band parameters. See Appendix II for a summary of the computational complexity. The FM-SBE method is designed to operate under moderate levels of background noise. Some of the background noise from the other end of the communication link is transmitted to the receiving mobile phone. A moderate or lower noise level can be expected in the received narrow-band signal since the amount of noise rejection for mobile telecommunication systems is specified [1]. Noise reduction methods should be applied prior to SBE in cases where the received noise level, from the far-end, is high. Previous SBE methods found in the literature use an estimate of the spectral vector constituting the entire input signal spectrum. Spectrum estimators based on linear-predictor coefficients or cepstral coefficients have been used. The presently investigated method anticipates that noise may be present in the input signal. Hence, a spectral vector less sensitive to noise is chosen. Here, it is assumed that the peaks in a noisy speech spectrum are more likely to belong to the speech signal than to the noise signal. The FM-SBE method exploits the spectral peaks for estimating the narrow-band spectral vector, neglecting low energy regions in the narrow-band spectrum. The background noise has less influence on the spectral vector giving a more reliable estimate of narrow-band speech characteristics. This is viable when the new spectral vector is sufficient for describing important speech characteristics in the narrow-band. The FM-SBE method is divided into an analysis and a synthesis part; see Fig. 1. The analysis part has the narrow-band signal as input and results in the parameters that controls the synthesis. The synthesis will generate the extended bandwidth speech signal. The analysis and synthesis is processed on segments of the input signal. Each segment has a duration of 20 ms. , and highThe low-band speech synthesized signal, , are added to the band speech synthesized signal, which generup-sampled narrow-band signal, ates the wide-band speech signal (1) where

denote the segment number and

the sample number.

Fig. 2. Overview of some of the parameters used in the proposed FM-SBE method. Narrow-band spectral envelope peaks are given as coordinates.

The individual processing of the three frequency bands can give rise to lags between the signals in the bands. The hearing is insensitive to phase differences in mono signals and a fusion to one sound is obtained even if parts of a sound lag slightly behind other parts of the sound [6]. The lag between signals in the different bands should be kept below the duration of a segment since a fusion is important for a perception of a bandwidth extended signal. More important is a total low delay of the processed signal. Many telecommunication systems specify a total system maximum delay for a speech signal; for GSM systems see [1]. The delay each operation demands is taken from the maximum total delay resource. Choosing a low-delay speech bandwidth extension method makes it possible to introduce more features, which could otherwise cause an unacceptable maximum total delay. Other aspects that should be considered are filter features and the computational complexity of the method. A more complex method often demands a longer processing time and hence causes a longer delay. Increasing the processor rate at the expense of battery-time of the mobile device can of course shorten the delay. It is well established in speech processing research that a speech signal segment can be fairly accurately modeled by an excitation signal and an all-pole model [7]. Some of the important features of the articulation configurations are the vocal tract resonances and the excitation sources [18]. The present method relies on that the narrow-band signal can be modeled partly by these features; see also Figs. 1 and 2. The low-band and high-band are synthesized based on these features. The FM-SBE method derives an amplitude level of the highband spectrum from logarithm amplitude peaks in the narrowband spectrum. Resonances are synthesized in the high-band at frequencies derived from peak frequencies in the narrow-band during voiced speech sounds. When the narrow-band is voiced, the harmonic spectrum is continued both into the high-band and low-band by copying the spectrum and direct sine tone synthesizes, respectively.

580

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

Fig. 3. Overview of the narrow-band speech analysis. The analysis derives parameters describing the vocal tract resonances and speech excitation sources.

A. Narrow-Band Speech Signal Analysis The only input at hand for SBE methods is the narrow-band signal. Evidently, the synthesis of the low-band and high-band signals are dependent on an estimation of parameters from this narrow-band signal. The analysis part of the bandwidth extension method consists of a linear predictor, a pitch activity detector (PAD), a pitch frequency estimator, a fricated speech activity detector (FAD), a voice activity detector (VAD), and a formant peaks amplitude and frequency estimator; see Fig. 3. The analysis is made on each short-time segment of the narrow-band signal and the estimates are updated for each segment. A short-time segment consists of samples. The narrow-band speech signal, , is modeled by an , and an excitation signal, all-pole filter,

a varying number of peaks it was observed that the mapping often gave artificial jumps in the high-band amplitude level between segments. The second approach is then used to estimate the rest of the peaks. The narrow-band AR-spectrum, , is divided equally wide subfrequency-bands. The frequency bin into having the maximum amplitude in each subfrequency-band is local maxima considered as a potential peak. Since only exist some of the subfrequency-bands do not contain a peak derived from a local maxima. It is also possible that several local maxima fall within one subband. When only one subband lacks a peak the potential peak in that subband is selected as a peak. When several subbands lack a peak, a peak is selected from the potential peaks repeatedly until a total of peaks are obtained. The selection process choose the potential peak in a subband which have no peaks and is furthest from the subbands containing most peaks, e.g., when most peaks are at low frequencies the potential peak in the highest subband is selected. The result , at the local maxima and the seis power estimates lected peaks, organized according to frequency; see Fig. 2. To improve the robustness against background noise the first , are further processed. This processing is inpeaks, spired by spectral subtraction methods; see for example [19]. of Estimates of the average background noise power the peaks during nonspeech periods are calculated by an exponential averaging window. The average background noise level , are subtracted from the current peak at peak positions, powers to obtain noise reduced peak powers. A compensation for different levels of the speech signal is needed to perform the feature mapping properly. The peak powers are transformed to a logarithm domain were the average speech dynamics is subtracted

(2) (4) is the filter order. The filter parameters and where are estimated using a linear all-zeros prethe signal [7]. The polynomial dictor from the input signal, is monic, i.e., the first coefficient is equal to one. The amplitude function—autoregressive (AR) spectrum—of the all-pole narrow-band filter is

(3) where is the autocorrelation of the narrow-band is a discrete frequency and is the number of fresignal, quency bins; see also Figs. 2 and 3. The spectral vector is calculated from the peaks in the peaks are estimated from each AR-specAR-spectrum. . Local maxima in trum, where the constant give narrow-band peak frequencies , and is expressed in normalized where frequency. The number of local maxima, K, in each segment varies between segments. For segments having less than local maxima a second approach is used to estimate the rest of the peaks since empirically it has been found that the mapping performs best with a constant number of peaks. When using

where the speech signal level compensation utilize the average maximum speech spectral peak VAD VAD (5) Any elements in being minus infinity is set to zero. A spectral feature vector is obtained by setting the first element equal to unity and the following equal to the log-amplitude peaks (6) where denotes transpose. The detectors employed in the analysis part are: a prominent fricated speech sound activity detector (FAD), a pitch activity detector (PAD), and a general voice activity detector (VAD); see Fig. 3. The VAD used is similar to the VAD for discontinuous transmission specified in the GSM AMR vocoder specification [20]. The VAD detects when it is necessary to perform a bandwidth extension. A decision when the current segment contains voiced speech or not is derived from a pitch frequency estimator. In [21] many pitch estimators and detectors are found.

GUSTAFSSON et al.: LOW-COMPLEXITY FEATURE-MAPPED SPEECH BANDWIDTH EXTENSION

581

A frication speech sound is typical for fricatives and affricates consonants [18]. Fricatives and affricates have prominent speech content at high frequencies, and only a small amount of perceptually important content at low frequencies. The aim of the FAD is to detect when the high-band shall contain a high level of unvoiced speech content. The boolean frication speech detection decision is calculated by FAD

otherwise

(7)

is the combinator of the where peak log-amplitudes and FAD is the boolean decision set to true when the product is negative; see also Fig. 3. The combiis calculated from a large set of spectral vectors and nator high-band amplitude level data; see Section IV. The high-band speech synthesis is derived differently depending on when: voiced speech; fricated speech; neither voiced or fricated speech is detected. The latter situation is used when the narrow-band segment does not contain speech sounds at all or when the high-band lacks predominantly speech content. An example is the situation when the current segment contains the complete closure part of a stop consonant. The three situations can be determined with control logic utilizing the detectors VAD, PAD, and FAD as VAD PAD VAD PAD FAD VAD VAD PAD FAD

(8)

where , &, and are the logical not, and and, or operators, respectively; see also Fig. 3. B. High-Band Speech Signal Synthesis The high-band speech synthesis generates a high-frequency spectrum by shaping an extended excitation spectrum. The shape of the high-band spectrum is determined by a gain, controlling the amplitude level, resonance peaks, and a fixed slope; see Fig. 4. , is extended up-wards in freThe excitation signal, quency. A simple method to accomplish this is to copy the spectrum from lower frequencies to higher frequencies. The method is simple since it can be applied in the same manner on any excitation spectrum. As described in Section II, during the extension it is essential to continue a harmonic structure. Most of the higher harmonics cannot be resolved by the human hearing system. However, a large enough deviation from a harmonic structure in the high-band signal could lead to a rougher sound quality [6]. Previously a pitch-synchronous transposing of the excitation spectrum has been proposed which continues a harmonic spectrum [22]. This transposing does not take into consideration the low energy level at low frequencies of telephone bandwidth filter signals, giving small energy gaps in the extended spectrum. Energy gaps are avoided with the present method since the frequency-band utilized in the copying is within the narrow-band. , is calculated The full complex excitation spectrum, on a grid, , of frequencies using an FFT of

Fig. 4. High-band speech synthesis of the suggested bandwidth extension method. The high-band signal is synthesized by applying a gain and filters to the excitation spectrum copied to the high-band. The bandpass filter gives a descending amplitude level in the high-band for increasing frequency. The formant filter used during voiced speech segments accentuate harmonic components in the excitation spectrum.

the excitation signal . The spectrum of the excitation signal is divided into zones: the lower match zone, , and the higher match zone, ; see Fig. 2. The frequency limits and constitute the most narrow frequency band possible to utilize in the copying. During voiced speech segments, the power , has peaks regularly at spectrum of the excitation, an interdistance equal to the pitch frequency. The power spectrum is searched for the maximum spectrum power in the frequency ranges of the lower and higher match zones

(9) The part of the spectrum between the two local maxima is the frequency band utilized in the extension. A harmonic structure is continued since the maximum in the power spectrum likely coincides with a harmonic tone to the pitch-frequency. A method that preserves the harmonic structure of the excitation signal in the high-band will more likely give a perceptual fusion to one speech sound in the human hearing system. When the speech segment is unvoiced the method operates in the same manner, although no harmonic structure needs to be continued. Then, to actually extend the excitation spectrum into higher frequencies, the spectrum between the two found maxima is copied reup to the frequency . The peatedly to the range from complex conjugated mirrored second half of the spectrum, inherent of real-valued time signals, is calculated. This results in having the bandwidth extended excitation spectrum ; see Fig. 2. a doubled sample rate, where The extended excitation spectrum is amplified giving a suitable amplitude level of the high-band; see Fig. 4. The ampli, by multude level is calculated from the spectral vector, tiplying with a weighting vector. Since the spectral vector is in the logarithm domain an estimate of the logarithm amplitude level of the high band is provided. The logarithm domain is used since the logarithm amplitude level is approximately

582

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

linear to the perceived loudness. The two states voiced speech and fricated speech is dealt with similarly, although different weighting vectors are employed. All weights used are estimated statistically from a large speech corpus containing wide-band speech signals, see Section IV. During voiced speech segments the logarithm amplitude level is derived from the spectral vector, , by (10) where is the combinator of the spectral vector. The speech dynamics is added to renormalize the log-amplitude level. The log-amplitude is transformed to a linear domain and normalized with the narrow-band excitation signal power, resulting in a gain (11) where is the square root of the excitation energy in the narrow-band. During unvoiced speech segments with fricated speech the gain is set according to (12) where (13) is the weighting of the specand tral vector for fricated speech segments. The final gain, thus, is voiced fricated neither voiced or fricated

where

(16) is the cepstrum of the formant filter, where is a constant vector of radii of the zeros, is a constant vector of radii of the poles is a constant normalizing gain. For details of the calcuand lation of the minimum phase formant filter, see Appendix I. For higher formant frequencies the poles and zeros are positioned and . The closer to the origin, i.e., shorter radii yield an increased bandwidth for higher formant frequencies. The placement of synthetic formant frequencies in the highband is based on estimated formant frequencies in the narrowband speech signal. The peaks in the narrow-band AR-spectrum, , are used as estimates of the formant frequencies, resulting from the vocal tract resonances. For these high-band formant frequency estimations only the actual AR-spectrum peaks where . The segment number is are used, dropped in this section to simplify notation even though the parameters are updated for each segment. A well-established modeling of the vocal tract is a concatenated tube model of tubes with varying diameters [7]. The peaks at the two highest freand , in the narrow-band are used in quencies, the calculation of the placement of the synthetic formants. This is justified by that these estimated resonance frequencies are the most likely to be resonances of the same front-most cavity. When this front-most cavity is considered to be a uniform tube, opened in the front and closed in the rear end, the resonance frequencies are

(14)

where is a low constant gain factor. The constant gain factor is derived by choosing a level that is lower than the minimum gain observed during voiced or fricated speech sounds. The low gain, , is used when no extension is desired. To enhance the speech signal further, the high-band speech generation may include a filter which gives spectral peaks at , in the high-band estimated resonance frequencies, range, where . Figs. 2 and 4 illustrate the resonance frequencies. The high-band may generally include the 4th–7th formants during voiced speech segments. The purpose of the resonances in the high-band is to shape the excitation signal. Such shaping results in accentuated harmonics. The accentuation will shift in frequency with the resonance frequencies giving a harmonic signal with varying characteristics even during constant pitch periods. The minimum-phase filter has one complex conjugated polepair and one complex conjugated zero-pair at the same angle for each synthetic resonance frequency. The poles are located closer to the unity circle compared with the zeros at the same angle. The frequency response of the filter is calculated as (15)

is the FFT operator and

(17) where at body temperature and atmospheric pressure, and is the length of the tube. The use of a uniform tube is a rough approximation. To calculate the frequencies of the highband resonances an estimation of the front most tube length and the resonance numbers, , is necessary. Assuming and are two consecutive resonance frequencies of the front-most cavity, the resonance number associated with can be estimated by (18) and (19) Another estimate of the resonance number corresponding to is the average of the previous estimates

(20)

GUSTAFSSON et al.: LOW-COMPLEXITY FEATURE-MAPPED SPEECH BANDWIDTH EXTENSION

583

Since the resonance number is an integer the estimate, , is approximated to the nearest integer. By inserting in (17) the tube length can be derived for each segment as (21) A shorter distance between the frequencies implies a longer tube. The fraction is limited, a maximum tube length of 20 cm is assumed which is a reasonable physically limit. The limitation results in a lowest distance limit between the resonance frequencies of 0.9 kHz. The synthetic formant frequencies are then calculated with (17) for

Fig. 5. Lower speech synthesis of the suggested bandwidth extension method. Sine tones at the pitch frequency and its harmonics are generated and amplified to a fraction of the first formant amplitude level.

(22) See also Fig. 2. Additionally a shaping minimum-phase bandpass filter is applied on the spectrum . This filter has a lower cutoff frequency of 3.4 kHz and a descending level in the high-band. This will reduce any synthesized content in the range of the narrow-band signal and introduces a perceptually pleasant shape of the spectrum. The high-band speech synthesis is voiced fricated or neither. (23) Fig. 4 illustrates how the filters and gain are applied to the extended excitation spectrum. Subsequently, an IFFT transforms the synthesized high-band to the time-domain (24) Consecutive high-band signal segments are concatenated by an overlap-and-add method [7].

the frequency range 50–300 Hz are either approximately equal or have a descending slope toward lower frequencies. Low frequency tones can substantially mask higher frequencies when a high amplitude level is used [6]. Masking denotes the phenomena when one sound, the masker, makes another sound, the masked, in-audible. The risk of masking gives that caution must be taken when introducing tones in the low-band. The amplitude level of all the sine tones is adaptively updated with a fraction of the amplitude level of the first formant (25) where is a constant fraction substantially less than one to ensure only limited masking will occur. The continuous sine tone frequencies are the pitch frequency and its harmonics. The pitch is estimated for each speech segment. To avoid discontinuities in the sine tones, the tones are samples of each segment. changed gradually during the first The angle frequency for each of the sine tones are calculated as

C. Low-Band Speech Signal Synthesis In conjunction to the bandwidth extension upward in frequency, a bandwidth extension downward in frequency is possible; see Fig. 5. The narrow telephone bandwidth has a lower cutoff frequency of 300 Hz. On a perceptual frequency scale, such as the Bark-scale, the low-band covers approximately three Bark-bands and the high-band covers four Bark-bands. This gives that the low-band is almost as wide as the high-band on a perceptual scale. During voiced speech segments most of the speech content in the low-band consists of the pitch and harmonics. During unvoiced speech segments the low-band is not perceptually important. The suggested method of synthesizing speech content in the low-band is to introduce sine tones at the pitch frequency and the harmonics up to 300 Hz. Generally, the number of tones is five or less since the pitch frequency is above 50 Hz [7]. The harmonics generated by the glottal source are shaped by the resonances in the vocal tract. In the low-band the lowest resonance frequency is important. The first formant is in the approximate range of 250–850 Hz during voiced speech [18]. This gives that, the natural amplitude levels at the harmonics in

(26) where (27) is the phase compensation to maintain a continuous sinusoid between segments, and

(28) is the angle update per sample. The estimated pitch frequency of the current segment is expressed as an angle frequency. Similarly, the amplitude is changed during a transition period

(29)

584

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

The lower speech synthesis then becomes

two sets. It can be decided which side of the plane a spectral vector is by examining when (32)

(30) which also is filtered with a low-pass filter having a limit of 300 Hz

is positive or negative. This gives (33)

(31) and since the vector

is identical to

see also Fig. 5. The precalculated filter is design to have a short delay, in the low-band by using a minimum-phase filter design method [23].

(34) can be identified from (7).

IV. SIMULATIONS To derive the weighting vectors applied on the spectral vector, for estimation of the FAD and the high-band gain, the extensive speech corpus TIMIT [24] is used. Seven sentences from each of the 630 talkers are selected, comprising over 3:40 h of speech signals. The signals are divided into segments of 20 ms duration, corresponding to 320 samples at 16 kHz. Four different types of background signals are imposed on the speech signals: car noise, speech babble noise, street noise, and silence. The background is imposed on the narrow-band speech signal at a randomly selected moderate SNR, compliant with noise rejection specifications of telecommunication systems [1]. All the speech signals are used at three different amplitude levels. The spectral peaks in the narrow-band are calculated for each segment according to the methods detailed in Section III-A. An estimate of the high-band energy level is calculated. Subsequently, a compensation for speech dynamics is applied, which is equivalent to the compensation of the spectral vector in Section III-B. The PAD and VAD are employed on the narrow-band signal. This gives that for each segment the , logarithm domain speech dynamics norspectral vector, , VAD , and PAD malized high-band amplitude are calculated. A. Frication Activity Detector The FAD is to detect when the current unvoiced speech segment has prominent fricated speech in the high-band. The unvoiced speech segments VAD PAD are divided into two sets depending on the high-band amplitude level. All but the first are used in a -space. The first elelement of the vector ement is neglected since it is always one. The spectral vectors of the two sets then corresponds to points, , in the -space. A line is connected at the center of inertia in each of the two along this line is chosen with sets of points. A point respect taken to the size of each set. This point also belongs to plane that divides the two sets. The plane has a normal a equal to a vector between the center of inertia in the

B. High-Band Gain Weighting Vector The calculation of the high-band amplitude level from the spectral vector is derived by employing a least squares method. The high-band amplitude level is derived by different weighting vectors depending on when the segment contains voiced speech VAD PAD , and fricated speech VAD PAD FAD . The weighting vectors are derived with the same method but on different sets of spectral vectors. The weighting vector for voiced speech is calculated by (35) is the voiced speech segments set, is the desired where high-band logarithm amplitude level and the -operator is a least squares problem solver in MATLAB. Similarly, for the fricated , the weighting vector is speech segments set, (36)

V. RESULTS This section presents objective evaluations of the estimated parameters and subjective evaluations of bandwidth extended speech signals. During these evaluations the TIMIT database has been used [24]. Just as in Section IV more than 3:40 h of speech data has been utilized for the objective evaluations. A. Frication Activity Detection The purpose of the FAD is to determine when to use the , suitable for segments having much weighting function, high-band content compared to the low-band. See Table I for a comparison between the actual outcome of the detector compared with the desired decision. The desired decision is derived from the amplitude level of the original high-band signal. When otherwise FAD . As the level is high the desired FAD can be seen in Table I, most nonfricated frames are correctly detected as FAD and approximately half of the fricated segments are detected. This performance is preferred over an increased

GUSTAFSSON et al.: LOW-COMPLEXITY FEATURE-MAPPED SPEECH BANDWIDTH EXTENSION

585

TABLE I FRICATION ACTIVITY DETECTION (FAD) OUTCOME COMPARED WITH A DESIRED DECISION GIVEN IN PERCENT OF ALL SEGMENTS. THE DESIRED DECISION IS ESTIMATED FROM THE ORIGINAL HIGH-BAND ENERGY LEVEL. A HIGH LEVEL INDICATE A DESIRED DETECTION OF A FRICATION SPEECH SEGMENT. AN OPTIMAL OUTCOME WOULD HAVE ZERO PERCENT MISS AND FALSE DECISIONS

Fig. 7. Histogram of the calculated (black) and measured (white) distance between two consecutive frequency peaks in the high-band. The calculated resonances are closer to each other compared with the measured ones.

Fig. 8. Histogram of the error between the calculated and measured original peak frequencies in the high-band. Each histogram bin has a width of 200 Hz. The first 200 Hz bin has a share of 19%.

Fig. 6. Histogram of the calculated (black) and desired (white) logarithm high-band amplitude levels for (a) voiced speech segments and (b) fricated speech segments. The desired log-amplitude levels are derived from wide-band speech data. The levels are below 0 since the speech signals have been normalized with the dynamics, P (m), of the speech signal. A similar distribution for the calculated and desired levels can be observed even though a higher share around the mean levels and the outliers at 40 dB for fricated speech is evident. The outliers are the result of that the spectral peaks are below the average noise level. Hence, the spectral vector (m) is zero except for the constant in the first element.

Q

0

uncertainty of the detection of nonfricated, FAD, segments since an inclusion of a high level in the high-band when not accurate will give a more sever annoyance for the listener [16]. B. High-Band Amplitude Level The high-band gain is calculated with the weighting functions and . The weighting functions are applied on the spectral vector . Two cases of the gain function are of interest: during voiced speech and during fricated speech. Depending on the actual classifications made by the VAD, PAD, and FAD, the two cases are detected; see Section III-A. The , is measured in the original high-band amplitude level, wide-band signal just as in Section IV. This amplitude level is compared with the level obtained by applying the weighting functions on the spectral vectors. The approximate distributions of the logarithm amplitude levels for fricated and voiced speech segments are presented in Fig. 6. It can be seen that the calculated log-amplitude levels have a distribution approximately

equal to the desired log-amplitude levels, although a concentration around the mean level is observed. Also, for fricated speech an approximately 10% higher share at 40 dB can be seen. Approximately 10% of the fricated speech segments have , below the average noise two or more spectral peaks, . Hence, two or more elements of the specpeak level, , are zero. These segments are referred to as tral vector, outliers. The outliers segments have only a small deviation from the constant gain level derived from the first element of . For 31% of the fricated segments without the outliers segments the amplitude level error are less than 6 dB. Voiced speech segments have an error of less than 6 dB in 57% of the segments. C. Resonance Frequency The resonances introduced in the high-band are calculated according to a rough acoustic model. The calculated resonance are derived as in Section III-B. For comparfrequencies ison the high-band resonances are measured. The measured frequencies of the original high-band resonances are derived sim. A linear predictor ilarly as the narrow-band resonances of order 14 is used and the resonance frequencies are derived from local maxima in the AR-spectrum. An important characteristic of the resonances is the frequency distance between conratio; secutive resonances. This distance is controlled by the see (21). An estimate of the measured and calculated distributions of the frequency distances are presented in Fig. 7. In the vicinity of 1 kHz an overrepresentation of frequency distances for the calculated resonances compared with the measured ones is evident. The average frequency distance for the original resonances is 1.6 kHz. A lower distance limit of 0.9 kHz is thus reasonable, as assumed in Section III-B. Since the uniform tube model is a crude approximation, the precision of the estimated resonance frequencies are modest;

586

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

TABLE II SUBJECTIVE EVALUATION OF THE ABSENCE OF DISTORTION ANNOYANCE ON A MEAN OPINION SCORE (MOS) SCALE FROM 1–5, WHERE 5 CORRESPONDS TO NO AUDIBLE DISTORTION AND 1 CORRESPONDS TO VERY ANNOYING DISTORTION

see Fig. 8. However, the main purpose of the synthesized resonances is to shift the accentuation of the harmonics. The errors between the calculated resonance frequencies and the measured original frequencies are derived by calculating the distance between the synthesized resonances and the closest measured resonance. Approximately 19% of the errors are less than 200 Hz; see Fig. 8. D. Subjective Listening Tests A resynthesis of missing frequency bands in the received signal is unlikely to be performed without perceptual differences compared to the speech content present at the transmitting side [3]. The quality of the processed speech signals should be evaluated to find out whether the distortion is perceptually disturbing. Subjective listening tests have been performed in conjunction to the objective measures. The goal with the first subjective test is to verify if the synthesized signals contain distortions and to which extent the distortions are annoying. The goal with the second test is to verify the performance compared with earlier speech bandwidth extension methods. The tests were conducted directly after each other for each test subject. The first test was made on sentences drawn randomly from the TIMIT database. Each sentence had a randomly picked background signal with a bandwidth of 300 Hz–3.4 kHz at a moderate amplitude level. The background signals used were: car noise, street noise, babble noise, and silence. For each drawn sentence and background eight different signals were derived. Two “original” signals were derived from the TIMIT speech signals by bandpass filtering with the narrow-bandwidth and the wide-bandwidth. Three synthesized signals were calculated with either the low-band synthesized signal or the high-band synthesized signal, or both the synthesized signals. These signals also contain the up-sampled narrow-band signal. Additionally three synthesized signals were constructed using parameters derived directly from the original wide-band signal. The parameters derived were the high-band gain and/or the highband excitation signal. To avoid listener fatigue each of the two tests was limited to about 10 min. The subject was placed alone in a quite room. The test signals were played through Sennheizer HD 600 headphones from a CD player. For the first test, ten sentences were used. For each sentence the two original signals, the three synthesized signals and one of the synthesized signals using parameters derived from the wide-

TABLE III SUBJECTIVE PREFERENCES OF THREE DIFFERENT SPEECH BANDWIDTH EXTENSION METHODS AND THE ORIGINAL NARROW-BAND SIGNAL. THE RESULTS ARE SHOWN IN PERCENT PREFERRED-METHOD OF THE PAIR. IN AVERAGE THE NARROW-BAND SIGNAL IS PREFERRED INSTEAD OF A SBE PROCESSED SIGNAL. FOR EXAMPLE, THE FM-SBE PROCESSED SIGNALS ARE PREFERRED IN 23% OF THE TEST CASES TO THE NARROW-BAND SIGNALS, SEE LAST COLUMN FIRST ROW

band signal were used. This gives 60 test signals. Twenty-two subjects rate the absence of distortion annoyance for each signal as 1–5, according to a mean opinion score (MOS) [7]. The average opinions are presented in Table II. The results are distinguished from each other with a confidence interval of 0.005 (three-star), 0.01 (two-star) or 0.05 (one-star) level. The results show that the synthesized high-band and wide-band signals have more annoying distortion than the narrow-band signal, where the MOSs are separable with a three-star confidence. The lowband has a slightly more annoying distortion than the narrowband signal with a two-star confidence. Subjects were not asked to include any impression of a larger bandwidth in these scores. The subjects have evidently not include a wide-band impression since no significant difference in annoying distortion between the narrow-band and wide-band could be determined. Eight of the 22 subjects rated the original wide-band signal as having equal or more distortion than the original narrow-band signal. In the second test the subjects hear 36 pairs of sentences and judge which of the two sentences they prefer. No neutral field was provided, to force the subjects to make a decision. A neutral rating is obtained from the average of random choices when the sentences are equally preferred. When the difference between the sentences is small an advantage of one of the sentences can still be evident. The test sentences are narrow-band signals or bandwidth extended speech signals. Three different speech bandwidth extension methods have been applied on the narrowband signals: Epps et al.’s codebook-derived spectral envelope and model-based excitation method [9]; Nilsson et al.’s GMMderived spectral envelope and folding based excitation method [16]; and the FM-SBE method. Six different sentences without background noise have been supplied by Epps. The sentences are used in the test without adding a background noise signal. For each sentence the four different signals are compared to each other in pairs, giving six different pairs. The subjective preferences are presented in Table III. The subjects prefer the original narrow-band signal with a two-star confidence interval, although comments from some subjects after the test indicate that the difference between the signals are small. If the difference is small then the advantage of any of the tested SBE processed signal would be small. All the percentages in Table III

GUSTAFSSON et al.: LOW-COMPLEXITY FEATURE-MAPPED SPEECH BANDWIDTH EXTENSION

are significantly different to each other at least on a one star level. The suggested FM-SBE method is ranked second after Nilsson’s method and before Epps’s method.

587

TABLE IV COMPUTATIONAL COMPLEXITY OF THE FM-SBE METHOD. PARTS WITH A STAR CONSTITUTE THE MAPPING FROM NARROW-BAND PARAMETERS TO THE HIGH-BAND PARAMETERS. THESE PARTS ONLY USE 0.59 MIPS

VI. DISCUSSION AND CONCLUSION A speech bandwidth extension method with low computational complexity has been presented. This is important in a mobile phone application since the processing time will add to the delay of the speech signal. Also, increased computational complexity will give a shorter battery-time of the mobile device. The present method is not as simple as some of the folding excitation methods, instead two shortcomings of the earlier methods are mitigated. Harmonic high-band signals and low-band signals are synthesized when the narrow-band contains a harmonic excitation. A harmonic excitation renders an increased perception of one speech sound, which is important in noisy environments. An adaptive gain of the high-band is applied instead of a fixed gain as in earlier methods. This gives a possibility to have a suitable level of the high-band compared with the narrow-band. Additionally, resonances in the high-band frequency region are derived from the narrow-band signal by an acoustic model of the front-most cavity. Some more advanced SBE methods make use of speech spectrum statistics to derive a bandwidth extension. These methods use a modeling of the entire narrow-band spectrum and the mapping to the high-band is derived from clean speech signals. The present method includes noise in the estimation of the mapping, and implements techniques to reduce the influence of the noise. The mapping of narrow-band features to high-band features, such as amplitude level and resonances, are performed independently. The mapping of features will result in a wide-band signal. The error of the estimated high-band features results in audible differences compared to the original wide-band signal. A listening test showed a small increase in the distortion annoyance from slightly annoying (3.4 MOS) for the narrow-band signals to annoying (2.6 MOS) for the bandwidth extended signals on the 1–5 graded MOS-scale. A second test, a paired preference listening test, was made to compare SBE processed signals with original narrow-band signals. This test shows a 77% preference for the narrow-band signal. Earlier listening tests in the literature have shown a much lower preference for the narrow-band signal: [15] 35%, [5] 32%–40% and [11] 12%–13%. A possible explanation for the difference is that the first MOS test directs the listeners’ attention to the distortion, which then continues to influence the preferences in the second test. Further, in any subjective test the instructions given to the subject influence the result. In the test performed in this investigation care has been taken not to instruct the subjects to rate a wide-band signal higher. In contrast, [11] instruct the listerner to chose the signal that sound widest. However, when in the second listening test the FM-SBE method was compared with other more computational complex methods a preference for the FM-SBE method was obtained against half of the other methods. Nilssons’s method that

showed a better subjective result than the FM-SBE method report no distortion in 32.1% of the signals in a subjective test [16]. Epps’s method is subjectively evaluated in [10]; a MOS rating of 2.78 for the SBE processed signals is reported compared with a rating of 4.25 and 2.74 for wide-band and narrow-band signals, respectively. The results indicate that the FM-SBE method has potential to give a preferred bandwidth extended signal, although the subjects find the amount of introduced distortion too high for all the tested speech bandwidth extension methods.

APPENDIX I FORMANT CEPSTRUM CALCULATION The minimum-phase formant filter is calculated from the poles and zeros by utilizing the cepstrum. Consider a causal with zeros minimum-phase transfer function and poles . The poles and zeros are strictly inside the . The corresponding unit circle, and of equal number cepstrum then is [23]

(37) is a dirac, is the step function and where is a real constant. The poles are complex conjugated and yielding

(38)

588

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006

A corresponding equation for the complex conjugated zeros fol(37) then becomes lows. When

(39) APPENDIX II COMPUTATIONAL COMPLEXITY The FM-SBE method has been implemented on a digital signal processor in fixed-point c-code. The implementation requires 29 million instructions per second (MIPS). The low-band synthesis computational complexity part is 3.4 MIPS. The major part of the computational complexity, 15 MIPS, is due to the pitch estimator. The pitch estimator is mainly used for the low-band extension but also for the PAD. Both the methods, [9] and [16], that are used for comparison in the listening test make use of a pitch estimation/detection and do not have a low-band extension. As can be seen in Table IV the mapping from narrow-band parameters to high-band parameters requires very few MIPS. REFERENCES [1] Digital Cellular Telecommunications System (Phase 2+); Transmission Planning Aspects of the Speech Service in the GSM Public Land Mobile Network (PLMN) System, GSM 03.50 Version 8.1.0, 1999. [2] Technical Specification Group Services and System Aspects; Speech Codec Speech Processing Functions; AMR Wide-band Speech Codec; Transcoding Functions, TS 26.190 v5.1.0, 2001. [3] M. Nilsson, H. Gustafsson, S. V. Andersen, and W. B. Kleijn, “Gaussian mixture model based mutual information estimation between frequency bands in speech,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 2002. [4] Y. M. Cheng, D. O’Shaughnessy, and P. Mermelstein, “Statistical recovery of wide-band speech from narrow-band speech,” in Proc. Int. Conf. Speech and Language Processing, 1992, pp. 1577–1580. [5] H. Yasukawa, “Quality enhancement of band limited speech by filtering and multirate techniques,” in Proc. Int. Conf. Speech and Language Processing, 1994, pp. 1607–1619. [6] B. C. J. Moore, Ed., Hearing, 2nd ed. New York: Academic, 1995. [7] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1987. [8] H. Yasukawa, “Restoration of wide band signal from telephone speech using linear prediction residual error filtering,” in Proc. Digital Signal Processing Workshop, 1996, pp. 176–178. [9] J. Epps and H. Holmes, “A new technique for wide-band enhancement of coded narrow-band speech,” in Proc. IEEE Workshop on Speech Coding, 1999, pp. 174–176. [10] J. Epps, “ Wide-band Extension of Narrow-band Speech for Enhancement and Coding,” Ph.D. dissertation, School Elect. Eng. Telecom., Univ. of New South Wales, Australia, Sep. 2000. [11] Y. Yoshida and M. Abe, “An algorithm to reconstruct wide-band speech from narrow-band speech based on codebook mapping,” in Proc. Int. Conf. Speech Language Processing, Yokohama, Japan, 1994, pp. 1591–1594. [12] P. Jax and P. Vary, “Wideband extension of telephone speech using a hidden markov model,” in IEEE Workshop on Speech Coding, 2000, pp. 133–135. [13] C. Avendano, H. Hermansky, and E. A. Wan, “Beyond nyquist: toward the recovery of broad-bandwidth speech from narrow-bandwidth speech,” in Proc. EUROSPEECH, Madrid, Spain, 1995, pp. 165–168.

[14] Y. Nakatoh, M. Tsushima, and T. Norimatsu, “Generation of broad-band speech from narrow-band speech using piecewise linear mapping,” in Proc. EUROSPEECH, vol. 3, 1997, pp. 1643–1646. [15] K.-Y. Park and H. S. Kim, “Narrowband to wide-band conversion of speech using gmm based transformation,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2000, pp. 1843–1846. [16] M. Nilsson and W. B. Kleijn, “Avoiding over-estimation in bandwidth extension of telephony speech,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2001. [17] W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Synthesis. Amsterdam, The Netherlands: Elsevier, 1995. [18] K. N. Stevens, Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. [19] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, no. 12, pp. 1586–1604, Dec. 1979. [20] Digital Cellular Telecommunications System (Phase 2+); Voice Activity Detector (VAD) for Adaptive MultiRate (AMR) Speech Traffic Channels; General Description, GSM 06.94 Version 7.1.1, 1998. [21] W. Hess, Pitch Determination of Speech Signals. New York: SpringerVerlag, 1983. [22] J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Washington, DC, 1979, pp. 428–431. [23] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. [24] DARPA-TIMIT, Acoustic-Phonetic Continuous Speech Corpus, 1990.

Harald Gustafsson was born in Sweden in 1972. He received the M.S.E.E. degree from Lund University, Lund, Sweden, in 1995, and the Ph.D. degree from Blekinge Institute of Technology, Ronneby, Sweden, in 2002. Since 1996, he is a Development/Research Engineer at Ericsson Mobile Platforms AB, Lund. His current research interests are in mobile multimedia system design and audio signal processing.

Ulf A. Lindgren received the Master’s degree in electrical engineering in 1991 and the Ph.D. degree in signal processing in 1997 both from Chalmers University of Technology, Göteborg, Sweden. From 1997 to 2004, he was employed by Ericsson Mobile Platforms AB, Lund, Sweden, where he held a position as Senior Specialist in audio processing. Currently, he is affiliated with the antenna research center at Ericsson AB, Göteborg, Sweden. Dr. Lindgren is a member of COOP, ICA, and OKQ8.

Ingvar Claesson was born in Sweden in 1957. He received the M.S. degree in 1980 and the Ph.D. degree in 1986 in electrical engineering at the University of Lund, Lund, Sweden. Since 1990 he has been building the Telecommunication and Signal Processing Department, Blekinge Institute of Technology, where he is Head of Research and Chair in applied signal processing. His interests are in multimedia signal processing, filter design, adaptive filtering, and noise cancellation.

Suggest Documents