SPEECH BANDWIDTH EXTENSION USING GAUSSIAN MIXTURE MODEL-BASED ESTIMATION OF THE HIGHBAND MEL SPECTRUM Hannu Pulakka1 , Ulpu Remes2 , Kalle Palom¨aki2 , Mikko Kurimo2 , Paavo Alku1 1
Department of Signal Processing and Acoustics, Aalto University, Finland 2 Adaptive Informatics Research Centre, Aalto University, Finland
[email protected]
ABSTRACT The quality and intelligibility of narrowband telephone speech can be enhanced by artifical bandwidth extension. This study combines Gaussian mixture model-based (GMM) mel spectrum extension with a filter bank implementation for generating the missing spectral content in the highband at 4–8 kHz. The narrowband mel spectrum is calculated from input speech and the GMM is used to estimate the mel spectrum in the highband. An excitation signal for the highband is generated as a combination of upsampled linear prediction residual and modulated noise. The excitation is divided into sub-bands that are weighted and summed to realize the estimated mel spectrum. The bandwidth-extended output is obtained as the sum of the artificial highband signal and narrowband speech. Listening tests indicate that this method is preferred over narrowband speech and over a previously presented artificial bandwidth extension method which is implemented in some mobile phone models. Index Terms— speech processing, speech enhancement, bandwidth extension, Gaussian mixture model, mel spectrum 1. INTRODUCTION Most telephone systems transmit narrowband (NB) speech with a frequency range limited to the traditional 300–3400 Hz telephone band or a slightly wider bandwidth. Since natural speech contains frequency components beyond the telephone band, the limited bandwidth degrades both the quality and intelligibility of speech. Wideband (WB) speech transmission covers audio frequencies 50 Hz– 7 kHz and enables significantly higher speech quality. Services using the wideband codec AMR-WB are currently becoming available to mobile phone customers, but the transition from NB to WB telephony is predicted to take a long time. During the transition period, NB and WB equipment coexist and the users encounter a large quality difference between NB and WB speech. To improve the quality and intelligibility of NB speech, artificial bandwidth extension (ABE or BWE) techniques have been developed. They attempt to regenerate the missing spectral content at the receiver based on NB input, which means no modifications are required to the existing encoding practices. This paper addresses ABE to the frequency range 4–8 kHz, which is denoted as the highband. The frequency range below 4 kHz is denoted as the lowband. Typical ABE methods are based on the source-filter model, which divides speech into an excitation signal and a filter. An excitation is generated for the missing highband and filtered to correspond to the estimated spectral envelope of WB speech. The spectral envelope parameters are estimated using features extracted from the NB speech. Features indicating the lowband spectral shape are
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
5100
typically used, and additional frequency-domain and time-domain features are often utilized [1]. Several methods including codebook mapping, neural networks, and Gaussian mixtures models (GMM) have been proposed for estimating the highband parameters. GMM methods have been used, e.g., with spectral vectors [2], mel-frequency cepstral coefficients (MFCC) [3], and line spectral frequencies (LSF) [4]. Further approaches utilizing GMM in ABE include adjusting the temporal envelope and gain of the highband sub-bands based on GMM-estimated parameters [5] and using GMM for both highband prediction and denoising of lowband features [6]. A straightforward ABE method was described and evaluated in [7]. The algorithm is robust and computationally inexpensive. Importantly, the method has been deployed in some mobile phone models of Nokia since 2007 [8] and therefore serves as a natural reference method. It is referred to as Ref-ABE in this work. This paper describes a new ABE method that combines a filter bank technique with GMM-based estimation of the highband mel spectrum. The GMM predictor used in this work is based on the reconstruction method proposed in [9] for missing data imputation in automatic speech recognition (ASR). A similar GMM-based approach was proposed in [10] for feature domain bandwidth extension in an ASR application, but has not been evaluated in a speech enhancement task. In ASR applications, the GMM predictor is used in the feature domain with a low temporal resolution, while in the ABE task, it is also necessary to generate the temporal fine structure in the missing frequency band. In this work, highband synthesis techniques described in [11] are used with minor modifications. The proposed GMM-ABE system is evaluated in a listening test where the system output is compared to narrowband speech and Ref-ABE output. While the GMM prediction and highband synthesis techniques are not novel as such, they have not, to the knowledge of the authors, been combined or evaluated in this setting before. 2. METHODS A block diagram of the proposed GMM-ABE method is shown in Fig. 1. The method first extracts mel-spectral features from the narrowband input in short time frames as described in Section 2.1. Bandwidth extension is then implemented in two phases. First, the highband mel spectrum is estimated using a GMM (Section 2.2) and then, a time-domain signal is generated so that the estimated highband mel spectrum is approximately realized (Section 2.3). 2.1. Feature extraction The input signal snb is narrowband speech sampled at 8 kHz. The input is upsampled to the sampling rate of 16 kHz, prefiltered with
ICASSP 2011
a highpass filter Hpf (z) = 1 − 0.97z −1 , and windowed into 16-ms frames with 8-ms overlap using a Hamming window; see steps A–C in Fig. 1. The frames are converted to the frequency domain using a 256-point FFT and a mel spectrum is calculated from the magnitude spectrum using a mel filter bank that comprises overlapping triangular windows with equal width on the mel scale (D). The filter bank has 21 bands that cover frequencies up to 8 kHz. Sixteen bands are located below 4 kHz and will be used as input features when attempting to estimate the remaining five bands. The feature values are also log-compressed before the highband estimation (E). 2.2. GMM predictor In the spectral domain, the artificial bandwidth extension task is a prediction problem with a goal to estimate the missing frequencies y(m) in the mth frame based on input information derived from the narrowband signal. In this work, the joint distribution of the input and output variables x and y is modeled as a Gaussian mixture model p(z) = P (ν)N (z; μ(ν), Σ(ν)), (1) ν T
T
T
where z = [ x y ] and P (ν) are the component weights i.e. prior probabilities and μ(ν) and Σ(ν) the means and covariances. The component or cluster index ν is assumed a hidden variable, and in this work, the clusters and distribution parameters are jointly estimated using the expectation-maximization (EM) algorithm implemented in the GMMBAYES Matlab toolbox1 . Note that the cluster means and covariances are partitioned as μx Σxx Σxy μ= ,Σ = (2) μy Σyx Σyy where μx and μy denote the mean vectors, Σxx and Σyy the covariance matrices, and Σxy = ΣTyx the cross-covariance matrix calculated for the input and output features x and y in the νth cluster. The prediction problem may now be formulated as finding the maximum a posteriori (MAP) estimate for the output y given the observed input x = x(m). For GMM-distributed variables, this is calculated as y(m)∗ = arg max P (ν|x(m), Λ)p(y|x(m), Λ, ν), (3) y
ν
where Λ denotes the model parameters from Equation (1). In practice, since MAP estimation with mixture models is difficult, this is approximated as y(m)∗ = P (ν|x(m), Λ) arg max p(y|x(m), Λ, ν), (4) ν
y
which is a weighted sum of cluster-conditioned MAP estimates for y(m). Maximizing the auxiliary function in Equation (4) corresponds to maximizing the lower bound of the likelihood in Equation (3). The posterior probabilities for clusters ν are calculated from P (ν) and the likelihoods p(x(m)|Λ, ν), which are calculated with diagonal covariances in this work. Since the cluster-conditioned predictive distributions are Gaussian, the MAP estimates in Equation (4) correspond to the expected values E(y|x(m), Λ, ν). The estimates are calculated as E(y|x(m), Λ, ν) = μy + Σyx Σ−1 xx (x(m) − μx ), 1 Available
(5)
in www.it.lut.fi/project/gmmbayes/
5101
Fig. 1. Block diagram of the proposed GMM-ABE method. The narrowband input and the bandwidth-extended output are denoted by snb and sabe , respectively. Gray arrows represent frame-based processing and black arrows sample-by-sample processing. Thick arrows indicate multiple signals, i.e., separate signals for the subbands. Delays for synchronizing the branches are not shown here. where the means and covariances are as defined in Equation (2). The cluster-dependent linear transformations R = Σyx Σ−1 xx can be precomputed to a look-up-table to save computation time during use. The GMM used in this work is trained on 500 clean speech utterances (52 minutes) selected from the Finnish SPEECON database [12]. The input features x(m) for the mth frame are the mel bands 1–16 from the current and two previous frames. The input features are calculated from narrowband signals filtered with the MSIN filter that approximates the input characteristics of a mobile station [13]. The signals are also scaled in amplitude to −26 dBov and processed with the AMR codec (12.2 kbps). The output features y(m) are the mel bands 17–21 calculated from the original wideband data. The data is modeled with 20 Gaussian components. 2.3. Highband synthesis The highband signal is generated by synthesizing five sub-band signals and weighting them so that the estimated mel spectrum y(m)∗ is approximately realized in each frame m. First, however, the estimates are temporally smoothed by using the filter H(z) = 0.25/(1 − 0.75z −1 ) if the estimated value has increased compared to the previous frame. This attenuates occasional peaks in y(m)∗ .
dB
0 −20 −40
4
5
6 Frequency (kHz)
7
8
Fig. 2. Magnitude responses of the five sub-band filters. The sub-band signals are obtained by filtering a wideband excitation signal with the 128-tap FIR filters illustrated in Fig. 2. The filter bandwidths and center frequencies correspond to the mel bands 17–21. The weights i.e. gain coefficients for the sub-bands are estimated using an iterative technique (step F in Fig. 1). The estimates are initialized by assuming that the passband signals do not affect adjacent mel bands, and in each iteration to follow, they are corrected by the ratio of the target mel spectrum y(m)∗ to the mel spectrum that would result from using the current estimates. The wideband excitation is primarily constructed from the linear prediction (LP) residual signal of each input frame using spectral folding. The LP residual is computed from the input signal sampled at 8 kHz using Hann windowing, the autocorrelation method and LP order of 10 (G). The residual frame is modified with a formant filtering technique (H) similar to the short-term postfiltering used in some speech coders [14]. The method restores some variation in the spectral envelope of the residual and was found to slightly improve the quality of the bandwidth-extended speech in informal listening tests. Each modified residual frame is normalized in amplitude to constant average power. To reduce the unnatural sound of the harmonic peaks in the residual excitation, it is combined with modulated noise [4]. The time envelope of a passband signal between 2 kHz and 3 kHz is estimated from the input speech (I) and white noise is modulated with this envelope (J) to generate the temporal fine structure within each frame. The energy of the modulated noise signal is normalized in each frame, and the modulated noise is then combined with the LP residual (K). As a compromise between the metallic, sharp sound of the residual and the noisy sound of the modulated noise, the residual signal was given more weight in the lowest three subbands and the modulated noise signal in the highest two subbands. Finally, the excitation signals are multiplied by the estimated gain coefficients (L) and successive frames are combined by overlapadd using the Hann window to construct a continuous excitation signal for each sub-band (M). Zero-insertion is used to obtain the spectral folding effect and a wideband signal with the sampling rate of 16 kHz (N). The excitation signals are filtered (O) with the bandpass filters in Fig. 2 and the passband signals summed together and added to the upsampled, lowpass-filtered, and appropriately delayed narrowband signal to obtain the bandwidth-extended signal sabe (P). 3. EVALUATION The proposed GMM-ABE method was compared to the Ref-ABE method [7] and to narrowband telephone speech using subjective listening tests. The test procedure was similar to the comparison category rating test used in [7]. In each test case, the listener was presented with two differently processed instances of the same sentence. The task of the listener was to assess the quality of the second sample compared to the first sample using a seven-point scale. Three processing types were compared in the test: • NB: Narrowband speech filtered with the MSIN filter, scaled in amplitude to −26 dBov, and processed with the AMR narrowband codec at the bit rate of 12.2 kbps.
5102
• Ref-ABE: The NB sample processed with Ref-ABE. • GMM-ABE: The NB sample processed with GMM-ABE. Ten utterances were selected from the Finnish SPEECON database [12] for the test. These are read sentences headset-recorded in quiet office environments and each spoken by a different speaker (5 males and 5 females). None of the test speakers were included in the GMM training data (Section 2.2) or the development data used for informal listening and parameter optimization (Section 2.3). The test utterances were randomly selected among fluently spoken utterances of sufficient length and have durations between 3.5 and 5.5 s. The estimated signal-to-noise ratios are between 21 and 29 dB. For each sentence, pairwise comparisons between the processing types were included in the listening test in both presentation orders. The test also contained ten null pairs with identical samples for control purposes. Thus, the listening test comprised a total of 70 comparisons. The order of the test cases was randomized for each listener using some balancing constraints on the order. Twelve native speakers of Finnish between 20 and 29 years of age participated in the test. The test was arranged in a quiet room using a computer interface. The samples were played to both ears through Sennheiser 580 headphones. Each listener had a short practice session before the actual test. The distributions of the listener responses are shown in Fig. 3. The distributions in the comparisons between Ref-ABE and GMMABE show preference for GMM-ABE on average. Comparisons of both ABE methods to the NB reference show that bandwidth extension is preferred in most cases, but there is another lower peak in the distributions in favor of the NB signal. On average, GMM-ABE obtained higher preference scores than Ref-ABE when compared to the NB reference. This is consistent with the distribution of scores in comparisons between Ref-ABE and GMM-ABE. In all three comparisons, the mean score was significantly different from zero (t-test, p < 0.001) indicating a statistically significant preference. 4. DISCUSSION Results from the listening tests show that samples generated with the proposed GMM-ABE method are significantly preferred over both the narrowband samples and samples generated with Ref-ABE [7]. Informal evaluation indicated that GMM-ABE yields brighter sound and produces more clear and consistent fricative sounds than RefABE. However, in some samples generated with GMM-ABE, the overall timbre could be considered too bright and metallic and some sibilants artifical-sounding, and the method also caused audible artifacts. The Ref-ABE method, in contrast, has been designed to be robust in varying conditions [8], and artifacts are specifically avoided. While noise robustness was not assessed in the present study, GMM prediction could be used for joint bandwidth extension and lowband denoising as proposed in [5]. In the proposed GMM-ABE framework, denoising could be based on missing data imputation [9]. Figure 4 shows the long-term average spectra of the ABE samples and the original wideband utterances. The spectrum predicted using GMM-ABE extends to higher frequencies and is on average closer to the original wideband spectrum than the Ref-ABE spectrum. More energy in the highband results in increased loudness, which often correlates with higher preference ratings in listening tests. The bandwidths of the ABE methods were not equalized for the test because the frequency range up to 8 kHz is available and any quality improvement obtained using the entire range is beneficial. The gap in the GMM-ABE spectrum at 4 kHz is due to the excitation generation method and the filtering technique used. Such a gap has only a minor perceptual effect [1, 7].
Ref−ABE vs. GMM−ABE
NB vs. Ref−ABE
NB vs. GMM−ABE
−3 −2 −1
0
1
2
3
%
100 75 50 25 0
%
100 75 50 25 0
%
100 75 50 25 0
−3 −2 −1
0
1
2
3
−3 −2 −1
0
1
2
3
Fig. 3. Distributions of listener ratings in the pairwise comparisons of Ref-ABE versus GMM-ABE, NB versus Ref-ABE, and NB versus GMM-ABE. In each illustration, the bars indicate relative frequencies of the scores from much worse (−3) to much better (3).
Magnitude (dB)
0
2003.
Wideband GMM−ABE Ref−ABE
−20
[2] K.-Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. ICASSP, 2000, pp. 1843–1846.
−40
[3] A. H. Nour-Eldin and P. Kabal, “Combining frontend-based memory with MFCC features for bandwidth extension of narrowband speech,” in Proc. ICASSP, 2009, pp. 4001–4004.
−60 0
2
4 Frequency (kHz)
6
[4] Y. Qian and P. Kabal, “Dual-mode wideband speech recovery from narrowband speech,” in Proc. Eurospeech, 2003, pp. 1433–1436.
8
Fig. 4. Long-term average spectra of the bandwidth-extended speech samples used in the listening test. The average spectrum of the original wideband recordings is shown for comparison. The level of the listening test samples was not modified after processing but the samples had identical level in the lowband and loudness differences were allowed. This choice was motivated by the target application; the output gain of a mobile terminal is mostly limited by the reproduction of the low frequencies, and increasing loudness without increasing the lowband gain or reducing speech quality is a desirable effect. Finally, the proposed method involves more computation than Ref-ABE, but there are no principal obstacles that would prevent its use in realistic communication applications; the computation is not overly complicated and future frames are not used. The system evaluated in this work used only mel-spectral features. Additional time or frequency domain features could improve the performance of GMM-ABE without significantly increasing the computational cost. The proposed method is similar to the one presented in [11], which uses a different feature set and utilizes a neural network instead of a GMM. According to informal comparisons, the GMMbased method yields brighter sound but causes more artifacts. Listening tests should be arranged for a comprehensive comparison. 5. ACKNOWLEDGEMENTS The work of HP is funded by the GETA graduate school, Nokia Devices, the Academy of Finland (LASTU research programme 135003), and Aalto University (Mide/UI-ART). The co-authors are supported by the Hecse graduate school (UR) and the Academy of Finland in the projects Auditory approaches to automatic speech recognition (UR, KJP) and AIRC (UR, KJP, MK). The authors would like to thank the participants of the listening test. 6. REFERENCES [1] P. Jax and P. Vary, “On artificial bandwidth extension of telephone speech,” Signal Process., vol. 83, no. 8, pp. 1707–1719,
5103
[5] K.-T. Kim, M.-K. Lee, and H.-G. Kang, “Speech bandwidth extension using temporal envelope modeling,” IEEE Signal Process. Lett., vol. 15, pp. 429–432, 2008. [6] M. L. Seltzer, A. Acero, and J. Droppo, “Robust bandwidth extension of noise-corrupted narrowband speech,” in Proc. Interspeech, 2005, pp. 1509–1512. [7] H. Pulakka, L. Laaksonen, M. Vainio, J. Pohjalainen, and P. Alku, “Evaluation of an artificial speech bandwidth extension method in three languages,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 6, pp. 1124–1137, 2008. [8] L. Laaksonen, H. Pulakka, V. Myllyl¨a, and P. Alku, “Development, evaluation and implementation of an artificial bandwidth extension method of telephone speech in mobile terminal,” IEEE Trans. Consum. Electron., vol. 55, no. 2, pp. 780– 787, 2009. [9] B. Raj, M. L. Seltzer, and R. M. Stern, “Reconstruction of missing features for robust speech recognition,” Speech Commun., vol. 43, no. 4, pp. 275–296, 2004. [10] M. L. Seltzer and A. Acero, “Training wideband acoustic models using mixed-bandwidth training data via feature bandwidth extension,” in Proc. ICASSP, 2005, pp. 921–924. [11] H. Pulakka, V. Myllyl¨a, L. Laaksonen, and P. Alku, “Bandwidth extension of telephone speech using a filter bank implementation for highband mel spectrum,” in Proc. EUSIPCO, 2010, pp. 979–983. [12] D. Iskra, B. Grosskopf, K. Marasek, H. van den Heuvel, F. Diehl, and A. Kiessling, “SPEECON – speech databases for consumer devices: database specification and validation,” in Proc. LREC, 2002, pp. 329–333. [13] International Telecommunication Union, ITU-T Recommendation G.191, Software tools for speech and audio coding standardization, 2005. [14] J.-H. Chen and A. Gersho, “Adaptive postfiltering for quality enhancement of coded speech,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 59–71, 1995.