Bandwidth Extension of Narrowband Speech Based on Blind ... - eurasip

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

BANDWIDTH EXTENSION OF NARROWBAND SPEECH BASED ON BLIND MODEL ADAPTATION

Sheng Yao and Cheung-Fat Chan Department of Electronic Engineering City University of Hong Kong, Kowloon, Hong Kong [email protected] and [email protected]

ABSTRACT Traditional telephone transmission network has speech frequency upper-limit below 4 kHz. The narrowband telephone speech (0–4 kHz) sounds muffled as compared with the original wideband speech (0-8 kHz). Artificial bandwidth extension is an economical way of enhancing the quality of narrowband speech without modifying the infrastructure of the network. Existing bandwidth extension methods usually include off-line learning phase and on-line enhancing phase. The performance of these systems depends largely on the consistency of wideband training data and actual narrowband input data. In real situation, input speeches usually mismatch with off-line training speeches, leading to serious model errors. To avoid the data mismatch, we propose a method based on blind adaptation of linear dynamic model. The benefit of our method is the exclusion of off-line training phase and experiment results show that our systems is comparable with those data-oriented systems in the measurements of highband spectral distortion. When data mismatch occurs, our system outperforms those systems. 1. INTRODUCTION With the gradual development of wideband voice terminals such as adaptive multi-rate wideband codec (AMR-WB) and variable rate multi-model wideband codec (VRM-WB), current speech transmission network is a mixture of traditional narrowband terminals and new wideband terminals. During this transition period, bandwidth extension systems (BWE) helps to enhance the perceived quality of narrowband speech without the cost of replacing the old narrowband infrastructure. The authors of [4] indicate that existing BWE systems are performing reasonably, not because they accurately retrieve the original missing high-band information, but rather they extend the high-band such that the signal sounds perceptually pleasant. Basically reported BWE systems can be This research is supported by Strategic Research Grant (Project 7001858) of City University of Hong Kong

©2007 EURASIP

classified into two categories: memoryless systems and memory systems. Memoryless methods are the earlier development of BWE, with members such as VQ codebook mapping [1], linear mapping [3] and Gaussian mixture model (GMM) conversion [2]. These methods are usually criticized for the disregardness of inter-frame correlationship, which is the cause of relatively large hissing artefacts. Recently, more attention is paid on memory system development. Candidates are hidden Markov model (HMM) method [5], HMM with state mapping [7] and linear dynamic model [8]. These systems are featured for the capability of estimating the missing high-band information based on previous estimations. They focus more on the retrieval of the trajectory of spectrum evolution, thus hissing artefact greatly reduced. However, all the systems aforementioned are dataoriented. They perform well when input narrowband speech is consistent with training database, i.e. the same speaker or similar recoding environment. It is not the case in real application where data mismatch often occurs. In this paper, we propose a memory system based on linear dynamic model whose parameters adapt to the input narrowband speech in a blind manner. Off-line model training is not required except for the initial model. Experiment results show that the proposed method is superior to memoryless systems and comparable with memory systems in the measurement of highband spectral distortion. The rest of paper is organized as follows. Section 2 presents the employment of linear dynamic model in bandwidth extension systems. Section 3 explains how the proposed system makes itself adaptive to the input narrowband speech in a blind sense. In section 4, the objective performance is compared. The last section is for conclusion. 2. LINEAR DYNAMIC MODEL The model is also termed state space model. In linear state space model, the hidden speech state vector x ( k ) ∈ R p is presumably linearly evolving according to equation (1). (1) x ( k + 1) = Ax ( k ) + u + w ( k )

2350


k is time index or speech frame index. Transformation matrix A and deterministic control vector u are pre-trained model parameter. w ( k ) is uncorrelated zero-mean Gaussian noise vector with covariance E [ w ( k ) w (l ) T ] = Q δ kl . The

observation vector o ( k ) ∈ R m is the noisy linear transformation of state vector x (k ) according to equation (2). o ( k ) = Cx ( k ) + v ( k )

(a) original wideband female speech from speaker clh

(2)

The equation is static in nature since vector o and x share the same index. v (k ) is also uncorrelated zero-mean Gaussian noise with covariance E [ v ( k ) v (l ) T ] = R δ kl .

(b) estimation of wideband speech with the mismatched model of male speaker bjm

By assuming o (k ) as input narrowband speech feature vector and x (k ) as the unknown wideband feature vector, linear state space model is employed in the speech bandwidth extension system. Such an assumption is reasonable because 1) Human speech process is a non-stationary random process. The hidden state of the process is never static. State equation (1) is one possible representation of state evolution for the ease of mathematical treatment. 2) The idea of linear transformation relationship between o (k ) and x (k ) assumed in observation equation (2) has already been applied in memoryless linear mapping system. The performance is satisfactory apart from the limitation of memoryless nature of the system. 3) Due to the presence of noise and possible noninvertiblity of matrix C in equation (2), the state vector x (k ) cannot be uniquely estimated given the observation vector o (k ) , which reflects the one-to-many mapping between narrowband and wideband speech features [6]. We extract 10-order line spectral frequencies (LSF) as narrowband feature o . Target wideband feature x is defined as 18-order LSF.

3. BLIND MODEL ADAPTATION Given a sequence of narrowband feature vector o and a trained linear state space model θ = { A, u , C , Q , R}

, if we assume θ is stationary and well trained for the sequence, a pretty good estimation of x sequence can be obtained via Kalman filter algorithm as shown below: For k = 1,2,..., L, Kalman Prediction xˆ k |k −1 = Axˆ k −1|k −1 + u Σ k |k −1 = AΣ k −1|k −1 AT + Q

Kalman Gain

©2007 EURASIP

Figure (1) illustration of model mismatch Κ k = Σ k |k −1C T (CΣ k |k −1C T + R) −1

Kalman Correction xˆ k |k = xˆ k |k −1 + Κ k (o(k ) − Cxˆ k |k −1 ) , Σ k |k = Σ k |k −1 − Κ k (CΣ k |k −1C T + R)Κ Tk .

initialized usually by xˆ 0|0 = E[ x (0)] = μ (0) Σ 0|0 = E[ x(0) x(0) T ] = Σ(0)

The formulation of Kalman gain matrix Κ k aims at the minimization of the trace of the state error matrix Σ k|k .Therefore Kalman filter algorithm is a MMSE estimator of the hidden wideband speech vector x (k ) . Note that, Kalman filter algorithm is sequential. However, θ = { A , u , C , Q , R } should not be stationary. The method described in [8] provides the state space model with several different modes. The system, block by block, chooses the best fitted modes for the input narrowband sequence via some clustering techniques. Albeit such a treatment offers the state space model a certain degree of dynamics and the subjective and objective performance of the system is satisfactory, like other data-oriented methods, a relatively large amount of training data is required. Besides, in case the input narrowband speech is quite different from the training database (i.e. speaker or recording environment difference), severe model error occurs, leading to unacceptable level of hissing. For example, a false formant trajectory may appear in high-band spectrogram when a male model is applied to a female input (see figure (1)) and vise versa. We propose a model updating mechanism that doesn’t require off-line training. The basic assumption is that the system is confident about previously estimated wideband features and, by utilizing those results, allows the updating of the model parameters. The concept is illustrated in figure (2). For an arbitrary input narrowband vector sequence with length N ( o ( k ), o ( k + 1), o ( k + 2 )... o ( k + N − 1) ), linear

2351


(a) with model parameter updating Figure (2) the sequence length is fixed to N frames and the updating block move forward frame by frame state space model is assumed stationary θ = θ k . The corresponding wideband estimate xˆ ( k ), xˆ ( k + 1), xˆ ( k + 2 )... xˆ ( k + N − 1) are obtained from the previous estimation. Consider the narrowband input vector o ( k + N ) . First, the sequential Kalman filter algorithm continues to estimate xˆ ( k + N ) with model θ = θ k . Given

(b) without model parameter updating

the narrowband observation sequence o ( k ), o ( k + 1), o ( k + 2 ), o ( k + 3 )... o ( k + N )

and wideband state estimations up to xˆ ( k + N ) , xˆ ( k ), xˆ ( k + 1), xˆ ( k + 2 ), xˆ ( k + 3 )... xˆ ( k + N ) ,

the linear state space model θ = θ k +1 for sequence from k + 1 to k + N is updated in maximum likelihood sense. [ Aˆ uˆ ]k +1 = [Φ1 (k ) Φ 2 (k )] ⋅ ⎡ Φ 3 (k ) ⎢Φ (k ) T ⎣ 4 Cˆ = Φ (k )Φ (k ) −1 k +1

5

Φ 4 (k )⎤ N ⎥⎦

−1

o ( k + 1), o ( k + 2 ), o ( k + 3), o ( k + 4 )... o ( k + N + 1) xˆ ( k + 1), xˆ ( k + 2 ), xˆ ( k + 3), xˆ ( k + 4 )... xˆ ( k + N + 1)

6

1 k+N Qˆ k +1 = ∑{xi xiT − [ Aˆ uˆ]k +1[ xi xiT−1 xi ]T } N i = k +1 1 Rˆ k +1 = [Φ 7 ( k ) − C k +1Φ 8 ( k )] N +1

The procedure continues until the end of input is reached, which actually conducts a timely online training of the linear state space model. The computation is not as burdensome as model training in [8] since quantities Φ 1 , Φ 2 to Φ 8 require

where k+N

∑x x

Φ1 (k ) =

Φ3 (k ) = ∑

i = k +1

k+N

T i i −1

, Φ (k ) = 2

k+N

∑x ,

i = k +1

Φ 7 (k) =

full calculations only once (for the very first updating). In the following updating process, an addition and a subtraction is enough. The value of N is set to 160 frames (about 20 ms for our codec configuration). If N is too small (say less than 50), matrix

i

x x , Φ 4 ( k ) = ∑ i = k +1 xi −1 , k+N

T i = k +1 i −1 i −1

Φ 5 (k ) =

k+N

∑o x

i = k +1

T i i

k+N

∑o o i=k

T i i

, Φ (k) = 6 , Φ 8 (k) =

k+N

∑x x i=k

T i i

,

k+N

∑xo i=k

⎡ Φ 3 (k ) ⎢Φ (k )T ⎣ 4

T i i

With θ = θ k +1 and next narrowband input o ( k + N + 1) , xˆ ( k + N + 1) can be estimated via Kalman filter algorithm. Then θ = θ k + 2 is updated with

©2007 EURASIP

(c) original wideband LSF trajectory Figure (3) system capability of tracking wideband features

Φ 4 (k )⎤ N ⎥⎦

and

Φ 6 (k )

may become singular. The initial linear state space model is offline trained with environmental signals collected when speakers are not talking. The method is named ‘blind’ because the updating of

2352


state space model is localized within consecutive N speech frames. Therefore the model parameters are optimized merely for these N frames. Besides, the wideband training data is the previous estimated data rather than the true data. One may wonder whether the updating is ‘correct’ since there are estimation errors. We did the following experiment and found out such a blind adaptation is trustable. The first portion of the experiment is to enhance the narrowband speech with initial model not adapted. In such a case, model error is quite large. The other portion is the normal operation (allowing blind adaptation). As is depicted in figure (3), under normal operation of the proposed system, the general shape of high-band feature trajectories can be retrieved. Average distortion of line spectral frequencies is listed in table (1). For reference, the third column is the result of well trained and source-matched memory system presented in [8]. Note that the goodness is more relevant to small high-order LSF distortions.

With Without LDS adaptaadaptarefertion tion ence lsf(1) 0.0056 0.0076 0.0051 lsf(2) 0.0143 0.0165 0.0132 lsf(3) 0.0342 0.0404 0.0290 lsf(4) 0.0425 0.0452 0.0361 lsf(5) 0.0461 0.0481 0.0369 lsf(6) 0.0566 0.0616 0.0471 lsf(7) 0.0587 0.0569 0.0463 lsf(8) 0.0731 0.0739 0.0551 lsf(9) 0.0884 0.1184 0.0770 lsf(10) 0.0608 0.0619 0.0552 lsf(11) 0.0512 0.0663 0.0493 lsf(12) 0.0591 0.0840 0.0578 lsf(13) 0.0741 0.1084 0.0731 lsf(14) 0.0699 0.0764 0.0695 lsf(15) 0.0618 0.0745 0.0620 lsf(16) 0.0567 0.0629 0.0566 lsf(17) 0.0320 0.0402 0.0317 lsf(18) 0.0253 0.0307 0.0252 Table (1) LSF distortions along orders

Figure (4) spectral envelope comparison

(a) original wideband female speech

(b) estimated wideband speech by linear state space model with correct speaker model

(c) estimated wideband speech by blind model adaptation Figure (5) performance illustration of blind model adaptation

Conceptually, in the beginning, the proposed system enhances the silence narrowband input, producing spectrally flat high-band noisy signals like what ordinary linear state space system does. When speech content comes in, the highband spectral distortion will be large if the initial model does not change accordingly. With adaptation, the model parameters are timely and locally optimized for current voiced narrowband input, driving the underlying model to a voiced model, frame by frame. Since the required wideband feature is previous estimations, which is spectrally flat, the new model parameter is actually optimized for such wideband output speeches as have similar narrowband with input and a spectrally flat high-band. Recall that most suffered speech sounds under bandwidth limitation are fricatives and plosives. These sounds have a relatively flat high-band and few

©2007 EURASIP

case (a) dotted curve: proposed method case (b) dashed curve: memoryless VQ method solid curve: original

voicing content in high-band portion. The objective of bandwidth extension is to artificially extend the bandwidth so that the speech becomes perceptually better. See illustration in figure (4). In case (a), the speech quality is still enhanced even though high-band formant structure is not recovered. But if case (b) occurs (due to model error or memoryless design limitation), human ear is quite sensitive to that noise. Finally in figure (5), we can see the performance difference between proposed method and conventional LDS system [8]. 4. PERFORMANCE EVALUATION The objective measurement is high-band spectral distortion defined as follows:

2353


D(dB)

Outlier (>5dB) 4.88%

circumstances. When test data is consistent with training database, the performance is better than memoryless systems and comparable with memory systems. When model mismatch occurs, it outperforms all the data-oriented methods.

Outlier (>7.5dB) 1.40%

VQ[1] 3.013 Linear 2.963 4.51% 1.11% mapping[3] GMM[2] 2.697 3.64% 0.98% HMM[5] 2.406 2.10% 0.17% HMM state 2.357 2.21% 0.20% mapping[7] Linear state 2.331 2.05% 0.16% space model[8] Proposed 2.551 3.21% 0.77% Table (2) Test A (test data matches training data)

5.167

Outlier (>5dB) 8.60%

Outlier (>7.5dB) 3.49%

4.882

7.16%

2.07%

3.698 2.860

6.49% 3.45%

1.78% 0.99%

3.019

3.61%

1.01%

2.813

3.37%

0.84%

2.570

3.25%

0.80%

D(dB) VQ[1] Linear mapping[3] GMM[2] HMM[5] HMM state mapping[7] Linear state space model[8] Proposed

5. CONCLUSION In this paper we present a bandwidth extension system based on blind adaptation of linear state space model. By the measurement of high-band spectral distortion, the proposed system is comparable with data-oriented memory systems and better than memoryless systems. When data mismatch occurs, the performance is better than all the data-oriented systems on the condition that the background environment is not dramatically changed. Moreover, off-line training is not required and the efficient computation of on-line model adaptation makes sure the system delay not too large. REFERENCES

Table (3) Test B (test data mismatches training data)

⎛2 D =⎜ ⎜π ⎝

1

π

∫ (10 log

10

S org (ω ) − 10 log 10

π 2

⎞2 S ext (ω ) ) d ω ⎟ ⎟ ⎠ 2

(3)

BWE systems such as [1][2][3][5][7][8] are implemented and trained in a speaker-dependent manner. The training data is from the phonetically balanced IViE corpus (http://www.phon.ox.ac.uk/IViE). 18-minute speaker dependent paragraph reading speech (about 100,000 frames according to our speech analysizer) is picked out to train all the six systems. The silence segments are collected for the training of the initial model of the proposed system, which is the only off-line training requirement for the system. The performance is listed in table (2) and (3). The mismatched test data is collected from another speaker with different gender. As we can see in table (2)(3), the proposed method has similar performances under two

©2007 EURASIP

[1] N. Enbom, and W.B. Kleijn, “Bandwidth Expansion of Speech Based on Vector Quantization of the Mel Frequency Cepstral Coefficients”, Proc. Speech Coding, pp. 171-173, 1999. [2] K.Y. Park, and H.S. Kim, “Narrowband to Wideband Conversion of Speech Using GMM Based Transformation”, Proc. ICASSP, pp. 1843-1846, 2000. [3] Y. Nakatoh, M. Tsushima, and T. Norimatsu, “Generation of Broadband Speech from Narrowband Speech Based on Linear Mapping”, Electronics and Communications in Japan, Part 2, Vol 85, No. 8, pp. 44-53, 2002. [4] M. Nilsson, H. Gustafsson, S. V. Anderson, and W. B. Kleijn, “Gaussian Mixture Model Based Mutual Information Estimation between Frequency Bands in Speech”, Proc. ICASSP, pp. I525-I528, 2002 [5] P. Jax, and P. Vary, “On artificial Bandwidth Extension of Telephone Speech”, Signal Processing, pp. 1707-1719, 2003. [6] Y. Agiomyrgiannakis, and Y. Stylianou, “Combined Estimation/coding of Highband Spectral Envelopes for Speech Spectrum Expansion”, Proc. ICASSP, pp. 469-472, 2004. [7] S.Yao and C.F.Chan, “Block-based Bandwidth Extension of Narrowband Speech Signal by using CDHMM”, Proc. ICASSP, pp. I793-I796, 2005 [8] S.Yao and C.F.Chan, “Speech Bandwidth Enhancement using State Space Speech Dynamics”, Proc. ICASSP, pp. I489-I492, 2006

2354