robust multiband excitation coding of speech based on variable ...

ROBUST MULTIBAND EXCITATION CODING OF SPEECH BASED ON VARIABLE ANALYSIS FRAME SIZES Eric W.M. Yu and Cheung-Fat Chan Department of Electronic Engineering City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong. Email: [email protected] [email protected]

ABSTRACT A robust technique for the coding of multiband excitation (MBE) model parameters from a non-stationary speech segment is proposed in this paper. The non-stationary speech segment which has an abrupt increase in its signal energy with respect to the time is divided into 2 quasistationary speech segments. A variable analysis frame size technique is proposed to analyze the lower energy portion and the higher energy portion separately. A high quality fixed 1.6 kbps variable frame size MBE linear predictive (MBELP) speech coder was developed.

Robust speech analysis and synthesis procedure are developed to improve the speech quality of a 1.6 kbps MBELP speech coder. y

k ,l

k ,h

p) w( l

w (q ) h

wk (m)

wk −1(m)

1. INTRODUCTION In MBE speech coding, the MBE model parameters are conventionally obtained by computation of the errors between the original short-term speech spectrum and a pitch dependent periodic spectrum [1]. The extracted model parameters are accurate provided that the speech segment contains a stationary signal. When a fixed size analysis window is sliding through the time, some of the captured speech segments will contain an abrupt change of signal energy with respect to the time due to the nonstationary nature of speech. An example of abrupt increase in speech signal energy is shown in Figure 1. As shown in Figure 2, explicit harmonic structure cannot be observed through the short-term magnitude spectrum of a speech segment which is captured by the window wk ( m) over the abrupt transition. By means of the conventional MBE analysis procedure which relies on the pitch dependent periodic spectrum for error computation, the accuracy of the MBE model will be decreased and eventually affecting the speech quality. The effect will be more obvious in fixed analysis frame size low bit rate speech coding since the frame shift is longer and the temporal resolution was reduced consequently. In this paper, our attention is focused on improving the speech coder robustness to the type of abrupt transition in Figure 1. A variable analysis frame size technique is proposed to obtained reliable and accurate MBE model parameters when the windowed speech signal transits abruptly from low energy to high energy with respect to the time.

τk y

xk , l

xk −1

xk

wk +1(m) x

k ,h

xk +1

Time Sample

Figure 1. Speech waveform (95 ms) and the windowing schemes 30

a

20 10

b

0

c

-10 -20 -30 -40 -50 0

1000

2000

3000

4000

Frequency (Hz)

Figure 2. Spectra of (a) the kth speech segment in Figure 1 (b) the signal sl ( p) and (c) the signal sh ( q ) 2. A REVIEW OF FIXED FRAME SIZE MBELP ANALYSIS AND SYNTHESIS In fixed rate block coding of speech signal, a short speech segment of about 20 to 40 ms is conventionally windowed for analysis and speech model parameters are extracted to represent the speech segment. The fixed size analysis window is sliding through the time in a fixed interval, say

N samples, and proceeding with the order as shown in Figure 1 by the windows wk −1 ( m) , wk ( m) , wk +1 ( m) and so on, where k is the frame index and −∞ < k < ∞ . The windows are overlapped by (L-N) samples, where L is the window length. For a discrete-time speech signal s ( n ), L / 2 −1 where −∞ < n < ∞, a segment ls ( Nk + m)qm= − L / 2 will be captured by the window wk ( m) . The spectrum of L / 2 −1

windowed speech signal ls ( Nk + m) wk ( m)qm= − L / 2 will then be obtained by a L-point discrete Fourier transform. MBE analysis procedure is then applied to extract the The MBE model MBE model parameter set Γk . parameters are pitch, V/UV information and band magnitudes. The band magnitudes are quantized through the corresponding 10th order LPC spectrum. A fast embedded 2-dimensional differential line spectrum pair (2DDLSP) quantization scheme is applied to convert the PARCOR coefficients to LSP parameters and quantize the prediction errors simultaneously [2]. The MBE speech synthesis procedure can be divided into the voiced speech synthesis and the unvoiced speech synthesis. For the voiced speech synthesis, the information associated with the parameter sets Γk and Γk +1 is used for voiced band magnitude interpolation between the time samples x k = b Nk + no g and x k +1 = b Nk + N + no g so that a bank of harmonic oscillators can be applied to synthesize voiced speech signal between x k and x k +1 . The value of no indicates the delay in samples. Note that no is set to zero in Figure 1 only for the sake of illustration. For the unvoiced speech synthesis, a windowed white noise of length equals to the analysis window size L is filtered according to the unvoiced band magnitudes obtained from the corresponding set of model parameters. For the model parameter set Γk , the filtered noise segment is centred on time sample x k . Continue sequence of unvoiced speech is reconstructed by an overlap-add procedure between adjacent segments of filtered noise. With proper alignment of the time samples, the reproduced speech signal is obtained by the sum between the reconstructed voiced speech and unvoiced speech. Due to the non-stationary characteristics of speech signal, it is possible for a fixed size analysis window to capture a segment of speech which is non-stationary. Eventually, the accuracy of extracted model parameters will be decreased. 3. VARIABLE FRAME SIZE MBE ANALYSIS AND SYNTHESIS The proposed variable analysis frame size technique includes method for the detection of abrupt increase in signal energy and the corresponding procedure in speech analysis and synthesis.

3.1 Detection of Abrupt Transition For each speech frame which follows a low energy speech frame, we detect the occurrence of the abrupt increase in signal energy and determine the time instant of the occurrence. The abrupt transition of speech segment L / 2 −1 ls ( Nk + m)qm= − L / 2 can be detected by tracking of the energy ratio α( k , j ) with respect to the time sample j, where j = 0,1,..., L − 1. The energy ratio is defined as

α( k , j ) =

1 L −1 2 L ∑ s ( Nk − 2 + i ) L − j i= j 1 j −1 2 L s ( Nk − + i ) ∑ j i =0 2

(1)

Transition occurs at the time index τ k = argmaxlα ( k , j )q ∀j

(2)

provided that α ( k , τ k ) is greater than a pre-defined threshold. 3.2 Analysis L / 2 −1 By considering the speech segment ls ( Nk + m)qm= − L / 2 which has been captured by the window wk ( m) , − L 2 ≤ m < L 2, if no abrupt increase of speech energy was detected, the parameter set Γk will be extracted from L / 2 −1

the windowed speech segment ls ( Nk + m) wk ( m)qm= − L / 2 as described in Section 2. Supposing an abrupt transition was detected, with the aid of transition time index τ k , we split the original fixed size analysis window wk ( m) into a window wl ( p ), − L 2 ≤ p < − L 2 + τ k , for the lower energy portion of the speech segment and a window wh ( q ) , − L 2 + τ k ≤ q < L 2 + τ k , for the higher energy portion of the speech segment as shown in Figure 1. The window wl ( p ) is applied to the low energy portion to form the speech segment sl ( p) = s ( Nk + p ) wl ( p )

(3)

− L 2 ≤ p < − L 2 + τk while the window wh ( q ) is applied to the high energy portion to form the speech segment sh ( q ) = s ( Nk + q ) wh ( q )

(4)

− L 2 + τk ≤ q < L 2 + τk As shown in Figure 2, the short-term magnitude spectra obtained individually from the windowed speech segments sl ( p) and sh ( q ) provide explicit information of the harmonic structure. Besides performing the discrete Fourier transforms and the subsequent MBE analysis individually on sl ( p) and sh ( q ) , we propose a more efficient approach for coding of the parameters of these 2

speech segments through some reasonable assumptions. For the speech signal sl ( p) , the excitation is assumed to be the same as the previous speech segment L / 2 −1 ls ( Nk − N + m) wk −1 ( m)qm=− L / 2 . Therefore, the V/UV

information and pitch of the parameter set Γk −1 can be used to represent the excitation of the speech signal sl ( p) . The band magnitudes with respect to sl ( p) are quantized using 10th order LPC spectrum. We utilize the autocorrelation sequence of sl ( p) to compute the corresponding 10th order PARCOR coefficients through the LeRoux-Gueguen method [3]. The PARCOR coefficients are quantized by the embedded 2DDLSP scheme as applied in [2]. For the speech signal sh ( q ) , the whole set of model parameters is assumed to be the same as Γk +1 which is the parameter set of the following speech L / 2 −1

segment ls ( Nk + N + m) wk +1 ( m)qm =− L / 2 . If we denote the parameter sets of speech segments sl ( p) and sh ( q ) by Γk ,l and Γk ,h , respectively, the pitch and V/UV information of the parameter set Γk −1 together with the 10th order LPC spectrum of sl ( p) constitute the parameter set Γk ,l while the parameter set Γk ,h is equivalent to Γk +1 . Consequently, we only have to encode the time index of the abrupt transition and the LPC spectrum of sl ( p) from the parameter set Γk ,l . There is no need to encode Γk ,h since Γk ,h = Γk +1 . The time index of the abrupt transition and the LPC spectrum of sl ( p) are grouped into a parameter set Γk . The parameter set Γk occupies the time slot which would be used by Γk if there was no abrupt transition detected. For fixed rate speech coding, the number of bits required by Γk is adjusted to be the same as Γk . Details of the bit allocation are provided in Section 4. 3.3 Synthesis The synthesis procedure is the same as that described in Section 2 if no abrupt transition was detected. Otherwise, we have to cater for the varied window sizes and positions in speech synthesis. Prior to the synthesis of voiced and unvoiced speech, we have to recover the parameter sets Γk ,l and Γk ,h . The parameter set Γk ,l is recovered by the combination of the pitch and V/UV information obtained from the parameter set Γk −1 together with the band magnitudes and abrupt transition time index τ k obtained from the parameter set Γk . The parameter set Γk ,h is recovered by using the relationship Γk ,h = Γk +1 . For the voiced speech synthesis, the parameter sets Γk −1 and Γk ,l are used for voiced band magnitude interpolation between the time samples x k −1 = b Nk − N + no g and y k ,l = b Nk − L (1 + γ ) 2 + τ k + no g

(5)

as shown in Figure 1. The information associated with Γk ,l and Γk ,h is used for voiced band magnitude interpolation between the time samples y k ,l and y k ,h = b Nk − L (1 − γ ) 2 + τ k + no g

(6)

Subsequently, the information associated with Γk ,h and Γk +1 is used for voiced band magnitude interpolation between the time samples y k ,h and x k +1 . The parameter γ in (5) and (6) controls the gradient of the voiced band magnitude linear interpolation from the low energy portion to the high energy portion. In fixed window size analysis, we have γ = N L . As shown in (5) and (6), a lower value γ = 0. 25 is adopted in order to reproduce the abrupt increase in signal energy closely. For the unvoiced speech synthesis, a windowed white noise of length equals to the length of the window wl ( p ) is filtered according to the unvoiced band magnitudes obtained from the parameter set Γk ,l . The filtered noise segment is centred on the time sample x k ,l = b Nk − L 2 + τ k 2 + no g

(7)

as shown in Figure 1 and is joined with the noise segment associated with Γk −1 by the overlap-add method. Afterwards, a windowed white noise of length equals to the length of the window wh ( p ) is filtered according to the information obtained from the parameter set Γk ,h . The filtered noise segment is centred on the time sample x k ,h = b Nk + τ k + no g

(8)

and is joined with the noise segment associated with Γk ,l . Similarly, a windowed white noise of length equals to the length of the window wk +1 ( m) is filtered according to the information obtained from the parameter set Γk +1 . The filtered noise segment is centred on the time sample x k +1 and is joined with the noise segment associated with Γk ,h . 4. IMPLEMENTATION OF A FIXED 1.6 KBPS VARIABLE FRAME SIZE MBELP SPEECH CODER The proposed variable analysis frame size technique was applied to develop a 1.6 kbps MBELP speech coder which produces intelligible speech with good quality. The proposed speech coder operates in either the steady mode or the transient mode according to the result of abrupt transition detection as described in Section 3.1. The continuous-time speech signal is sampled in 8 kHz. In the steady mode, the proposed speech coder has a fixed analysis window length of 32 ms and a fixed analysis window shift of 25 ms. For a fixed 1.6 kbps speech coder, we can use 40 bits per frame to quantize the model parameters in both steady and transient modes. The bit allocation per frame of the 1.6 kbps variable analysis

frame size MBELP speech coder is shown in Table 1. We allocated 8 bits for both pitch quantization and operation mode indication. The first ( 2 8 − 1) quantization levels are used for pitch quantization while the last quantization level is used to indicate transient mode. In the steady mode, pitch is quantized where the pitch search range is from 76.4 Hz to 400 Hz. The band magnitudes are quantized by 30 bits. The corresponding 10th order spectral envelope will be quantized by 24 bits while the gain will be quantized by 6 bits. In order to quantize the V/UV information with only 2 bits, the conventional binary sequence for V/UV information quantization was replaced by a 2-bit V/UV transition frequency index [2]. The V/UV transition frequency is obtained by an analysisby-synthesis procedure where the error between the original spectrum and the synthetic spectrum associated with the simplified V/UV mixture function are closed loop minimized. If the segment of speech is found having an abrupt increase in signal energy, the speech analysis procedure will follow Section 3.2 where pitch is not quantized. As an indication of transient mode, all the 8 bits which are originally used for pitch quantization will be filled with binary 1. In the transient mode, we only have to quantize the parameter set Γk as described in Section 3.2. The 2 bits originally used for quantization of V/UV transition frequency index will be used for quantization of the abrupt transition time index τ k . The 10th order spectral envelope and gain of the windowed signal sl ( p) will be quantized by 24 bits and 6 bits, respectively. Table 1. Bit Allocation NO. OF BITS Transient Mode PARAMETERS Steady Mode Pitch 8 8 (11111111b ) V/UV or Abrupt 2 (V/UV) 2 (Abrupt Transition Index Transition Index) Gain 6 6 Spectral Envelope 24 24 TOTAL 40 40

its fixed analysis frame size counterpart and is comparable with the 2.4 kbps MBELP speech coder. a b c

Time Sample

Figure 3. Waveforms of (a) the kth speech segment in Figure 1 (b) the reproduced speech through proposed 1.6 kbps MBELP coder and (c) the reproduced speech through a fixed frame size 1.6 kbps MBELP coder. 30 20

a

10

b

0

c

-10 -20 -30 -40 -50 0

1000

2000

3000

4000

Frequency (Hz)

Figure 4. Spectra of (a) the kth speech segment in Figure 1 (b) the reproduced speech through proposed 1.6 kbps MBELP coder and (c) the reproduced speech through a fixed frame size 1.6 kbps MBELP coder.

6. CONCLUSION A robust variable analysis frame size technique was developed for the MBE analysis and synthesis of nonstationary speech segment which has an abrupt increase in energy with respect to the time. The proposed technique was applied successfully on the development of a high quality 1.6 kbps MBELP speech coder.

5. RESULTS Spectra and waveforms of the original speech signal windowed by wk ( m) and the reproduced speech signal are shown in Figure 3 and 4, respectively. The spectrum and waveform of the speech signal reproduced by a same bit rate fixed analysis frame size MBELP speech coder are also shown for comparison. It is obvious that the synthetic spectrum and the event localization of the speech segment with abrupt energy increase is improved by the proposed technique. Informal listening shown that the quality of the proposed 1.6 kbps MBELP speech coder is higher than

REFERENCES [1] D.W. Griffin and J.S. Lim, "Multi-band excitation vocoder," IEEE Trans. on ASSP, vol.36, pp. 12231235, 1988. [2] W.M.E. Yu and C.F. Chan, "Efficient multiband excitation linear predictive coding of speech at 1.6 kbps," Proc. Eurospeech, pp. 685-688, Sept. 1995. [3] J. LeRoux and C. Gueguen, "A fixed point computation of partial correlation coefficients," IEEE Trans. on ASSP, vol. ASSP-25, pp. 257-259, 1977.

robust multiband excitation coding of speech based on variable ...

robust multiband excitation coding of speech based on variable ...

Suggest Documents

Robust Excitation-based Features for Automatic Speech ... - TCTS Lab

Speech Coding Based On Sparse Modeling

MULTIPLE DESCRIPTION SPEECH CODING FOR ROBUST ...

Robust Speech Understanding Based on Expected ...

Robust Speech/Non-Speech Discrimination Based on Pitch Estimation ...

Robust Video Region-of-interest Coding Based on Leaky Prediction

Hybrid System for Robust Recognition of Noisy Speech Based on ...

an improved background noise coding mode for variable rate speech ...

Wavelet Based Speech Coding Using Orthogonal ... - CiteSeerX

Variable Frame Rate Hierarchical Analysis for Robust Speech ...

A DISTRIBUTED-SOURCE-CODING BASED ROBUST ... - CiteSeerX

CELP Speech Coding Based on an Adaptive Pulse Position Codebook

Speech Coding based on Spectral Dynamics - Idiap Publications

Missing-Feature-Theory-Based Robust Simultaneous Speech

Robust Speech Recognition Based on Noise and SNR ... - CiteSeerX

Robust ASR Based On Clean Speech Models: An ...

Robust speech recognition using consensus function based on multi ...

F0 estimation based on robust ELS complex speech analysis - Core

Noise-robust speech recognition based on SNR ...

Robust Speech Recognition Based on Mapping Noisy Features to ...

Robust Speech Recognition Based on Binaural ... - Research at Google

CASA Based Speech Separation for Robust Speech ... - Google Sites

Speech Coding for Robust Transmission over ... - Semantic Scholar

Robust Audio-Visual Speech Recognition Based on Late Integration