mixed excitation linear prediction coding of ... - Semantic Scholar

3 downloads 40 Views 119KB Size Report
coders[7], or the combination of transform and predictive coding techniques[8], [9]. ... performed directly on the input speech by using an adaptive window based ...
MIXED EXCITATION LINEAR PREDICTION CODING OF WIDEBAND SPEECH AT 8 KBPS *Weiran Lin, *Soo Ngee Koh, #Xiao Lin *School of EEE Center for Signal Processing, School of EEE, Nanyang Technological University Nanyang Avenue, Singapore 639798

#

ABSTRACT This paper presents our study on the feasibility and effectiveness of using the MELP (Mixed Excitation Linear Prediction) model for coding wideband (7KHz) speech signals at a transmission bit rate of 8kbps. In order to achieve a reasonably good subjective quality for the decoded speech while maintaining a low operating bit rate at the same time, modifications to the pitch estimation, LP analysis/synthesis and post filtering stages of the original MELP model are discussed. Informal listening tests show that the subjective quality of the decoded speech of the proposed coder is rated to be slightly better than the MPEG4 CELP coder operating at 14.4kbps for both male and female utterances. The subjective quality of the decoded female utterances from the proposed coder operating at 8.4kbps is rated to be comparable to that produced by the ITU G.722 coder operating at 48kbps.

1. INTRODUCTION Wideband speech signals are band limited to 50-7000Hz and sampled at 16000Hz. Compare to the narrowband telephone speech, wideband speech is much more natural and intelligible. It is essential for high quality voice communication with almost 100% intelligibility. In recent years, with the introduction of the ISDN, wideband speech coding has advanced rapidly. Its development is accelerated by the commercial success of consumer and professional digital products, such as video teleconferencing, high-quality digital AM radio broadcasting and high-fidelity telephony, etc., in which high quality wideband speech signals are indispensable. In 1986, the International Telegraph and Telephone Consultative Committee (CCITT, now ITU-T) recommended the G.722 standard for wideband speech and audio coding. This wideband speech codec provides good speech quality at 64, 56 and 48kbps. In September 1993, the International Telecommunications Union (ITU-T) started the process of standardizing wideband speech coding at 16, 24 and 32kbps. The G.722 standard will serve as a reference for the development of the new coding schemes. In addition to the ITU efforts, MPEG4 version 1 was standardized by the Moving Picture Experts Group (MPEG) in October 1998 for multimedia applications which also include wideband speech coding at 12-24kbps. Among the previous research, most reported wideband speech coders are CELP coders [1], [2], [3], [4], [5], [6], transform coders[7], or the combination of transform and predictive coding

techniques[8], [9]. All of these coders are coding at around 16kbps and the minimum bit rate obtained from these coders is 13kbps. However, few studies[10] on wideband speech coding at 8kbps or lower have been reported so far. In 1995, the Mixed Excitation Linear Predictive (MELP) [11][12][13] vocoder was proposed for low bit rate narrowband speech coding at 2.4kbps. This MELP coder solves the problems associated with the traditional LPC coder and its performance is better than that of the Federal Standard CELP at 4.8kbps. In 1996, MELP was chosen by the DoD Digital Voice Processing Consortium (DDVPC) to replace the 2.4kbps Federal Standard FS 1015 (LPC-10)[14]. Since good quality narrowband speech could be achieved by MELP at 2.4kbps, good quality wideband speech coded at 0.5 bit/sample or lower is therefore thought to be possible to achieve. The feasibility of using MELP for wideband speech coding is investigated in this paper. In the following sections, the basic scheme of MELP is first reviewed. The proposed wideband speech-coding scheme based on MELP is then described. The quality of the decoded speech obtained from the proposed coder is subjectively compared with those of the MPEG4 CELP coder and the G.722 coder. At last, our work is concluded and future work is suggested.

2. BASIC MELP CODER MELP is a speech coding scheme based on the traditional LPC vocoder. This coder offers a full parametric representation of speech signals and can produce speech of communication quality at a bit rate of 2.4 kbps. In the analysis stage of the MELP coder, voicing strengths are decided based on the maximum autocorrelation of the signals filtered by 5 fixed bandpass filters and the time envelope of the bandpassed signals. The periodic or aperiodic flag is decided according to the voicing strength. The Linear Prediction (LP) analysis is performed on the windowed input speech and the coefficients of the all-pole LP filter are computed. Thus, the residual signal is obtained. The pitch is estimated according to the maximum autocorrelation function of the low-passed residual signal or low-passed speech signal. Pitch doubling check is applied to correct the wrong pitch which is actually the multiple value of the true pitch. The fractional pitch refinement is performed directly on the input speech by using an adaptive window based on the initial pitch estimate. The first 10 amplitudes of the pitch harmonics are computed by using the FFT on the windowed LP residual signals. The final entities

2

transmitted by the MELP transmitter are the pitch value, the v/uv strengths, the gain, the periodic/aperiodic flag and the LP coefficients.

Fourier.Coef

Shaping Filter

Pulse Generator Pitch

v/uv strengths

+

Periodic/Aperiodic Shaping Filter

Noise Generator

Synthesized

Spectral Enhancement Filter

scale

speech

Pulse Dispersion Filter

LP Analysis

The integer pitch is estimated based on the maximum autocorrelation of the lowpass filtered speech signals. This pitch value influences the accuracy of the v/uv strength estimates as well as the gain and Fourier coefficients analysis. It is therefore very important to obtain an accurate pitch at this stage. The lowpass filter used in the MELP standard is changed from 1kHz 6th order to 800Hz 10th order. This change leads to larger variation in the autocorrelation of the lowpass filtered signal especially in the transition region, thus resulting in a more accurate integer pitch being selected based on the maximum autocorrelation value. In bandpass voicing analysis, the speech signal is filtered into five frequency bands, with passbands of 0-1000, 1000-2000, 2000-4000, 4000-6000, 6000-8000Hz in order to fit the wideband speech frequency range. In addition, the energy zerocrossing rate[15] is used to adjust the v/uv decision. The energy zero-crossing rate is defined as:

R ezc =

gain

Figure 1. Standard MELP decoder At the decoder (Figure 1), mixed excitation is used to reduce the distortions experienced in the conventional LPC vocoder.. The periodic excitation is generated by the IDFT of the interpolated pitch in length and interpolated Fourier coefficients placed at the harmonic frequencies derived from the pitch. The pitch used here is adjusted by varying 25% of its position according to the periodic/aperiodic flag in order to reproduce the erratic glottal pulses during the transition part and to reduce the tonal noise. The noise excitation is generated by using a normalized uniform random number generator. These 2 excitations are respectively passed through 2 different time-varying shaping filters whose coefficients are obtained based on the v/uv strengths. The mixed excitation is obtained by adding these 2 filtered excitations together. In order to let the synthesized speech match the original spectral shape in the formant regions, an adaptive spectral enhancement filter is applied to the mixed excitation. The prediction synthesis is then performed on the filtered signal. Finally, the output signal from the LP synthesizer is scaled by the gain and filtered by the pulse dispersion filter.

3. MELP CODING OF WIDEBAND SPEECH The MELP standard is modified to encode wideband speech signals. The modifications include the LP analysis/synthesis stage, pitch estimation, voiced/unvoiced strength determination and the post filtering stage.

3.1 Modification to the MELP Model In the proposed wideband MELP coder, the frame size is set to 180 samples (11.25ms) in duration. It is half of the frame size used in the MELP standard. Since the frame size controls the update rate of all the coding parameters, a shorter frame interval leads to a faster update rate which directly improves the overall quality of the decoded speech.

E rms ZC

(1)

where ZC is the zero-crossing of each frame and Erms is the energy of the frame. Based on our experiment, the Rezc threshold is set to 2.5. Our simulation results showed that this method improved the v/uv decisions especially for those voiced frames whose energy is very small. In LP analysis, we find that the linear prediction (LP) order needs to be increased to 20 or above in order to maintain a reasonably good quality for the synthesized speech. The length of the window used for the LP analysis and the Fourier analysis is increased from 200 to 300 as it results in a smoother quality for the synthesized speech. The final pitch estimate in the MELP standard is based on the integer pitch obtained in the initial stage. Our estimation is carried out in the range of 40-320 samples instead of the integer pitch to decrease the number of occurrences of the wrong final pitch estimate. Besides, in order to preserve the continuity of the pitch estimates between neighboring speech frames and to reduce the sudden pitch change that leads to obvious distortions, a pitch tracking method is used by considering the pitch estimates from the previous and future frames. If the pitch satisfies the condition that it is either less than 0.8 or more than 1.2 of the previous or future pitch, it is judged to be a sudden change pitch and it is then replaced by the average of the previous and future pitch estimates. With the use of such a method, the occasional wrong pitch estimate, especially in the transition region, is corrected. The quality of the synthesized speech is improved at the cost of one more frame delay. In the decoder part, since the coefficients of the adaptive spectral enhancement filter are calculated by bandwidth expansion on the interpolated LPC filter coefficients, the order of this enhancement filter is increased to 20 corresponding to the increased order of the LP filter. The pulse dispersion filter is eliminated since the 20th order LP synthesizer can adequately represent the envelope of the speech spectral envelope.

3.2 Parameter quantization and Encoding •

Pitch Quantization

3

The pitch and the lowest band voicing strength are quantized jointly, just like the case in the standard MELP coder. The uniform quantizer is still used in the proposed wideband MELP coder. The logarithmic scaled pitch is quantized using a 400 level (9bits) uniform quantizer ranging from log40 to log320 instead of 99 levels used in the standard MELP coder. Since the sampling rate is 16 kHz, the pitch parameter resolution should be finer than that found in the narrowband case. The index of the result is mapped to the 9 bit codeword using a lookup table. •

Bandpass voicing and Gain quantization

The lowest bandpass voicing is already quantized with the pitch. The other 4 bandpass voicing strengths and the gain values are quantized in the same way as in the standard MELP coder except that for the quantization of the second gain value, a 256-level uniform quantizer in the range of 6.0 to 77.0 dB is used. The first gain value is quantized to 4 instead of 3 bits using the same adaptive algorithm recommended in the MELP standard. Although the ranges of gain values in the narrowband and wideband speech signals are almost the same, more bits are required to represent the gain values more accurately. •

LPC coefficient coding

In the proposed wideband speech coder, 20 LSFs are used to model the speech spectral envelope and the LSFs are quantized using split VQ (SVQ)[16]. In order to optimize the performance of SVQ, the Dynamic Resolution Analysis (DRA)[17] scheme is used in our LSF SVQ to decide the split pattern and bit allocation for each partition. In the DRA scheme, a quantitative measure called Dynamic Resolution (DR) is used to predict roughly the performance of a specific combination of split pattern cum bit allocation. The optimal combination is likely to be found in the top 10% of the combinations with largest DR value. According to the result of DRA, the 20 LSFs are split into 5 subvectors of dimensions 3, 3, 3, 4 and 7. The obtained bit allocations for these 5 partitions are 10, 10, 10, 11 and 9 respectively. The bit allocation pattern results in a total bit rate of 7.55kbps. Based on this partition pattern, further investigation is done to obtain the optimum bit allocation pattern. For each partition, we generated various codebooks of different sizes and the average spectral distortion is used to measure the objective distortion for each codebook of each partition. 151125 training sets obtained from the training speeches in the TIMIT database are used. All VQ codebooks were designed using the LBG algorithm with LPCW[16] weights. The results are shown in Table 1. Using the results in Table 1, we selected (14,13,11,11,11) as the bit allocations for the 5 partitions respectively. ( The overall bit rate corresponding to this allocation pattern is 8.4kbps ). They are chosen such that the percentage of outlier of over 4dB is zero. By using this pattern, the overall ASD of 20 LSFs due to the quantization process is 1.2186dB. The percentage of outliers between 2-4dB is 5.15%; the percentage of outliers over 4dB is 0.089%. Although this result is not in accordance with the condition of transparent quantization[16], the performance is still acceptable according to our perceptual listening tests. Besides, further analysis showed that among the 5.15% outlier of ASD between 2-4dB, 3.97% is induced by the last partition’s quantization. This indicates that in wideband speech, the lower

part of the spectrum is much more important that the higher part of the frequency. As long as the quantization of the lower frequency part satisfies the transparent quantization condition, the perceptual quality is acceptable even though the overall distortion does not match the condition. LSF 1 ↓ 3

0.288

0.239

0.194

0.154

Outlier

2-4dB

1.87

1.83

1.11

1.06

0.67

0.33

(%)

>4dB

0.33

0.18

0.13

0.03

0.01

0.00

0.465

0.383

0.314

0.253

0.208

0.170

ASD(dB)

LSF 4 ↓ 6

ASD(dB)

LSF 7 ↓ 9

2-4dB

0.81

0.62

0.42

0.32

0.22

0.15

(%)

>4dB

0.076

0.063

0.076

0.025

0.00

0.00

0.416

0.336

0.270

0.218

0.175

0.143

Outlier

2-4dB

0.22

0.10

0.013

0.013

0.013

0.013

(%)

>4dB

0.025

0.013

0.00

0.00

0.00

0.00

0.656

0.561

0.478

0.409

0.350

0.299

0.55

0.20

0.15

0.11

0.11

0.089 0.00

ASD(dB)

LSF 14 ↓ 20

0.35

Outlier

ASD(dB)

LSF 10 ↓ 13

0.420

Outlier

2-4dB

(%)

>4dB

ASD(dB)

0.076

0.013

0.00

0.00

0.00

1.173

1.082

0.985

0.912

0.837

Outlier

2-4dB

6.07

3.55

2.06

1.17

0.70

(%)

>4dB

0.089

0.064

0.00

0.00

0.00

Table 1. ASD and percentage of outliers of various level of quantizers for the 5 partitions In order to keep the bit rate to 8kbps, the above bit allocation pattern is adjusted to (14,10,10,11,10). By using this pattern, the overall ASD due to the quantization of 20 LSFs is 1.3069dB; the percentages of outliers between 2-4dB and over 4dB are 8.35% and 0.12% respectively. If we exclude the distortion of the higher frequency part, the ASD due to the quantization of the lowest 13 LSFs is 0.7296dB; the percentages of outliers between 2-4dB and over 4dB are 1.62% and 0.08% respectively. As the lowest 13 LSFs quantization almost satisfies the criterion of transparent quantization, this pattern can be used to achieve the desired operating bit rate of 8kbps. •

Fourier Magnitude Quantization

The ten Fourier magnitudes are quantized by a 10-bit vector quantizer. The distance used in the codebook search is the same as that in the MELP standard. The bit allocation for a wideband MELP frame is shown in Table 2. The bit allocation for a LSFs Fourier Magnitude Gain

Bits 60 or 55 10 10

Pitch and lowest band voicing

9

Bandpass voicing

4

Aperiodic flag and Sync bit Total bits/ 11.25ms per frame Bit rate

2 95 or 90 8.4 or 8 kbps

Table 2. Bit allocation for wideband MELP

4

4. PERFORMANCE We conducted subjective listening tests by using the 90 and 95 bits/frame allocation. The subjective tests are performed on 10 listeners using 16 sentence-pairs from 8 male and 8 female speakers per listener. These test sentences are selected from the test speeches in the TIMIT database. The results are presented in Table 3. Test G722 at 48kbps Wideband MELP at 8.4 kbps

Female 48.75% 51. 25%

Male 51.25% 48.75%

Test MPEG 4 at 14.4kbps Wideband MELP at 8.4 kbps

Female 40% 60%

Male 42. 5% 57. 5%

Test MPEG 4 at 14.4kbps Wideband MELP at 8 kbps

Female 43.75% 56. 25%

Male 46. 25% 53.75%

Table 3. Results of informal listening tests for wideband speech coding based on MELP The test results show that the subjective quality of the synthesized speech obtained from the proposed 8.4kbps coder is comparable to G.722 at 48kbps. When comparing with the MPEG4 CELP at 14.4kbps, the subjective quality for both the male and female speech obtained from our coding scheme is slightly better. However, as the pitch tracking method we applied is still not optimized to correct the pitch error, some distortions can still be detected in a few male utterances. In addition, the occasional v/uv decision errors during transition part also result in distortions.

5. CONCLUSION A wideband speech coding scheme is proposed and its performance at the bit rates of 8 and 8.4kbps is presented. The subjective test results show that very good quality of decoded speech could be obtained at 0.5 bit/sample. The proposed 8kbps coder performs slightly better than the 14.4kbps MPEG4 CELP coder. By operating at a slightly higher bit rate of 8.4kbps, it is also close to G.722 in performance. Future work will focus on pitch estimate and v/uv decision to achieve an even better quality for the decoded speech.

6. REFERENCE [1] E.Ordentlich and Y.Shoham.“Low-delay code-excited linear-predictive coding of wideband speech at 32kbps”, In Proc.IEEE ICASSP, page 9-12, Toronto, Canada, May 1991 [2] G.Roy and Peter Kabal. “Wideband CELP speech coding at 16 kbps”. In Proc. IEEE ICASSP, page 17-20, Toronto, Canada, May 1991 [3] Laflamme, C.Adoul, J-P.Salami, R.Morissette, S. Mabilleau, P. “16 kbps wideband speech coding technique based on algebraic CELP. Proc.IEEE,ICASSP., v.1.1991.page 13-16

[4] Paulus, Juergen. Schnitzler, Juergen. “16 kbit/s wideband speech coding based on unequal subbands” ICASSP, Proc.IEEE., v1. 1996. p 255-258 [5] Ubale, Anil. Gersho, Allen. “A Low-delay wideband speech coder at 24 kbps”, ICASSP, Proc.IEEE., v 1. 1998. p 165168 [6] Schnitzler, Juergen. “13.0 kbit/s wideband speech codec based on SB-ACELP”, ICASSP, Proc.IEEE. v 1. 1998. p 157-160 [7] Quackenbush, Schuyler. “A 7 kHz bandwidth, 32 kbps speech coder for ISDN”. Proc.IEEE. ICASSP, v 1. 1991 p 1-4 [8] Udaya Bhaskar “Adaptive prediction with transform domain quantization for low rate audio coding”. In Proc. IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, 1991 [9] R.Lefebvre, R.Salami, C Laflamme, and J-P.Adoul “High qualty coding of wideband audio signals using transform coded excitation(TCX)”. In Proc. IEEE ICASSP, v 1, p193196, Adelaide, Australia, April 1994 [10] C.McElroy, B. Murray and A.D.Fagan, “Wideband Speech Coding in 7.2 kbit/s”, Proc.IEEE, ICASSP, v2. 1993. P620623 [11] Alan V. McCree and Thomas P. , “ a mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Transactions on speech and audio processing, Vol.3, No.4, July 1995 [12] McCree, Alan. Truong, Kwan. George, E Bryan. Barnwell, Thomas P. Viswanathan, Vishu.. “2.4 kbit/s MELP coder candidate for the new U. S. federal standard,” ICASSP, IEEE Proc. v 1 1996. p 200-203 [13] Supplee, Lynn M. Cohn, Ronald P. Collura, John S. McCree, Alan V. “MELP: The new federal standard at 2400 bps SpeechProcessing,” ICASSP, Proc.IEEE v 2 1997. p 1591-1594 [14] T.E..Tremain, “The Government Standard Linear Predictive coding Algorithm: LPC-10”, Speech Technology, pp.40-49, April 1982 [15] N. Abu-Shikhah, M.Deriche, “A Novel Pitch Estimation Technique using the Teager Energy Function,” Proc ISSPA, vol.1, pp.135-138, Brisbane, August 1999 [16] Kuldip K. Paliwal and Bishnu S. Atal, “Efficient Vector Quantization of LPC Parameters at 24bits/frame,” IEEE Trans.speech, adio processing, vol.1, no.1, pp.3-14, Jan, 1993 [17] Chang-Qian chen, Soo-Ngee Koh* and Pratab Sivaprakasapillai, “A novel scheme for optimising partitioned VQ using a modified resolution measure”, Signal Processing 44 (1995) .pp. 233-241

Suggest Documents