A Voice Activity Detection Algorithm for Wireless ... - Semantic Scholar

3 downloads 0 Views 442KB Size Report
Feb 2, 2000 - ratio of low-band energy to full-band energy, zero-crossing rate, and peakiness measure. key words: voice activity detection, wireless ...
IEICE TRANS. COMMUN., VOL.E83–B, NO.2 FEBRUARY 2000

414

LETTER

A Voice Activity Detection Algorithm for Wireless Communication Systems with Dynamically Varying Background Noise Jae Won KIM† , Min Sik SEO† , Byung Sik YOON† , Song In CHOI† , and Young Gap YOU†† , Nonmembers

SUMMARY Speech can be modeled as short bursts of vocal energy separated by silence gaps. During typical conversation, talkspurts comprise only 40% of each party’s speech and remaining 60% is silence. Communication systems can achieve spectral gain by disconnecting the users from the spectral resource during silence periods. This letter develops a simple and efficient Voice Activity Detection (VAD) algorithm to work in a mobile environment exhibiting dynamically varying background noise. The VAD uses a classification method involving the full-band energy, ratio of low-band energy to full-band energy, zero-crossing rate, and peakiness measure. key words: voice activity detection, wireless communications

1.

Introduction

All modern speech compression schemes for wireless communication use a VAD algorithm for further bit rate reduction. Since less bits are required for silence representation, the communication bit rate can be dropped, allowing other speech links in mobile telephony, increased data rate transmission for simultaneous voice/data multiplexing schemes. The role of the VAD is to distinguish between active speech and background noise. This is a well known classification problem, but the variety and varying nature of both the active speech and background noise make this problem quite complicated in practice. With any classification parameters, discriminant function have to be selected and devised. The objective of this letter is to design a simple and efficient, but robust VAD algorithm for mobile and portable wireless communication system to achieve spectral efficiency through advanced multiplexing techniques that allow multiple sources to use the same radio channel at the same time. The VAD algorithm also provides power efficiency and reduction in radiated emission through discontinuous transmission. The characteristics of speech signals and the problems of speech detection in dynamically changing background noise are studied to provide better understandManuscript received May 29, 1999. Manuscript revised August 30, 1999. † The authors are with the ETRI, 161 Kajong-Dong, Yusong-Gu, Taejon, 305-350 Korea. †† The author is with Computer and Communication Engineering Department, Chungpook National University, Korea.

ing of the proposed VAD. In this letter we designed a VAD algorithm based on adaptive noise estimation and multi-boundaries discrimination. The performance of the proposed VAD was verified on the detected activity and clipping rate, compared to those of G.729B VAD [1]. 2.

Characterizing Noise

Speech

and

Background

A set of distinctive sounds called phonemes is used to describe a language. There are various methods for classifying phonemes. Phonemes can be grouped based on properties related to the time waveform, frequency characteristics, manner of articulation, and type of excitation. Speech sounds can be classified into three distinct classes, voiced sounds, unvoiced sounds, and silences, according to their mechanism of speech production. Voiced sounds are generated by forcing air through the glottis or an opening between the vocal folds. The tension of vocal cords is adjusted so that they vibrate in oscillatory fashion. Thus, voiced sounds show periodicity. All the vowels including the semivowels and diphthongs are voiced sounds. Unvoiced sounds are produced by forming a constriction at some point in the vocal tract, and forcing air through the formed constriction at a high velocity to generate turbulence. Unlike voiced sounds, unvoiced sounds do not have any prominent periodic components. Silences are the parts of the speech sequence during which no sound is produced by the human speech production mechanism. In many ordinary environments, silences are literally silent periods, but in mobile and portable environment, silences in the speech sequence are dominated by the background noise which varies dynamically in energy level and spectral characteristics. In wireless telecommunications, the speech signal is corrupted by additive background acoustical noises that we can divide into two categories based on the environment. The first category is background noise from an office-type environment. The office-type environment contains some, but not a lot of, background noise. Office-type background noise varies slowly both in terms of volume and statistical characteristics. The second category is dy-

LETTER

415

Fig. 1

The flow of the proposed VAD.

namic background noise from portable or mobile environments. This category can be frequently occurred and be significantly less controlled; the background noise exhibits a wide dynamic range and varies rapidly and unpredictably [2]. 3.

The Overview of the VAD Algorithm

The proposed VAD algorithm shown in Fig. 1 works on a frame-by-frame basis. The VAD accepts 10 msec frames of the input speech signal and analyzes the frame to indicate whether it actually contains speech or not. The digitized and quantized input signal is divided into 80 sample frames (corresponding to 10 msec) to be processed. Each frame is first pre-processed to produce an offset-free signal. The full-band energy, the ratio of energy below 1000 Hz to that of the entire frequency bandwidth(dc to 4000 Hz), zero-crossing rate, and peakiness measure of the LPC residual are then measured from offset-freed signal. A VAD decision is then determined based on the values of the four parameters and the frame is classified as speech or silence. The significance of each of the four parameters is discussed below. The speech signal has high energy content in its voiced and certain of its unvoiced sounds, and thus measuring the energy level is a very basic and efficient tool for detecting silence gap. However, in noisy uncontrolled environments, such as those encountered in mobile and portable communication systems, the measure of energy level itself in the input signal does not give a perfect solution for the speech classification. The

speech production system produces a set of formants determined primarily by vocal tract and nasal tract characteristics. The first formant frequencies for voiced sounds are located below 1000 Hz and more energy is concentrated at the first formant than any other. The majority of unvoiced sounds, however, display strong spectral concentration in higher frequency range. The background noises display uniform spectral distribution. It is possible to distinguish between active speech and background noise by examining the distribution of energy along the frequencies. Detecting the zerocrossing rates from the offset-free speech samples is an efficient method to discriminate unvoiced sounds from voiced sounds and silence. The zero-crossing rate of a speech signal is detected in the time domain by multiplying the sign values of adjacent speech samples. If a frame contains a few pulses that are considerably larger in absolute value than remaining samples, the peakiness measure of LPC residuals is high; otherwise, it is low. Hence a large value of the peakiness measure occurs (1) for voiced speech, where the periodic pitch pulse can dominate the waveform, (2) at the start or end of a voiced segment, where a portion of the higher energy voiced signal is in the same frame as the lower energy unvoiced signal or silence, and (3) for unvoiced plosives, which are characterized by a burst of energy, followed by a short silence. The general characteristics of speech and background noise used in this letter are summerized in Table 1.

IEICE TRANS. COMMUN., VOL.E83–B, NO.2 FEBRUARY 2000

416 Table 1

4.

General characteristics of speech/background noise.

A Detailed Description of the VAD Algorithm

The background noise can change considerably between different conversations as well as during a conversation, from a quite room to a noisy street or fast moving car. Hence, an estimation of the varying characteristics of the background noise is required. A set of statistics parameters denoted by frame full-band energy Ef (n), ratio of low-band energy to full-band energy Rlb (n), zerocrossing rate Zx (n), and peakiness measure P M (n) are estimated and used for VAD algorithm. 4.1 Full-Band Energy The full-band energy, Ef (n), is the logarithm of the normalized first autocorrelation coefficient R(0).   1 Ef (n) = 10 ∗ log10 R(0) , (1) 10 According to a adaptive threshold Te based on background noise level (B), a silence flag, fe−sil , is set according to following equation.  1, if Ef (n) < Te (B) fe−sil = (2) 0, otherwise 4.2 Ratio of Low-Band Energy to Full-Band Energy Low band energy, measured on the below 1 kHz band, is computed as follows   1 T h Rh El (n) = 10 ∗ log10 (3) N where h is the impulse response of a 13-tab FIR filter with cutoff frequency at 1 kHz, R is the Toeplitz autocorrelation matrix of size 13*13 with the autocorrelation coefficients on each diagonal. The low-band energy ratio for each frame is calculated as follows Rlb (n) = El (n)/Ef (n),

4.3 Zero-Crossing Rate The zero-cross rate, Zx (n), can be found in the time domain by comparing the sign of adjacent speech samples. The Zx (n) of a sampled speech signal X(n) defined as

k=1

− sgn[X(n − k − 1)]|

where Tlb1 and Tlb2 are the constant thresholds for unvoiced and voiced speech, respectively.

(6)

where sgn(X) is 1 for X > 0, −1 for X < 0. The two flags (fz−vce , fz−unv ) for voiced and unvoiced speech are set according to following equation.  1, if Zx (n) < Tz1 fz−vce = (7) 0, otherwise  fz−unv =

1, if Zx (n) < Tz2 0, otherwise

(8)

where Tz1 and Tz2 are constant thresholds for voiced and unvoiced speech, respectively. 4.4 Peakiness Measure The peakiness measure(PM) can be found from the LPC residual signals. The PM is given by   N 1 2 k=1 r (n + k) N P M (n) = 1 N (9) k=1 |r(n + k)| N where r(n) is the LPC residual and N is the frame size [3], [4]. The two flags, fpm−l , fpm−u , are set according to following equation.  1, if Tpm1 < P M (n) < Tpm2 fpm−l = (10) 0, otherwise

(4)

A flag based on low-band energy, flb , is set according to following equation  0, if Tlb1 ≤ Rlb (n) ≤ Tlb2 flb = (5) 1, otherwise

1 |sgn[X(n − k)] 2 N

Zx (n) =

 fpm−u =

1, if P M (n) > Tpm2 0, otherwise

(11)

where Tpm1 and Tpm2 are the unvoiced and voiced speech including plosive unvoiced speech, respectively.

LETTER

417

Fig. 2 noises.

Fig. 3 noises.

Average activity dependent on SNR for three type

Average clipping rate dependent on SNR for three type

4.5 VAD Decision Using Eqs. (2), (5), (7), (8), (10), and (11) the VAD result of the proposed method is now given by V AD(n) = (!fe−sil )|(fz−unv )|(fpm−u )| · (fe−sil ∗ flb ∗ fz−vce ∗ fpm−l )

(12)

where ‘!, |, *’ denote the logical operators ‘not, or, and,’ respectively. The zero and non-zero of VAD mean the inactive speech frame and active speech frame, respectively. 4.6 Experiments and Discussion To test the overall performance of the VAD algorithm, a telephone-bandwidth speech data base consisting of four korean sentences with 1,145 frames and −12 dBov

Fig. 4

Activity dependent on SNR for vehicular noise.

Fig. 5

Clipping dependent on SNR for vehicular noise.

input speech level, spoken by 2 male and 2 female speakers, was used. Three different types of background noise—vehicular noise (car), street noise, and babble noise—have been added in a set of noisy data bases with SNRs ranging from 50 to 10 dB. To evaluate the performance of the VAD method , two objective measures are defined: 1) the activity is percentage of frames that are declared as active speech by VAD (about 47% for cleanspeech data base). 2) the clipping rate is the percentage of frames that are classified as inactive by the VAD and had been previously rated as active for clean speech. Figures 2 and 3 show the performance on detected activity and clipping rate of the proposed method and G.729B according to SNR values under three different type noises. At above 30 dB, both algorithms have similar performance. But the G.729B has higher detected activity at 20 dB and higher clipping rate at 10 dB than that of the proposed algorithm, respectively. Figures 4 and 5 show the results on SNR for vehicle noise. At high SNR region, both algorithms have similar and good

IEICE TRANS. COMMUN., VOL.E83–B, NO.2 FEBRUARY 2000

418 Table 2

Activity/clipping for three types of noise (SN R = 20 dB).

Table 3

Activity/clipping for three types of noise (SN R = 10 dB).

performance. But the proposed method shows better results on activity and clipping rate than G.729B at 20 dB. At 10 dB, G.729B has lower activity but higher clipping rate than the proposed method. It can be seen that the proposed method shows lower fluctuation in activity and clipping rate than that of G.729B under vehicular noise environment. Table 2 shows the comparative results for three type noises at 20 dB SNR. The G.729B has higher detected activity by 10% than the proposed method in the same clipping rate. The both algorithm provide a moderate amount of clipping rate on three noise environments. Table 3 shows the results for each noise at 10 dB SNR. The G.729B has higher clipping rate than the proposed method by 1.6% in nearly same activity condition. The worst clipping rate was occurred in street noise of the proposed method and in babble noise of G.729B. While the G.729B strongly increase the clipping rate at high noisy environment, the proposed method moderately increases the clipping rate in nearly same activity condition. While both algorithms have good performance at high SNR speech samples, the performance of G.729B VAD degrades more severely than that of proposed method at low SNRs. The complexity of the algorithm is also an essential factor. Because of division and squared sum operations in peakiness measure, the proposed algorithm has slightly high complexity than G.729B VAD. But the increase is not significant, the proposed method may be work in a mobile or portable environment exhibiting dynamically varying background noise.

5.

Conclusions

This letter presents a voice activity detection algorithm based on decision function concerning four parameters: full-band energy, ratio of low-band energy to full-band energy, zero-crossing rate, and new peakiness measure of the LPC residual. The performance of the proposed method is measured objectively by the percentage of active speech duration and the degree of clipping, and compared with the results of clean speech. The performance of the proposed method shows better than that of G.729B at low SNRs of below 20 dB. Though having small additional complexity, the proposed VAD algorithm can be practical for mobile and portable wireless systems because it has high performance at low SNR, and is simple enough to be implemented using a portion of a single DSP. References [1] ITU-T: Draft Recommendation G.729, Annex B: Voice Activity Detection, 1996. [2] R.A. Goubran and H.M. Hafez, “Background acoustic noise reduction in mobile telephony,” Proc. 36th IEEE. Soc. Conf. On Vehicular Technology, pp.72–76, May 1986. [3] D.L. Thomson and D.P. Prezas, “Selective modeling of the LPC residual during unvoiced frames: White noise or pulse excitation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3087–3090, Tokyo, 1986. [4] A.V. McCree and T.P. Barnwell III, “A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Speech & Audio Processing, vol.3, no.4, pp.242–250, July 1995.

Suggest Documents