A Fundamental Frequency Extraction Method ... - Semantic Scholar

Applied Mathematics in Electrical and Computer Engineering

A Fundamental Frequency Extraction Method Based on Windowless and Normalized Autocorrelation Functions M. A. F. M. RASHIDUL HASAN Saitama University Graduate School of Science and Engineering 255 Shimo-okubo,Sakura-ku,Saitama, 338-8570 JAPAN [email protected]

TETSUYA SHIMAMURA Saitama University Graduate School of Science and Engineering 255 Shimo-okubo,Sakura-ku,Saitama, 338-8570 JAPAN [email protected]

Abstract: This paper presents a fundamental frequency estimation algorithm of noisy speech signal using the combination of windowless and normalized autocorrelation functions. Instead of the input speech signal, we employ its windowless autocorrelation function for obtaining the normalized autocorrelation function. The windowless autocorrelation function is a noise-reduced version of the input speech signal where the periodicity is more apparent with enhanced pitch peak. Experimental results on male and female voices in white noise indicate that the proposed method sufficiently outperforms existing methods in terms of gross pitch error. Key–Words: Pitch extraction, Normalized autocorrelation function, Windowless autocorrelation function, White noise

1

Introduction

an accurate estimation method for fundamental frequency is, thus, desired to handle this environment. However, the ACF produces extraction errors in the fundamental frequency and the error rate is greatly influenced by the vocal tract characteristics [5].

Fundamental frequency or pitch estimation of speech signal is used in various important application areas such as automatic speech recognition, speaker identification, low-bit rate audio coding, speech enhancement using harmonic model etc. Recently many fundamental frequency estimation algorithms have been proposed but accurate and efficient fundamental frequency estimation is still a challenging task [1] [2]. The speech signal is not always strongly periodic and the instantaneous frequency varies within each frame. Also, the presence of noise generates a degraded performance of fundamental frequency extraction algorithms. Numerous methods have been proposed in the literature to address this problem. In general, they can be categorized into three classes: time-domain, frequency-domain, and time-frequency domain algorithms. Due to the extreme importance of the problem, the strength of different methods has been explored [3]. Time-domain methods operate directly on the signal temporal structure. These include, but are not limited to, zero-crossing rate, peak and valley positions, and autocorrelation. The autocorrelation model appears to be one of the most popular methods for its simplicity and explanatory power. The autocorrelation function (ACF) method [4] is tunable in random noise and is the most powerful method particularly in a white noise environment. A white noisy environment is often seen in communication systems, and

ISBN: 978-1-61804-064-0

Among many other improvements reported on the ACF method, Markel [6] and Itakura et al. [7] utilized auto-regressive (AR) inverse filtering to flatten the signal spectrum. This AR preprocessing step has effects on emphasizing the true period peaks in ACF. However, for high-pitched speech or in white Gaussian noise, the process of AR estimation is itself erroneous. Shimamura et al. [8] weighted the ACF, resulting in the weighted autocorrelation (WAC) method, which can estimate fundamental frequency at a low signal to noise ratio (SNR). However, the major shortcoming of the WAC is that the possibility of nonexistence of spurious peaks cannot be guaranteed in strong noisy conditions. Talkin [9] proposed a normalized cross correlation based method that produces better results in fundamental frequency detection than the ACF as the peaks are more prominent and less affected by rapid variations in the signal amplitude. Normalized ACF (NACF) based technique is introduced in [10] with higher fundamental frequency estimation accuracy than the simple ACF. A noticeable improvement of the NACF based method is achieved by a signal reshaping technique in which the enhancement of specific harmonic is performed [11]. The dominant harmonic of the noisy speech signal is determined by us-

305


ing discrete Fourier transform and boosting the amplitude of dominant harmonic in the analyzing signal. The method is termed here as dominant harmonic enhancement (DHE). In the DHE method, there may occur the shifting of fundamental frequency peak due to the noise effects, and the presence of higher frequency harmonics introduces some errors. In this paper we propose an efficient fundamental frequency estimation technique that utilizes the windowless ACF of the speech signal instead of the speech signal itself for computing the NACF. The windowless ACF of the speech signal is a noise compensated equivalent of the speech signal in terms of periodicity which improves SNR greater than 10 dB [12]. Then, application of the NACF method on the SNR improved speech signal provides better fundamental frequency determination. Experimental results on male and female voices in white noise show that the occurence probability of fundamental frequency errors becomes lower using the proposed windowless autocorrelation based NACF method when compared with other methods.

2

4000 2000

−2000

Amplitude

−4000

ai cos(2πif0 n + θi )

200

(1) N ACF (τ ) = √

400 600 Time (samples)

800

1000

1 e0 eτ

N −τ X−1

s(n)s(n + τ )

(4)

s2 (n), 0 ≤ τ ≤ L − 1.

(5)

n=0

where

(2)

τ +N −L−1 X n=τ

As reported in [9], this method is better suited for pitch period estimation than the standard ACF as the peaks are more prominent and less affected by rapid variations in the signal amplitude. Nevertheless, the largest peak in ACF still occurs at double or half the correct lag value or some other incorrect values, giving rise to some errors. In this paper, we propose a modified NACF method that utilizes the windowless ACF instead of the speech signal itself. Experimental results suggest that the proposed method can be effective against the presence of additive noise.

(3)

The Rss (τ ) exhibits local maxima at nT0 and provides pitch period candidates. The main advantage of this method is its noise immunity. However, effect of formant structure can result in the loss of a clear peak in Rss (τ ) at the true pitch period. The second difficulty is that the peak estimates vary as a function of the lag index τ , since the summation interval shrinks as τ increases. This compromises its noise immunity and estimation accuracy when the peak is at a longer

ISBN: 978-1-61804-064-0

0

lag (that corresponds to a lower pitch (higher fundamental frequency) case). Methods have been proposed to improve the pitch period extraction by emphasizing the true peak in ACF [4-11]. A modification to the basic autocorrelation is the normalized ACF [9] of the signal s(n), 0 ≤ n ≤ N − 1, that is computed as

n=0

∞ 1X a2 cos(2πf0 nτ ) 2 n=0 n

1000

Figure 1: The windowless ACF of speech signal: (a) Speech signal (b) Windowless ACF.

for s(m), m = 0, 1, 2,..., N − 1. By using Eq. (1), Eq. (2) can be expressed for a very long data segment approximately as Rss (τ ) =

800

x 10

eτ = s(n)s(n + τ )

600

5

10

−5

where f0 = 1/T0 is the fundamental frequency and T0 is the pitch period. The ACF is a popular measure for pitch period that can be expressed as 1 Rss (τ ) = N

400

(b)

i=0

N −1 X

200

0

The voiced speech can be expressed as a periodic signal s(n) as follows: ∞ X

0

5

Background Information

s(n) =

(a)

0

3

Proposed Method

According to the signal in Eq.(1) and the ACF in Eq.(3), clearly the periodicity of s(n) and that of

306


Noisy signal 5000 (a)

0 −5000

0

50

100

150

5

2

x 10

200 250 300 Time (samples)

350

400

450

500

Failure of peak detection

WAC

(b)

1

Amplitude

0

0

50

100

0.5

150

200


NACF

can arise for a speech segment with relatively weaker periodicity. The Rxx (τ ) can be enhanced in terms of periodicity by defining it in a windowless condition as exploited in [12], where the signal outside the window is not considered as zero as shown in Fig. 1(a). Thus the number of additions in the averaging process is always common. This results in almost similar amplitude correlation peaks even as the lag number increases. The windowless ACF can be defined for the noisy signal x(n) as

(c) 0

0

50

100

150

1


DHE

0.5 0

0

1.5 1

50

100

Rxw (τ ) =

200

150

(d) 200

Accurate peak detection

PROPOSED

(e)

0.5 0

0

50

100 Lag (samples)

150

200

Figure 2: Pitch peak using different methods for a voice frame of a female speaker at SNR= −5 dB: (a) Noisy signal (b) WAC (c) NACF (d) DHE (e) PROPOSED. The true pitch period is 44 in samples Rss (τ ) are similar. Since the autocorrelation of a signal is obtained by an averaging process, it can be treated as a noise-compensated version of the speech segment in terms of periodicity. This can be shown as follows. When s(n) is corrupted by additive noise v(n), the noisy signal is given by x(n) = s(n) + v(n)

(6)

When v(n) is white Gaussian uncorrelated with s(n), Eq. (3) can be written as Rxx (τ ) =

(

Rss (τ ) + σv2 for τ = 0, Rss (τ ) for τ 6= 0

(8)

for x(m), m = 0, 1, 2, . . . , 2N -1. In this case, an N length sequence of Rxw (τ ), τ = 0, 1, 2, . . . , N -1 is obtained. For the ACF in Eq. (2), when (n + τ ) > N, s(n + τ ) becomes zero. However, in Eq. (8), x(n + τ ) is not zero outside N. This modification makes Rxw (τ ) more stronger in periodicity with emphasized peaks as seen in Fig. 1(b). Suzuki [12] demostrated that the use of autocorrelation domain signal (as expressed in Eq. (7)) improves the SNR greater than 10 dB. The main concern in [12] was the distortion introduced due to the change of amplitude (i.e., a2n /2 instead of an ). This is, however, completely irrelevant in pitch period estimation. The second concern in [12] was the exclussion of zero-lag since it includes the noise component. This exclusion might be useful for spectral estimation as described in [13]. However, for pitch period estimation, the exclusion of zero-lag or lower lags somewhat hampers the periodicity. Thus, Rxw (τ ), τ = 0, 1, 2, ..., N − 1, results in a noisecompensated version of the speech signal with strong periodic waveform. By using Eq. (8), Eq. (4) can be expressed as N ACFw (τ ) = √

1 e0w eτ w

N −τ X−1

Rxw (n)Rxw (n + τ )

n=0

(9)

where

(7)

eτ w =

σv2

where is the noise variance of v(n). According to Eq. (7), only the first lag is affected by the noise presence. In this paper, we aim to utilize Rxx (τ ) as the input signal with modification for pitch period estimation. The modification is performed because Rxx (τ ) is computed using a finite length of speech segment. As the lag number increases, there is less data involved in the computation, leading to reduction in amplitude of the correlation peaks. As mentioned in Section 2, it compromises the accuracy when the true peak occurs at a long lag. The similar problem

ISBN: 978-1-61804-064-0

−1 1 NX x(n)x(n + τ ) N n=0

τ +N −L−1 X n=τ

2 Rxw (n), 0 ≤ τ ≤ L − 1.

(10)

To demonstrate that the use of the windowless ACF signal enhances the pitch peak, we present a female voiced frame and its WAC and NACF (positive part only) in Fig. 2. It can be observed that using the WAC and NACF of s(n) pitch period can be estimated only with double pitch error. In both WAC and NACF, the amplitude of the pitch peaks are smaller than the peaks at double pitch location. It is assumed that the application of the DHE emphasize only the amplitude of

307


Table 1: Conditions of experiments

Average no. of GPE (%)

20

15

WAC NACF DHE PROPOSED

Sampling frequency Band limitation Window function Window size Frame shift Number of FFT points SNRs (dB)

10

5

0 Clean

20

15

10 SNR (dB)

5

0

−5

4 Figure 3: Average performance results in terms of percentage of gross pitch error for male speakers in white noise at various SNR conditions.

Average no. of GPE (%)

Experimental Results

To assess the proposed method, natural speeches spoken by three Japanese female and three male speakers are examined. Speech materials are 11 sec-long sentences spoken by every speaker sampled at 10 kHz rate, which are taken from NTT database [14]. The reference file of the fundamental frequency is constructed by computing the fundamental frequency every 10 ms using a semi-automatic technique based on visual inspection. The simulations were performed after adding additive noise to these speech signals. The evaluation of accuracy of the extracted fundamental frequency is carried out by using

20

15

10 kHz 3.4 kHz Rectangular 51.2 ms (N=512) 10 ms 2048 ∞, 20, 15, 10, 5, 0, −5

WAC NACF DHE PROPOSED

10

e(l) = Ftrue (l) − Fˆ (l) 5

0 Clean

20

15

10 SNR (dB)

5

0

where Ftrue (l) is the true fundamental frequency, Fˆ (l) is the extracted fundamental frequency by each method, and e(l) is the extraction error for the l-th frame. If |e(l)| > 20%, we recognized the error as a gross pitch error (GPE) [11]. The possible sources of the GPE are pitch doubling, halving and inadequate suppression of formants to affect the estimation. For the GPE, proportion of the errors was calculated. The experimental conditions are tabulated in Table 1. We attempt to extract the pitch information of clean and noisy speech signals. All the candidate algorithms are applied in additive white Gaussian noise. The noise is taken from the Japanese Electronic Industry Development Association (JEIDA) Japanese Common Speech Corp. The performance of the proposed method is compared with a well-known weighted autocorrelation method, WAC [8], normalized ACF based method, NACF [9] (according to Eq. (4)), and dominant harmonic enhancement based method, DHE [11]. For the implementation of the DHE, the parameter α in [11] is set to 0.5. For the NACF, DHE, and the proposed method, the setting of L = 200 is commonly used. Pitch estimation error in percentage, which is the average of GPEs for male and female speakers, are shown in Figs. 3 and 4, respectively. The performance

−5

Figure 4: Average performance results in terms of percentage of gross pitch error for female speakers in white noise at various SNR conditions.

the dominant harmonic of the prefiltered speech signal [11]. However, the amplitude of the other harmonics may also be emphasized based on their relative phases. That is why the performance of fundamental frequency detection using the DHE method often degrades especially for low SNR speech signals. In Fig. 2(d), a pitch error occured in the DHE. On the contrary, in the NACF of Rxw (τ ) (Eq. (9)), the amplitude of the true pitch peak is enhanced enabling accurate estimation of pitch period (Fig. 2(e)). It is, therefore, worth using the windowless ACF signal for reducing the pitch errors.

ISBN: 978-1-61804-064-0

(11)

308


of the WAC and NACF methods provides slightly better results than the other two methods upto SNR = 5 dB for male case, but in all other SNR conditions for both male and female cases their performances are not satisfactory. For male and female in higher noisy cases the DHE method provides better results compared with the WAC and NACF methods. On the contrary, the proposed method gives far better results for both male and female cases in different types of SNR conditions. These experimental results show that the proposed method is superior to the three other methods in almost all cases. Particularly, in lower SNR cases (0 dB, −5 dB), the proposed method performs more robustly compared with the other methods.

5

[7] F. Itakura and S. Saito, Speech information compression based on the maximum likelihood spectral estimation, Journal of Acoustical Society of Japan, Vol. 27, No. 9, pp. 463 –472, 1971. [8] T. Shimamura and H. Kobayashi, Weighted autocorrelation for pitch extraction of noisy speech, IEEE Trans. on Speech and Audio Processing, Vol. 9, No. 7, pp. 727 –730, 2001. [9] D. Talkin, A robust algorithm for pitch tracking (RAPT), In: Speech Coding and Synthesis edited by W. B. Kleijn and K. K. Paliwal, Elsevier, 1995. [10] K. Kasi and S. A. Zahorian, Yet another algorithm for pitch tracking, Proc. ICASSP, pp. 361 –364, 2002. [11] M. K. Hasan, S. Hussain, M. T. Hossain and M. N. Nazrul, Signal reshaping using dominant harmonic for pitch estimation of noisy speech, Signal Processing, Vol. 86, pp. 1010 –1018, 2006. [12] J. Suzuki, Speech processing by splicing of autocorrelation function, Proc. ICASSP, pp. 713 – 716, 1976. [13] B. J. Shannon and K. K. Paliwal, Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition, Speech Communication, Vol. 48, pp. 1458 –1485, 2006. [14] Multilingual Speech Database for Telephometry, NTT Advance Technology Corp., Japan, 1994.

Conclusion

In this paper, a modified NACF method utilizing the windowless ACF has been introduced, which leads to robustness against additive noise. Simulation results indicate that the proposed method provides better performance in terms of GPE (in percentage) compared with the existing methods such as WAC, NACF and DHE for a wide range of SNR varying from −5 dB to ∞ dB. Especially the performance of the proposed method in low SNR cases is noticeable higher than that of the WAC, NACF and DHE based methods. These results suggest that the proposed method can be a suitable candidate for fundamental frequency extraction from noisy speech. References: [1] W. Hess, Pitch Determination of Speech Signals, Springer - Verlag, 1983. [2] L. R. Rabiner and R. W. Schafer, Theory and Applications of Digital Speech Processing, First Edition, Prentice Hall, 2010. [3] P. Veprek and M. S. Scordilis, Analysis, enhancement and evaluation of five pitch determination techniques, Speech Communication, Vol. 37, pp. 249 –270, 2002. [4] L. R. Rabiner, On the use of autocorrelation analysis for pitch detection, IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-25, No. 1, pp. 24 –33, 1977. [5] W. J. Hess, Pitch and voicing determination, In: Advances in Speech Signal Processing edited by S. Furui and M. M. Sondhi, Marcel Dekker, 1972. [6] J. D. Markel, The SIFT algorithm for fundamental frequency estimation, IEEE Trans. on Audio and Electroacoustics, Vol. AU-20, No. 5, pp. 367 –377, 1972. ISBN: 978-1-61804-064-0

309