Proceedings of the 38th Southeastern Symposium on System Theory Tennessee Technological University Cookeville, TN, USA, March 5-7, 2006
TA3.1
Wavelet Processing for Pitch Period Estimation Shonda L. Bernadin Electrical and Computer Engineering Georgia Southern University Statesboro, GA 30460 USA
[email protected] and Simon Y. Foo Department of Electrical and Computer Engineering FAMU-FSU College of Engineering Tallahassee, Florida 32310 USA
[email protected]
synchronous, algorithms utilize the start of the glottal closure instant to detect pitch periods. GCl is an important moment in the biological process of human speech production. As air moves through the vocal tract, a tubular component that has the lips at one end and the larynx at the other end, different types of sounds are produced, like voiced, unvoiced and nasal components of speech. The vocal cords are two folds of skin on the larynx that palpitate as air is forced through the slit, or glottis, between them. GCl is the formal name of this process and when air passes through the glottis, voiced sounds are produced.
ABSTRACT This paper studies different approaches to pitch period estimation using wavelet processing. Pitch period estimation, the detection of a fundamental frequency in a speech segment, is an integral part of most speech processing applications. It can be used in speech feature extraction to analyze and characterize speech. Many of our previous attempts use autocorrelation methods for pitch period estimation of frames. The results are adequate for signal analysis and speech recognition, however they are not always reliable. It is essential, especially in speech synthesis applications, that highprecision pitch period estimation algorithms are designed in order to achieve high quality speech synthesis. The application of wavelets to traditional pitch period estimation algorithms may provide insight into the behavior of speech and, thus provide a more robust foundation for speech synthesis applications. Keywords Wavelet transforms, pitch detection, pitch period estimation, pitch-marking
Pitchmarking algorithms are based on this natural biological process and can therefore produce more accurate representations of fundamental frequencies. However, the GCl can be masked in complex speech, which can make its detection difficult. Though these algorithms are accurate due to the modeling of a human process, they are usually less robust than short-term pitch period estimation algorithms. Short-term and pitch-marking algorithms are amenable to wavelet processing [1,2]. The nonstationary attributes of speech require small frame lengths for short-term pitch detection algorithms. Wavelets are known to compensate for nonstationarity in speech signals by providing variable-length techniques for segmentation and analysis. Pitchmarking algorithms can use wavelet analysis to detect local maxima in the speech signal, which will help determine the glottal closure instant. In addition, some researchers suggest that optimization techniques can be applied to pitch period algorithms if pitch determination were considered as an optimization problem [3,4]. Estimating pitch using optimization methods allows for a variety of different optimization techniques to be used, which may be explored in future research. This paper is divided into pitch period estimation algorithm development and performance evaluations. In addition,
INTRODUCTION Conventional approaches to pitch period estimation (PPE) can be classified into two categories: short-term algorithms, which find the average fundamental frequency per frame using autocorrelation or linear prediction techniques; and pitch-marking algorithms, which focus on time domain analysis and detection of the glottal closure instant (GCl). Short-term algorithms must assume that the pitch is relatively constant over a frame of speech. This is not always a true assumption especially for speech, since it is usually classified as a nonstationary signal. With short-term pitch detection, speech segments must be small enough to capture a relatively stationary segment of speech. These algorithms are less susceptible to noise variations and more robust, but they produce a less accurate estimation of pitch. Pitch-marking, or pitch 0-7803-9457-7/06/$20.00 ©2006 IEEE.
426
several conclusions will be drawn based on these observations.
the silence durations. For example, the pre-recorded input file ‘a.wav’ is illustrated in the top plot in Figure 4. The second plot illustrates the segmented signal. The frame analysis phase divides the segmented signal into 10 millisecond frames for further analysis. Autocorrelation function and pitch periods were determined for each frame. The algorithm in Figure 2(b) implements the crosscorrelation between the input speech and the Morlet wavelet basis function. Figure 2(c) illustrates an algorithm that applies the wavelet transform to the segmented speech, reconstructs the transformed signal, then performs autocorrelation to identify the pitch.
PITCH PERIOD ESTIMATION ALGORITHMS Short-term pitch period estimation algorithms are typically developed using autocorrelation methods, whereas pitch-marking algorithms use probabilistic methods to determine optimal pitch. The mathematical description of an autocorrelation function is given by equation (1), R n (m)
N 1 k
¦ >w(m) x(n m)@ >w(k m) x(n k m)@
(1)
Input Speech
Input Speech
m 0
where N is the length of a rectangular window, w(n), k is the lag, and x(n) is the input speech. The optimal pitch is determined by the maximum value of energy components within each frame. By applying wavelet techniques to conventional methods, variations of the autocorrelation PPE algorithm were developed. In these simulations the Morlet, Haar, and Mexican Hat wavelets were used as the mother wavelets. A comparison of their accuracy in speech signal representations was studied. However, in other work it has been suggested that the Morlet wavelet is best for speech [5]. Its mathematical description is given in equation (2) and its compactly supported shape is illustrated in Figure 1 with D=5.
W ( x)
cos(Dx) exp( x 2 / 2)
Segmentation
Segmentation
Frame Analysis
Frame Analysis
Autocorrelation Function
Crosscorrelation Function
Fundamental Frequency
Fundamental Frequency
(a)
(b) Input Speech
(2) Segmentation
Frame Analysis
Wavelet Transform Speech
Autocorrelation Function
Figure 1-Plot of Morlet Transform
It is expected that the Haar wavelet will provide the least amount of accuracy due to its relative simplicity. The Mexican Hat wavelet was chosen for comparison purposes.
Fundamental Frequency (c)
The algorithms presented in Figure 2 illustrate how wavelets were implemented to evaluate the pitch using autocorrelation methods. In Figure 2(a), the conventional autocorrelation method is used to determine the pitch or fundamental frequency in each frame. Segmentation and frame analysis are common steps in all three algorithms. Segmentation involves the isolation of the voiced/unvoiced components of the input by neglecting
Figure 2- Short-term PPE Algorithms: (a) Autocorrelation Method, (b) Crosscorrelation with Wavelet Basis Function, and (c) Wavelet-transformed input speech
427
input speech segment for the vowel sound ‘a’. The speech sound is segmented to isolate only the voiced components in the signal. For the wavelet transform algorithms, the segmented wavelet-transformed speech is used. Input Speech
Input Speech
Segmentation
Segmentation
Frame Analysis
Frame Analysis
Energy Calculation
Wavelet Transform Speech
Local maximum identification
GCI Pitch marking
Fundamental Frequency
Energy Calculation
Local maximum identification Figure 3-Segmented Input Speech GCI Pitch marking
(a) Fundamental Frequency
Figure 4 –Segmented Wavelet transformed
(b)
Using the algorithms described above, the maximum pitch was determined for each frame. The pitch contour was compared to the other algorithms to determine the relative accuracy of each of the algorithms in identifying the pitch. The autocorrelation algorithm was used as the benchmark because of its conventional approach to pitch period estimation. Figure 5 shows the tabular error calculations between the autocorrelation algorithm and the other four PPE algorithms and figure 6 illustrates the graphical view of the pitch contours for the five algorithms using the Mexican Hat wavelet.
Figure 3 - (a) Pitch-marking algorithm to identify fundamental frequency using GCI (b) Wavelettransformed pitch-marking algorithm Pitch period estimation pitch marking algorithms were also considered in this investigation. Pitch marking algorithms seek to identify the glottal closure instant (GCI) of speech sounds. The maximum energy in a speech signal is analyzed and the instant at which it occurs (GCI) symbolizes the pressure of the fundamental frequency in the signal. The pitch can be identified using this information. Figure 3 illustrates the pitch marking algorithms that were developed. As illustrated, segmentation and frame analysis were still considered in these algorithms. The local maximum was determined for each frame and the GCI pitch was calculated based on this instant. For the wavelet version, the input speech was wavelet-transformed using different mother wavelets before GCI pitch detection was performed.
Figure 6 – Maximum Error Calculations for the ‘e’ sound where E1 as autocorrelation, E2 as crosscorrelation, E3 as wavelettransform autocorrelation , E4 as GCI pitch marking and E5 as GCI wavelet transformed pitch marking
Figure 7 shows the individual pitch contours for the five algorithms. Based on the observations and the analysis of the fundamental frequencies within each frame, the results revealed that the pitch estimation of the crosscorrelation of speech with the wavelet transform
PERFORMANCE EVALUATIONS The chosen database for these simulations included vowels sounds, spoken by a female speaker, sampled at 22.1kHz. Figure 4 illustrates the wavelet transformed
428
produced the maximum error results for the vowels sounds overall. This result implies that the comparison between the traditional autocorrelation approach and the crosscorrelation does not provide a very accurate measure of pitch. In general, the results for the wavelettransformed autocorrelation pitch detection provided the most accurate calculations when compared with the autocorrelation algorithm. The pitch-marking algorithms performed relatively well on average. The error calculations were very similar to each other, which implies that the glottal closure instances identified in each frame were very close numerically. Therefore, their pitch calculations would be similar as well.
CONCLUSIONS Based on the results, the wavelet autocorrelation algorithms for short-term pitch detection are accurate when compared to the conventional autocorrelation methods. The pitch contours generated for each frame produced minimal error. Clearly, the crosscorrelation method produces higher error than the wavelettransformed method, likely due to the differences in the characteristics of the mother wavelets and the input speech. Pitch-marking algorithms provide less error than the crosscorrelation method with better accuracy and scalability characteristics. Based on the calculations, there is not much difference in the wavelet-based GCI approach and the conventional GCI approach. Thus, in short term and pitch-marking PPE algorithms, wavelets provide an alternate approach with comparatively good results for pitch detection. Future work will investigate other characteristics in speech that are amenable to wavelets. REFERENCES [1] L. Janer, J. Bonet, and E. Lleida-Solano, “Pitchdetection and Voiced/Unvoiced Decision Algorithm based on Wavelet Transforms”, Proceedings of the International Conference on Spoken Language Processing (ICSLP), October 3-6, 1996, Philadelphia, PA. [2] Gavat, M. Zirra, and V. Enescu, “Pitch Detection of Speech by Dyadic Wavelet Transform”, Proceedings of the International Conference on Signal Processing Application and Technology (ICSPAT), 1996 [3] V. Goncharoff and P. Gries, “An Algorithm for Accurately Marking Pitch Pulses in Speech Signals”, Proceedings of the IASTED International Conference on Signal and Image Processing (SIP’98), October 28-31, 1998, Las Vegas, NV. [4] M. Sakamoto and T. Saitoh, “An Automatic PitchMarking Method using the Wavelet Transform”, Proceedings of the ICSLP, October 15-21, 2000, Beijing, China. [5] S. Walker and S. Foo. “Optimal Wavelets for Speech Signal Representations”, Journal of Systemics, Cybernetics and Informatics, Volume 1, June 2003
Figure 6 – Graphical plot of pitch contours for ‘e’ sound
(a)
(b)
(c)
(d)
(e)
Figure 7 – Pitch contours for PPE algorithms: (a) Short-term Autocorrelation, (b) Short-term Crosscorrleation with Wavelet Transform, (c) Autocorrelation with Wavelet-transformed speech, (d) Pitch-marking GCI Algorithm, and (e) Pitch-marking Wavelet Transform Algorithm
429