Formant Detection Through Instantaneous-frequency ... - IEEE Xplore

6 downloads 0 Views 421KB Size Report
Least Square (RLS) algorithm. The method accuracy is assessed by comparing it with conventional formant detection techniques. The method is also analysed ...
International Symposium on Signal Processing and its Applications, ISSPA, Gold Coast, Australia, 25-30 August, 1996. Organized b y the Signal Processing Research Centre, QUT, Brisbane, Australia.

FORMANT DETECTION THROUGH INSTANTANEOUS-FREQUENCY ESTIMATION USING RECURSIVE LEAST SQUARE ALGORITHM

S. Ghaemmaghami, M. Deriche, and B. Boashash Signal Processing Research Centre Queensland University of Technology GPO Box 2434, Brisbane, Q 4001, Australia shahrokhQmarkov.eese.qut.edu.au m.dericheQqut.edu.au

time domain [2]. The poles of the model, or the roots of the denominator of the LPC system function, approximate the spectrum peaks or formants. Basically, the inverse flter analysis gives more accurate estimates of formants than the root finding procedure but the accuracy is typically less than 90% [2].

Abstract

Fonnant frequencies, represented by major peaks in the spectrum, convey important information about speech. “Instantaneous Frequency” (IF) estimation is a methods to track formants. This paper proposes a method to detect the formants of voiced speech using a Recursive Least Square (RLS) algorithm. The method accuracy is assessed by comparing it with conventional formant detection techniques. The method is also a n a l y s e d &om the viewpoint of phonetic conformity using ‘-mporal Decomposition”.

Such formant trackers work well when the formants change slowly and gradually. But in the case of abrupt changes in formant trajectories, poor results are obtained by using methods based on the short time Fourier Transform 141. Another difficulty associated with these methods arises from closeness of two adjacent formants which may produce a single peak in the spectrum during certain intervals, leading to missing one of the existing formants. To resolve these problems, firstly, a high-resolution time-frequency method is required to follow the sudden formant changes and, secondly, the method must have a memory of the past to discriminate two close formants, based on formant continuity.

1. Introduction Formant frequencies play an important role in most speech applications. In fact, a large amount of phonetic information is conveyed by the corresponding parts of speech signals [l].Accordingly, formant detection and tracking are very useful in extracting spee-chfeatures and in &ing its evolutionary behavior. Typically, speech formants are detected through searching for peaks in spectral representations using shorttime Fourier Transform, filter-bank analysis, or Linear Pndiction (11- Such methods yield an accuracy of about 90% for the first three formants [lo]. In short-time Fourier Trausform-based methods, the peaks in the spectrum of successive windowed speech segments are detected. In this method, the frequency resolution increases by using a longer window which causes a reduction in time resolution. So, there is a tradeoff between the formant detection capability (frequency resolution) and formant tradcing accuracy (time resolution) [2). Formant estimation through iilter-bad analysis is basically the same s the Fourier Transform based method in which the spectrum peaks are detected from the outputs of the filter-bank applied to a given segment of speech [I). In the Linear Prediction method, two typical techniques are nsed: root jindang, and peak picking of the reciprocal of the inverse filter in the LPC model. Both procedures yield estimates of formants on the basis of an dl-pole approximation of speech which stems from speech predictability in

Zwtantancous Frequency (IF’) estimation is one of effective methods to detect and track frequency changes of a monocomponent signal. But, in the case of multi-component signals, the result can be meaningless [4]. If s ( t ) is the signal of interest, it can be d&ed in term of the amplitude and the phase of its individual components, s IS]: s ( t ) = A1 (t)e’”(‘)

+ Az(t)e’*’(*)+ ...

(1)

where the frequency and band-width of each component, is obtained from corresponding & ( t )and A ; ( t ) ( i = 1,2,...).

In the case of speech which can be seen as a multi-component signal, each component represents a specific formant or a path in the time-frequency domain (1). Although there are basically i d b i t e number of formants in speech signals, 4 or 5 formants are sufficient to represent the vocal tract characteristics [2]. Consequently, there is no need to detect and follow all speech frequency components in most applicatioas. A number of methods have been proposed in the liter-

81

International Symposium on Signal Processing and its Applications, ISSPA, Gold Coast, Australia, 25-30 August, 1996. Organized by the Signal Processing Research Centre, Q UT, Brisbane, Australia.

Step 1. De-emphasis We first de-emphasize the speech signal by H ~ ( zto)attenuate the high-frequency components:

ature to detect and track individual components of multicomponent signals through IF estimation [4].Most of these methods give high instantaneous resolution in both time and frequency domains which is primarily required to cope with problem in formant tracking. But, on the other hand, following spurious peaks in speech spectrum would be the major drawback of such methods leading to confusion in the main formants at some instants 191. To alleviate such a problem, we use Recursive Least Square (FUS)algorithm which is shown to be one of the most efficient methods for analysis of quasi-stationary signals [7]. RLS adaptively models the data as a linear prediction sequence, through an exponentially weighted approximation to the inverse of the covariance matrix [4],[7].The algorithm parameters are extracted in a recursive way [4], as:

a,,+'= 4 - e , + ~ P n d en+i

Bd(2)

Step 2. Pitch removal

To remove the fundamental frequency, F,, which is the reciprocal of pitch period and could be troublesome in finding the first formant, the signal is passed through an adaptive high-pass filter based on pitch information. To detect the pitch, we use cepstral analysis associated with an adaptive thresholding technique to increase the reliability of the process. Given s(n) a.~the discrete speech signal, cepstrum coefficients are calculated as:

(2) %( I C )

+ +z ~ ,

(4)

A frame is taken as voiced and the pitch value is determined if there is a detected peak in the cepstnun larger than a designated threshold, within a certain intenal in which a reasonable pitch is expected (corresponding pitch periods in the range of 2-15msec to cover almost all male and female pitch values [l]).Assuming 1'2 and T2 as the smallest and the largest pitch values (2 and 15 msec), respectively, the pitch is calculated from P = kma=.Fs,if:

+

= [ z ( n )z(n

= R{J-'{Wsm (k)III

where cn (IC) is the k-th cepstrum coefficient at frame m, I2 is for red, F'represents the inverse Fourier Transform, and S,,,(k) is the Fourier Transform of windowed s ( n ) at frame m,using Hanning window.

+ z&-',

where n is the time index, P,,is an approximation to the inverse of the covariance matrix, -U is the vector of prediction filter coefficientsa t time n, e,+1 is the prediction error a t time n 1, and a is the forgettingfactor. 2, is the signal vector at time n given as: -z

(3)

where p = 1 in the experiments.

= z(n 1 )

Pn = [apn-I-'

= 1+ p2-I

- 1)...r ( n - L +

where L is the length of prediction filter and T represents the transposition. The RLS algorithm provides a tracking ability relying on the past of the signal. By using this algorithm, the frequency of the predominant component of speech, indeed the major formant, can be obtained. To find all desired formants, an iterative procedure is proposed in this paper. The accuracy in formant determination achieved using this method is compared to that of traditional techniques. Also, we evaluate the proposed method by phonetic d e uance analysis via temporal decomposition (TD) [8] of the matrix of formants.

where P is the pitch period, k,,., is the index of the first cepstnun coefficient which exceeds the threshold, Fa is the sampling frequency, kl = T,.F., 1 = (Tz T,).F,, and t , is a predefined threshold. In tbis method, as evident, the threshold is not taken fixed for different frames but is determined on the basis of the s u m of cepstnun coefficient5 in the search interval. This pitch detector has a low sensitivity to t, and hence it can be kept fixed (typically, t , = 5) for a large number of speech utterances spoken by different speakers.

-

2. The method By using the RLS technique to estimate the IF of voiced speech, only the major component frequency, which would be the fundamental frequency or the first formant. can be detected and tradced. To find other formants, therefore, we need to modify the spectral characteristics of the signal so that the predominant component be the desired formant detectable by the RLS algorithm. To do this, we apply the algorithm to the speech signals in consecutive steps, each associated with an appropriate preprocessing stage, to change the frequency predominance from one formant to the next. Accordingly, in each step, one formant is detected and hence tracked. In the following, the method is described in detail.

The first two harmonics of the pitch are usually trouble some in detecting the first formant [lo] but other pitch harmonics do not actually af€ect the formant structure because of two reasons: low-pass characteristic of vocal tract, which attenuates high-frequency harmonics of the pitch, and the power spectrum of the glottal pulse train (during voiced speech), which typically decreases at higher frequencies. The procedure to remove the first two harmonics of the pitch is similar to that described in step 4 (filtering), by using an appropriate time-varying high-pass filter whose cutoff frequency is designated to remove the second pitch harmonic (aswell as the first one).

82

International Symposium on Signal Processing and its Applications, ISSPA, Gold Coast, Australia, 25-30 August, 1996. Organized b y the Signal Processing Research Centre, Q UT, Brisbane, Australia.

Step 3. IF estimation The FUS method is then used to estimate the first formant which is often evident from the major peak in the spectrogram.

Step 4 . Filtering To remove the spectral components associated with the first formant and to continue the procedure, an adaptive time-varying filter is needed. We fust partition the speech signal into SO% overlapped segments. Then, we remove the part of the spectrum corresponding to the first formant, on the basis of IF quantities associated with each segment, in an adaptive way, by using a sharp high-pass filter. The cutoff frequency of the filter is set on the basis of average formant band-widths (50, 80,120,200,and 200 Hz for the first five formants, respectively [1],[2]).

1.a. Waveform 0.5

6 804

e5 0.3

1.b Spectrogram 05

Step 5. Estimation of nezt formants After removing the first formant, the procedure is continued as described in the steps 3 and 4 to estimate the next formants.

04

For formant tracking, the formants trajectory is not sufficient as there is no information about the amount of energy. So, we need to separate the detected formant regions by using appropriate timevarying filters in the time-frequency plane. This is done in the same way as in step 4.

1.c Formant trajectories Figure 1. Temporal and spectral characteristics of a speech segment. Formant trajectories detected shown by white lines.

3. Experimental results The initial results obtained using the proposed formant detection method are shown in Table 1, using a number of voiced utterances spoken by different speakers. Formant Accuracy

I F1 I

F2

I F3 I F4 I

(Hz)1 48 1 51 I 92 [ 440

I

formants, arises from the fact that such formants are intrinsically ambiguous because of the non-stationary behavior of voiced speech. This non-stationarity is the result of mixed characteristics of the speech source, even during voiced sounds [lo],and the overlapping phonetic structure of the speech resulting from coarticulation ef€ect [8]which causes interference with adjacent unvoiced sounds. These non-stationary components appear in the high frequency region of the spectrum where the stationary components of the voiced sounds are often very weak. This is why for most formant vocoders, only three formants are used and the fourth formant is taken fixed [ll]. Some other important features of the proposed method can be deduced from the illustrations. For instance, it can be seen that close formants, the first two formants at the end of the utterance and the last two at the beginning of the signal, are well detected. Indeed, the FUS algorithm uses the preceding information to predict the new parameters of the signal, in the least square sense. This behavior, on the other hand, prevents fast tracking of abrupt changes in the formant structure, mostly in the high frequency region of the spectrum. As shown in the figures, the algorithm smoothes the formant trajectory, specifically at higher frequencies. So the error in estimating the higher formants comes mostly from the higher variance and dispersion of spectrum peaks. But this smoothing, from the viewpoint of speech quality, does

F5 680

Thble 1. Overall error in formant detection and tracking. The results represented in Figure 1are obtained from processing a composite voiced speech signal composed of 101, /TI, and /a/,uttered by a male speaker. It has been lowpass filtered at about 4KHr and sampled with 8KHz. The total duration is 200 msec, which can approximately be divided into three parts of 87,48, and 65 maec duration, for the three aforementioned sounds, respectively. Figure 1shows the speech waveform, its spectrogram, and the detected formant trajectories superposed on the spectrogram (white lines). 4.

_L_r__

Discussion

The overall accuracy in formant estimation using the pro-

posed method is presented in Table 1. In comparison with conventional formant trackers (described in section l), in which the overall accuracy is 60 H z for first two formants, and 110 Hz for the third one [lo], our method performs better. The relatively larger error in the fourth and fifth

83

Intenataonal Symposzum on Sagnal Processang and ats Applacataons, ISSPA, Gold Coast, Australaa. 25-30 August. 1996. Organized b y the Signal Processzng Research Centre, QUT, Bnsbane. Australaa.

been analysed from the viewpoint of phonetic conformity using temporal decomposition. The major shortcoming of the method appears in formant detection and tracking of speech associated with highpower non-stationary components such as: speech during highly overlapped voiced-unvoiced instants, speech in noisy environment, and female speech. We plan to combine this method with a robust adaptive formant band-width estimation technique to increase the system accuracy. also

i ‘ 6. References

[I] J. L. Flanagan, “Speech Analysis, Synthesis, and Perceptionn, Springer- Verlag, 1972. [2] J. R. Deller, Jr., J. G. Proakis, 3. H. L. Hansen, “DiscreteTime Processing of Speech Signals”, MacMillan Pub. Co., 1993. [3] H. M. Hanson, P. Maragos, A. Potamianos, ”Finding Speech Formants and Modulations Via Energy Separation: With Application to a Vocoder”, Proc. ICASSP 99, Vol. 2, pp. 716-719, 1993. [4] B. Boashash, “Estimation and Interpreting The Instantaneous Frequency of a Signal-Part 2: Algorithms and Applications”, Proc. IEEE, Vol. 80, No. 4, pp. 539568, Apr. 1992. 15) L. Cohen, K. T. Assaleh, A. Fineberg, “Instantaneous Bandwidth and Formant Bandwidth”, Proc. ICASSP 93, pp. 13-17,1993. 161 L. Cohen, “What is a Multicomponent Signal”, Proc. ICASSP 92, pp. 113-116,1992. [7]S, Haykin “Adaptive Filter Theory”, Pnntice-Hall, 1991. [SI B. S. Atal, ”Efficient Coding of LPC Parameters by Temporal Decomposition”, Proc. ICASSP 89, pp. 8184, 1983. 191 B. Boashash, Ed., “Time-Frequency Signal Analysis”, Longman Cheshire, 1992. [lo] D.O’Shaughnessy, “Speech Communication: Human and Machine”, Addison- Wesley Pub. Co., 1987. (111 P. E. Papamichalis, “Practical Approaches to Speech Coding”, Prentice-Hall Inc., 1987. 1121 H. P. Knagenhjelm, W. B. Kleijn, “Speckal Dynamics is More Important Than Spectral Distortion”, Pmc. ICASSP 95, Vol. 1, pp. 732-735,1995.

Figure 2. temporal decomposition of the resulted formantmatrix as the spectral parameters. The extracted events are shown at the top. not deteriorate the speech for two reasons: the above-mentioned low sensitivity to high frequency distortion, and auditory preference of smoothed power spectrum envelope speech [12]. Nevertheless, the algorithm has been able to detect lowenergy formants (4th and 5th formants during the first half of the utterance) and non-clear ones (for instance, the 3rd formant during the second half of the signal) as well. This results from the tracking ability of the RLS algorithm which is based on the signal statistics (71. To perform an evaluation from the viewpoint of phonetic conformity, we formed a matrix of formants and took it as the spectral parameter sets of the speech signal. Then, we decomposed the matrix using temporal decomposition (81. The result is shown in figure 2 where the extracted speech events are indicated along the utterance. As can be seen, the three vowel sounds have been detected at nearly the correct locations. It is to be noted that although formants are meaningful mostly for voiced speech, the concept of formant estimation can be extended to unvoiced speech, as well 1111. But, on the other hand, the RLS algorithm would not yield such a good result with non-stationary signals 171. Therefore, this method would not be suitable in applications involving formant estimation during unvoiced sounds (asformant uocoding). In such cases, Least Mean Square (LMS) IF estimation algorithms can give better results during unvoiced sounds than the U S method (see [4] and 171). Additionally, we have noticed that even in the case of voiced speech, some degradation can occur at end-points of the analysed speech segment due to coarticulation effect of adjacent unvoiced sounds.

5. Conclusion We have proposed a formant detection method using an

RLS algorithm to find and track the IF of a voiced speech signal. The method has been evaluated through comparison of the achieved accuracy in formant estimation with

that of conventional formant detection techniques. It has

84

Suggest Documents