#2015 The Acoustical Society of Japan
Acoust. Sci. & Tech. 36, 6 (2015)
PAPER
Real-time robust formant estimation system using a phase equalization-based autoregressive exogenous model Hiroki Oohashiy, Sadao Hiroya and Takemi Mochida Human Information Science Laboratory, NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, 3–1, Morinosato Wakamiya Atsugi, 243–0198 Japan (Received 22 January 2015, Accepted for publication 5 June 2015) Abstract: This paper presents a real-time robust formant tracking system for speech using a realtime phase equalization-based autoregressive exogenous model (PEAR) with electroglottography (EGG). Although linear predictive coding (LPC) analysis is a popular method for estimating formant frequencies, it is known that the estimation accuracy for speech with high fundamental frequency F0 would be degraded since the harmonic structure of the glottal source spectrum deviates more from the Gaussian noise assumption in LPC as its F0 increases. In contrast, PEAR, which employs phase equalization and LPC with an impulse train as the glottal source signals, estimates formant frequencies robustly even for speech with high F0 . However, PEAR requires higher computational complexity than LPC. In this study, to reduce this computational complexity, a novel formulation of PEAR was derived, which enabled us to implement PEAR for a real-time robust formant tracking system. In addition, since PEAR requires timings of glottal closures, a stable detection method using EGG was devised. We developed the real-time system on a digital signal processor and showed that, for both the synthesized and natural vowels, the proposed method can estimate formant frequencies more robustly than LPC against a wider range of F0 . Keywords: Formant estimation, Online, Linear predictive coding, Phase equalization PACS number: 43.72.Ar
1.
[doi:10.1250/ast.36.478]
INTRODUCTION
Estimating resonance of the vocal tract, called the formant, plays an important role in speech science and technology. During the past decades, linear predictive coding (LPC) have been widely used for estimating formant frequencies from speech signals due to simplicity in its computation and reasonable accuracies of the estimations. However, it is also well known that the estimation accuracy for speech with high fundamental frequency F0 would be degraded. This is because Gaussian noise as the excitation signals assumed in the LPC model deviates from actual signals especially for high F0 [1]. To overcome this problem, methods based on the modeling of excitation signals for voiced speech have been proposed. One of the methods is discrete all-pole modeling [2], which assumes a periodic impulse excitation in LPC for voiced speech. Others are LPC with a glottal source hidden Markov model [3] or with the Rosenberg
e-mail:
[email protected] Currently, Haskins Laboratories
y
478
glottal model [4]. These methods are robust to F0 but have high computational complexity and need around ten iterations. One of the reasons for such high computational complexity is that phase characteristics of natural speech signals is not minimum as assumed by the speech production model. To reduce the computational complexity of robust formant estimations using LPC, some previous studies [5,6] recruit a process to modify speech signals so that they fit into a simple periodic impulse excitation model. Hiroya and Mochida [6] have proposed a phase equalization-based autoregressive exogenous model (PEAR) of speech signals, which applies a phase equalization to speech signals. Phase equalization is a way to modify phase characteristics of speech signals using a matched filter [7]. Both the spectral envelope and the subjective quality of the phase-equalized speech are almost equivalent to those of the original speech: The human auditory perception is less sensitive to short-term phase characteristics of speech signals [7]. Although an iteration is hardly necessary for PEAR due to the phase equalization, PEAR is incompatible with a real-time formant tracking system in the point that the
H. OOHASHI et al.: REAL-TIME FORMANT ESTIMATION BY PEAR
computational complexity of PEAR is several times as large as that of the conventional LPC analysis. A real-time formant tracking system would be an important technology for investigating human speechproduction mechanisms [8–10] and for speech-language therapy. The speech transformation and representation by adaptive interpolation of weighted spectrogram (STRAIGHT) [11] can robustly estimate the spectral envelope of speech signals using a pitch synchronous analysis, but it was originally not suitable for a real-time processing due to its heavy computation. Recently, several studies attempt to develop a real-time STRAIGHT [12–14]. One of them [12] shows a processing delay within 100 ms on a tablet PC, but the delay is too long for real-time applications. Thus, there have been a few studies on a realtime robust formant tracking system. In the present study, we developed real-time PEAR (RT-PEAR) to reduce the computational complexity of PEAR. Furthermore, we evaluated the performance of the proposed system using synthesized and natural vowels in terms of errors, variances and the ratio of inter- to intravowel variances of the estimated formant frequencies, and its biases toward harmonics.
2.
PHASE EQUALIZATION
In phase equalization, the idea is to convert the phase characteristics of the original speech signals to the minimum phase. This is done by converting the LPC residual signals e to nearly zero phase [7]. eðtÞ ¼ sðtÞ
P X
where s represents the original speech signals, a denotes the LPC coefficients, and P is the dimension of the LPC coefficients. As shown in Fig. 1D, the LPC residual signals for natural speech are not zero phase. Phase equalization aims to convert the LPC residuals signals in voiced speech to an impulse train of pitch periods as output passed through the M þ 1 tap FIR filter h. Provided one pulse exists at a known position t0 in the frame for the sake of simplicity, the aim is achieved by deriving the optimum filter h satisfying a following equation: M=2 X
hðÞeðt Þ;
ð1Þ
¼M=2
where is a delta function representing an impulse of excitation signals. The filter h is derived by minimizing the mean squared error between the left and right terms of Eq. (1) in a frame: !2 M=2 X X argmin hðÞeðt Þ ðt t0 Þ : h
t
¼M=2
t1
tI
Fig. 1 An instance of waveforms for the Japanese vowel /i/ produced by a female. (A) Original speech signals. (B) An excitation signal model assumed by PEAR, which is composed of Gaussian noise and an impulse train spaced at pitch marks t0 ; ; tI . (C) Phase-equalized LPC residual signals and pitch marks detected from the signals. (D) LPC residual signals. Shaded areas, which range around points delayed from the glottal closures derived from EGG signals, are search scopes for a typical pitch mark. The delay was calculated by cross-correlation between LPC residuals and EGG. (E) EGG signals. (F) EGG signals.
aðpÞsðt pÞ;
p¼1
ðt t0 Þ ¼
t0
Note that an increase of the number of taps M þ 1 for the FIR filter h would result in a decrease of the mean square error. If the autocorrelation function of e is a delta function for the time delay up to M þ 1, then ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ,v u M=2 u X t ð2Þ eðt0 þ Þ2 : hðtÞ ¼ eðt0 tÞ ¼M=2
That is, the LPC residual signals are converted into a positive impulse train through the FIR filter whose coefficients are the values of the LPC residual signals itself, which is reversed at a reference position in the time domain. For the obtained h, the phase-equalized speech signals x are computed by xðtÞ ¼
M=2 X
hðÞsðt Þ:
ð3Þ
¼M=2
Figure 1C shows an example of the results of phase equalization. The phase-equalized LPC residual signals show very sharp pitch spikes at the instant corresponding to the timings of glottal closure, which are referred to as pitch marks.
479
Acoust. Sci. & Tech. 36, 6 (2015)
3.
PROPOSED METHOD
Phase equalization has been used to optimize the excitation signals of voiced speech for low-bit rate speech coding [7] but not for estimating the spectral envelope of voiced speech. In this section, first, we describe a method for estimating LPC coefficients from the phase-equalized speech signals in accordance with original PEAR [6]. The original PEAR enables us to robustly estimate formant frequencies even from speech signals with high F0 . Yet, a reduction in computational complexity is required for real-time processing. In order to reduce it, second, we present a novel formulation of LPC with an impulse train and a formulation with the TANDEM method [15]. Taking a practical application into consideration, stable pitch mark detections are also necessary. For pitch mark detections, third, we propose a method to use electroglottography (EGG) signals in addition to LPC residual signals. 3.1. Original PEAR Let the phase-equalized speech signals be the output of the LPC filter whose input comprises the impulse train corresponding to pitch marks t0 ; ; tI and the Gaussian noise elsewhere in the frame (Fig. 1B). Thus, we consider minimizing the following function: !2 P X X ^ aðpÞx xw ðtÞ w ðt pÞ t6¼t0 ;;tI
þ
p¼1
X
xw ðtÞ
P X
t¼t0 ;;tI
ð4Þ
!2 ^ aðpÞx w ðt pÞ Gw ðtÞ
;
p¼1
where Gw ðti Þ for i ¼ 0; ; I is the windowed impulse amplitude and I þ 1 is the number of impulses in the frame. The LPC coefficients a^ are calculated by solving the following simultaneous equation: 1 0 10 ^ að1Þ Rxx ð0Þ . . . Rxx ðP 1Þ B CB . C .. .. .. B CB . C @ A@ . A . . . ^ aðPÞ Rxx ð0Þ Rxx ðP 1Þ . . . 0 1 I X xw ðti 1ÞGw ðti Þ C B Rxx ð1Þ ð5Þ B C i¼0 B C B C .. C; ¼B . B C B C B C I X @ A xw ðti PÞGw ðti Þ Rxx ðPÞ
solve it [16]. The impulse amplitude is obtained so that P ^ Eq. (4) is minimized: Gw ðti Þ ¼ xðti Þ Pp¼1 aðpÞxðt i pÞwðti pÞ=wðti Þ for i ¼ 0; ; I, where w is the window function. Therefore, we determine the LPC coefficients and the amplitude iteratively, but we find iteration is hardly necessary. In unvoiced speech, since no pitch mark exists, the number of pitch marks I þ 1 is zero. Thus, in this case, Eq. (5) is equivalent to the Toeplitz matrix in the conventional LPC analysis, i.e., the autocorrelation method. 3.2. Real-time PEAR In Eq. (5), calculations of phase-equalized speech signals x and their autocorrelation functions Rxx and the impulse amplitude Gw are required. To reduce the computational complexity, we introduce the following assumptions. By substituting Eqs. (2) and (3) into Eq. (6) under the assumption that the autocorrelation function of e is a delta function for the time delay up to M þ 1, Rxx ðqÞ corresponds to an autocorrelation function of the windowed P original speech signals s: Rss ðqÞ ¼ L1 t¼0 sw ðtÞsw ðt þ qÞ. Moreover, let wðti pÞ be wðti Þ, then Gw ðti Þ ’ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PM=2 2 eðt i Þ . Therefore, ¼M=2 VðpÞ ¼
I X
xw ðti pÞGw ðti Þ
i¼0
’
I X i¼0
wðti pÞwðti Þ
M=2 X
eðti Þsðti p Þ:
¼M=2
The LPC coefficients a^ are obtained by solving the following equation: 0 10 1 ^ Rss ð0Þ . . . Rss ðP 1Þ að1Þ B CB . C .. .. .. B CB . C @ A@ . A . . . ^ Rss ðP 1Þ . . . Rss ð0Þ aðPÞ ð7Þ 1 0 Rss ð1Þ Vð1Þ C B .. C: ¼B A @ . Rss ðPÞ VðPÞ Note that phase-equalized speech signals x and their autocorrelation functions Rxx and the impulse amplitude Gw are not included in Eq. (7). The left-hand side matrix has already been decomposed by the Levinson-Durbin algorithm [16] for conventional LPC. Thus, computational complexity in RT-PEAR is smaller than in the original PEAR.
i¼0
where Rxx is an autocorrelation function of the windowed phase-equalized speech signals xw : Rxx ðqÞ ¼
L1 X
xw ðtÞxw ðt þ qÞ;
ð6Þ
t¼0
where L is the window length. As Eq. (5) is a Toeplitz matrix, we can use the Levinson algorithm to efficiently
480
3.3. TANDEM Window Even when RT-PEAR is applied to estimate a spectrum, the obtained spectrum is not temporally stable. Kawahara et al. [15] has found that the temporally stable power spectrum of a periodic signal can be calculated as the average of two power spectrums by using a pair of time windows temporally separated for half of the fundamental
H. OOHASHI et al.: REAL-TIME FORMANT ESTIMATION BY PEAR
period, called a TANDEM window. According to the Wiener-Khinchin theorem, the power spectrum is the Fourier transform of the corresponding auto-correlation function. Thus, in order to apply the TANDEM window with RT-PEAR, we use the average of two autocorrelation functions Rss and the average of two terms of V in Eq. (7) for the temporally separated windows. 3.4. Pitch Mark Detection In previous studies [6,7], positions of pitch marks t0 ; ; tI are detected on the basis of LPC residual signals. However, pitch mark detection is difficult for speech with high F0 and environmental noises because peaks of LPC residual signals corresponding to pitch marks sometimes can’t be distinguished from other local peaks (Fig. 1D). We preliminary confirmed that a misdetection of pitch marks sometimes degraded the accuracy of formant estimation by RT-PEAR. Thus, we used EGG signals in addition to LPC residual signals. Concretely, after timings of glottal closures were obtained by selecting peaks of a derivative of EGG signals (EGG), an impulse train spaced at the timings was constructed. Then, an optimal delay between the impulse train derived from EGG and LPC residual signals was calculated by cross-correlation between them. Next, the position of typical pitch mark was obtained by seeking the sharpest peak of the LPC residual signals e at around each sample delayed from the timings of glottal closures derived from EGG (shaded areas in Figs. 1C and 1D). The sharpness of a peak was quantified by the sum of square of differences between a reference point and samples around P it: i feðtÞ eðt þ iÞg2 . By using the typical pitch mark, phase-equalized LPC residual signals were calculated by Eqs. (2) and (3). Finally, the positions of pitch marks were obtained by selecting peaks of the phase-equalized LPC residual signals (Fig. 1C). 3.5. Algorithms In summary, the LPC coefficients were calculated as follows: Conventional LPC calculated the auto-correlation function Rss , the LPC coefficients a, and the LPC residual signals e from speech signals s. Then, for voiced speech, the pitch marks t0 ; ; tI were obtained by using the LPC residual and EGG signals. Finally, the LPC (RT-PEAR) coefficients a^ were calculated.
4.
EXPERIMENTS
In the present study, we implemented TANDEM RTPEAR and conventional LPC analysis with the TANDEM window on a digital signal processor (DSP) equipped with Renesas SH7785 with an EGG developed by Glottal Enterprise. This microprocessor uses an SH-4A CPU core with a maximum operating frequency of 600 MHz and
Table 1 Correct formant frequencies F1;2 of the synthesized five Japanese vowels in Hertz.
F1 F2
/i/
/e/
/a/
/u/
/o/
Mean
310 2,300
470 2,040
780 1,200
330 1,120
420 710
462 1,474
realizes a processing performance of 1080 MIPS. Although the TANDEM window was applied to estimate a temporally stable spectrum, we simply refer to the methods as RT-PEAR and conventional LPC analysis. The TANDEM window makes it possible to decrease variances of the estimated formant frequencies, especially for speech with low F0 , but average frequencies remain unchanged. Offline analysis revealed that errors in formant estimations by RTPEAR were similar to those by the original PEAR. On the DSP side, speech signals were digitized at an 8-kHz sampling rate, and digital signals were sent to the received buffer whose length was 4 ms. Digital signals included in four adjacent buffers were pre-emphasized by a first-order high-pass filter whose transfer function was HðzÞ ¼ 1 0:97z1 . Then, a 16-ms Blackman window was applied to the pre-emphasized signals and eight LPC coefficients were obtained every 4 ms. PEAR, i.e., the autoregressive exogenous model, does not guarantee filter stability, unlike the conventional LPC analysis, but there were no unstable filters in this experiments mainly because of the pre-emphasis. For the pitch mark detections, we searched for typical pitch marks among five samples around points delayed from glottal closures derived from EGG signals, and defined the sharpness of 17 samples around candidates for a typical pitch mark. 4.1. Synthesized Vowels Since it is difficult to determine correct formant frequencies of natural speech signals for the evaluation of formant estimation errors, we synthesized and used the five Japanese vowels /i, e, a, u, o/. Durations of vowels were two seconds. The steady-state vowels were synthesized from the first four formant frequencies F1;;4 , their bandwidths, and F0 using Klatt formant synthesizer [17]. F1;2 of these vowels are shown in Table 1. Values of F0 ranged from 100 to 300 Hz in increments of 20 Hz. A time reversal low-pass filtered impulse train was used for excitation signals. It is considered that errors in the formant estimation by PEAR are largely related to the performance of phase equalization for input speech signals. In general, phase characteristics of natural speech signals are not minimum. In order to approximate them with those of synthesized signals, an impulse train was passed through 2nd order allpass filter in line with the idea in a previous study [18]:
481
Acoust. Sci. & Tech. 36, 6 (2015) (0.0, 0.9)
(0.5, 0.9)
(0.0, 0.5)
(0.5, 0.5)
(0.9, 0.9)
frequency is located at around the harmonics for some vowels, misunderstandings may happen because of the resulting small value, but this problem can be solved by averaging index values for all vowels [3]. Moreover, we calculated the ratio of inter- to intravowel variance by the following equation: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u P T u v2f/i,e,a,u,o/g Nv ðv Þðv Þ D ¼ tP ; PNv v T v v2f/i,e,a,u,o/g j¼1 ðxj v Þðxj v Þ
(0.0, 0.0)
where xvj represents the vector of representative formant frequencies ðF1 ; F2 Þ of the jth utterance of a vowel category v, v denotes the mean vector of xvj , is the grand mean vector of v , and Nv is the number of utterances of a vowel category v. A larger value of this index means that the estimated F1;2 are well clustered in terms of the vowel categories [19].
1cycle
Fig. 2 The all-pass filtered impulse train used for synthesizing vowels in the present study. The upper left numbers in each panel indicate filter coefficients and in Eq. (8), which determine phase characteristics of signals.
HðzÞ ¼
ð z1 Þð z1 Þ ; ð1 z1 Þð1 z1 Þ
ð8Þ
where the coefficients and , which determine phase characteristics of synthesized signals, were set to 0.0, 0.5 or 0.9 in the present study. As shown in Fig. 2, when ð; Þ ¼ ð0:0; 0:0Þ, the filtered impulse train has zero phase. A previous study [18] reports that, when ð; Þ ¼ ð0:9; 0:9Þ, phase derivations from natural speech signals are minimal. For pitch mark detections of the synthesized vowels, we simply selected peaks of impulse trains spaced at pitch periods tranduced into DSP instead of EGG signals, because of the difficulty of creating quasi-EGG signals corresponding to the synthesized vowels. To quantify the performance, we used three indices: percent error in the formant estimation, bias of the estimation toward harmonics, and the ratio of inter- to intra-vowel variance. For percent errors, a representative value of each synthesized vowel was calculated by averaging estimated formant frequencies from the first to third quarter of the total duration. Then we obtained the ratio of errors in the representatives with respect to the correct formant frequencies as the index using the following equation: 100 jF^i Ficorrect j=Ficorrect , where F^i and Ficorrect denote the estimated and correct ith formant frequency. The bias of formant frequencies toward harmonics was defined by theffi following equation: k ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2 2 ^ for n ¼ argmin jF^iLPC nF0 j i2k ðFi nF0 Þ =Nk F0 n and k ¼ f1g, {2} or {1,2}, where F^iLPC denotes the ith formant frequency estimated by conventional LPC analysis and Nk is the size of set k. A smaller value of this index means that the estimated formant frequencies are more biased toward harmonics. When the correct formant
482
4.2. Natural Vowels For natural vowels, because of a lack of correct formant frequencies, evaluations by the percent errors in formant frequencies are difficult. Thus, we evaluated the performance of the proposed system in two ways. 4.2.1. Vowels with neutral F0 In the first approach, we recorded natural vowels with neutral F0 . Six adults (five females and one male) aged from 25 to 40 participated in the vowel recordings. All the participants were native Japanese speakers and exhibited no obvious difficulties in speech production. They gave written informed consent to participate in the present study, which was approved by the NTT Communication Science Laboratories Research Ethics Committee. The participants sat on a chair in front of a microphone, and were asked to produce each of the isolated Japanese vowels ten times with their neutral F0 . EGG signals were synchronously recorded. Audio and EGG signals were low-pass filtered with a cutoff frequency of 8 kHz and digitized at a sampling frequency of 16 kHz. 4.2.2. Vowels with stepwise ascending and descending F0 In the second approach, to further investigate the efficiency of the proposed method, we compared the robustness of formant estimations between RT-PEAR and conventional LPC analysis against a wider range of F0 . That is, we used the five natural vowels of a male speaker, each of which was produced with changing F0 by ascending and descending in five steps without changing the articulatory posture as much as possible. Duration of a section of each F0 was 500 ms. A variance as an index was calculated from formant frequencies for a 240-ms interval in the middle of each nine F0 sections. A smaller value of the index means that the estimated F1;2 are more robust against changing F0 .
H. OOHASHI et al.: REAL-TIME FORMANT ESTIMATION BY PEAR
Fig. 3 Mean percent errors with respect to the correct first formant frequencies for the five synthesized Japanese vowels with each set of phase parameters against fundamental frequencies. The upper left numbers in each panel indicate filter coefficients and in Eq. (8), which determine phase characteristics of signals.
In this experiment, we measured articulatory postures using an electro-magnetic articulographic system [20] to confirm whether the participant maintained them during speaking. The articulatory parameters were represented by the vertical and horizontal positions of seven receiver coils, which were placed on the lower incisor, the upper and lower lips, three tongue positions, and the position of the Adam’s apple as larynx height [21].
5.
RESULTS
5.1. Synthesized Vowels Figure 3 shows the mean percent errors in F1 with respect to the correct values against F0 . We confirmed that differences in percent errors in F2 between the conventional LPC analysis and RT-PEAR were minor: Grand means of the percent errors in F2 were 1.2% and 0.7% for the conventional LPC analysis and for RT-PEAR, respectively. Thus, we show results of the percent errors only in F1 . For the conventional LPC analysis, the errors in F1 were larger for high F0 . In particular, for the vowels with which F0 was higher than 180 Hz excluding 220 and 300 Hz, the errors reached 7%. The results of conventional LPC analysis did not depend on the phase characteristics of signals. Contrary to the conventional LPC analysis, the results for RT-PEAR depended on the phase characteristics of signals. For the vowels synthesized with ð; Þ ¼ ð0:0; 0:0Þ, the errors of RT-PEAR (1 tap) were less than around 2%
regardless of F0 . Although the errors for RT-PEAR tended to increase with the number of taps, the errors for RTPEAR with the largest number of taps still remained lower than those of the conventional LPC analysis. On the other hand, for the vowels synthesized with nonminimum phases such as ð; Þ ¼ ð0:5; 0:5Þ and ð0:9; 0:9Þ, the errors for RT-PEAR (1 tap) showed tendencies similar to those for the conventional LPC analysis. The errors for RT-PEAR with a larger number of taps became smaller than those for the conventional LPC analysis. As in the conventional LPC analysis, the errors for PEAR were also larger for high F0 . Mean errors over phase characteristics are shown in Fig. 4. The grand mean of the errors indicates that RTPEAR (17 taps) showed the best performance. A statistical analysis revealed that the errors for RT-PEAR (17 taps) were significantly lower than those of the conventional LPC analysis for all F0 values except 100, 140, 220, and 300 Hz (paired t-test: p < 0:01 for 120, 160, 180, 200, 260, 280 Hz, and p < 0:05 for 240 Hz). In addition to the error analysis, we also measured the bias of the estimated F1 toward harmonics. As shown in Fig. 5, F1 estimated by the conventional LPC analysis was more biased toward harmonics than those estimated by RTPEAR were. For RT-PEAR, when ð; Þ ¼ ð0:0; 0:0Þ, the index values of RT-PEAR (1 tap) were similar to those of RT-PEAR with a larger number of taps. However, for vowels with non-minimum phases, the index values of RT-
483
Acoust. Sci. & Tech. 36, 6 (2015) Table 2 The ratio of inter- to intra-vowel variance evaluated for the vowels synthesized with ð; Þ ¼ ð0:0; 0:0Þ, ð0:5; 0:5Þ and ð0:9; 0:9Þ. ð; Þ LPC RT-PEAR (1 tap) RT-PEAR (17 taps)
Fig. 4 Mean percent errors with respect to the correct first formant frequencies for the five synthesized Japanese vowels against fundamental frequencies. The rightmost set of markers shows grand mean values of LPC and TANDEM RT-PEAR.
PEAR (1 tap) were lower than those of RT-PEAR with a larger number of taps and similar to those of the conventional LPC analysis. Table 2 shows values of the inter- to intra-vowel variance index for the conventional LPC analysis and RTPEAR with 1 and 17 taps. Regardless of the number of taps and phase characteristics of speech signals, the index values for RT-PEAR were larger. For the vowels with ð; Þ ¼ ð0:0; 0:0Þ, the index values of RT-PEAR (1 tap) were larger than those of RT-PEAR (17 taps). On the other hand, for the vowel with ð; Þ ¼ ð0:5; 0:5Þ and ð0:9; 0:9Þ,
0.4
Distance from F1 to harmonics normalized by F0
0.3
ð0:0; 0:0Þ
ð0:5; 0:5Þ
ð0:9; 0:9Þ
23.2 69.7 37.5
23.3 21.9 38.7
23.2 26.5 36.9
the index values of RT-PEAR (17 taps) were larger than those of RT-PEAR (1 tap). RT-PEAR (1 tap) seems sensitive to phase characteristics of speech signals compared with RT-PEAR (17 taps): 69.7 for ð; Þ ¼ ð0:0; 0:0Þ but 26.5 for ð0:9; 0:9Þ. The results of both the bias index and the inter- to intravowel variance index evaluations seem consistent with those observed in the percent errors, indicating that both indices reflect the percent errors. 5.2. Natural Vowels In the following analysis, for the number of taps for RT-PEAR, we adopted the optimal number (17 taps) suggested by the analysis of the synthesized vowels. 5.2.1. Vowels with neutral F0 Figure 6 shows values of the bias index of each vowel. A statistical analysis for pooled data of representatives 1 , 2 and 1;2 revealed that RT-PEAR was significantly
(0.0, 0.9)
(0.5, 0.9)
(0.0, 0.5)
(0.5, 0.5)
(0.9, 0.9)
0.2 0.1 0 0.4
100
0.3
200
300
0.2 0.1 0 0.4
100
200
300
(0.0, 0.0)
0.3
LPC:
0.2
RT-PEAR:
1
5
9
13
17 taps
0.1 0
100
200
300
Fundamental frequency [Hz]
Fig. 5 Mean values of the bias index toward harmonics 1 for the five synthesized Japanese vowels against fundamental frequencies F0 . The index was defined as the distance from the estimated first formant frequency F1 to harmonics normalized by F0 . The upper right numbers in each panel indicate filter coefficients and in Eq. (8), which determine phase characteristics of signals.
484
H. OOHASHI et al.: REAL-TIME FORMANT ESTIMATION BY PEAR Robustness to harmonics relative to LPC analysis
0.3
Speaker A
B
**
0.1
* *
* *
0.2 * *
* * *
*
*
0
C
*
**
* ** *
** * *
**
-0.1
* *
**
-0.2
0.2 0.1
* * * *
*
**
*
* *
**
**
D **
**
*
* *
**
E
** * **
* *
**
** * *
* *
* * * *
**
* *
F
**
* ** *
* *
**
1
2
1,2
**
**
*
*
* *
**
*
0 -0.1 -0.2 -0.3
Table 3 The ratio of inter- to intra-vowel variance evaluated for the natural vowels with neutral fundamental frequencies F0 . The number of taps for TANDEM RT-PEAR was 17.
*
-0.3 0.3
**
*
* *
i e a uo 1
2
1,2
1
2
1,2
Fig. 6 Mean differences (1 S.D.) between values of the bias indices toward harmonics of TANDEM RTPEAR (17 taps) and those of the conventional LPC analysis. Horizontal solid lines represent mean values of the index over vowel categories. The indices are defined as the distance from the estimated formant frequencies F1;2 to harmonics. A positive value means that F1;2 estimated by TANDEM RT-PEAR was less biased toward harmonics than those by conventional LPC analysis were. The mean fundamental frequency for each participant is shown in Table 3. The single and double asterisks denote statistically significant differences between the methods at p < 0:05 and p < 0:01 (Wilcoxon signed-rank test).
Speaker A
Fig. 7 Distributions of the representative formant frequencies estimated from the natural vowels with neutral F0 of each participants. The number of taps for TANDEM RT-PEAR was 17. The mean fundamental frequency for each participant is shown in Table 3.
less biased toward harmonics than the conventional LPC analysis (paired t-test: tð299Þ ¼ 6:40, p < 0:01 for 1 , tð299Þ ¼ 13:06, p < 0:01 for 2 , and tð299Þ ¼ 12:48, p < 0:01 for 1;2 ). Figure 7 and Table 3 show the distribution of the representative F1;2 and the inter- to intra-vowel variance index for each participant. The mean values of the ratio of the inter- to intra-vowel variance index showed that F1;2 estimated by RT-PEAR were better clustered than those estimated by the conventional LPC analysis. These results,
Speaker Speaker Speaker Speaker Speaker Speaker Mean
A B C D E F
Mean F0
LPC
RT-PEAR
124.2 214.5 214.6 229.0 240.8 242.9 211.0
19.5 14.0 26.3 28.7 17.7 21.4 21.3
21.6 14.8 25.2 30.1 20.0 24.1 22.6
Table 4 The variance among representative formant frequencies F1;2 of the stable section in the vowels with stepwise ascending and descending fundamental frequencies F0 . The number of taps for TANDEM RTPEAR was 17. Averaged values of minimal and maximal F0 for each vowel were 128.2 and 192.1 Hz, respectively. F2
F1
/i/ /e/ /a/ /u/ /o/
LPC
RT-PEAR
LPC
RT-PEAR
550.5 183.2 1,008.2 169.3 508.9
324.3 96.7 759.2 109.9 398.2
2,180.3 677.4 1,027.0 910.4 7,188.3
2,053.9 468.9 722.0 911.0 3,379.1
considering the discussion in Sect. 5.1, may indicate that the errors in F1;2 estimated by RT-PEAR were lower than those estimated by the conventional LPC analysis. 5.2.2. Vowels with stepwise ascending and descending F0 For the analysis of natural vowels with stepwise ascending and descending F0 , we measured the variances among representative F1;2 of stable sections. Averaged values of minimal and maximal values of F0 for each vowel were 128.2 and 192.1 Hz, respectively. Average standard deviations of articulatory postures were 0.11, 0.11, 0.29 and 0.49 mm for the jaw, lips, tongue and larynx, respectively, indicating that articulatory posture changes were small enough. Table 4 shows that the variances of F1;2 estimated by RT-PEAR were lower than those of conventional LPC analysis, except for a minor difference in F2 of the vowel /u/. Mean values of the bias indices 1 , 2 and 1;2 of pooled data for RT-PEAR (0.241, 0.263 and 0.275) were larger than those for conventional LPC analysis (0.212, 0.238 and 0.247). The inter- to intra-vowel variance index of RT-PEAR was larger than that of the conventional LPC analysis (10.5 for RT-PEAR; 8.4 for the conventional LPC analysis).
485
Acoust. Sci. & Tech. 36, 6 (2015) Table 5 Mean percent errors with respect to the correct first formant frequencies for the five synthesized Japanese vowels when F0 is 180 Hz. The number of taps for TANDEM RT-PEAR was 17.
LPC RT-PEAR
/i/
/e/
/a/
/u/
/o/
11.0 2.9
3.9 0.6
2.8 1.3
8.0 2.8
5.6 2.2
6.
DISCUSSION
We proposed a novel formulation of PEAR, which was expected to achieve more robust formant estimation than the conventional LPC analysis and less computational complexity than our original PEAR. In the evaluation using the synthesized vowels, estimation errors for the conventional LPC analysis tended to be larger for the vowels whose F0 were higher than 180 Hz, while the errors for RTPEAR (17 taps) were significantly lower than those for the conventional LPC analysis, except for the vowels whose F0 were 100, 140, 220, or 300 Hz. One of the possible reasons for the non-significance of these vowels would be that average F1 of the synthesized vowels was 462 Hz (Table 1). That is, if a correct F1 is the integer multiple of F0 , i.e., F0 values were around 115.5, 154.0, and 231.0 Hz in this experiments, bias of F1 toward harmonics in the conventional LPC analysis would not result in increasing errors in the F1 . As in the case of conventional LPC analysis, the errors for RT-PEAR also tended to be larger for high F0 . However, one of the most important advantages of RTPEAR is that error ranges for F0 were much smaller than that in the conventional LPC analysis. That is, according to the previous discussion, the errors for the conventional LPC analysis were affected by the relationship between F0 and correct F1 . On the other hand, RT-PEAR was less affected by this relationship. Thus, RT-PEAR can be applied to vowels with a wide range of F0 compared with the conventional LPC analysis. According to the relationship between F0 and correct F1 , it is well known that errors for the conventional LPC analysis are large for high vowels, such as /i, u/. Table 5 shows estimation errors of the five Japanese vowels for the conventional LPC analysis and RT-PEAR when F0 was 180 Hz. For the vowels /i, u/, both the errors for the conventional LPC analysis and the error reductions by RTPEAR were larger than for the other vowels. The error range among the vowels for RT-PEAR was smaller than that for the conventional LPC analysis. The results for synthesized vowels also showed that the optimal number of taps for RT-PEAR is related to phase characteristics of input speech signals. For the vowels synthesized with ð; Þ ¼ ð0:0; 0:0Þ, RT-PEAR (1 tap)
486
Fig. 8 An instance of the spectrum (dashed line) and spectral envelopes (sold lines) of the Japanese vowel /i/ produced by participant D. The inset and main panels show the whole of spectrum and those around first and second harmonics, respectively. The gray solid lines were obtained by conventional LPC analysis, and the black ones were done by TANDEM RTPEAR (17 taps).
showed better performance than RT-PEAR with a larger number of taps. On the other hand, for the vowels synthesized with ð; Þ ¼ ð0:5; 0:5Þ and ð0:9; 0:9Þ, the errors for RT-PEAR with a larger number of taps became lower than those of RT-PEAR (1 tap). One tap in PEAR means that phase equalization is not conducted for LPC residual signals. LPC residual signals for the vowels synthesized with ð; Þ ¼ ð0:0; 0:0Þ are similar to an impulse train. Thus, LPC with an impulse train model assumed by PEAR could well represent minimum phase speech signals without phase equalization and the best formant estimation accuracy was achieved by RT-PEAR (1 tap). However, LPC with an impulse train model would be insufficient for representing natural speech signals whose phase characteristics are not minimum generally. Therefore, in order to improve the estimation accuracies of formant frequencies for natural speech signals, it is important to apply phase equalization with more than one tap, in accordance with phase characteristics of input speech signals. The error analysis for the synthesized vowels suggested that the optimal number of taps for RT-PEAR was 17. As Fig. 4 shows, when the number of taps was larger, the errors became lower. This is because phase-equalized LPC residual signals more closely resemble the impulse train assumed by the speech production model as the number of taps increases. Yet, note that, under the assumption of Eq. (1) that phase equalization is conducted for each pulse, the number of taps should not be larger than an interval of pitch marks. Moreover, the phase characteristics of
H. OOHASHI et al.: REAL-TIME FORMANT ESTIMATION BY PEAR
input speech signals are important for determining an adequate number of taps. Thus, considering the phase characteristics, a number of taps larger than 17 would not work for vowels with high F0 appropriately in the present study. Taken together, when applied to synthesized vowels, RT-PEAR can robustly estimate formant frequencies even for speech signals with high F0 . In order to evaluate more practical performance, we also assessed F1;2 estimations from two kinds of natural speech signals: vowels with neutral F0 and those with stepwise ascending and descending F0 . To quantify the performance of the conventional LPC analysis and RTPEAR for the natural vowels, we measured the bias index and the ratio of inter- to intra-vowel variance. First, we analyzed the natural vowels with neutral F0 . Figure 6 shows that almost the bias index values of individual vowels were over zero, indicating that RTPEAR was less biased toward harmonics than the conventional LPC analysis was. Statistical significances were observed for more than half of the vowels (53.3% for 1 , 56.8% for 2 and 66.7% for 1;2 ; Wilcoxon signed-rank test: p < 0:05). Figure 8 shows an example of spectral envelopes estimated by RT-PEAR and the conventional LPC analysis. The first peak of the spectral envelope (F1 ) estimated by conventional LPC analysis was certainly closer to first harmonics than that estimated by RT-PEAR, supporting the results of the bias index. This figure also shows that the envelope estimated by RT-PEAR fitted closer to the harmonics peaks, as in another robust estimation method [2]. However, for /i/ of participant B and /i, e, u/ of participant C, the results of the bias index did not show the effectiveness of RT-PEAR. This would be because phase equalization of 17 taps did not work effectively for these vowels. Figure 9 shows that RT-PEAR for these vowels would be better than conventional LPC analysis if the optimal number of taps can be selected. For participant C, the inter- to intra-vowel variance index also did not show the effectiveness of RT-PEAR (17 taps), but the index for all the other participants suggested the superiority of RT-PEAR over the conventional LPC analysis. Second, to analyze the natural vowels with stepwise ascending and descending F0 , we calculated the variance index in addition to the bias index and the ratio of interto intra-vowel variance. These indices showed that RTPEAR has a stronger tolerance to changes in F0 than the conventional LPC analysis. However, reconsideration of the results in terms of speech production would be needed. To summarize the results for the natural vowels, although RT-PEAR would be able to robustly estimate
Fig. 9 Mean differences (1 S.D.) between values of the bias index toward harmonics 1 of TANDEM RTPEAR and those of conventional LPC analysis for speakers B and C. Black and gray filled markers represent the mean differences for the optimal numbers of taps and those for the number of taps used in Fig. 6, respectively. The index is defined as the distance from the estimated formant frequency F1 to harmonics. A positive value means that F1 estimated by TANDEM RT-PEAR was less biased toward harmonics than that estimated by the conventional LPC analysis.
Table 6 Computational complexity. Number of products LPC LPC with TANDEM PEAR RT-PEAR TANDEM RT-PEAR
OðLP þ P2 Þ Oð2LP þ P2 Þ Oð2LP þ 3P2 þ SP þ IM þ LM þ PIÞ OðLP þ 2P2 þ SP þ PIM þ 2PI þ PMÞ Oð2LP þ 2P2 þ SP þ 2PIM þ 4PI þ 2PMÞ
formant frequencies, adequate phase equalization in accordance with the phase characteristics of natural speech signals is required. Aside from accuracies of formant estimations, one of the important features of the proposed system is its computational complexity. Table 6 shows computational complexity evaluated in terms of the number of products in algorithms. The number of quotients is negligibly small. S means frame shift size. For I ¼ 3 and M ¼ 16, the computational complexities of PEAR, RT-PEAR without TANDEM, and TANDEM RT-PEAR are 4.3, 1.9, and 3.5 times as large as that of the conventional LPC analysis without TANDEM, respectively. Thus, the computational complexity of RT-PEAR without TANDEM is less than half of that of original PEAR. In the present study, we implemented TANDEM RTPEAR on a DSP whose processing delay was 12 ms, although the value depends on the operating system, the
487
Acoust. Sci. & Tech. 36, 6 (2015)
audio interface, and other factors. Computation time for one analysis window, including pitch mark detections and formant estimation, was within 4 ms, which was the length of one buffer. One possible application of a real-time formant tracking systems is in transformed auditory feedback experiments [8–10]. In these experiments using a transformed auditory feedback system based on the conventional LPC analysis [9,10], the processing delays are at most 11 ms, indicating that the delay of the proposed method is adequate for such experiments. Another important feature is the stability for pitch mark detections. We preliminary confirmed that, in the case of pitch mark detections from LPC residual signals only, misdetection of pitch marks, especially for speech signals with high F0 , sometimes degraded the performance of formant estimation by PEAR. Thus, the stable performance of RT-PEAR in the present study was due to improvement of the reliability of pitch mark detections using EGG. Even though the system requires EGG signals, it may still be effective for experimental and medical use. However, a method for extracting pitch marks from speech signals without EGG signals, e.g., [22], is necessary in order for our system to be widely used in the future.
7.
CONCLUSIONS
In the current study, we presented a real-time (12-ms delay) robust formant estimation system using RT-PEAR, and evaluated performances of the system for the synthesized and natural vowels. Statistical results suggested that RT-PEAR with more than one tap is superior to the conventional LPC analysis in terms of robustness to both F0 and the phase characteristics of speech signals, indicating that RT-PEAR was less biased toward harmonics. These results indicate that RT-PEAR can be applied to vowels with a wider range of F0 than conventional LPC can.
ACKNOWLEDGEMENTS The authors thank H. Uchida of The University of Tokyo for help with programs and Dr. H. Gomi for many useful discussions. REFERENCES [1] S. Hiroya, ‘‘Formant analysis of vowels: Process and hypotheses,’’ J. Acoust. Soc. Jpn. (J), 70, 538–544 (2014) (in Japanese). [2] A. El-Jaroudi and J. Makhoul, ‘‘Discrete all-pole modeling,’’ IEEE Trans. Signal Process., 39, 411–423 (1991). [3] A. Sasou and K. Tanaka, ‘‘Glottal source modeling using HMM and robust analysis of high fundamental frequency speech,’’ IEICE Trans. Inf. Syst., 84, 1960–1969 (2001) (in Japanese). [4] T. Ohtsuka and H. Kasuya, ‘‘Robust ARX speech analysis method taking voicing source pulse train into account,’’ J.
488
Acoust. Soc. Jpn. (J), 58, 386–397 (2002) (in Japanese). [5] P. Alku, J. Pohjalainen, M. Vainio, A.-M. Laukkanen and B. H. Story, ‘‘Formant frequency estimation of high-pitched vowels using weighted linear prediction,’’ J. Acoust. Soc. Am., 134, 1295–1313 (2013). [6] S. Hiroya and T. Mochida, ‘‘Phase equalization-based autoregressive model of speech signals,’’ Proc. Interspeech 2010, pp. 42–45 (2010). [7] M. Honda, ‘‘Speech coding using waveform matching based on LPC residual phase equalization,’’ Proc. IEEE ICASSP, pp. 213–216 (1990). [8] D. W. Purcell and K. G. Munhall, ‘‘Compensation following real-time manipulation of formants in isolated vowels,’’ J. Acoust. Soc. Am., 119, 2288–2297 (2006). [9] V. M. Villacorta, J. S. Perkell and F. H. Guenther, ‘‘Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception,’’ J. Acoust. Soc. Am., 122, 2306– 2319 (2007). [10] S. Cai, S. S. Ghosh, F. H. Guenther and J. S. Perkell, ‘‘Focal manipulations of formant trajectories reveal a role of auditory feedback in the online control of both within-syllable and between-syllable speech timing,’’ J. Neurosci., 31, 16483– 16490 (2011). [11] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne`, ‘‘Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction: Possible role of a repetitive structure in sounds,’’ Speech Commun., 27, 187–207 (1999). [12] H. Banno, H. Hata, M. Morise, T. Takahashi, T. Irino and H. Kawahara, ‘‘Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation,’’ Acoust. Sci. & Tech., 28, 140–146 (2007). [13] M. Morise, T. Matsubara, K. Nakano and T. Nishiura, ‘‘A rapid spectrum envelope estimation technique of vowel for highquality speech synthesis,’’ IEICE Trans. Inf. Syst., 94, 1079– 1087 (2011) (in Japanese). [14] M. Morise, ‘‘Cheaptrick, a spectral envelope estimator for highquality speech synthesis,’’ Speech Commun., 67, 1–7 (2015). [15] H. Kawahara, M. Morise, T. Takahashi, R. Nishimura, T. Irino and H. Banno, ‘‘TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation,’’ Proc. IEEE ICASSP, pp. 3933–3936 (2008). [16] G. H. Golub and C. F. van Loan, Matrix Computations, 3rd ed. (The Johns Hopkins University Press, Baltimore, MD, 1996). [17] D. H. Klatt, ‘‘Software for a cascade/parallel formant synthesizer,’’ J. Acoust. Soc. Am., 67, 971–995 (1980). [18] X. Sun, F. Plante, B. M. G. Cheetham and K. W. T. Wong, ‘‘Phase modelling of speech excitation for low bit-rate sinusoidal transform coding,’’ Proc. IEEE ICASSP, pp. 1691–1694 (1997). [19] Y. Miyoshi, K. Yamato, M. Yanagida and O. Kakusho, ‘‘Analysis of speech signals of short pitch period by a twostage sample-selective linear prediction,’’ IEICE Trans. Fundum. Electron., 70, 1146–1156 (1987) (in Japanese). [20] T. Kaburagi and M. Honda, ‘‘Calibration methods of voltageto-distance function for an electro-magnetic articulometer (EMA) system,’’ J. Acoust. Soc. Am., 101, 2391–2394 (1997). [21] S. Hiroya, T. Mochida and M. Honda, ‘‘A relationship between articulatory positions and formant information by human articulatory-acoustic data,’’ Proc. Autumn Meet. Acoust. Soc. Jpn., pp. 297–298 (2003) (in Japanese). [22] K. S. R. Murty and B. Yegnanarayana, ‘‘Epoch extraction from speech signals,’’ IEEE Trans. Audio Speech Lang. Process., 16, 1602–1613 (2008).