IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
445
A New Method for Obtaining Accurate Estimates of Vocal-Tract Filters and Glottal Waves From Vowel Sounds Huiqun Deng, Student Member, IEEE, Rabab Kreidieh Ward, Fellow, IEEE, Michael Peter Beddoes, Life Senior Member, IEEE, and Murray Hodgson
Abstract—Previously, estimating vocal-tract filters and glottal waves from vowel sounds imposed either the invalid assumption that glottal waves over closed glottal intervals are zero, or parametric models for glottal waves, resulting in biased vocal-tractfilter estimates and glottal-wave estimates lacking information over closed glottal intervals. We obtain unbiased vocal-tract-filter estimates from sustained vowel sounds, for which the glottal waveforms are periodically stationary random processes. Two equations are constructed each relating the vocal-tract filter to the sound signal and the glottal wave over one of two closed glottal intervals. By subtracting one equation from the other, the periodic components of the glottal wave are eliminated from the vocal-tract-filter estimation, and an unbiased vocal-tract-filter estimate is obtained. The average of many such estimates from different closed glottal intervals of the sound is the final estimate, which is used to obtain the glottal wave by inverse filtering the sound. The results obtained from vowel sounds /a/ produced by some subjects are presented. Over closed glottal phases, the glottal waves obtained are nonzero. During vocal-fold colliding, they increase; during vocalfold parting, they decrease or even increase. The vocal-tract filters obtained yield vocal-tract area functions similar to that measured from an unknown subject’s magnetic resonance image. Index Terms—Glottal wave, parameter estimation, speech analysis, vocal-tract filter.
I. INTRODUCTION
O
BTAINING accurate estimates of glottal waves and vocal-tract filters (VTFs) from vowel-sound signals has long been of interest to many researchers. Glottal wave estimates are needed in vocal-fold function analysis, speaker identification, and natural sounding speech synthesis. VTF estimates are needed in speech recognition, speech synthesis, and vocal tract shape detection. A vowel-sound signal is the convolution of the glottal wave and the vocal-tract filter. Estimating the glottal wave and the VTF from a vowel-sound signal without knowing either of them is an ill-defined inverse
Manuscript received September 6, 2004; revised April 2, 2005. The Associate Editor coordinating the review of this manuscript and approving it for publication was Dr. Li Deng. H. Deng was with the Electrical and Computer Engineering Department, University of the British Columbia, Vancouver, BC, V6T 1Z4, Canada. She is now with INRS-EMT, Montreal, QC, H5A 1K6, Canada (e-mail:
[email protected]). R. K. Ward and M. P. Beddoes are with the Electrical and Computer Engineering Department, University of the British Columbia, Vancouver, BC, V6T 1Z4, Canada (e-mail:
[email protected];
[email protected]). M. Hodgson is with the Mechanical Engineering Department, University of the British Columbia, Vancouver, BC, V6T 1Z4, Canada (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSA.2005.857811
problem. In solving this problem, previous approaches impose some assumptions about the unknown glottal wave. Some methods obtain the parameters of the VTF over the closed glottal phase assuming the glottal wave is zero over the closed glottal phase; the glottal wave signal is then obtained by integrating the inverse-filtered sound signal using the VTF estimate [1]–[3]. Errors in these methods arise from the invalid assumptions that glottal waves over closed glottal phases are always zero. This assumption is invalid since in reality, glottises can hardly be completely closed over closed glottal phases, and glottal waves over closed glottal phases are not always zero. Also, during the closed glottal phases, there may be some acoustic disturbances caused by the movements of the vocal-fold surface. Although the airflow caused by these effects may be small compared to the glottal airflow over open glottal phases, the vocal-fold surface movements may be fairly rapid [8]. In this case, the derivative of the glottal wave should not be ignored, since it is the time derivative of the glottal wave (not the total glottal airflow) that contributes to the speech sound. As a result, the VTF parameters obtained based on this assumption are biased by the nonzero glottal wave over closed glottal phases, and the glottal-wave signal cannot be correctly recovered by inverse filtering the speech signal using the VTF estimate. Other previous methods jointly estimate the parameters of the glottal-wave model and of the VTF from a cycle of the vowel-sound signal [4]–[7]. Errors in these methods arise from two aspects. The first is from the simplified parametric model of the unknown glottal wave. Parametric models, e.g., the LF model [9], are approximations of the unknown glottal wave, and are unable to capture all details of the glottal waveform. Also, parametric models ignore the influence of the vocal-tract resonance on the glottal wave. Consequently, the glottal waveform obtained lacks details, and the resulting VTF estimate is biased by the residual glottal wave components. The second error source is due to the time-varying glottis. Since the VTF parameters are estimated over one glottal cycle of the vowel sound, the effect of the time-varying glottal impedance corresponding to the open glottis is included in the VTF estimate. It is shown that the glottal impedance increases the formant bandwidths and frequencies of the VTF estimate [10]. Although previous methods may yield good enough estimates for some applications, for applications such as speaker identification, voice analysis, and vocal-tract shape detection, more accurate estimates of glottal waves and of VTFs are needed. Accu-
1558-7916/$20.00 © 2006 IEEE
446
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
rate VTF estimates should contain as less as possible the effects of open glottises and of glottal waves. To overcome the difficulty in knowing the glottal-wave signals, this study proposes to estimate VTFs from sustained vowel sounds. Since for such a sound the pitch, loudness, and the vocal tract are kept unchanged, thus the glottal wave can be characterized as a periodically stationary random process in the VTF estimation. The periodic components of the glottal wave can be eliminated from the VTF estimation. It is shown that unbiased VTF estimates can be obtained from sustained nonnasalized vowel-sound signals. The glottal waves can be obtained by inverse filtering the vowel sounds. The main contribution of this study is to transfer the ill-defined inverse problem into an over-determined parameter-estimation problem, based on the knowledge about the acoustic source-tract system. In Section II, we clarify the concepts related to VTFs and glottal waves, and their relationships with speech signals. Based on these concepts and relationships, in Section III, we provide a method for detecting glottal phases using sustained vowelsound signals. In Section IV, we present how to estimate the VTF over closed glottal phases without using existing assumptions about glottal waves. In Section V, we present a method for determining the subsegments of a recorded speech signal required for the VTF estimation. In Section VI, factors that affect the estimate of the glottal wave obtained by inverse filtering the speech signal is revealed. Section VII contains the results, and discussions.
II. GLOTTAL WAVE AND VOCAL-TRACT FILTER The glottal wave is the total volume velocity at the back end of the vocal tract. When the glottis is open, the glottal wave is the airflow coming from the trachea and passing through the glottis, and is determined by the trans-glottal pressure and the glottal area [11]. When the glottis is closed, the glottal wave may not be zero, since the vocal-fold movements may cause acoustic can be disturbances. The production of the glottal wave , modeled using the equivalent volume-velocity source , as shown in Fig. 1 and the equivalent source impedance is the acoustic impedance looking into the tra[11], [12]. chea from the back end of the vocal tract. becomes infinite is the vocal-tract driving point when the glottis is closed. impedance looking from the back end of the vocal tract into the is the lip radiation impedance, which converts vocal tract. the volume velocity to the sound pressure in space. The vocal tract acts as an acoustic filter. It modulates the input to produce , the volume velocity at the lip opening. The transfer function of the vocal tract should be determined by its shape, and should not include the effect of the glottal . Thus, we define the VTF transfer function as impedance , where and are the Fourier transand , respectively. The VTF is different forms of from the glottis-vocal-tract-filter (GVTF), whose transfer func, where is the Fourier tion we define as transform of . The GVTF contains the effect of the timevarying and nonlinear glottal impedance, and is time-varying and nonlinear. From Fig. 1, when the glottis is closed (i.e., when
Fig. 1. Acoustic model of the vowel production system.
Fig. 2. Acoustic tube model of the vocal tract and the volume velocities in the tube model.
is infinite), the GVTF transfer function becomes equal to the VTF transfer function. In discrete-time signal processing, the vocal tract is modeled using a cylindrical tube with M sections each having the same length but a different cross-sectional area, as shown in Fig. 2 is the cross-sectional area of the th section. [13], [14]. and are the positive-going volume velocity and the negative-going volume velocity, respectively, at the left end of the th section at time . According to the continuity of volume . If the number of sections of velocity, the tube model is related to the sampling rate of the speech signal by [13], [14] (1) where is the length of the vocal tract, and is the sound speed, then the discrete-time volume-velocity signal flow diagram of the VTF can be represented using Fig. 3 [15]. In Fig. 3, the convention that positive-going volume velocities and negative-going volume velocities have the same reference direction is used [16]. We define the transfer function of the VTF in the discrete-time domain as (2) , , and are transforms of , , and , the discrete-time signals of , , and , respectively. In contrast, we define the transfer function of the GVTF in the discrete-time domain as where
(3) It is shown that [15]
(4)
DENG et al.: OBTAINING ACCURATE ESTIMATES OF VOCAL-TRACT FILTERS
Fig. 3.
where signal of
is the , and
Signal flow diagram from the glottal source to the lip volume velocity.
transform of , the discrete-time is the glottal reflection coefficient (5)
where and
is the reflection coefficient at the boundary of th sections
and into (4), we get the Substituting as shown in (10) at the bottom of the page, where ’s are funcis tions of , ’s, and . Sine is time-varying, also time-varying. can be obtained from (4) by letting
th
(6) and
447
is the reflection coefficient at the lip opening (7)
(11) ’s are functions of ’s, and . where The sound pressure at a microphone placed at a distance from the lips is related to the lip volume velocity by a derivative factor [10], [16] (12)
The lip reflection coefficient is frequency-independent only if the speech signal is limited to a very low frequency range. Over a wider frequency range, the lip reflection coefficient acts as a low pass filter. In the discrete-time domain, we model the frequency-dependent lip reflection coefficient as (8) where since , and the parameters and are functions of the lip opening area. The glottal reflection coefficient is a time-varying high-pass filter. In the domain, It can be approximated using a first-order FIR filter (9)
where is the air density, is the distance from the lips to the microphone, is the sound speed. From (2), (11) and (12), we can relate the sound pressure to the glottal wave signal as shown in (13) at the bottom of the , and page, where . From (3), (10) and (12) we can relate the sound pressure signal to the glottal source signal as shown in (14) at the bottom of the next page. can be obAccording to (13), the glottal wave signal , given the estained from the speech signal , we first need to detect the closed timate. To estimate glottal phases. In Section III, we provide a method for estiobmating glottal phases using the glottal source signal tained from the speech signal according to (14).
(10)
(13)
448
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
III. ESTIMATING GLOTTAL PHASES Glottal phases are usually detected using EGG (electroglottographic) signals, or high-speed cameras. However, in many situations, these are not available. In this section, we present a method for detecting the glottal phases using vowel sounds. It is known that when the glottis is open, the equivalent glottal , is a linear function of the time-varying transsource, glottal area [11], [12]. Therefore, the glottal closure instant can reaches its negabe detected when the time derivative of tive maximum peak in a cycle. The opening glottal phase (when the glottal area is increasing) can be detected if the time derivaremains positive; the closing glottal phase (when tive of the glottal area is decreasing) can be detected if the derivative remains negative. Within a glottal cycle, the interval of after the glottal closure instant and before the glottal opening is identified as the closed glottal phase. In the following, from the . vowel signal, we estimate the time derivative of In (14), let the time-domain equivalent of be . Then, according to (14), the time-domain speech signal can be represented as
(15) signal by inverse Let us obtain the filtering the speech signal using the time-averaged version of be the time-averaged versions of GVTF. Let . Let the th glottal cycle correspond to the time , where is the number interval of samples in the pitch period. According to (15), should satisfy (16a) as shown at the bottom of the page or
and are vectors, is a where matrix, is a vector. In (16b), and are unknowns, is the measurement in the interval , and is the measurement in the interval , and is the approximation error vector resulting from representing ’s by their averaged versions, i.e., the time-varying , where is the actual time-varying vector. Equation (16b) represents an under-determined system of linear equations, since the number of unknowns is larger than . To overcome the number of equations, i.e., this difficulty, we use a “sustained” vowel sound since, for such a sound, the glottal source and the vocal-tract shape remain nearly to represent the glottal source signal unchanged. We use in the th cycle. It is better to select the th cycle that is as near as possible to the th cycle, so that the change in the glottal wave is minimum. Denote the starting point of th cycle as , which has the same relative position in the th cycle as in the th and can be obtained cycle. Then, more equations about as shown in (17a) at the bottom of the next page or (17b) is a vector consisting of samples in where , and is the interval matrix consisting of samples in the interval , and is the approximation error. and have different values from and , since the speech is not exactly periodic, due to the randomness of signal the turbulence noise contained in the glottal wave. Combining (16b) and (17b), we get the following equation: (18)
(16b)
(14)
.. .
.. .
.. .
.. .
.. .
.. .
.. .
.. .
(16a)
DENG et al.: OBTAINING ACCURATE ESTIMATES OF VOCAL-TRACT FILTERS
where is a matrix, and are vectors. can be obtained by taking the least-squares error solution of (18). Equation (18) represents an over-determined system is larger of linear equations, since the number of equations . Also, the column vecthan the number of unknowns tors in are linearly independent. Thus, according to [17], the least-squares error solution of is
(19) The estimate of the tained by filtering the speech signal
signal is then obusing the obtained
. The obtained
filter
using these simplifications, we obtain unbiased VTF estimates. . Then, from (13) Let
(20) The time-domain equivalent of is For convenience, it is referred to as the sound pressure at the be the time-domain equivalips. In (20), let . lent of can be expressed as Then, according to (20),
may
be unstable. We solve this problem by choosing other 2 nearby and cycles and using different signal segments to construct in (19). Simulations show that the filtering effect of is not significant. Therefore, the result of inverse filtering is viewed as the estimate of the delayed . Using its waveform, glottal source signal the glottal phases can be detected, as mentioned in the beginning of this section. It should be noted that, since the actual GVTF is time-varying, the glottal source signal cannot be “accurately” recovered from the speech signal by using a time-invariant filter. Nevertheless, as validated in Section VII, the glottal phases deestimate are correct. tected by using the After the glottal phases are detected, we estimate the VTF parameters using the speech signal subsegments corresponding to closed glottal phases. IV. ESTIMATING THE VOCAL-TRACT FILTER As mentioned in Section I, existing methods for obtaining VTF estimates from vowel sounds involve over-simplified assumptions about unknown glottal waves. In this section, without
(21) In estimating the coefficients ’s of the VTF, only the samples that do not contain the influence of the open glottis should be used. Let these samples be in the interval within the th cycle. Then, according to (21), the following relationship must be satisfied as shown in (22a) at the bottom of the next page or (22b) and are vectors, is mawhere vector. In (22), the unknowns are trix, is and , whereas the entries of and are known measurements. Equation (22) is an under-determined system of linear is larger equations, since the number of unknowns than the number of equations . To solve for ’s, we construct more equations, using signal samples over the adjacent cycle , and take advantage of the fact that for a sustained vowel sound, the glottal wave is a periodically stationary random process. Let in the the interval th cycle have the same relative position as
.. .
.. .
.. .
.. .
449
.. .
.. .
.. .
.. .
.. .
.. .
(17a)
.. .
.. .
.. .
.. .
.. .
(22a)
450
tions about the page or
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
and
in the th cycle, then the additional equaare obtained (see (23a) at the bottom of
and the least-squares error solution is taken as the estimator of
(29) (23b) is the measurement in the interval , is the measurement in the interval , and is the difference of the derivative of the glottal wave over the two intervals. Subtracting (22b) from (23b), we get where
(24) where is matrix, and are vectors. Now, the periodic components of the glottal wave is removed from the system (24). originates from the randomness of the , then (24) turbulence noise in the glottal wave. If is an over-determined equation. The least-squares error solution of (24) is then taken as the estimate of (25) From (24) and (25), it is clear that the estimate given by (25) contains the influence of the unknown (26) It is shown in the Appendix that for a sustained vowel sound, the estimator given by (25) is unbiased. The accuracy of the VTF estimate can be improved by averaging many such estimates obtained from different cycles of the sound. For some voices, the duration of the closed glottal phase is very short, and the number of equations in (24) is less than . In such a case, the number of unknowns, i.e., more equations can be constructed from samples in other closed glottal phases of the sound, such as those in the th and th cycles (27) Combining (24) and (27), we get (28)
.. .
.. .
.. .
In Section V, we provide a method for locating the subsegments corresponding to the closed glottal phases. V. LOCATING SIGNAL SEGMENTS CORRESPONDING TO CLOSED GLOTTAL PHASES As shown in Section III, we can estimate the intervals during which the glottis is closed using the estimate of the signal obtained from the vowel sound signal. signal Assume that within the th cycle, the , then returns to reaches its maximal negative peak at instant zero gradually or with fluctuations, and finally remains positive . Then, is after crossing zero at the instant undergoes the the duration of the closed glottal phase. same process during the interval as does during the interval . The interval is identified as the closed glottal phase interval in the th cycle, and is denoted , where as is the glottal closure instant. subsegments required in Next, we determine the (22)–(25) for the VTF estimation. According to (21), at the ( sampling peinstant riods after the glottal closure instant), the sound pressure at the lips is . In this equation, contains the glottal reflection sound that is reflected from the glottis at and arrives at the lips at (recall the time for the sound wave to travel from the glottis to the lips is sampling periods), and all other sound pressure signals contain the glottal reflection sounds that are reflected from the glottis after the glottis closes. At the in, the sound pressure at the lips is stant . contains the In this equation, (i.e., the last instant of glottal reflection at the closed glottal phase), and other sound pressure signals
.. .
.. .
.. .
.. .
.. .
(23a)
DENG et al.: OBTAINING ACCURATE ESTIMATES OF VOCAL-TRACT FILTERS
451
contain the glottal reflection after the glottis closes and before samples in the interval the glottis opens. Therefore, the do not contain the influence of the open glottis, and and in (22). Specifically, thus are used in constructing , is . and the length of Then, we translate the above required samples into samples. Since , the , then samples in and the interval are used in , and in (22). Specifically, constructing . Simisamples for and are in the interval larly, the , where is the instant when signal reaches its negative peak in the the th cycle. Having constructed and , the all-pole parameters of VTF can be obtained from (25). subsegments that do not conAs shown above, the tain the influence of the open glottis can be located using the waveform obtained in Section III, without knowing the distance from the lips to the microphone or using other signals, such as EGG.
VI. OBTAINING THE GLOTTAL WAVEFORM As shown in (13), an estimate of the numerator can be obtained by inverse filtering the speech signal using . Since the filtering effect of is not significant, the result of the inverse filtering is viewed as the scaled and delayed derivative . The glottal of the glottal wave signal is then obtained by integrating waveform . The zero line of its derivative using the filter the glottal wave cannot be recovered from the sound pressure in the transfer function from signal due to the factor the lip volume velocity to the sound pressure at the microphone, as shown in (12). Since the glottal wave is never negative, its zero line is set to its minimum value in this study. The method is under research. for estimating It is noted that a glottis may never completely close during closed glottal phases. Thus, the VTF estimate may contain the effect of finite glottal impedance, which increases the formant bandwidths and frequencies of the VTF estimate. The differences between the estimate of the derivative of the glottal wave and the actual one can be analyzed as follows. Let the transfer . In fact, it is equal function of such a VTF estimate be corresponding to the incomplete glottal closure. to an Recalling (13), we express the estimate of the delayed derivative of the glottal wave obtained by filtering the speech signal using the inverse filter of the VTF estimate by
(30) where ; , , and are the Fourier transforms of the estimate of the delayed derivative of the glottal wave, the speech signal, and the actual derivative of the glottal wave. Equation (30) means that, if the glottal is not much greater than , then the estimate impedance of the derivative of the glottal wave contains extra components . It is known that becomes large at the resonant frequencies of the VTF. Thus, the estimate of the glottal wave (derivative) may contain extra components at nearly the VTF resonance frequencies if the incomplete glottal closure is large. A method for eliminating the effects of large incomplete glottal closures in the VTF estimates is under research.
VII. RESULTS AND DISCUSSION Five female and six male normal adult subjects produced , keeping the pitch, loudness and sustained vowel sound the vocal-tract shape fixed for 3 seconds, in a sound controlled booth in the UBC Interdisciplinary Speech Research Lab. The subjects sat at a distance of 30.5 cm (12 in) from the microphone. The speech sound and the synchronized EGG signals were digitized using Kay Elemetrics CSL 4400 and recorded using a computer. The sampling rate of each signal was kHz. Thus, the time delay from the lips to the microphone is (sampling periods). The recorded speech and the EGG signals were first de-noised using a method based on the wavelet transform. The wavelet coefficients below the fundamental frequency of the produced speech signal were set to zero. Due to limited space here, we present the results obtained for two female and two male subjects. The order of the GVTF is determined according to the av, the length of erage length of the vocal tract. For the sound the vocal tract is approximately 14.5 and 17.5 cm for female and male adult subjects respectively. Thus, the order of the GVTFs is 39 for the female, and 46 for the male subjects. The GVTF frequency responses obtained are shown using broken lines in plots (e) of Figs. 4–7. The vocal-tract length for each subject is then adjusted according to the obtained GVTF formant frequencies. According to [18], high-frequency formants are less affected by the vocal-tract shape, and the vocal-tract length is , where is the th forapproximately . We take the mant frequency, and the sound speed average of the lengths estimated from 3rd to 17th format frequencies of the GVTF as the vocal-tract length. Then, the order . In case of the VTF is determined as the estimated vocal-tract length is unreasonable, we use the average vocal-tract length. For each subject, the estimate of the derivative of the glottal source signal is first obtained, as shown by the waveforms (solid lines) in plots (b) of Figs. 4–7. Using this signal, glottal phases are then identified. Within a cycle, the glottal closure instant was identified when
452
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
Fig. 4. (a) Speech signal p (n) of =a= produced by female subject 1; (b) the estimate of the delayed derivative of the glottal source (solid line), and the delayed EGG signal (dashed line); (c) the estimate of the delayed derivative of the glottal wave; (d) the estimate of the delayed glottal waveform; (e) the frequency responses of the all-pole parts of the VTF (solid line) and of the average GVTF (dotted line); and (f) the VTAF (solid line) derived from the VTF estimate and that (dotted line) measured using MIR for an unknown male subject.
reaches its negative peak; the opening glottal phase was identified when remains positive; the closing remains glottal phase was identified when negative; the closed glottal phase was after the glottal closure instant and before the opening glottal phase. The glottal closure instants and glottal phases identified using waveforms are validated using synchrothe nized EGG signals. In this paper, decreasing EGG signals display increasing vocal-fold contact area. The EGG signals were samples to compare with the delayed by waveforms, and were labeled as . The glottal closure instants identified using the waveform are marked using circles. The instants at which the vocal-fold contact area increases at the highest speed (i.e., the signal reaches negative derivative of the peaks) are marked using stars. From each signal (dotted line) in plots (b), one can see that 1) in each glottal cycle, the identified glottal closure instant slightly precedes the instant when the vocal-fold contact area increases at the highest speed; 2) during the interval corresponding to the identified closed glottal phases, the amplitude of first rapidly decreases, and than remains at its lowest level for a short time, then increases from its lowest level; 3) during the interval corresponding to the opening glottal phase, the signal first keeps increasing and then remains at its highest level; and 4) during the interval corresponding to the signal first remains closing glottal phase, the at its highest level and then decreases. The above obtained relationship between the identified glottal phases and the EGG
signal agrees with the Rothenberg model relating the glottal phase and the EGG signal [19]. This confirms that the glottal closure instants and phases are correctly identified using our method. subsegments corresponding to the closed glottal The intervals are then identified, and are marked by solid zero lines in the plots (a), (b) and (c) of Figs. 4–7. From two or four of these subsegments only, one may not obtain a good VTF estimate. For each sound, we obtain an accurate VTF estimate, which is stable and has clear resonant formants, by averaging many stable VTF estimates obtained over many closed glottal intervals (60–200), depending on the length of the sustained vowel sound produced. The frequency responses of the VTF esproduced by the four subjects are shown by the timates for solid lines in plots (e) of Figs. 4–7. To confirm the quality of the VTF estimates obtained, we compared the vocal-tract area functions (VTAFs) derived from these VTF estimates using the method in [20] with the vocal-tract area function measured from an unknown male subject’s magnetic resonance image (MRI) [21]. To make comparison, the VTAFs are normalized relative to their maximum cross-sectional areas, respectively. In addition, the VTAF measured from the MRI is again normalized to have the same vocal-tract lengths as these of the subjects, respectively. One can see from plots (f) of Figs. 4–7, the VTAFs produced (solid lines) derived from the VTF estimates of by the subjects are very similar to that (dotted lines) measured from the MRI, although there are individual differences between vocal-tract shapes. Thus, the VTF estimates obtained are quite accurate.
DENG et al.: OBTAINING ACCURATE ESTIMATES OF VOCAL-TRACT FILTERS
453
Fig. 5. (a) Speech signal p (n) of =a= produced by female subject 2; (b) the estimate of the delayed derivative of the glottal source (solid line), and the delayed EGG signal (dashed line); (c) the estimate of the delayed derivative of the glottal wave; (d) the estimate of the delayed glottal waveform; (e) the frequency responses of the all-pole parts of the VTF (solid line) and of the average GVTF (dotted line); and (f) the VTAF (solid line) derived from the VTF estimate and that (dotted line) measured using MIR for an unknown male subject.
Fig. 6. (a) Speech signal p (n) of =a= produced by male subject 1; (b) the estimate of the delayed derivative of the glottal source (solid line), and the delayed EGG signal dashed line); (c) the estimate of the delayed derivative of the glottal wave; (d) the estimate of the delayed glottal waveform; (e) the frequency responses of the all-pole parts of the VTF (solid line) and of the average GVTF (dotted line); and (f) the VTAF (solid line) derived from the VTF estimate and that (dotted line) measured using MIR for an unknown male subject.
For each sound, the derivative of the glottal wave is then obtained by filtering the speech signal using the inverse filter of the VTF estimate. We observed that each obtained derivative glottal waveform displays single or multiple positive peaks over the short interval of the vocal-fold collision (when the vocal-fold contact area was increasing, as indicated by the steep decreasing EGG signal), as shown in plots (c) of Figs. 4–7. These positive peaks could be due to the glottal chink (an opening in the posterior glottis) as well as the vocal-fold collision. It is known from simulations [22] that a moderate glottal chink leads source-tract interaction, resulting in ripples in the glottal waveform right after the glottal closure instant. It is also known that during phonation, compression and rarefaction tissue waves propagate along the vibrating membranous vocal folds, and that when the lower margins of the vocal folds are in contact, their upper margins are still apart [23]. As the rarefaction of the tissue wave travels the vertical extent of the contacting vocal folds, the vocal-fold contact area increases rapidly (see the steep decreasing EGG signals), and the air between the folds is squeezed into the vocal tract [24]. As mentioned in Section I, although the squeezed airflow may be very limited, its derivative can have large positive values during the rapid fold collision. One might also concern that the VTF
estimates may contain the effects of incomplete glottal closures, and thus the derivative glottal waveforms contain resonance of the VTFs according to (30). However, our simulations show that such incomplete inverse filtering results in stable ripples over the whole glottal cycle, not just in the interval of the vocal-fold collision. Therefore, for the cases shown in plots (c) of Figs. 4–7, the positive peaks during the vocal-fold collision cannot be the residual resonance of the VTFs. We also observe that after the vocal-fold contact areas become maximal, as the vocal folds part from their lower margins toward their upper margins (over the interval of the earlier half of each increasing EGG signal segment), some derivatives of the glottal waves remain negative, as shown in plots (c) of Figs. 4–7, and some (by some female subjects) remain nearly zero or even positive (not shown here considering that their estimates contain more effects of incomplete glottal closures). The glottal waveforms obtained for the four subjects are shown in plots (d) of Figs. 4–7. As can be deduced from their derivatives, over closed glottal phases, the glottal waves are nonzero. During vocal-fold colliding, they increase; during vocal-fold parting, they decrease, or even increase (not shown). This means that when the vocal-fold contact area becomes maximal, a glottis may not be closed maximally: as the folds
454
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 2, MARCH 2006
eliminating the effects of incomplete glottal closures in the VTF estimates is under research. APPENDIX We prove that the estimator given by (25) is an unbiased estimator of , i.e., the mathematical expectation of the estimate equals the true parameters of the VTF (A1) From (26), we get (A2) Therefore, we want to prove (A3) Denote matrix consisting of be
, where is a samples. Let the , then
th entry of
(A4) and the expectation of
Fig. 7. (a) Speech signal p (n) of =a= produced by male subject 2; (b) the estimate of the delayed derivative of the glottal source (solid line), and the delayed EGG signal (dashed line); (c) the estimate of the delayed derivative of the glottal wave; (d) the estimate of the delayed glottal waveform; (e) the frequency responses of the all-pole parts of the VTF (solid line) and of the average GVTF (dotted line); and (f) the VTAF (solid line) derived from the VTF estimate and that (dotted line) measured using MIR for an unknown male subject.
part from their lower margins toward their upper margins, a glottis may become more and more closed.
is
(A5) Since the glottal-wave signal is a periodically stationary equals one or multiple pitch process, and since periods, then the joint probability density function of is identical to that of . Then
VIII. CONCLUSION Accurate estimates of glottal waves and vocal-tract filters are important in natural sounding speech synthesis, speech recognition, speech pathology, speaker and (or) emotion identifications, and phonetic acoustics. Estimating the glottal wave and the VTF from a vowel sound signal without knowing either of them is an ill-defined inverse problem. Previous approaches impose over-simplified assumptions about the unknown glottal wave, resulting in VTF estimates being biased, and glottal-wave estimates lacking detailed information over closed glottal phases. Without imposing these simplifications about unknown glottal waveforms, we obtain unbiased VTF estimates from sustained vowel sounds. The glottal waves estimated using our method contain detailed information over closed glottal phases, and the VTFs obtained yield vocal-tract area functions similar to that measured from an unknown subject’s magnetic resonance image. We also provide a new method for detecting the glottal phases using such sound signals. This method does not require the knowledge of the distance from the microphone to the subject, nor use other signals, such as EGG. A method for
(A6) Thus (A7) and (A8) Therefore, the estimator given by (25) is an unbiased estimator of . ACKNOWLEDGMENT The authors would like to thank the two anonymous reviewers and the editor for their detailed comments and suggestions on the earlier version of the paper. They are also grateful to Prof. B. Gick, Department of Linguistics, UBC, for his instructive comments about the vocal-fold vibrations. Finally, they thank
DENG et al.: OBTAINING ACCURATE ESTIMATES OF VOCAL-TRACT FILTERS
S. Rahemtulla, the technical director of UBC Interdisciplinary Speech Lab, and the volunteer students for helping us record speech and EGG signals.
REFERENCES [1] R. L. Miller, “Nature of the vocal cord wave,” J. Acoust. Soc. Amer., vol. 31, pp. 667–677, Jun. 1959. [2] D. Y. Wang and J. D. Mark, “Least squares glottal inverse filtering from the acoustic speech waveform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 4, pp. 350–355, Aug. 1979. [3] D. E. Veenman and S. Bement, “Automatic glottal inverse filtering from speech and electroglottographic signals,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, no. 2, pp. 369–377, Apr. 1985. [4] H. Fujisaki and M. Ljugqvist, “Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the glottal source waveform,” in IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 12, 1987, pp. 637–640. [5] H. Lu, “Joint estimation of vocal tract filter and glottal source waveform via convex optimization,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999, pp. 79–82. [6] H. Kasuya, K. Maekawa, and S. Kiritani, “Joint estimation of voice source and vocal tract parameters as applied to the study of voice source dynamics,” in Proc. ICPhs99, San Francisco, CA, 1999, pp. 2505–2512. [7] M. Frohlich, D. Michaelis, and H. W. Strube, “SIM-Simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals,” J. Acoust. Soc. Amer., vol. 110, no. 1, pp. 479–488, Jul. 2001. [8] J. Holmes and W. Holmes, Speech Synthesis and Recognition, 2nd ed. New York: Taylor & Francis, 2001, pp. 13–14. [9] G. Fant, “Glottal flow: Models and interaction,” J. Phon., vol. 14, no. 3/4, pp. 393–399, 1986. [10] J. L. Flanagan, Speech Analysis Synthesis and Perception. New York: Springer-Verlag, 1972, pp. 39–65. [11] T. V. Ananthapadmanabha and G. Fant, “Calculation of true glottal flow and its components,” Speech Commun., vol. 1, pp. 167–184, 1982. [12] T. F. Quatieri, Discrete-Time Speech Signal Processing Principles and Practice. Englewood Cliffs, NJ: Prentice-Hall, 2001, pp. 154–157. [13] B. S. Atal and L. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,” J. Acoust. Soc. Amer., pt. 2, vol. 50, no. 2, pp. 637–655, 1971. [14] H. Wakita, “Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms,” IEEE Trans. Audio Electroacoust., vol. AU-21, no. 5, pp. 417–427, Oct. 1973. [15] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978, pp. 92–95. [16] H. Deng, M. Beddoes, R. Ward, and M. Hodgson, “Estimating the glottal waveform and the vocal-tract filter from a vowel sound,” in Proc. IEEE PacRim Conf. Comm., Comp., Sig. Proc., Aug. 2003, pp. 297–300. [17] L. Hogben, Elementary Linear Algebra. Minneapolis, MN: West, 1987, pp. 336–339. [18] H. Wakita, “Normalization of vowels by vocal-tract length and its application to vowel identification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-25, no. 2, pp. 183–192, Apr. 1977. [19] J. S. Rubin, Diagnosis and Treatment of Voice Disorders. New York: Igaku-Shoin, 1995, pp. 290–311. [20] H. Deng, R. Ward, M. Beddoes, and M. Hodgson, “Effects of glottal and lip boundary conditions on vocal-tract area function estimates from speech signals,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Mar. 2005, pp. 901–904. [21] B. H. Story, I. R. Titze, and E. A. Hoffman, “Vocal tract area functions from magnetic resonance imaging,” J. Acoustic Soc. Amer., vol. 100, no. 1, pp. 537–554, Jul. 1996.. [22] B. Cranen and J. Schroeter, “Physiologically motivated modeling of the voice source in articulatory analysis/synthesis,” Speech Commun., vol. 19, pp. 1–19, 1996. [23] G. S. Berke and B. R. Gerratt, “Laryngeal biomechanics: An overview of mucosal wave mechanics,” J. Voice, vol. 7, no. 2, pp. 123–128, 1993. [24] I. R. Titze, “Glottal flow models,” J. Phon., vol. 14, pp. 405–406, 1986.
455
Huiqun Deng (S’03) received the B.Eng. degree in electrical engineering from Tsinghua University, China, in 1985, the Masters degree in audio engineering from Beijing Broadcasting Institute, China, in 1988, and the the Ph.D. degree in electrical engineering from the University of British Columbia, Vancouver, BC, Canada, in 2005. She was a full-time lecturer on acoustics, audio measurements, circuit analysis, and digital audio, in Beijing Broadcasting Institute from 1988–1999. Her current research interests are analysis and processing of speech, acoustical, and audio signals and systems. She is currently a Postdoctoral Fellow in speech processing with INRS-EMT, Montreal, QC, Canada. Rabab Kreidieh Ward (F99) was born in Beirut, Lebanon. She received the B.Eng. degree from the University of Cairo, Egypt, and the M.S. and Ph.D. degrees in electrical engineering from the University of California, Berkeley, in 1969 and 1972, respectively. She is a Professor in the Electrical and Computer Engineering Department and the Director of the Institute for Computing, Information and Cognitive Systems at the University of British Columbia. Her expertise lies in digital signal processing and applications to cable and high-definition TV, baby cry signals, brain–computer interfaces, and medical images, including mammography, microscopy and cell images. She holds six patents and many of her research ideas have been transferred to industry. Dr. Ward is a fellow of the Royal Society of Canada, the Engineering Institute of Canada, and the Canadian Academy of Engineers. She is a recipient of a UBC Killam Research Prize. Michael Peter Beddoes (LS’96) received the B.E.E. degree in 1945 from Glasgow University, Glasgow, U.K., and the Ph.D. degree in electrical engineering from the Imperial College of London University, London, U.K., in 1956. He has had a long and interesting career in electrical engineering. He is presently Emeritus Professor in electrical and computer engineering. He joined the University of British Columbia and has published widely on aids for the blind (a seminal sabbatical study leave in 1966–1967 at the Cognitive Information Group with Dr. S. Mason and Dr. M. Eden is acknowledged), advances in electronic circuits, Gabor filters, speech coding, video compression, and Baysian belief networks and neuromuscolar diagnosis. His abiding interest is research into developing machines to help people. He holds support for the next three years to work on the topics “bruxism detector” and “speech analyzer.” Dr. Beddoes is a Life Professional Engineer of British Columbia. Murray Hodgson received the B.Sc. (Honors) in physics and mathematics from Queen’s University, Kingston, ON, Canada, in 1974, and the M.Sc. degree in sound and vibration studies in 1978 and Ph.D. degree in acoustical engineering in 1983, both from the University of Southampton, Southampton, U.K. Since then he has worked as a Post-Doctoral Fellow in mechanical engineering, Sherbrooke University, Sherbrooke, QC, Canada, and as a Research Associate in the Institute for Research in Construction at the National Research Council, Ottawa, ON, Canada. He is currently Professor of acoustics in the School of Occupational and Environmental Hygiene and the Department of Mechanical Engineering at the University of British Columbia. He is the Director of the Acoustics and Noise Research Group. He is also a member of the UBC Institute for Computing, Information and Cognitive Systems (ICICS) and the UBC Institute for Hearing Accessibility Research (IHEAR). His major professional expertise and research interests are in the measurement, characterization, prediction and control of sound fields in rooms—especially workrooms, such as industrial workshops and classrooms. Dr. Hodgson is a Chartered Engineer (C.Eng.) in the Institute of Mechanical Engineers, U.K.