IEICE TRANS. INF. & SYST., VOL.E87–D, NO.5 MAY 2004
1079
PAPER
Special Section on Speech Dynamics by Ear, Eye, Mouth and Machine
Automatic Extraction of Tone Command Parameters for the Model of F0 Contour Generation for Standard Chinese Wentao GU†∗a) , Nonmember, Keikichi HIROSE† , Member, and Hiroya FUJISAKI† , Fellow
SUMMARY The model for the process of F0 contour generation, first proposed by Fujisaki and his coworkers, has been successfully applied to Standard Chinese, which is a typical tone language with a distinct feature that both positive and negative tone commands are required. However, the inverse problem, viz., automatic derivation of the model parameters from an observed F0 contour of speech, cannot be solved analytically. Moreover, the extraction of model parameters for Standard Chinese is more difficult than for Japanese and English, because the polarity of tone commands cannot be inferred directly from the F0 contour itself. In this paper, an efficient method is proposed to solve the problem by using information on syllable timing and tone labels. With the same framework as for the successive approximation method proposed for Japanese and English, the method presented here for Standard Chinese is focused on the first-order estimation of tone command parameters. A set of intra-syllable and inter-syllable rules are constructed to recognize the tone command patterns within each syllable. The experiment shows that the method works effectively and gives results comparable to those obtained by manual analysis. key words: model for F0 contour generation, automatic parameter extraction, Standard Chinese, tone command pattern
1. Introduction The contour of fundamental frequency (F0 ) of speech plays an important role in both text-to-speech synthesis and automatic speech recognition systems. On the one hand, accurate modeling of F0 contours is critically important for the intelligibility and naturalness of synthetic speech. On the other hand, there have been increasing interests in incorporating an F0 model into automatic speech recognition and understanding. The model for the process of F0 contour generation (henceforth the F0 contour model), first proposed by Fujisaki and his coworkers [1], has been successfully applied to many languages [2] including Standard Chinese, which is a typical tone language [3]–[5]. The model is based on the formulation of the underlying physiological and physical mechanisms [6], which makes it distinct from all other approaches for intonation modeling. It can generate very close approximations to observed F0 contours from a relatively small number of linguistically meaningful parameters. While it is straightforward to generate an F0 contour from a set of model parameters, the inverse problem, viz., derivation of command parameters from a given F0 contour, cannot be solved analytically. This inverse problem, howManuscript received September 10, 2003. Manuscript revised November 19, 2003. † The authors are with The University of Tokyo, Tokyo, 113– 0033 Japan. ∗ Presently, with Shanghai Jiaotong University, China. a) E-mail:
[email protected]
ever, is of great importance both for prosody analysis and for construction of prosody corpora for speech synthesis purpose. A successive approximation method with multi-stage first-order estimation has been proposed for both Japanese and English [7], [8]. With a reliable first-order estimation, the method ensures convergence to the globally optimum solution. In this paper, we present a modified method for Standard Chinese. The major problem is how to deal with both positive and negative tone commands, which have been shown to be necessary for modeling the F0 contours of Standard Chinese [3]. 2. The F0 Contour Model for Standard Chinese The F0 contour model is a command-response model that describes F0 contours in the logarithmic scale as the sum of phrase components, accent components (or tone components for tone languages) and a baseline level ln Fb . It has been proved that the two kinds of input commands, i.e., phrase commands and accent/tone commands, have good correspondence with various linguistic, para-linguistic and non-linguistic information of speech [9]–[11]. The model diagram for Standard Chinese [3] is shown in Fig. 1, where the phrase commands (impulses) produce phrase components through the phrase control mechanism, giving the global shape of the F0 contour, while the tone commands (pedestals) generate tone components through the tone control mechanism, characterizing the local F0 changes. Both mechanisms are assumed to be criticallydamped second-order linear systems. The model can be given by the following equations: ln F0 (t) = ln Fb +
I
A piG p (t − T 0i )
i=1
+ G p (t) = Gt (t) =
J
At j {Gt (t − T 1 j ) − Gt (t − T 2 j )},
(1)
j=1
α2 t exp(−αt), t ≥ 0, 0, t < 0,
(2)
min[1 − (1 + βt) exp(−βt), γ], t ≥ 0, 0, t < 0,
(3)
where G p (t) represents the impulse response function of the phrase control mechanism, and Gt (t) represents the step response function of the tone control mechanism. The pa-
IEICE TRANS. INF. & SYST., VOL.E87–D, NO.5 MAY 2004
1080
Fig. 1
The F0 contour generation model for Standard Chinese. Table 1
Fig. 2 The F0 contours and underlying tone command patterns for the isolated syllable “yi” of four lexical tones.
rameters A pi and T 0i denote the magnitude and time of the ith phrase command respectively, while At j , T 1 j and T 2 j denote the amplitude, onset time and offset time of the jth tone command respectively. I and J indicate the total number of phrase commands and tone commands in the utterance. The constants α, β and γ are set at their respective default values 3.0 (1/s), 20.0 (1/s) and 0.9 respectively in the current study. Unlike most non-tone languages, which have only positive accent commands, Standard Chinese as a tone language requires both positive and negative tone commands, as have been discussed in [3]. In Standard Chinese there are four lexical tones: T1 (high tone), T2 (rising tone), T3 (low tone) and T4 (falling tone). These tones are attached to each syllable. An example of F0 contours for the isolated syllable “yi” of each lexical tone is shown in the upper part of Fig. 2. Most works on the tones of Standard Chinese have been based on the symbolized patterns proposed by Chao [12], who characterized the F0 contour shapes of the four lexical tones by a set of piecewise linear functions connecting several target levels (namely 1 ∼ 5, from low to high), as given in Table 1. This simplified representation for isolated syllables, however, cannot characterize tone variations and the phrasal intonation in an utterance. Although many later works (e.g. [13]) introduced various rules to model tonal coarticulation, they are essentially qualitative. On the contrary, the F0 contour model proposed by Fujisaki et al. not only takes both phrase components and tone components into consideration, but also provides a fully quantitative interpretation for F0 contours of Standard Chinese. In this model, each lexical tone is assumed to correspond to a specific tone command pattern (henceforth in-
Notation of four lexical tones in Standard Chinese.
trinsic pattern): T1 (positive), T2 (negative followed by positive), T3 (negative) and T4 (positive followed by negative), as shown in the lower part of Fig. 2. For T2 and T4, the offset of the first tone command is assumed to coincide with the onset of the second tone command. In some works on Standard Chinese, certain singlesyllable particles are considered to belong to the so-called neutral tone (T0), which does not possess its original tone but of which the F0 contour varies with the preceding tone. We have observed, however, that the tone commands of any lexical tone can be greatly reduced or become null in a specific context and is perceived as the neutral tone. For example, in some bi-syllabic words, the neutral tone will fall on the second syllable of the word. In the F0 contour model, T0 is assumed to have no intrinsic tone command. 3. Parameter Extraction Method As mentioned in the Introduction, the model parameters can only be derived by successive approximation (i.e., Analysisby-Synthesis) with a good initial estimation. In [7], a multistage approximation method was proposed to derive the first-order estimation by approximating an observed F0 contour by a set of piecewise 3rd-order polynomials which are continuous and differentiable everywhere. The method was also successfully extended to English in [8]. For Standard Chinese, we use the same framework, but modify the second stage, namely first-order estimation of tone commands (instead of accent commands). The framework of the automatic parameter extraction method is as follows [7]: (1) Pre-processing of an observed F0 contour The observed F0 contour is first processed to remove those factors that cannot be captured by the F0 contour model. After gross error correction, microprosody removal, interpolation and smoothing, the observed F0 contour is ap-
GU et al.: EXTRACTION OF MANDARIN F 0 MODEL PARAMETERS
1081
proximated by a set of piecewise 3rd-order polynomials. (2) First-order estimation of tone command parameters If we neglect the effect of phrase components which is much more gradual than that of tone components, the positive maxima (p-maxima) and negative minima (n-minima) of the first derivative (hence both are inflection points) of the smoothed F0 contour should correspond to the onsets and offsets of tone commands with a constant delay of 1/β. Although there can be other types of inflection points, in the current study we only pay attention to p-maxima and nminima, for they convey the most important information on local F0 changes. (3) First-order estimation of phrase command parameters After removing the estimated tone components from the smoothed F0 contour, each phrase command is detected successively by a left-to-right procedure from the residual contour. (4) Optimization of parameters by Analysis-by-Synthesis Parameters of tone and phrase commands are optimized by successive approximation to obtain a set of parameters leading to the least mean square error in ln F0 (t) over the entire utterance. In stages (1), (3) and (4), some thresholds and detailed processes are also modified to adapt to Standard Chinese, which as a tone language usually shows faster local F0 changes than Japanese and English. For example, in stage (1) the smoothing interval is set at 150 ms instead of 200 ms, which has been found to be appropriate for Japanese utterances. 4. First-Order Estimation of Tone Command Parameters 4.1 Problem Analysis Unlike Japanese and English, Standard Chinese involves both positive and negative tone commands, which makes correct detection of tone commands much more difficult, because the polarity of tone commands cannot be inferred directly from the F0 contour itself. In Mixdorff et al.’s work [14], an automatic parameter extraction method for Standard Chinese F0 contour model was proposed by introducing a high-pass filter through which the output contour (HFC) was supposed to correspond only to tone components. The underlying assumption there is that the contour of tone components has the summation of zero (viz., the positive and negative tone components occupy equal proportions). This assumption, however, is not necessarily valid. For example, it is not valid when an utterance is composed mostly of T1 (or mostly of T3) syllables. The task is difficult for Standard Chinese for two reasons: (1) there is no way to determine from the F0 contour itself whether a p-maximum (or n-minimum) of the first derivative corresponds to the onset of a positive (or negative) command or the offset of a negative (or positive) command or both, (2) The p-maxima and n-minima of the first derivative do not necessarily occur in pairs. It is possible that two
consecutive commands share one inflection point between them (offset of the first command and onset of the second command), and it is also possible that two consecutive inflection points have first derivatives of the same polarity. To overcome this difficulty, it is necessary to use some additional information. Here we utilize syllable timing and tone labels available in Standard Chinese speech corpus, by which the tone command pattern within each syllable can be recognized on the basis of a certain set of rules. After tone command pattern recognition, it becomes straightforward to obtain the first-order estimation of tone command parameters. 4.2 Tone Command Pattern Recognition As assumed in the F0 contour model for Standard Chinese, an intrinsic tone command pattern should be associated with the syllable of each lexical tone. In practice, however, the command pattern within a syllable undergoes variations for three reasons: (1) tonal variations in continuous speech due to tone sandhi and tone coarticulation, (2) inaccurate F0 contour due to both noises introduced in pitch extraction and errors introduced in contour interpolation and smoothing, (3) inaccurate syllable segmentation especially where severe coarticulation exists. Therefore, a set of rules needs to be constructed for tone command pattern recognition. The basic requirement is that a series of tone commands can be generated from the inflection points of the F0 contour. As stated above, each inflection point corresponds to either onset or offset of a tone command as shown in Fig. 2, or in other words, to either ascent or descent edge of a pedestal. The starting and ending points for each edge of pedestals can be reasonably defined in three levels: 1 (positive), −1 (negative) and 0 (baseline). Therefore the recognition task is to decide the starting and ending levels of the pedestal edge corresponding to each inflection point according to the polarity of derivative. For example, for a p-maximum, we need to find out whether it is an onset of a positive command (0→1) or an offset of a negative command (−1 → 0) or both (−1 → 1). 4.2.1 Intra-Syllable Command Pattern Recognition Firstly, intra-syllable tone command pattern recognition is carried out, based on the criterion of best match with the intrinsic command patterns, by the aid of some heuristics. Figure 3 shows some most frequently (not all) observed patterns associated with different series of inflection points. Each column in the figure depicts the command patterns associated with four different tonal syllables when a specific series of inflection points (+: p-maximum, −: n-minimum) are observed within the syllable. In the figure, the thin solid lines depict the recognized onsets or offsets of tone commands, while the dotted lines present a reference of the intrinsic tone command patterns. Patterns recognized as identical to the intrinsic tone command patterns are depicted by the thick solid lines.
IEICE TRANS. INF. & SYST., VOL.E87–D, NO.5 MAY 2004
1082
Fig. 4 Supplement of tone commands for syllables within which no inflection points are detected.
Fig. 3 Tone command patterns associated with different series of inflection points for each tonal syllable.
It is observed that some cases are associated with multiple candidates. For such cases, the command patterns are determined on the basis of timing. For example, when there is only one p-maximum in a T4 syllable (as shown in the first column for T4), if it occurs within the first half of the syllable, it will be recognized as an onset of positive command, otherwise it will be recognized as an offset of negative command. 4.2.2 Inter-Syllable Command Pattern Refinement After intra-syllable tone command pattern recognition, the onset and offset of each tone command are still not necessarily uniquely determined. First, the ending level of the last inflection point in the current syllable may not necessarily match the starting level of the first inflection point in the next syllable. Second, the onset of utterance-initial tone command or the offset of utterance-final tone command may not be observed from the F0 contour, because the onset of utterance-initial (or offset of utterance-final) tone command may occur before (or after) voicing, and the offset of utterance-final positive tone command may be confounded with the utterance-final F0 reset and hence cannot be accurately detected. Third, no command patterns have been specified for T0 syllables (within which both levels of all inflection points are set temporarily as 0). Therefore, we apply the following set of inter-syllable rules to further refine the tone command patterns: (1) Process for utterance-initial and utterance-final tone commands (1.a) If the utterance-initial syllable is T1 within which no p-maximum is detected, a dummy point ended with level 1 is added as the onset of positive tone command ahead of the initial T1 syllable. (1.b) If the starting level of utterance-initial inflection point or the ending level of utterance-final inflection point is not 0, a dummy point should be added (with a constant time shift) before the initial one or after the final one as the onset of initial tone command or the offset of final tone command. (2) Connect the ending level of the last inflection point in the current syllable and the starting level of the first inflection point following the current syllable. (2.a) If both levels are 0 and the two inflection
Fig. 5 Insertion of dummy points (onsets or offsets of tone commands) between two inflection points of the same polarity.
points are opposite in polarity, when there is another syllable of T1 (or T3) between them (no inflection points detected within this syllable), change both levels into 1 (or −1). As illustrated in Fig. 4, this process is introduced to avoid missing the tone commands (depicted by the dashed lines) for syllables within which no inflection points are detected. (2.b) If only one level is 0 and the two inflection points are opposite in polarity, simply change the level 0 to coincide with the other. (2.c) Otherwise, if the two levels differ and the two inflection points are of the same polarity, as illustrated in Fig. 5, a dummy point needs to be inserted (as depicted by the dotted lines) in the middle of the two inflection points. This process is adopted to recover the offset of the current tone command (when the current syllable does not end at level 0, as shown in the first row) or the onset of the next tone command (when the next syllable does not start at level 0, as shown in the second row) or both (as shown in the third row). (3) Determine the tone command patterns in T0 syllables. If both the starting and ending levels of an inflection point remain 0, a set of heuristic rules will be used to determine the levels based on the context. For instance, as shown in the first two rows of patterns in Fig. 6, when the preceding and following tone commands are of the same polarity (as depicted by the solid lines), the current inflection point (as indicated in the middle of the two tone commands, with the polarity indicated in the top row) will be connected with the neighbors to compose a new tone command with the opposite polarity, as depicted by the dashed line. However, there is an exception as shown in the bottom row of patterns in Fig. 6, where the offset of the preceding tone command is not an inflection point but a dummy point as depicted by the dotted lines. In such a case, a new tone command will be composed with the current inflection point as the onset, while another dummy point will be inserted afterwards as the offset.
GU et al.: EXTRACTION OF MANDARIN F 0 MODEL PARAMETERS
1083
(only possible when s(T 1 j ) 0), the correction is adopted, and a back-tracing process is conducted to re-examine other alternatives for the preceding tone command patterns. 5. Experiment
Fig. 6 Some example cases for specifying tone command patterns in T0 syllables.
The purpose of introducing these specific processes in the above two cases is to produce reasonable and natural connections between inflection points. 4.3 Estimation of Timing and Amplitudes of Tone Commands After command pattern recognition based on inflection points, a series of tone commands need to be generated. The onset and offset time of these tone commands can be easily determined, with only a constant 1/β ahead of the corresponding inflection points. The amplitudes of tone commands can be derived from the derivatives of ln F0 (t) at the inflection points. Since for Standard Chinese the p-maxima and n-minima of the derivative do not necessarily occur in pairs, the amplitudes of tone commands should be determined in connection with their neighbors. In the proposed method, we set the amplitude of the current tone command At j as follows: (a) If T 1 j corresponds to an inflection point, F0 (T 1 j ) · e/β, if s(T 1 j ) = 0 and e(T 2 j ) 0, [F0 (T 1 j ) − F0 (T 2 j )] · e/2β, At j = (4) if s(T 1 j ) = 0 and e(T 2 j ) = 0, F0 (T 1 j ) · e/β + At, j−1 , if s(T 1 j ) 0, At j = sgn(F0 (T 1 j )) · At min , if sgn(At j ) sgn(F0 (T 1 j )), (5) (b) If T 1 j corresponds to an inserted dummy point, −F0 (T 2 j ) · e/β, if e(T 2 j ) = 0, (6) At j = −F0 (T 2 j ) · e/2β, if e(T 2 j ) 0. Here F0 (t) denotes the first derivative of ln F0 (t), while s(t) and e(t) denote the starting and ending level of the inflection point at t respectively. At min represents a threshold value 0.08 set as the minimum absolute amplitude of tone commands, and sgn(•) represents the sign function. The equation (5) is introduced because the derivatives at a consecutive set of inflection points will not sum up exactly to 0 due to various noises. In the case where an obvious error occurs, viz., the polarity of tone command reverses
The speech material used in the current study is the same as that used by Wang et al. in [5], which consists of 80 utterances read by two male native speakers of Standard Chinese. The speech signals are digitized at 10 kHz with 16-bit precision, and the F0 values are extracted by a modified autocorrelation analysis of the LPC residual signal [15]. In the current experiment the initial value of the baseline Fb is set at 80 Hz. In [5] a manual analysis on F0 contours of these utterances was conducted by an expert to detect the commands manually by Analysis-by-Synthesis. Although the study there indicated that some constraints/rules could be introduced (especially for tone commands) to ease the manual analysis and at the same time result in little degradation, we still use the original results of Analysis-by-Synthesis without any constraints as the reference here. The performance of the first-order estimation of tone command parameters is evaluated by comparison with the results of manual extraction, and is expressed in terms of miss and false alarm rates. The miss and false alarm rates are defined as below, with the manual extraction result regarded as the standard solution: deletion correct + deletion insertion false alarm = correct + insertion
miss =
(7) (8)
In these equations, ‘correct’ represents the number of correctly detected tone commands, while ‘deletion’ and ‘insertion’ represent the numbers of tone command deletion and insertion respectively. For simplicity, only command detection is examined here, while the accuracy of command amplitudes has not been taken into account. An automatically extracted tone command is regarded to be correct when both the onset and offset are within 0.15 s distance from the corresponding manually extracted ones. Besides, when consecutive tone commands of the same polarity in the result of manual extraction are found to be merged in the result of automatic extraction, it is not considered as an error since such a merger reflects the effect of tonal coarticulation. In the current experiment, the miss and false alarm rates for tone commands are 8.8% and 13.4% respectively. Considering that this is the result before Analysisby-Synthesis, the result is quite promising, since successive approximation is expected to further reduce the errors. Although the current study is focused only on the firstorder estimation of tone commands while the first-order phrase commands estimation and Analysis-by-Synthesis processes have not been optimized, the preliminary approximation still shows good performance. The average approximation error, defined as the average of root mean square
IEICE TRANS. INF. & SYST., VOL.E87–D, NO.5 MAY 2004
1084
(r.m.s.) error between observed and approximated values of ln F0 within voiced intervals for all the utterances, is equal to 0.077. This is equivalent to approximately 8% for the r.m.s. value of the relative error in F0 . The example results for two utterances of Standard Chinese read by the two speakers are shown in Figs. 7 and 8 respectively, where Figs. 7 (a) and 8 (a) give the manual extraction results, while Figs. 7 (b) and 8 (b) give the automatic extraction result after Analysis-by-Synthesis. The crossed symbols depict the observed F0 values, while the solid lines and dashed lines depict the approximated F0 contours and the contribution of phrase components respectively. It is shown that the automatically approximated F0 matches the observed F0 quite well, though slightly inferior to the manual approximation. The major difference observed between the results of the two methods is that some consecutive tone commands of the same polarity obtained in manual analysis are found to be merged in the result of automatic extraction. Such mergers reflect the effect of tonal coarticulation, and hence are not regarded as errors. In fact, as indicated by previous experiments, the F0 differences resulting from these mergers are almost imperceptible. Besides, some differences as observed in the figures may be due to imperfect pre-processing of F0 contours. Fig. 7 An example of the F0 contour model parameter extraction for an utterance of Standard Chinese read by Speaker A.
6. Conclusion A method was proposed for automatic extraction of parameters of the F0 contour model for Standard Chinese. Adopting the same framework as in the method already developed for Japanese and English, the current study was focused on the first-order estimation of tone commands having both positive and negative polarities. The information on syllable timing and tone labels was employed, and a set of rules was constructed to help recognize the tone command patterns. The results of preliminary experiments showed that the method worked effectively. In general, the proposed method for first-order estimation of tone command parameters is expected to be applicable to all the tone languages whose basic tonal units are associated with their respective tone command patterns. For different tone languages, however, a language-specific set of heuristic rules needs to be constructed for tone command pattern recognition. Moreover, the set of rules can be constructed in a quite similar way as proposed for Standard Chinese: intra-syllable command pattern recognition followed by adjustment of inter-syllable interaction. References
Fig. 8 An example of the F0 contour model parameter extraction for an utterance of Standard Chinese read by Speaker B.
[1] H. Fujisaki and S. Nagashima, “A model for synthesis of pitch contours of connected speech,” Annual Report, Engg. Res. Inst., University of Tokyo, vol.28, pp.53–60, 1969. [2] H. Fujisaki, “The fundamental frequency contours of speech — Its modeling, underlying mechanisms, and application to multilingual speech synthesis,” Proc. 1999 Int’l Conference on Speech Processing, vol.1, pp.19–26, Seoul, Korea, 1999. [3] H. Fujisaki, P. Hall´e, and H. Lei, “Application of F0 contour
GU et al.: EXTRACTION OF MANDARIN F 0 MODEL PARAMETERS
1085
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12] [13]
[14]
[15]
command-response model to Chinese tones,” Reports of Acoust. Soc. Jpn. Autumn Meeting, pp.197–198, 1987. K. Hirose, H. Lei, and H. Fujisaki, “Analysis and formulation of prosodic features of speech in standard Chinese based on a model of generating fundamental frequency contours,” J. Acoust. Soc. Jpn., vol.50, no.3, pp.177–187, 1994. C. Wang, H. Fujisaki, S. Ohno, and T. Kodama, “Analysis and synthesis of the four tones in connected speech of the Standard Chinese based on a command-response model,” Proc. Eurospeech ’99, vol.4, pp.1655–1658, Budapest, Hungary, 1999. H. Fujisaki, “A note on the physiological basis for the phrase and accent components in the voice fundamental frequency contour,” in Vocal Physiology: Voice Production, Mechanisms and Functions, ed. O. Fujimura, pp.347–355, Raven Press, 1988. S. Narusawa, N. Minematsu, K. Hirose, and H. Fujisaki, “A method for automatic extraction of parameters of the fundamental frequency contour generation model,” J. Inf. Process. Soc. Jpn., vol.43, no.7, pp.2155–2168, 2002. S. Narusawa, N. Minematsu, K. Hirose, and H. Fujisaki, “Automatic extraction of model parameters from fundamental frequency contours of English utterances,” Proc. ICSLP ’02, vol.3, pp.1725–1728, Denver, USA, 2002. H. Fujisaki and K. Hirose, “Analysis of voice fundamental frequency contours for declarative sentences of Japanese,” J. Acoust. Soc. Jpn. (E), vol.5, no.4, pp.233–242, 1984. H. Fujisaki and K. Hirose, “Analysis and perception of intonation expressing paralinguistic information in spoken Japanese,” Proc. ESCA Workshop on Prosody, pp.254–257, Lund, Sweden, 1993. K. Hirose, N. Minematsu, and H. Kawanami, “Analytical and perceptual study on the role of acoustic features in realizing emotional speech,” Proc. ICSLP2000, vol.2, pp.369–372, Beijing, China, 2000. Y.-R. Chao, A Grammar of Spoken Chinese, University of California Press, Berkeley, 1968. C. Shih and R. Sproat, “Issues in text-to-speech conversion for Mandarin,” Computational Linguistics and Chinese Language Processing, vol.1, no.1, pp.37–86, 1996. H. Mixdorff, H. Fujisaki, G. Chen, and Y. Hu, “Towards the automatic extraction of Fujisaki model parameters for Mandarin,” Proc. Eurospeech ’03, vol.2, pp.873–876, Geneva, Switzerland, 2003. K. Hirose, H. Fujisaki, and S. Seto, “A scheme for pitch extraction of speech using autocorrelation function with frame length proportional to the time lag,” Proc. ICSLP ’92, vol.1, pp.149–152, Banff, Canada, 1992.
Wentao Gu received the B.E., M.E. and Ph.D. degrees in electronic engineering from Shanghai Jiaotong University, Shanghai, China, in 1993, 1996 and 1999 respectively. From September 1997 to March 1998, he was with the Language Modeling Department, Bell Labs Research, Lucent Technologies, Murray Hill, NJ, as a visiting researcher working on speaker adaptation method for duration model in text-tospeech synthesis. Since November 1999, he has been Lecturer at the Department of Electronic Engineering, Shanghai Jiaotong University, Shanghai, China. From October 2001 to September 2003, he stayed at the University of Tokyo as a visiting researcher, working on Mandarin F0 modeling and HMM-based Mandarin speech synthesis. His interests include prosody modeling, speech synthesis and speech recognition. He is a member of the Acoustical Society of Japan.
Keikichi Hirose received the B.E. and Ph.D. degrees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1972 and in 1977 respectively. In 1977, he joined the University of Tokyo as Lecturer at the Department of Electrical Engineering, and in 1994, became Professor at the Department of Electronic Engineering. From 1996, he was Professor at the Department of Information and Communication Engineering, Graduate School of Engineering, the University of Tokyo. Since April 1999, he has been Professor at the Department of Frontier Informatics, Graduate School of Frontier Sciences, the University of Tokyo. From March 1987 until January 1988, he was a visiting scientist at the Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, USA. His research interests widely cover spoken language information processing, especially the topics related to prosody. He is a member of the Institute of Electrical and Electronics Engineers, the Acoustical Society of America, the International Speech Communication Association, the Acoustical Society of Japan, the Information Processing Society of Japan, and other professional organizations.
Hiroya Fujisaki received the B.S., M.S. and Dr. Eng. degrees in electrical engineering from the University of Tokyo in 1954, 1956 and 1962, respectively. During 1958 to 1961 he was Fulbright Scholar at the Massachusetts Institute of Technology, and in 1960 he was guest researcher at the Royal Institute of Technology, Stockholm. In 1962 he joined the Faculty of Engineering of the University of Tokyo as Assistant Professor and became Professor of Electrical Engineering in 1973. At the University of Tokyo he held a joint appointment at the Department of Linguistics of the Graduate School of Humanities, and also held the position of Professor of Speech Science at the Research Institute of Logopedics and Phoniatrics of the Faculty of Medicine. He was also Visiting Professor at the University of Texas at Austin, Royal Institute of Technology, Stockholm, University of Goettingen and Nanjing University, and was Guest Professor at the University of Science and Technology of China. He retired from the University of Tokyo in March 1991 and became Professor Emeritus. Since April 1991, he has been Professor at the Tokyo University of Science. During these years, he has been engaged in a wide range of research on speech and language, including production, perception, acquisition, impairment, analysis, synthesis, coding, recognition, and published a number of technical papers and books. For his academic achievements and international contributions he received a number of awards including distinguished paper awards both from the Institute of Electrical Communication Engineers of Japan and the Institute of Electrical Engineers of Japan, the distinguished achievement award from the Electronics, Information and Communication Engineers, the meritorious service award of the IEEE Acoustics, Speech, and Signal Processing Society, the special distinguished service award of the Acoustical Society of America, and the Third Millennium Medal from IEEE. In 1989 he was honored by the Mayor of Tokyo as Person of Merit in Science and Technology. He is an Honorary Member of the Acoustical Society of Japan, a Fellow of the Acoustical Society of America, a Fellow of the Institute of Electronics, Information, and Communication Engineers, a Life Member of the Institute of Electrical and Electronics Engineers, a member of the Engineering Academy of Japan, and of many other academic societies. He is also a member of Tau Beta Pi and Sigma Xi.