IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
1109
Voice Conversion Using Duration-Embedded Bi-HMMs for Expressive Speech Synthesis Chung-Hsien Wu, Senior Member, IEEE, Chi-Chun Hsia, Te-Hsien Liu, and Jhing-Fa Wang, Fellow, IEEE
Abstract—This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis. Index Terms—Bi-HMM voice conversion, embedded duration model, expressive speech synthesis, prosody conversion.
I. INTRODUCTION
C
ONCATENATIVE text-to-speech (TTS) systems have recently been presented for high-quality emotional speech synthesis [1], [2]. However, without a set of large speech databases, these systems suffer from synthesizing an utterance of different speaking styles, speakers, or emotions [3]. Voice conversion methods have previously been proposed to convert the speech signals uttered by a speaker to the other speaker with limited speech data. Kawanami et al. [4] adopted the voice conversion method to convert an utterance from neutral to expressive speech, which is generally characterized by two primary features, spectrum, and prosody. Murray et al. [5] concluded that although prosodic features characterize the main expression in speech, spectral features are also indispensable in emotional speech expression. With a limited size of speech database, this study proposes a spectral and a prosodic voice Manuscript received May 31, 2005; revised March 22, 2006. This work was supported in part by the Ministry of Economic Affairs, Republic of China, under Contract 92-EC-17-A-02-S1-024 and in part by Dr. Kawahara for his support of the STRAIGHT analysis/synthesis program. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Harald Hodge. C.-H. Wu, C.-C. Hsia, and T.-H. Liu are with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, R.O.C. (e-mail:
[email protected];
[email protected];
[email protected]). J.-F. Wang is with the Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan, R.O.C. (e-mail:
[email protected]). Digital Object Identifier 10.1109/TASL.2006.876112
conversion model as the post-process of the TTS system for expressive speech synthesis. Abe et al. [6] applied a codebook mapping method from source to target feature vectors for spectral conversion, and attained acceptable performance. However, stochastic approaches have dominated the development of voice conversion systems in the past decade. Stylianou et al. [7] and Gillett et al. [8] presented a Gaussian mixture model (GMM) with a conversion function trained with minimum mean square error (MMSE) criterion for each Gaussian component. The GMM-based voice conversion is performed using a frame-by-frame procedure, disregarding spectral envelope evolution, with the time-independence assumption. Toda et al. [9] introduced a GMM-based framework considering dynamic features. Hidden Markov model (HMM)-based methods have recently been proposed [10], [11]. The state transition property in HMM-based methods presents a good approximation of the spectral envelope evolution in the time axis. The HMM was trained with the source and target speech data simultaneously, and models the probability distribution of the feature vector sequence according to its actual state sequence and the evolution of speech with transition probabilities between states. However, modeling all the source and target speeches in a joint HMM may incur confusion not only in the mixture densities, but also in the transition probabilities. Furthermore, in a standard HMM, the state duration probability decreases exponentially with time. This exponential state duration density is inappropriate for most signals [12]. Prosody can assist listeners in interpreting utterances by grouping words into larger information units and drawing attention to specific words. Moreover, prosody represents the speaker’s attitude, confidence and mood in a conversation. Silverman [13] has indicated that a domain-specific prosody model can significantly improve the comprehension of the synthesized speech. Kawanami et al. [4] and Murray [5] have revealed that prosodic features provide useful cues to emotional speech expression. Rule-based prosody modeling approach has been used for prosody modification [14]–[16]. Rules are invoked to imitate human pronunciation. However, the design of these rules is labor-intensive, and collecting appropriate and complete rules to describe the prosody diversity is difficult. Stochastic methods with sufficient training data can achieve good approximation to utterance prosody, and is appropriate for automatic learning of prosodic information. A dynamical system model has been introduced to model the pitch and energy contours in an utterance [17]. Conversely, the template tree [15] and decision tree [18] methods using a small training data set have been proposed for pitch and duration modeling. With
1558-7916/$20.00 © 2006 IEEE
1110
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
Fig. 1.
Architecture of the proposed expressive speech synthesis system.
significant features, the decision tree method derived from a small-sized training data can achieve satisfactory performance. Since voice conversion in this paper is performed by an analysis-by-synthesis scheme, the quality of the synthesized speech is highly dependent on the analysis-by-synthesis model. This paper adopts the Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum (STRAIGHT) algorithm, proposed by Kawahara et al., [19], [20] to estimate the spectrum and pitch contour of the neutral utterance synthesized by the TTS system. The STRAIGHT algorithm is a high-quality analysis and synthesis method, which uses pitch adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region to remove signal periodicity. This algorithm extracts F0 (the fundamental frequency) by Time-domain Excitation extractor using Minimum Perturbation Operator (TEMPO), and designs an excitation source using phase manipulation. This paper presents a novel voice conversion model comprising source and target HMMs, called Bi-HMM, for post processing in the text-to-speech (TTS) system, to achieve expressive speech synthesis. Gamma distribution is embedded as the duration model for each state in Bi-HMMs, which is hereinafter called duration-embedded Bi-HMM (DeBi-HMM). Most TTS systems in Mandarin use about 1400 tonal syllables as basic synthesis units [14], [15], [21]. Considering the coarticulation between syllables and expressive tonal syllables in different textual contents, a large speech database for each expressive style is needed to build an expressive concatenative TTS system. In Mandarin speech, a syllable can be divided into two subsyllables comprising an initial part followed by a final part, producing a total of 58 context-independent subsyllables. A small speech database was designed and collected to cover all 58 context-independent subsyllables in Mandarin [15] for each expressive style. Conversion models are trained for each context-independent subsyllable. A large speech database with neutral style was collected as the unit inventory of the TTS system, and the synthetic neutral utterance is used as the input speech of the expressive voice conversion model for expressive speech synthesis. Fig. 1 illustrates the proposed framework. The STRAIGHT algorithm was adopted to estimate the spectrum and pitch contour of neutral speech, which is synthesized by the TTS system. The estimated STRAIGHT spectrum is converted using the DeBi-HMM conversion model. A style-dependent decision tree is employed to convert the prosodic features with the context information from the TTS system. Finally, the expressive
Fig. 2. State alignment between the source and target HMMs which compose the Bi-HMM.
speech is synthesized by the STRAIGHT algorithm, using the converted spectrum and prosodic features. The rest of this paper is organized as follows. Section II describes the probabilistic framework of the proposed durationembedded Bi-HMM model, and also introduces the prosody conversion based on the expressive style-dependent decision tree. The experiments and results are described in Section III. The conclusion is finally drawn in Section IV. II. PROPOSED CONVERSION METHOD This paper presents a joint estimation of source and target HMMs, called Bi-HMM, for spectrum conversion. The Bi-HMM is adopted to model the spectral envelope evolution of source and target speech, separately but synchronously. The conversion function for each state is estimated with MMSE criterion under the conditional normal assumption. For state duration modeling, gamma distribution is utilized to refine the state duration model [12], [22]. Style-dependent decision tree is exploited for prosody conversion. A. Bi-HMM Spectral Conversion Model Fig. 2 illustrates the state alignment between the source and target HMMs which compose the Bi-HMM. The round node denotes the random variable, and the arrow denotes the conditional dependence relationship between two variables. The upper portion depicts the relationship between the source feature vector with length and its corsequence . The responding state sequence
WU et al.: VOICE CONVERSION USING DURATION-EMBEDDED Bi-HMMs
1111
current observation is conditionally independent of the pre. Similarly, the vious observations given the current state is conditionally independent of the past state given state the immediately preceding state . The lower portion depicts the relationship between the target feature vector sequence y y y and its corresponding state sequence . In the training phase, the expectation-maximization (EM) algorithm [23] is used to estimate the Bi-HMM parameter set . The source and target speech feature vector sequences and of each corresponding speech segment pair are aligned by the dynamic time warping (DTW) algorithm. The continuous conversion functions for each Gaussian component in a given state are estimated with the MMSE criterion [24]. and an input source feature vector For a given state , the converted target vector is predicted using the following conversion function:
(1) and where denotes the number of mixture components; denote the mean vectors of source and target feature vector is the cosequences in mixture of state , respectively; variance matrix of source feature vector sequence; is the cross-covariance matrix of target and source feature vector serepresents the conditional probability quences, and where belongs to the mixture in state and is estimated as (2)
where denotes the weight of mixture in state . Given the Bi-HMM parameter set and the source and target feature vector sequences and , the state sequences and with maximum joint probability is obtained as follows:
. The optimal state sequence pair phase is obtained by
in the conversion
(4) The target vector sequence with maximum probability in Bi-HMM is obtained by the following conversion function:
(5) B. Embedded Duration Model The state duration probability decrease exponentially with time in standard HMM. When the speech signal stays in state with initial self-transition probabilities , for frames, the implicit duration probability density is a geometric distribution , which means that has the highest probability, and the self-transition probability estimated by the EM algorithm is zero. This exponential state duration density is inappropriate for most speech signals [12], [22]. In [12], the divergences between empirical distribution and different parametric distributions are shown, and Gamma distribution resulted in the lowest divergence compared to Gaussian and Poisson. This paper utilizes the following Gamma duration model: (6) where represents the gamma function; and are the parameters of the gamma distribution. The EM algorithm is adopted to estimate the parameters of the gamma distribution. The E-step calculates the expectation of the log likelihood of the complete data given new estimates, and , after having the following current estimates, given by and
(7) where is the starting frame of a state, represents the total number of frames for a training utterance, and is defined as (8)
(3) and denote the state transition where probabilities for source and target feature vector sequences and in the Bi-HMM, respectively. There is a corresponding converted feature sequence for each state sequence of the source HMM. The target HMM is used to score each converted feature denotes the best state sequence pair with sequence. maximum joint probability, obtained from the Bi-HMM using the Viterbi algorithm [25]. Given the input , is predicted for a specific state sequence a candidate target
where is the Kronecker delta function. Unfortunately, no and , can be obclosed-form solution to new estimates, tained from (7). Burshtein [26] derived the maximum likelihood estimates of gamma parameters, and , by the empircontains the ical method. Since gamma distribution and variance , the parameters are then derived mean empirically from the sample mean and variance. The state duration models of the source and target HMMs in Bi-HMM are used to score each state sequence and not to predict the state duration. The Viterbi algorithm is adopted to decode the best state sequence pair of source and target HMMs [25].
1112
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
TABLE I LINGUISTIC QUESTION SET USED TO TRAIN THE DECISION TREE FOR PROSODY CONVERSION
C. Prosody Conversion by Decision Tree The prosody conversion decision tree (PDT) is derived from the prosodic features of a syllable. All features are recorded as the ratio with respect to the neutral style in the decision tree. Pitch, energy, and duration are the most widely used prosodic features in speech synthesis. This work adopted the ratio of syllable duration, pitch mean, pitch dynamic range, energy mean, and energy dynamic range for prosody conversion. The ratios were estimated from the emotionally parallel speech database. Table I shows the linguistic question set used to train the decision tree, which includes the linguistic features in word and sentence levels. At the word level, the tone types of a syllable in Mandarin have five categories, tone 1 to 5. The word length ranged from one to three syllables, and only the eight major parts of speech (verb, adjective, noun, adverb, preposition, conjunction, particle, and interjection) [27] were considered. The positions of word in a phrase at the sentence level are divided into three sections, namely front, middle, and end, each of which spans one-third of the number of syllables in the phrase. The same definition is adopted for the position of a word in a sentence. III. EXPERIMENTS AND RESULTS Several subjective and objective evaluations were conducted using statistical hypothesis testing. A set of small-sized speech databases for six expressive styles was designed and collected to train the voice conversion models. Natural speech, rather than synthetic speech, was collected and used as the training data. For feature extraction, the mel-frequency cepstral coefficients (MFCCs) were calculated from the smoothed spectrum extracted by the STRAIGHT algorithm. The analysis window was 23 ms, with a window shift of 8 ms. The cepstral order was set to 45. Since the proposed method was adopted as a post-processing module of a TTS system, the speech synthesized by the TTS system was used as an input of the voice conversion model. A total number of distinct tonal syllables of 1410 were used as the basic units to synthesize neutral speech. The experiment was performed with a speech database consisting of 5613 sentences, pronounced by a female native speaker with a neutral expression in a professional recording studio. The speaker was a radio announcer, and was familiar with our study. All utterances were recorded at a sampling rate of 22.05 kHz and 16-bit resolution. The duration of the collected speech database was 5.46 h. An HMM-based speaker-dependent speech recognizer was applied to align the speech segments of each syllable [28].
A. Speech Database Design and Data Collection The expressive styles were chosen first when compiling a set of speech database for expressive voice conversion. Six expressive styles based on Eide’s illustration [29] were applied, namely happiness, sadness, anger, confusion, apology, and question. These six expressive styles are commonly defined and used in an expressive TTS system. A context-independent subsyllable balanced script was designed for each expressive style. The script was selected from a large sentence pool for each expressive style. Table II summarizes the details of the selected text scripts. All of the sentences were pronounced by the same speaker for the TTS system. All utterances were recorded at a sampling rate of 22.05 kHz and a resolution of 16 bits. Table III summarizes the collected speech data in detail. The sad speech had the lowest mean F0 and the smallest dynamic range. This result supports findings in previous studies [1], [2]. Furthermore, happiness had the shortest syllable duration, and anger had the largest dynamic range of the root-mean-square (rms) energy. Fifteen sentences other than those in the collected speech database for each expressive style were also randomly chosen as the test set. These sentences were also pronounced by the same speaker and recorded in the same environment as the target utterances for comparison to the converted utterances. All the utterances were aligned by an HMM-based speaker-dependent speech recognizer and refined manually to train the conversion models. B. Objective Test Initially, each sentence in the test set was synthesized by the TTS system. The synthesized utterances were then further converted using four spectral conversion methods: 1) GMM-based; 2) HMM-based; 3) Bi-HMM-based; and 4) DeBi-HMM-based voice conversion. The conversion function in (1) was simplified where both the covariance and the cross-covariance matrices were diagonal. The HMMs are left-to-right models, including 21 three-state context-independent initial part models and 37 five-state final part models. The same state numbers were applied for source and target HMMs in Bi-HMM and DeBi-HMM. The EM algorithm is adopted to train the conversion models for each subsyllable, and then the DTW process was performed on each subsyllable segment pair according to the segment boundaries. To determine the number of Gaussian components, Fig. 3 illustrates the average rms log-spectral distortion between the target and converted utterances as a function of the number of Gaussian components in each state. All the 300 parallel utterances were used as the training data for each expressive style. The rms log-spectral distortion was calculated using the MFCC derived from the smoothed spectrum extracted by the STRAIGHT algorithm (9) and denote the target and the spectrum-converted where feature vectors at time , respectively. The number of Gaussian components was set to 128 in the following experiments. Fig. 4 shows the rms log-spectral distortion between the target and converted speech as a function of the number of
WU et al.: VOICE CONVERSION USING DURATION-EMBEDDED Bi-HMMs
1113
TABLE II STATISTICS OF EXPRESSIVE SPEECH DATABASES
TABLE III ACOUSTIC PROPERTIES OF EXPRESSIVE SPEECH DATABASES (NUMBER OF SENTENCES: 300; UNIT: SYLLABLE)
Fig. 3. Objective experimental results for the number of Gaussian components.
the training sentences. Consequently, HMM-based methods (HMM, Bi-HMM, and DeBi-HMM), using the dynamic characteristics of speech signals, yield lower distortion than the
Fig. 4. Objective experimental results for the number of training sentences.
GMM-based method. Fig. 5 shows the distortion for each spectral conversion method in each expressive style, and shows that DeBi-HMM produces less distortion than Bi-HMM, since it more accurately models the state duration. Fig. 6 shows
1114
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
TABLE IV AVERAGE STATE OCCUPATION TIME (milliseconds) OF THE SECOND STATE OF 15 FINAL PARTS IN NEUTRAL STYLE DeBi-HMM WITH t-TEST USING A SIGNIFICANCE LEVEL OF p < 0:05(3 : p < 0:05; +p > 0:05)
FOR
HMM
AND
TABLE V AVERAGE REAL TIME TO CONVERT A SENTENCE FOR DIFFERENT SPECTRAL CONVERSION METHODS
Fig. 5. Distortion for different approaches in each expressive style. (Number of sentences: 300)
Fig. 6. Distortion contour for a syllable with 158 consecutive frames.
the frame rms log-spectral distortions measured for a syllable converted from neutral to angry expression by Bi-HMM and DeBi-HMM. Fig. 7 shows the state duration distributions of the second state of the final of the Mandarin word for “zero” /ling/ for HMM and DeBi-HMM. In this figure, the empirical state duration distribution of each model was calculated from the results by forced state alignment for the training data. The curve of the estimated state duration distribution of HMM was calculated from the estimated state self-transition probability, and the curve of the estimated state duration distribution of the
Fig. 7.
State duration distributions for HMM and DeBi-HMM.
DeBi-HMM was plotted according to the estimated parameters of the embedded gamma distribution. Analytical results indicate that the estimated state duration distribution of the DeBi-HMM is similar to the empirical distribution of the DeBi-HMM, and different from the geometric distribution. Table IV lists the average state occupation times of the second state of 15 Final parts aligned by HMM and DeBi-HMM with the -test using a [30]. significance level of Table V lists the average real time for each sentence to measure the computational cost of the four spectral conversion methods. The real time was calculated by dividing the duration of computation by the duration of the sentence. Twenty sentences with a mean length of utterance (MLU) [31] of 14.2 1.7 syllables were selected at random as the test data. These sentences were further synthesized by the TTS system, and the average duration of each utterance was 2.87 0.31 s. As revealed in Table IV, DeBi-HMM had the highest computational cost. The system was set up on a personal computer with 1.5-GHz CPU and 512-MB RAM. C. Subjective Test Each test sentence in the test set were synthesized by the TTS system and further converted using the following conversion methods for each expressive style: 1) GMM-based spectral conversion; 2) HMM-based spectral conversion; 3) Bi-HMM-based spectral conversion;
WU et al.: VOICE CONVERSION USING DURATION-EMBEDDED Bi-HMMs
1115
the number of leaves in the PDT was largest in the Happiness style. This result reveals the variety of prosodic features in happiness. The naturalness of the converted utterance was also evaluated, excellent according to a 5-scale scoring method very poor . Fig. 9 compares various conversion methods with a criterion of mean opinion score (MOS) and its standard deviation. The results reveal a significant difference between any two methods using the -test with a significance level of [30]. The results reveal that the prosodic features characterizing the main expression in speech and spectral conversion strengthen the expressive effect for expressive voice conversion. Fig. 8. Listening test results on expressive style identification for different approaches.
Fig. 9. Mean opinion scores for different approaches.
4) DeBi-HMM-based spectral conversion; 5) PDT-based prosody conversion; PDT-based 6) DeBi-HMM based spectral conversion prosody conversion. The numbers of Gaussian components and states were the same as those applied in the objective tests. All the 300 sentence pairs were used to train the spectral conversion models and the prosody conversion decision tree (PDT) for each expressive style. The total number of utterances presented to each listener was 540. A double-blind experiment was conducted in the subjective study [30]. For each test sentence randomly selected from the test set, 36 converted utterances processed by each conversion method to each expressive style were randomly output to the human subjects. Twenty adult subjects, around 22–31 years of age, were asked to classify each utterance as one of the six expressive styles. The subjects were familiar to our study. Fig. 8 shows the classification results, and indicates that the Bi-HMM-based method is better than the HMM-based method. DeBi-HMM, with duration modeling, gave the highest identification rate. Although prosody controls most of the perceptual effect of the expressive style, combining spectral conversion and prosody conversion achieved the highest identification rate of 78.3% on average in this experiment. Table III lists the number of leaves of each decision tree, indicating that
IV. CONCLUSION This paper presents a novel DeBi-HMM framework as a post-processing module of the TTS system for spectral conversion. In this framework, the source and target speech signals are modeled separately but synchronously with two HMMs. The state duration model is embedded into the stochastic spectral conversion model. For prosody conversion, prosodic features are modified on the spectrum-converted speech based on the prosody ratios modeled by the expressive style-dependent decision tree. Results from the objective experiments confirm the reduction of distortion between the converted and target expressive speech. The embedded state duration model improves the stability of the state sequence modeling and reduces the spectral distortion. Subjective tests reveal that prosody represents most expression cues, but spectral conversion is also important in emotion expression. Although the identification rates of Bi-HMM and DeBi-HMM are similar, the proposed DeBi-HMM yields a higher converted speech quality than Bi-HMM. Moreover, a series of comparisons with the expressive style TTS system is essential to demonstrate the effectiveness of the proposed framework using voice conversion. This paper presents a voice conversion framework for expressive speech synthesis with small-sized speech data. While the system performance can be considered to depend on the size of the training data, further thought is that more training data are needed to improve the system performance. REFERENCES [1] M. Schröder, “Emotional speech synthesis—A review,” in Proc. of EUROSPEECH, vol. 1, Aalborg, Denmark, 2001, pp. 561–564. [2] A. Iida, F. Higuchi, N. Campbell, and M. Yasumura, “A corpus-based speech synthesis system with emotion,” Speech Commun., vol. 40, no. 1–2, pp. 161–187, 2003. [3] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis,” IEICE Trans. Inf. Syst., vol. E88-D, no. 3, pp. 502–509, 2005. [4] H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and K. Shikano, “GMM-based voice conversion applied to emotional speech synthesis,” in Proc. of EUROSPEECH, Geneva, Switzerland, 2003, pp. 2401–2404. [5] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion,” J. Acoust. Soc. Amer., vol. 93, no. 2, pp. 1097–1108, 1993. [6] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in Proc. ICASSP, Tokyo, Japan, 1988, pp. 655–658.
1116
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY 2006
[7] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, Mar. 1998. [8] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech synthesis,” in Proc. ICASSP, Seattle, WA, 1998, pp. 285–288. [9] T. Toda, A. W. Black, and K. Tokuda, “Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter,” in Proc. ICASSP, vol. 1, Philadelphia, PA, Mar. 2005, pp. 9–12. [10] H. Duxans, A. Bonafonte, A. Kain, and J. van Santen, “Including dynamic and phonetic information in voice conversion systems,” in Proc. ICSLP, Jeju Island, South Korea, 2004, pp. 5–8. [11] E. K. Kim, S. Lee, and Y. H. Oh, “Hidden Markov model-based voice conversion using dynamic characteristics of speaker,” in Proc. EUROSPEECH, vol. 5, Rhodes, Greece, Sep. 1997, pp. 2519–2522. [12] J. T. Chien and C. H. Huang, “Bayesian learning of speech duration models,” IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 558–567, Nov. 2003. [13] K. Silverman, “On customizing prosody in speech synthesis: Names and addresses as a case in point,” in Proc. ARPA Workshop Human Lang. Technol., Princeton, NJ, 1993, pp. 317–322. [14] L. S. Lee, C. Y. Tseng, and C. J. Hsieh, “Improved tone concatenation rules in a formant-based Chinese text-to-speech system,” IEEE Trans. Speech Audio Process., vol. 1, no. 3, pp. 287–294, May 1993. [15] C. H. Wu and J. H. Chen, “Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis,” Speech Commun., vol. 35, no. 3–4, pp. 219–237, 2001. [16] D. H. Klatt, “Review of text-to-speech conversion for English,” J. Acoust. Soc. Amer., vol. 82, no. 3, pp. 737–793, 1987. [17] K. N. Ross and M. Ostendorf, “A dynamical system model for generating fundamental frequency for speech synthesis,” IEEE Trans. Speech Audio Process., vol. 7, no. 3, pp. 295–309, May 1999. [18] Z. W. Shuang, Z. X. Wang, Z. H. Ling, and R. H. Wang, “A novel voice conversion system based on codebook mapping with phonemetied weighting,” in Proc. ICSLP, Jeju Island, South Korea, 2004, pp. 1197–1200. [19] H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,” in Proc. of ICASSP, vol. 2, Munich, Germany, 1997, pp. 1303–1306. [20] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch adaptive time-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol. 27, no. 3–4, pp. 187–207, 1999. [21] S. H. Hwang, S. H. Chen, and J. R. Wang, “A Mandarin text-to-speech system,” Int. J. Comput. Ling. Chinese Lang. Process., vol. 1, no. 1, pp. 87–100, 1996. [22] S. E. Levinson, “Continuously variable duration hidden Markov models for automatic speech recognition,” Comput. Speech Lang., vol. 1, pp. 29–45, 1986. [23] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977. [24] Fundamentals of Statistical Signal Processing: Estimation Theory, 1993. [25] X. Huang, A. Acero, and H. W. Hon, Spoken Language Processing, a Guide to Theory, Algorithm, and System Development. Englewood Cliffs, NJ: Prentice-Hall, 2001. [26] D. Burshtein, “Robust parametric modeling of duration in hidden Markov models,” IEEE Trans. Speech Audio Process., vol. 4, no. 3, pp. 240–242, May 1996. [27] “The CKIP Categorical Classification of Mandarin Chinese (In Chinese),” Chinese Knowledge Information Processing Group, Taipei, Academia Sinica, Taiwan, CKIP Tech. Rep. 93–05, 1993. [28] C. H. Wu and Y. J. Chen, “Recovery of false rejection using statistical partial pattern trees for sentence verification,” Speech Commun., vol. 43, pp. 71–88, 2004. [29] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and J. Pitrelli, “A corpus-based approach to expressive speech synthesis,” in Proc. 5th ISCA Speech Synthesis Workshop, Pittsburgh, PA, 2004, pp. 79–84. [30] S. Shott, Statistics for health professionals. Philadelphia, PA: Saunders, 1990. [31] R. Brown, A First Language: The Early Stages. Cambridge, MA: Harvard Univ. Press, 1973.
Chung-Hsien Wu (SM’03) received the B.S. degree in electronics engineering from National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1981 and the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University (NCKU), Tainan, Taiwan, in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering, NCKU, Tainan. He became a Professor in August 1997. From 1999 to 2002, he served as the Chairman of the Department. He also worked at the Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, in summer 2003 as a Visiting Scientist. His research interests include speech recognition, text-to-speech, multimedia information retrieval, spoken language processing, and sign language processing for hearing-impaired. Dr. Wu is a member of the International Speech Communication Association (ISCA) and ROCLING.
Chi-Chun Hsia received the B.S. degree in computer science from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 2001. His research interests include digital signal processing, text-to-speech synthesis, natural language processing, and speech recognition.
Te-Hsien Liu received the B.S. degree in electronic engineering from National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C., in 2000, and the M.S. degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2005. His research interests include digital signal processing, natural language processing, and text-to-speech synthesis.
Jhing-Fa Wang (F’99) received the B.S. and M.S. degrees in electrical engineering from National Cheng Kung University (NCKU), Tainan, Taiwan, R.O.C., in 1973 and 1979, respectively, and the Ph.D. degree in computer science and electrical engineering from the Stevens Institute of Technology, Hoboken, NJ, in 1983. He is now a Chair Professor at NCKU. He developed a Mandarin speech recognition system called Venus-Dictate, known as a pioneering system in Taiwan. He is currently leading a research group of different disciplines for the development of “advanced ubiquitous media for created cyberspace.” He has published about 91 journal papers, 217 conference papers, and obtained five patents since 1983. His research areas include ubiquitous content-based media processing, speech recognition, and natural language understanding. Dr. Wang is now the Chairman of IEEE Tainan Section. He received outstanding awards from the Institute of Information Industry in 1991 and the National Science Council of Taiwan in 1990, 1995, and 1997, respectively. He has been invited to give the keynote speech at the Pacific Asia Conference on Language, Information, and Computation (PACLIC 12), Singapore, and served as the General Chairman of the International Symposium on Communication (ISCOM) 2001, Taiwan. He was an Associate Editor for the IEEE TRANSACTIONS ON NEURAL NETWORKS and the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS.